chore: upgrade to 3.1.0+v2.0.0 (pgautoupgrade + resilient ClickHouse entrypoint) #3

Open
autonomic-bot wants to merge 5 commits from upgrade-3.1.0+v2.0.0 into main

5 Commits

Author SHA1 Message Date
270c8404ce fix: make restore correct under a live app (CI restore + custom tiers)
All checks were successful
cc-ci/testme cc-ci: success
Three independent bugs made `abra app restore` leave the stack broken:

1. ClickHouse: schema_migrations is a TinyLog table and clickhouse-backup
   can only FREEZE MergeTree data - it backed up the table schema but
   not its rows, so a restore emptied the migration ledger. The next app
   boot re-ran every IngestRepo migration against the fully-built tables
   and crash-looped (DUPLICATE_COLUMN: utm_medium) - the post-restore 502
   in CI build 237. Fix: export the ledger to TSV into the backup dir
   (rides in the snapshotted backup/events path) in the backup pre-hook,
   reload it in the restore post-hook.

2. App restart policy: condition was on-failure, but when postgres is
   disrupted under the app the BEAM supervision tree escalates and Erlang
   exits GRACEFULLY (status 0) - swarm marks the task Complete and never
   restarts it (reproduced: app stranded at 0/1). Fix: condition any.

3. pg_restore: --clean without --if-exists exits 1 when a dropped object
   is absent ("errors ignored"), killing the && chain and leaving the
   dump behind. Fix: --if-exists, plus pg_terminate_backend afterwards so
   the app pooled connections reconnect against the recreated objects.

Validated on a dev deploy: marker + truncated ClickHouse events both
return on restore, migration ledger intact (17 rows), post-restore event
ingestion for a new site works, and an app reboot after restore migrates
cleanly. Known cosmetic caveat: until the app is restarted, its Postgrex
type cache holds stale OIDs and background Oban jobs log "cache lookup
failed for type" - ingestion and serving are unaffected; an operator
restart after a restore clears it.
2026-06-09 23:01:24 +00:00
4cab6b5146 fix: backup labels to backup-bot-two v2 volume syntax (restore was a no-op)
Some checks failed
cc-ci/testme cc-ci: failure
backup-bot-two 2.4.0 snapshots paths INSIDE named volumes
(backupbot.backup.volumes.<vol>.path, relative to the volume root) and
IGNORES the old backupbot.backup.path label. The db pre-hook wrote
/postgres.dump.gz to the container's ephemeral root fs — outside every
volume — so the dump never reached the snapshot and the restore post-hook
failed on a missing file (gzip: /postgres.dump.gz: No such file).

- db: dump into the db-data volume (transient; hooks remove it) and
  snapshot only that file via backupbot.backup.volumes.db-data.path —
  same pattern as keycloak, which passes backup/restore on this CI.
  Also use $POSTGRES_DB in the restore hook: the previous $PLAUSIBLE_DB
  is defined nowhere and only connected via libpq's username fallback.
- clickhouse: snapshot only backup/events (the clickhouse-backup output)
  inside the event-data volume instead of the whole volume — restoring
  raw data files under a running server is unsafe; the post-hook performs
  the logical restore.
2026-06-09 21:53:18 +00:00
9f8bcbc9e3 fix: clickhouse-backup install must succeed loudly, never silently degrade
Some checks failed
cc-ci/testme cc-ci: failure
Replaces the previous best-effort (|| true) approach: a deploy without
clickhouse-backup would have silently broken backup/restore, so the
entrypoint now hard-fails (visibly, in service logs) if the tool truly
cannot be installed — but makes that case effectively unreachable:

- cache the VERIFIED binary on the persistent clickhouse volume, keyed
  by version: downloaded at most once per app; container restarts never
  re-fetch (kills the re-download amplification that turned a GitHub
  throttle into a permanent crash-loop)
- canonical Altinity release URL (project moved; old path is a redirect)
- bounded retries with backoff + wget read timeout (a stalled connection
  can no longer hang the deploy)
- verify the binary executes before trusting or caching it (catches
  truncated downloads and a corrupt cache)
- compose: fix app depends_on to the real service name
  (plausible_events_db) — docker compose config was failing on it, which
  disabled CI image prepull and pushed pulls into the deploy window
- bump CLICKHOUSE_ENTRYPOINT_VERSION v4 -> v5 (swarm configs immutable)

Verified on a dev deploy: fresh download path, cached-restart path,
clickhouse-backup create/list/delete, and /api/health all green.
2026-06-09 19:09:13 +00:00
b90a8c4239 fix: clickhouse entrypoint - backup download is best-effort (server must start regardless)
Some checks failed
cc-ci/testme cc-ci: failure
The previous entrypoint treated clickhouse-backup as required: a download failure
(rate-limit or transient network) caused install_clickhouse_backup to return 1 which
with set -e exited the entrypoint before /entrypoint.sh ran. ClickHouse never started,
the swarm restarted it, the download was retried, amplifying the throttle -> crash-loop
-> deploy timeout (cc-ci Q4.7b).

Fix: install_clickhouse_backup || true — the server starts even if the backup tool
cannot be fetched. Backup/restore degrades until a later restart fetches it.

Also: fix stray trailing quote in backupbot.restore.post-hook; bump
CLICKHOUSE_ENTRYPOINT_VERSION v3->v4 (config content changed).
2026-06-09 18:30:18 +00:00
50a3715caa chore: upgrade to 3.1.0+v2.0.0
Some checks failed
cc-ci/testme cc-ci: failure
Minor bump — no operator action required (Postgres/ClickHouse changes are automatic).

- Postgres: use pgautoupgrade/pgautoupgrade:18-alpine in place of the custom
  pg_upgrade entrypoint. The existing cluster is upgraded in place automatically
  on deploy; PGDATA pinned to the legacy path; adds a pg_isready healthcheck.
  Removes entrypoint.postgres.sh.tmpl and DB_ENTRYPOINT_VERSION.
- ClickHouse backup fetch: cache the clickhouse-backup binary on the persistent
  volume and retry with backoff to avoid the download crash-loop. The tool is
  required — if it can't be installed after retries the entrypoint aborts and
  the server does not start, rather than coming up without backup/restore.
- Add CLICKHOUSE_DATABASE_URL; bump the clickhouse entrypoint config version.
- Remove a stray broken link reference in the README.
2026-06-09 15:46:28 +00:00