fix: make restore correct under a live app (CI restore + custom tiers)

Three independent bugs made `abra app restore` leave the stack broken: 1. ClickHouse: schema_migrations is a TinyLog table and clickhouse-backup can only FREEZE MergeTree data - it backed up the table schema but not its rows, so a restore emptied the migration ledger. The next app boot re-ran every IngestRepo migration against the fully-built tables and crash-looped (DUPLICATE_COLUMN: utm_medium) - the post-restore 502 in CI build 237. Fix: export the ledger to TSV into the backup dir (rides in the snapshotted backup/events path) in the backup pre-hook, reload it in the restore post-hook. 2. App restart policy: condition was on-failure, but when postgres is disrupted under the app the BEAM supervision tree escalates and Erlang exits GRACEFULLY (status 0) - swarm marks the task Complete and never restarts it (reproduced: app stranded at 0/1). Fix: condition any. 3. pg_restore: --clean without --if-exists exits 1 when a dropped object is absent ("errors ignored"), killing the && chain and leaving the dump behind. Fix: --if-exists, plus pg_terminate_backend afterwards so the app pooled connections reconnect against the recreated objects. Validated on a dev deploy: marker + truncated ClickHouse events both return on restore, migration ledger intact (17 rows), post-restore event ingestion for a new site works, and an app reboot after restore migrates cleanly. Known cosmetic caveat: until the app is restarted, its Postgrex type cache holds stale OIDs and background Oban jobs log "cache lookup failed for type" - ingestion and serving are unaffected; an operator restart after a restore clears it.
2026-06-09 23:01:24 +00:00
parent 4cab6b5146
commit 270c8404ce
1 changed files with 17 additions and 4 deletions
--- a/compose.yml
+++ b/compose.yml
@ -26,7 +26,10 @@ services:
      - internal
    deploy:
      restart_policy:
-        condition: on-failure
+        # `any`, not `on-failure`: when postgres is disrupted under the app (e.g. a restore),
+        # the BEAM supervision tree escalates and Erlang shuts down GRACEFULLY (exit 0) — with
+        # on-failure swarm marks the task Complete and never restarts it, leaving the app down.
+        condition: any
      labels:
        - "traefik.enable=true"
        - "traefik.http.services.${STACK_NAME}.loadbalancer.server.port=8000"
@ -62,7 +65,12 @@ services:
        backupbot.backup.pre-hook: sh -c 'pg_dump -U "$$POSTGRES_USER" -Fc "$$POSTGRES_DB" | gzip > /var/lib/postgresql/data/postgres.dump.gz'
        backupbot.backup.post-hook: "rm -f /var/lib/postgresql/data/postgres.dump.gz"
        backupbot.restore: "true"
-        backupbot.restore.post-hook: sh -c 'gzip -d /var/lib/postgresql/data/postgres.dump.gz && pg_restore --clean -U "$$POSTGRES_USER" --dbname="$$POSTGRES_DB" < /var/lib/postgresql/data/postgres.dump && rm -f /var/lib/postgresql/data/postgres.dump'
+        # --if-exists: without it the DROPs error on objects absent from the live db and
+        # pg_restore exits 1 ("errors ignored"), killing the && chain (dump left behind).
+        # pg_terminate_backend afterwards: pg_restore --clean recreates objects under the live
+        # app, so its pooled connections keep stale type-OID caches ('cache lookup failed for
+        # type ...' crash loops, e.g. Oban) — terminating them makes Ecto reconnect fresh.
+        backupbot.restore.post-hook: sh -c 'gzip -d /var/lib/postgresql/data/postgres.dump.gz && pg_restore --clean --if-exists -U "$$POSTGRES_USER" --dbname="$$POSTGRES_DB" < /var/lib/postgresql/data/postgres.dump && rm -f /var/lib/postgresql/data/postgres.dump && psql -U "$$POSTGRES_USER" -d "$$POSTGRES_DB" -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();"'

  plausible_events_db:
    image: clickhouse/clickhouse-server:23.4.2.11-alpine
@ -86,10 +94,15 @@ services:
        # inside the event-data volume — not the live raw data files (restoring those under a
        # running server is unsafe; the restore post-hook performs the logical restore instead).
        backupbot.backup.volumes.event-data.path: "backup/events"
-        backupbot.backup.pre-hook: clickhouse-backup create events
+        # schema_migrations is a TinyLog table — clickhouse-backup can only FREEZE MergeTree
+        # data, so it backs up that table's SCHEMA but not its rows, and a restore would leave
+        # the migration ledger empty: the next app boot then re-runs every ClickHouse migration
+        # against the fully-built tables and crash-loops (DUPLICATE_COLUMN). Export its rows
+        # into the backup dir alongside the clickhouse-backup output, and reload them on restore.
+        backupbot.backup.pre-hook: sh -c 'clickhouse-backup create events && clickhouse-client --query "SELECT * FROM plausible_events_db.schema_migrations FORMAT TSV" > /var/lib/clickhouse/backup/events/schema_migrations.tsv'
        backupbot.backup.post-hook: "rm -rf /var/lib/clickhouse/backup/events"
        backupbot.restore: "true"
-        backupbot.restore.post-hook: clickhouse-backup restore --rm events && rm -rf /var/lib/clickhouse/backup/events
+        backupbot.restore.post-hook: sh -c 'clickhouse-backup restore --rm events && clickhouse-client --query "TRUNCATE TABLE plausible_events_db.schema_migrations" && clickhouse-client --query "INSERT INTO plausible_events_db.schema_migrations FORMAT TSV" < /var/lib/clickhouse/backup/events/schema_migrations.tsv && rm -rf /var/lib/clickhouse/backup/events'

 volumes:
  db-data: