chore: use pgautoupgrade postgres 18 to match upstream coop-cloud/plausible#10

chore: upgrade to 3.1.0+v2.0.0 (pgautoupgrade postgres 13 to 16 + resilient backup/restore)
refactor: extract backup/restore into config scripts, trim comments
2026-06-15 17:22:31 +00:00 · 2026-06-12 04:28:40 +00:00 · 2026-06-10 16:55:20 +00:00 · 2026-06-09 23:01:24 +00:00 · 2026-06-09 21:53:18 +00:00 · 2026-06-09 19:09:13 +00:00
6 changed files with 109 additions and 50 deletions
--- a/README.md
+++ b/README.md
@ -24,14 +24,5 @@
 5. `abra app deploy YOURAPPDOMAIN`
 6. Open the configured domain in your browser to finish set-up

-## Postgres upgrades
-
-The `db` service uses the
-[`pgautoupgrade`](https://github.com/pgautoupgrade/pgautoupgrade) image, so when
-the recipe bumps the Postgres major version the existing cluster is upgraded in
-place automatically on the next `deploy` — no manual migration steps. As with
-any major database upgrade, **take a backup of the `<stack_name>_db` volume
-first** (e.g. `abra app backup <domain>`).
-
 [`abra`]: https://git.coopcloud.tech/coop-cloud/abra
 [`coop-cloud/traefik`]: https://git.coopcloud.tech/coop-cloud/traefik
--- a/abra.sh
+++ b/abra.sh
@ -1,3 +1,5 @@
 export CLICKHOUSE_CONF_VERSION=v2
 export CLICKHOUSE_USER_CONF_VERSION=v2
-export CLICKHOUSE_ENTRYPOINT_VERSION=v3
+export CLICKHOUSE_ENTRYPOINT_VERSION=v6
+export PG_BACKUP_VERSION=v1
+export CLICKHOUSE_BACKUP_SCRIPT_VERSION=v1
--- a/clickhouse_backup.sh
+++ b/clickhouse_backup.sh
@ -0,0 +1,30 @@
+#!/bin/bash
+
+set -e
+
+# clickhouse-backup output lives inside the event-data volume (snapshotted via
+# backupbot.backup.volumes.event-data.path). Restoring the raw data files under a
+# running server is unsafe, so restore performs a logical restore instead.
+BACKUP_DIR=/var/lib/clickhouse/backup/events
+MIGRATIONS_TSV="$BACKUP_DIR/schema_migrations.tsv"
+
+backup() {
+  clickhouse-backup create events
+  # schema_migrations is a TinyLog table — clickhouse-backup only FREEZEs MergeTree
+  # data, so its rows aren't captured. Export them alongside the backup, else a restore
+  # leaves the ledger empty and the next boot re-runs every migration (DUPLICATE_COLUMN).
+  clickhouse-client --query "SELECT * FROM plausible_events_db.schema_migrations FORMAT TSV" > "$MIGRATIONS_TSV"
+}
+
+backup_cleanup() {
+  rm -rf "$BACKUP_DIR"
+}
+
+restore() {
+  clickhouse-backup restore --rm events
+  clickhouse-client --query "TRUNCATE TABLE plausible_events_db.schema_migrations"
+  clickhouse-client --query "INSERT INTO plausible_events_db.schema_migrations FORMAT TSV" < "$MIGRATIONS_TSV"
+  rm -rf "$BACKUP_DIR"
+}
+
+"$@"
--- a/compose.yml
+++ b/compose.yml
@ -7,7 +7,7 @@ services:
    command: sh -c "sleep 10 && /entrypoint.sh db createdb && /entrypoint.sh db migrate && /entrypoint.sh run"
    depends_on:
      - db
-      - events_db
+      - plausible_events_db
    environment:
      - BASE_URL=https://$DOMAIN
      - SECRET_KEY_BASE
@ -26,14 +26,16 @@ services:
      - internal
    deploy:
      restart_policy:
-        condition: on-failure
+        # `any`, not `on-failure`: a restore disrupts postgres under the app and Erlang then
+        # shuts down gracefully (exit 0), which on-failure treats as done and never restarts.
+        condition: any
      labels:
        - "traefik.enable=true"
        - "traefik.http.services.${STACK_NAME}.loadbalancer.server.port=8000"
        - "traefik.http.routers.${STACK_NAME}.rule=Host(`${DOMAIN}`${EXTRA_DOMAINS})"
        - "traefik.http.routers.${STACK_NAME}.entrypoints=web-secure"
        - "traefik.http.routers.${STACK_NAME}.tls.certresolver=${LETS_ENCRYPT_ENV}"
-        - coop-cloud.${STACK_NAME}.version=4.0.0+v2.0.0
+        - coop-cloud.${STACK_NAME}.version=3.1.0+v2.0.0
  db:
    image: pgautoupgrade/pgautoupgrade:18-alpine
    volumes:
@ -51,14 +53,18 @@ services:
      interval: 5s
      timeout: 5s
      retries: 60
+    configs:
+      - source: pg_backup
+        target: /pg_backup.sh
+        mode: 0555
    deploy:
      labels:
        backupbot.backup: "true"
-        backupbot.backup.pre-hook: sh -c 'pg_dump -U "$$POSTGRES_USER" -Fc "$$POSTGRES_DB" | gzip > "/postgres.dump.gz"'
-        backupbot.backup.path: "/postgres.dump.gz"
-        backupbot.backup.post-hook: "rm -f /postgres.dump.gz"
+        backupbot.backup.volumes.db-data.path: "postgres.dump.gz"
+        backupbot.backup.pre-hook: "/pg_backup.sh backup"
+        backupbot.backup.post-hook: "/pg_backup.sh backup_cleanup"
        backupbot.restore: "true"
-        backupbot.restore.post-hook: sh -c 'gzip -d /postgres.dump.gz && pg_restore --clean -U "$$POSTGRES_USER" --dbname="$$PLAUSIBLE_DB" < /postgres.dump && rm -f /postgres.dump'
+        backupbot.restore.post-hook: "/pg_backup.sh restore"

  plausible_events_db:
    image: clickhouse/clickhouse-server:23.4.2.11-alpine
@ -73,16 +79,19 @@ services:
      - source: clickhouse_entrypoint
        target: /custom-entrypoint.sh
        mode: 0555
+      - source: clickhouse_backup
+        target: /clickhouse_backup.sh
+        mode: 0555
    networks:
      - internal
    deploy:
      labels:
        backupbot.backup: "true"
-        backupbot.backup.pre-hook: clickhouse-backup create events
-        backupbot.backup.path: "/var/lib/clickhouse/backup/events"
-        backupbot.backup.post-hook: "rm -rf /var/lib/clickhouse/backup/events"
+        backupbot.backup.volumes.event-data.path: "backup/events"
+        backupbot.backup.pre-hook: "/clickhouse_backup.sh backup"
+        backupbot.backup.post-hook: "/clickhouse_backup.sh backup_cleanup"
        backupbot.restore: "true"
-        backupbot.restore.post-hook: clickhouse-backup restore --rm events && rm -rf /var/lib/clickhouse/backup/events"
+        backupbot.restore.post-hook: "/clickhouse_backup.sh restore"

 volumes:
  db-data:
@ -103,3 +112,9 @@ configs:
  clickhouse_entrypoint:
    name: ${STACK_NAME}_clickhouse_entrypoint_${CLICKHOUSE_ENTRYPOINT_VERSION}
    file: entrypoint.clickhouse.sh
+  pg_backup:
+    name: ${STACK_NAME}_pg_backup_${PG_BACKUP_VERSION}
+    file: pg_backup.sh
+  clickhouse_backup:
+    name: ${STACK_NAME}_clickhouse_backup_${CLICKHOUSE_BACKUP_SCRIPT_VERSION}
+    file: clickhouse_backup.sh
--- a/entrypoint.clickhouse.sh
+++ b/entrypoint.clickhouse.sh
@ -1,21 +1,9 @@
 #!/bin/bash
-# clickhouse entrypoint (cc-ci Q4.7b hardening — recipe-PR for recipe-maintainers/plausible).
-#
-# clickhouse-backup is the BACKUP tool (backupbot pre/post-hooks: `clickhouse-backup create/restore`).
-# It is NOT required for clickhouse-SERVER (`/entrypoint.sh`) to run. The published recipe fetched it
-# with `set -ex` + a single silenced no-retry wget to ephemeral /tmp, so ANY transient failure of the
-# 22 MB GitHub download (rate-limit / network) exited the container BEFORE the server started → swarm
-# restarted it → re-downloaded → amplified the throttle → crash-loop → deploy timeout (cc-ci Q4.7).
-#
-# Hardening (no behaviour change when the download succeeds first try):
-#   - cache the binary on the PERSISTENT clickhouse data volume (/var/lib/clickhouse) so it is fetched
-#     at most once and reused on every container restart (no re-download amplification);
-#   - retry with backoff to ride out transient GitHub failures;
-#   - un-silenced so a failure is diagnosable in `docker service logs`.
-#
-# Policy: clickhouse-backup is REQUIRED. If it cannot be installed after all retries the entrypoint
-# aborts (non-zero exit) and the server is NOT started — we deliberately fail the deploy loudly rather
-# than come up silently without backup/restore capability.
+# Install clickhouse-backup (powers this recipe's backup/restore hooks) before starting the
+# server. The binary is cached on the persistent volume keyed by version (downloaded at most
+# once per app) and fetched with bounded retries + a read timeout; the binary is verified before
+# being trusted or cached. If it truly cannot be installed the deploy fails loudly rather than
+# silently shipping broken backups.

 set -e

@ -35,33 +23,37 @@ elif [[ $ARCH =~ "x86_64" ]]; then
 fi

 CACHE_DIR=/var/lib/clickhouse/.ccci-bin
-CACHED="${CACHE_DIR}/clickhouse-backup"
+CACHED="${CACHE_DIR}/clickhouse-backup-v${CLICKHOUSE_BACKUP_VERSION}"
 BIN=/usr/local/bin/clickhouse-backup
-URL="https://github.com/AlexAkulov/clickhouse-backup/releases/download/v${CLICKHOUSE_BACKUP_VERSION}/clickhouse-backup-linux-${ARCH}.tar.gz"
+URL="https://github.com/Altinity/clickhouse-backup/releases/download/v${CLICKHOUSE_BACKUP_VERSION}/clickhouse-backup-linux-${ARCH}.tar.gz"
+
+binary_ok() {
+  "$1" --version >/dev/null 2>&1
+}

 install_clickhouse_backup() {
  mkdir -p "$CACHE_DIR"
-  if [ -x "$CACHED" ]; then
+  if [ -x "$CACHED" ] && binary_ok "$CACHED"; then
    cp -f "$CACHED" "$BIN"
-    echo "clickhouse-backup: restored from persistent cache ($CACHED)"
+    echo "clickhouse-backup: using verified cached binary ($CACHED)"
    return 0
  fi
+  rm -f "$CACHED" # absent or fails to execute — re-fetch
  for attempt in 1 2 3 4 5; do
-    if wget --continue --output-document=/tmp/clickhouse-backup.tar.gz "$URL" \
-       && tar -xf /tmp/clickhouse-backup.tar.gz --directory=/usr/local/bin --strip-components=3; then
+    if wget -T 30 --continue --output-document=/tmp/clickhouse-backup.tar.gz "$URL" \
+       && tar -xf /tmp/clickhouse-backup.tar.gz --directory=/usr/local/bin --strip-components=3 \
+       && binary_ok "$BIN"; then
      cp -f "$BIN" "$CACHED" 2>/dev/null || true
-      echo "clickhouse-backup: downloaded + cached (attempt ${attempt})"
+      echo "clickhouse-backup: downloaded, verified + cached (attempt ${attempt})"
      return 0
    fi
-    echo "clickhouse-backup: fetch attempt ${attempt} failed; backing off $((attempt * 10))s" >&2
-    sleep $((attempt * 10))
+    echo "clickhouse-backup: fetch attempt ${attempt}/5 failed" >&2
+    [ "$attempt" -lt 5 ] && sleep $((attempt * 10))
  done
-  echo "clickhouse-backup: fetch FAILED after all retries — aborting; clickhouse-server will NOT start (backup tool is required)" >&2
+  echo "clickhouse-backup: could not install after 5 attempts — failing the deploy (without it backup/restore would be silently broken)" >&2
  return 1
 }

-# Required: if the backup tool cannot be installed after retries, abort (set -e) so the deploy fails
-# loudly instead of coming up without backup/restore capability.
 install_clickhouse_backup

 exec /entrypoint.sh
--- a/pg_backup.sh
+++ b/pg_backup.sh
@ -0,0 +1,29 @@
+#!/bin/sh
+
+set -e
+
+# The dump lives at the db-data volume root: backup-bot-two v2 snapshots paths inside
+# named volumes (backupbot.backup.volumes.db-data.path), not the container root fs.
+DUMP=/var/lib/postgresql/data/postgres.dump
+
+backup() {
+  pg_dump -U "$POSTGRES_USER" -Fc "$POSTGRES_DB" | gzip > "$DUMP.gz"
+}
+
+backup_cleanup() {
+  rm -f "$DUMP.gz"
+}
+
+restore() {
+  gzip -d "$DUMP.gz"
+  # --if-exists: otherwise DROPs on objects absent from the live db error out and
+  # pg_restore exits 1, killing the chain and leaving the dump behind.
+  pg_restore --clean --if-exists -U "$POSTGRES_USER" --dbname="$POSTGRES_DB" < "$DUMP"
+  rm -f "$DUMP"
+  # pg_restore --clean recreates objects under the live app, so its pooled connections
+  # keep stale type-OID caches ('cache lookup failed for type ...' crash loops, e.g.
+  # Oban). Terminate them so Ecto reconnects fresh.
+  psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();"
+}
+
+"$@"
Author	SHA1	Message	Date
notplants	d77adba469	chore: use pgautoupgrade postgres 18 to match upstream coop-cloud/plausible#10 All checks were successful cc-ci/testme cc-ci: success Details	2026-06-15 17:22:31 +00:00
autonomic-bot	709a294d9a	chore: upgrade to 3.1.0+v2.0.0 (pgautoupgrade postgres 13 to 16 + resilient backup/restore) All checks were successful cc-ci/testme cc-ci: success Details	2026-06-12 04:28:40 +00:00
notplants	13458fac56	refactor: extract backup/restore into config scripts, trim comments All checks were successful cc-ci/testme cc-ci: success Details Move the postgres and clickhouse backup/restore hook logic out of inline compose labels into dedicated pg_backup.sh / clickhouse_backup.sh config scripts (the pattern other recipes use), and trim the verbose explanatory comments down to the essential rationale, now living in the scripts.	2026-06-10 16:55:20 +00:00
autonomic-bot	270c8404ce	fix: make restore correct under a live app (CI restore + custom tiers) All checks were successful cc-ci/testme cc-ci: success Details Three independent bugs made `abra app restore` leave the stack broken: 1. ClickHouse: schema_migrations is a TinyLog table and clickhouse-backup can only FREEZE MergeTree data - it backed up the table schema but not its rows, so a restore emptied the migration ledger. The next app boot re-ran every IngestRepo migration against the fully-built tables and crash-looped (DUPLICATE_COLUMN: utm_medium) - the post-restore 502 in CI build 237. Fix: export the ledger to TSV into the backup dir (rides in the snapshotted backup/events path) in the backup pre-hook, reload it in the restore post-hook. 2. App restart policy: condition was on-failure, but when postgres is disrupted under the app the BEAM supervision tree escalates and Erlang exits GRACEFULLY (status 0) - swarm marks the task Complete and never restarts it (reproduced: app stranded at 0/1). Fix: condition any. 3. pg_restore: --clean without --if-exists exits 1 when a dropped object is absent ("errors ignored"), killing the && chain and leaving the dump behind. Fix: --if-exists, plus pg_terminate_backend afterwards so the app pooled connections reconnect against the recreated objects. Validated on a dev deploy: marker + truncated ClickHouse events both return on restore, migration ledger intact (17 rows), post-restore event ingestion for a new site works, and an app reboot after restore migrates cleanly. Known cosmetic caveat: until the app is restarted, its Postgrex type cache holds stale OIDs and background Oban jobs log "cache lookup failed for type" - ingestion and serving are unaffected; an operator restart after a restore clears it.	2026-06-09 23:01:24 +00:00
autonomic-bot	4cab6b5146	fix: backup labels to backup-bot-two v2 volume syntax (restore was a no-op) Some checks failed cc-ci/testme cc-ci: failure Details backup-bot-two 2.4.0 snapshots paths INSIDE named volumes (backupbot.backup.volumes.<vol>.path, relative to the volume root) and IGNORES the old backupbot.backup.path label. The db pre-hook wrote /postgres.dump.gz to the container's ephemeral root fs — outside every volume — so the dump never reached the snapshot and the restore post-hook failed on a missing file (gzip: /postgres.dump.gz: No such file). - db: dump into the db-data volume (transient; hooks remove it) and snapshot only that file via backupbot.backup.volumes.db-data.path — same pattern as keycloak, which passes backup/restore on this CI. Also use $POSTGRES_DB in the restore hook: the previous $PLAUSIBLE_DB is defined nowhere and only connected via libpq's username fallback. - clickhouse: snapshot only backup/events (the clickhouse-backup output) inside the event-data volume instead of the whole volume — restoring raw data files under a running server is unsafe; the post-hook performs the logical restore.	2026-06-09 21:53:18 +00:00
autonomic-bot	9f8bcbc9e3	fix: clickhouse-backup install must succeed loudly, never silently degrade Some checks failed cc-ci/testme cc-ci: failure Details Replaces the previous best-effort (\|\| true) approach: a deploy without clickhouse-backup would have silently broken backup/restore, so the entrypoint now hard-fails (visibly, in service logs) if the tool truly cannot be installed — but makes that case effectively unreachable: - cache the VERIFIED binary on the persistent clickhouse volume, keyed by version: downloaded at most once per app; container restarts never re-fetch (kills the re-download amplification that turned a GitHub throttle into a permanent crash-loop) - canonical Altinity release URL (project moved; old path is a redirect) - bounded retries with backoff + wget read timeout (a stalled connection can no longer hang the deploy) - verify the binary executes before trusting or caching it (catches truncated downloads and a corrupt cache) - compose: fix app depends_on to the real service name (plausible_events_db) — docker compose config was failing on it, which disabled CI image prepull and pushed pulls into the deploy window - bump CLICKHOUSE_ENTRYPOINT_VERSION v4 -> v5 (swarm configs immutable) Verified on a dev deploy: fresh download path, cached-restart path, clickhouse-backup create/list/delete, and /api/health all green.	2026-06-09 19:09:13 +00:00
autonomic-bot	b90a8c4239	fix: clickhouse entrypoint - backup download is best-effort (server must start regardless) Some checks failed cc-ci/testme cc-ci: failure Details The previous entrypoint treated clickhouse-backup as required: a download failure (rate-limit or transient network) caused install_clickhouse_backup to return 1 which with set -e exited the entrypoint before /entrypoint.sh ran. ClickHouse never started, the swarm restarted it, the download was retried, amplifying the throttle -> crash-loop -> deploy timeout (cc-ci Q4.7b). Fix: install_clickhouse_backup \|\| true — the server starts even if the backup tool cannot be fetched. Backup/restore degrades until a later restart fetches it. Also: fix stray trailing quote in backupbot.restore.post-hook; bump CLICKHOUSE_ENTRYPOINT_VERSION v3->v4 (config content changed).	2026-06-09 18:30:18 +00:00
notplants	50a3715caa	chore: upgrade to 3.1.0+v2.0.0 Some checks failed cc-ci/testme cc-ci: failure Details Minor bump — no operator action required (Postgres/ClickHouse changes are automatic). - Postgres: use pgautoupgrade/pgautoupgrade:18-alpine in place of the custom pg_upgrade entrypoint. The existing cluster is upgraded in place automatically on deploy; PGDATA pinned to the legacy path; adds a pg_isready healthcheck. Removes entrypoint.postgres.sh.tmpl and DB_ENTRYPOINT_VERSION. - ClickHouse backup fetch: cache the clickhouse-backup binary on the persistent volume and retry with backoff to avoid the download crash-loop. The tool is required — if it can't be installed after retries the entrypoint aborts and the server does not start, rather than coming up without backup/restore. - Add CLICKHOUSE_DATABASE_URL; bump the clickhouse entrypoint config version. - Remove a stray broken link reference in the README.	2026-06-09 15:46:28 +00:00