chore: upgrade to 3.1.0+v2.0.0 (pgautoupgrade + resilient ClickHouse entrypoint) #3
Reference in New Issue
Block a user
No description provided.
Delete Branch "upgrade-3.1.0+v2.0.0"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Recipe upgrade for plausible (supersedes #1 and #2).
Version:
3.1.0+v2.0.0— minor bump; no operator action required (the Postgres and ClickHouse changes are fully automatic).Changes
entrypoint.postgres.sh.tmpl(apt-installed old PG binaries + manualpg_upgrade --link) with thepgautoupgrade/pgautoupgrade:18-alpineimage. The existing cluster is upgraded in place automatically;PGDATAis pinned to the legacy path. Adds apg_isreadyhealthcheck.clickhouse-backupbinary on the persistent volume + retry with backoff to avoid the download crash-loop. The tool is required — if it can't be installed after retries the entrypoint aborts (the server does not start) rather than coming up without backup/restore.CLICKHOUSE_DATABASE_URLand bumps the ClickHouse entrypoint config version.Testing
Verified on a test instance: fresh deploy healthy; PG 13 → 18 in-place upgrade confirmed (data + full plausible schema intact, app serving); ClickHouse comes up with the backup tool cached.
!testme
🌻 cc-ci —
plausible@60a611d1❌ failure → https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/218(summary card unavailable — see the run for details.) full logs · dashboard
60a611d1fdto50a3715caa!testme
🌻 cc-ci —
plausible@50a3715c❌ failurefull logs · dashboard
!testme
🌻 cc-ci —
plausible@50a3715c❌ failure → https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/220(summary card unavailable — see the run for details.) full logs · dashboard
!testme
🌻 cc-ci —
plausible@50a3715c❌ killed → https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/223(summary card unavailable — see the run for details.) full logs · dashboard
!testme
🌻 cc-ci —
plausible@b90a8c42❌ failure → https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/227(summary card unavailable — see the run for details.) full logs · dashboard
!testme
🌻 cc-ci —
plausible@9f8bcbc9❌ failurefull logs · dashboard
@notplants — root cause of the ClickHouse crash-loop found + fixed on the recipe side; CI needs a one-line test-config change to go green.
Root cause (the real one)
The crash-looping
plausible_events_dbin CI is not the PR's tree — it is the published3.0.0+v2.0.0base that the upgrade tier deploys first. The 3.0.0 entrypoint's ARCH mapping has nox86_64branch (added later in 3.0.1, commit6c73753"Add x86_64 support"), so on this amd64 host it requestsclickhouse-backup-linux-x86_64.tar.gz→ HTTP 404, deterministically (the asset is named…-amd64.tar.gz; verified: x86_64→404, amd64→200). Withwget --quiet … 2>/dev/null+set -ethe container exits 1 without printing a byte — hence the emptydocker service logs— and crash-loops until the deploy times out (1200 s) ⇒ install RED, everything else skipped.The harness picks the base as
recipe_versions[-2]: published tags are…, 3.0.0+v2.0.0, 3.0.1+v2.0.0, so[-2]= 3.0.0 — exactly the version that can never start on amd64. This PR adds 3.1.0 above the newest published tag, which is the documented case where the default is wrong and the correct base is[-1](3.0.1). The harness provides an explicit override for this:This run is in DEFAULT (recipe-only) mode, so the cc-ci tests are out of bounds — flagging it here instead of editing the gate. With that line, the upgrade tier tests the real upgrade path
3.0.1 → 3.1.0and the broken 3.0.0 base is never deployed.Recipe-side fix (pushed,
9f8bcbc)Replaces the earlier best-effort (
|| true) approach — a deploy withoutclickhouse-backupwould have silently broken backup/restore. The entrypoint now hard-fails loudly if the tool truly cannot be installed, but makes that case effectively unreachable:wget -T 30read timeout (a stalled connection can't hang the deploy);docker service logs;app.depends_onreferenced nonexistentevents_db→ fixed toplausible_events_db, which un-breaksdocker compose configand re-enables CI image prepull.Verified on a dev deploy (since torn down): fresh-download path, cached-restart path,
clickhouse-backup create/list/delete, and/api/health{"clickhouse":"ok","postgres":"ok","sites_cache":"ok"}all green.Also noted while debugging (pre-existing on main, left out of this PR's scope):
.env.sampleshipsDISABLE_AUTH=replace-me/DISABLE_REGISTRATION=replace-me; plausible atomizes these (String.to_existing_atom), so an operator who deploys without changing them gets an app crash-loop (binary_to_existing_atom("replace-me")).!testme
🌻 cc-ci —
plausible@4cab6b51❌ failurefull logs · dashboard
Three independent bugs made `abra app restore` leave the stack broken: 1. ClickHouse: schema_migrations is a TinyLog table and clickhouse-backup can only FREEZE MergeTree data - it backed up the table schema but not its rows, so a restore emptied the migration ledger. The next app boot re-ran every IngestRepo migration against the fully-built tables and crash-looped (DUPLICATE_COLUMN: utm_medium) - the post-restore 502 in CI build 237. Fix: export the ledger to TSV into the backup dir (rides in the snapshotted backup/events path) in the backup pre-hook, reload it in the restore post-hook. 2. App restart policy: condition was on-failure, but when postgres is disrupted under the app the BEAM supervision tree escalates and Erlang exits GRACEFULLY (status 0) - swarm marks the task Complete and never restarts it (reproduced: app stranded at 0/1). Fix: condition any. 3. pg_restore: --clean without --if-exists exits 1 when a dropped object is absent ("errors ignored"), killing the && chain and leaving the dump behind. Fix: --if-exists, plus pg_terminate_backend afterwards so the app pooled connections reconnect against the recreated objects. Validated on a dev deploy: marker + truncated ClickHouse events both return on restore, migration ledger intact (17 rows), post-restore event ingestion for a new site works, and an app reboot after restore migrates cleanly. Known cosmetic caveat: until the app is restarted, its Postgrex type cache holds stale OIDs and background Oban jobs log "cache lookup failed for type" - ingestion and serving are unaffected; an operator restart after a restore clears it.!testme
🌻 cc-ci —
plausible@270c8404✅ passedfull logs · dashboard
@notplants — GREEN ✅ at level 4 (max): https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/247 — install, upgrade (3.0.1→3.1.0 with the in-place postgres 13→18 pgautoupgrade), backup, restore, and the functional event-tracking tier all pass; clean teardown, no secret leaks.
What it took (each verified on a dev deploy before pushing, all dev deploys torn down):
9f8bcbc— ClickHouse entrypoint: clickhouse-backup install is cached on the persistent volume, retried with timeout, binary-verified, canonical Altinity URL, and hard-fails loudly instead of silently degrading. Fixeddepends_onservice name (re-enables CI image prepull).4cab6b5— backup labels migrated to backup-bot-two v2 volume syntax (the oldbackup.pathlabel is ignored by 2.4.0; the pg dump never reached snapshots, so restore was a silent no-op).270c840— restore correctness under a live app: ClickHouseschema_migrations(TinyLog — not FREEZEable) round-trips via TSV in the backup dir; apprestart_policy: any(graceful BEAM exit-0 stranded the task under on-failure);pg_restore --clean --if-exists+pg_terminate_backendso app connections reconnect against recreated objects.cc-ci side (merged):
UPGRADE_BASE_VERSION = "3.0.1+v2.0.0"(the harness's default base, 3.0.0, can never start on amd64 — its entrypoint lacks an x86_64 mapping → 404 → silent exit 1: the original "empty logs" crash-loop) andpsql -qin the event-tracking site registration (cc-ci#9).Operator note for the release: after an
abra app restore, restart the app (abra app restart <domain> app) — until then background Oban jobs log stale type-OID errors (ingestion and serving are unaffected).!testme
🌻 cc-ci —
plausible@270c8404❌ failurefull logs · dashboard
!testme
🌻 cc-ci —
plausible@270c8404✅ passedfull logs · dashboard
!testme
🌻 cc-ci —
plausible@270c8404✅ passedfull logs · dashboard
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.