fix(db): make pg_upgrade migration idempotent & crash-safe #3
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix-pg-migration-idempotent"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Makes the postgres
pg_upgrademajor-version migration idempotent and crash-safe: a state-driven guard on theold_data/new_datascratch dirs replaces the marker file, so an interrupted migration auto-recovers (empty leftovers) or stops cleanly for manual recovery (non-empty) instead of crash-looping onmkdir: File existsor silently re-initdb-ing over live data. BumpsDB_ENTRYPOINT_VERSIONv1->v2 so swarm reloads the config.The postgres major-version migration in the db entrypoint was not safe to re-run. If the container was killed mid-migration it could crash-loop forever ("mkdir: cannot create directory .../old_data: File exists") or silently initdb a fresh empty cluster over the live data once PG_VERSION had been moved out of $PGDATA but before the in-progress marker was written. Replace the marker file with a state-driven guard keyed on the scratch dirs: empty old_data/new_data means the run was interrupted before any data moved, so discard and retry (idempotent); non-empty means data may only live there, so stop for manual recovery. Bump DB_ENTRYPOINT_VERSION v1->v2 so swarm picks up the new (immutable) config.Tested on cctest by running the new entrypoint in
pgvector/pgvector:pg17against a real seeded postgres 13 cluster: (1) full 13→17 migration completes with data intact; (2) a stuck deploy with emptyold_data/new_dataauto-recovers and migrates; (3) a non-empty scratch dir exits FATAL without deleting the data.Added commit
57f5ee2: pg_upgrade was hardcoded to-U $POSTGRES_USER(discourse), which fails the "database user is the install user" check on clusters whose bootstrap superuser ispostgreswithdiscourseas a separate app role. It now detects the old cluster's real install user (briefly starts it and readspg_roleswhereoid = 10) and uses that for both the new cluster's initdb andpg_upgrade -U. Verified on cctest against a prod-like v13 cluster (superuserpostgres, non-superuserdiscourserole): 13→17 completes, data intact.!testme
🌻 cc-ci —
discourse@bd5f1817✅ passedfull logs · dashboard
bd5f181737to33add86dd333add86dd3to5d71fc560d5d71fc560dto6ae2d2cf516ae2d2cf51toa9f08eed28Auto-closed by cc-ci canonical sweep: its changes are already in upstream main (merged upstream); mirror main re-synced
Pull request closed