Compare commits

..

1 Commits

Author SHA1 Message Date
5d71fc560d fix(db): make the postgres major-version migration safe and correct
The in-place pg_upgrade in the db entrypoint could crash-loop or fail on real
clusters. This reworks it:

- Idempotent, crash-safe: replace the fragile migration_in_progress marker with
  a state-driven guard on the old_data/new_data scratch dirs. An empty leftover
  means a run was interrupted before any data moved (data still intact at
  $PGDATA) so it is discarded and retried; a non-empty one means data may live
  only there, so it stops for manual recovery. Removes both the
  "mkdir: File exists" crash-loop and the silent fresh-initdb-over-live-data
  window.

- Correct install user: pg_upgrade must run as the old cluster's bootstrap
  superuser (oid 10), and the new cluster must be initialised with that same
  user. It is not necessarily $POSTGRES_USER (clusters created with the default
  "postgres" superuser plus a separate app role are common). Detect it from the
  old cluster (briefly start it and read pg_roles where oid = 10) and use it for
  both the new cluster's initdb and the pg_upgrade -U argument.

- Bump DB_ENTRYPOINT_VERSION v1->v2 so swarm reloads the (immutable) config.

Verified on cctest: clean 13->17, interrupted-then-retried, and prod-like
clusters whose install user is "postgres" with a separate "discourse" app role.
2026-06-16 18:26:37 +00:00
2 changed files with 3 additions and 3 deletions

View File

@ -1,2 +1,2 @@
export DB_ENTRYPOINT_VERSION=v3 export DB_ENTRYPOINT_VERSION=v2
export PG_BACKUP_VERSION=v2 export PG_BACKUP_VERSION=v2

View File

@ -43,7 +43,7 @@ services:
#- "traefik.http.routers.${STACK_NAME}.middlewares=${STACK_NAME}-redirect" #- "traefik.http.routers.${STACK_NAME}.middlewares=${STACK_NAME}-redirect"
#- "traefik.http.middlewares.${STACK_NAME}-redirect.headers.SSLForceHost=true" #- "traefik.http.middlewares.${STACK_NAME}-redirect.headers.SSLForceHost=true"
#- "traefik.http.middlewares.${STACK_NAME}-redirect.headers.SSLHost=${DOMAIN}" #- "traefik.http.middlewares.${STACK_NAME}-redirect.headers.SSLHost=${DOMAIN}"
- "coop-cloud.${STACK_NAME}.version=0.8.1+3.5.0" - "coop-cloud.${STACK_NAME}.version=0.8.0+3.5.0"
healthcheck: healthcheck:
test: "ruby -e \"require 'uri'; require 'net/http'; uri = URI('http://localhost:3000/srv/status'); res = Net::HTTP.get_response(uri); if res.is_a?(Net::HTTPSuccess) then exit (0) else exit (1) end\"" test: "ruby -e \"require 'uri'; require 'net/http'; uri = URI('http://localhost:3000/srv/status'); res = Net::HTTP.get_response(uri); if res.is_a?(Net::HTTPSuccess) then exit (0) else exit (1) end\""
interval: 30s interval: 30s
@ -80,7 +80,7 @@ services:
backupbot.restore.post-hook: "/pg_backup.sh restore" backupbot.restore.post-hook: "/pg_backup.sh restore"
redis: redis:
image: redis:8.8-alpine image: redis:7.4-alpine
networks: networks:
- internal - internal
volumes: volumes: