journal+decisions(2): ghost migration-lock deadlock root cause + healthcheck-overlay fix + abra +U chaos-version normalization

This commit is contained in:
2026-05-30 05:54:54 +01:00
parent 1570ccb698
commit 8ff5ad246a
2 changed files with 62 additions and 0 deletions

View File

@ -932,3 +932,32 @@ script, `backup` = pg_dump|gzip → backup.sql, `backupbot.backup.volumes.postgr
(archive only the dump), `restore` post-hook = terminate connections + DROP DATABASE FORCE + createdb
+ reimport. postgres:15 plain dump → no special handling (mechanism already proven generic on immich).
Validated: `RECIPE=mattermost-lts PR=1` full lifecycle GREEN, restore tier PASSES (ci_marker survives).
## 2026-05-30 — ghost MySQL cold-boot: healthcheck start_period overlay + abra +U marker normalization
**Context (Phase 2 Q4.4 ghost).** The current ghost recipe (1.x+6.21.2) uses **MySQL** (not sqlite).
Its fresh-DB first boot runs a ~6-9min schema migration (dozens of CREATE TABLEs, each a separate
MySQL round-trip → round-trip-bound, NOT CPU-bound; reproduced on both 2- and 4-vCPU). The upstream
recipe healthcheck `start_period:1m` (+10×30s ≈ 6min grace) kills the still-migrating task, leaving a
stale `migrations_lock` → every later task deadlocks (`MigrationsAreLockedError`).
**Decision 1 — cc-ci deploy overlay for the healthcheck (NOT a recipe change, NOT a test change).**
`tests/ghost/compose.ccci-health.yml` raises ONLY the app healthcheck `start_period` to 900s, applied
via `recipe_meta.EXTRA_ENV COMPOSE_FILE=compose.yml:compose.ccci-health.yml` + `install_steps.sh`
(copies the overlay into the recipe checkout) + `CHAOS_BASE_DEPLOY=True` (the untracked overlay trips
abra's pinned clean-tree check). Rationale: failures are ignored during start_period but a PASS still
marks healthy immediately, so the fresh migration finishes + releases the lock with no other change;
the app's real healthcheck still gates readiness — no assertion weakened. Only the install tier's
fresh migration needs it (the upgrade redeploys on the populated DB → fast boot). The overlay is an
untracked file, so `git checkout -f` (upgrade re-checkout) preserves it → COMPOSE_FILE keeps
resolving across install AND upgrade. This is the general pattern for any recipe whose upstream
healthcheck is too tight for its own first-boot migration on cc-ci infra.
**Decision 2 — assert_upgraded normalizes abra's `+U` working-tree marker (shared harness).**
A cc-ci overlay sitting in the recipe checkout as an untracked file makes abra stamp
`coop-cloud.<stack>.chaos-version='<commit>+U'` (U=untracked). The COMMIT still equals head_ref (HC1
satisfied) but the `+U` suffix broke `assert_upgraded`'s exact-prefix match → spurious upgrade FAIL.
Fix: strip the `+...` working-tree-state marker before the commit match (`chaos.split('+',1)[0]`).
HC1 is preserved — the underlying commit must still equal head_ref; a stale prev-checkout chaos
redeploy stamps prev's commit (also `+U` if overlaid) and still won't match. General: every future
cc-ci overlay recipe (untracked overlay + CHAOS_BASE_DEPLOY) would otherwise hit this.