decisions(2): ghost P4 restore dead-end + root cause (abra backup intermittently omits mysql volume; restore post-hook silent no-op); fix plan

This commit is contained in:
autonomic-bot
2026-05-30 20:52:19 +00:00
parent 1aca09d4db
commit b9b7293298

View File

@ -1054,3 +1054,43 @@ sub-plan's own lasuite-drive collabora "start_period [KEYSTONE]" recipe-PR.
- **discourse**: recipe-PR `recipe-maintainers/discourse#1` sets `start_period: 20m` (covers the
15-25min Rails first-boot; default was 5m). cc-ci recipe_meta no longer sets APP_START_PERIOD.
- **ghost (E1)**: must use the SAME literal-bump approach, NOT an env var (same abra limitation).
## 2026-05-30 — ghost P4 restore: 3rd-failure DEAD-END + root cause (backup omits mysql volume)
**Dead-end (stop per §guardrail, 3 identical fails):** ghost full5/full6/full7 (REF=ae43ffe, db-grace
overlay db@15m start_period) ALL failed P4 restore — `ci_marker` table absent post-restore — after
full3 (app-only overlay, db@native 1m) passed it once. Re-running unchanged is futile.
**Definitive root cause (restic snapshot inspection, repo /backups/restic in backupbot container):**
`abra app backup create` is INTERMITTENTLY OMITTING the mysql volume from the snapshot.
- full5 backup snapshot `b6200e44`: contains `…_mysql/_data/backup.sql.gz` ✓
- full6 `7daac418` + full7 `410a63b9` (latest): contain ONLY `…_ghost_content/_data` + `/secrets` —
**no mysql volume / no backup.sql.gz**.
`abra app restore` restores the LATEST snapshot → for full6/7 that snapshot has no db dump →
`/var/lib/mysql/backup.sql.gz` absent at restore → the recipe restore post-hook
`gunzip -c …/backup.sql.gz | mysql -u root` reads nothing. mysql_backup.sh has `set -e` but NOT
`set -o pipefail`, so a failed/empty gunzip pipes empty input to mysql (rc 0) → restore "succeeds"
SILENTLY while reimporting nothing → dropped ci_marker never returns. (cc-ci P4 correctly caught a
real data-loss path; generic test_restore_healthy passed = app up, masking it — exactly why P4 exists.)
**Two distinct defects, both in the ghost recipe-PR (recipe-maintainers/ghost#1), NOT cc-ci tests:**
1. **Unreliable backup capture.** The db service uses `backupbot.backup.volumes.mysql.path: backup.sql.gz`
— a schema whose volume capture is intermittent here (worked full5, silently skipped full6/7;
correlates with the db-grace overlay disturbing db-container settle timing around the upgrade
chaos-redeploy + backup). The PROVEN pattern in this project is mattermost-lts#1's
`backupbot.backup.path: "/var/lib/postgresql/data/"` (container-path schema, pre-hook dumps into the
dir, post-hook rm's the dump, physical restore) — never intermittent.
2. **Silent no-op restore.** mysql_backup.sh restore lacks `set -o pipefail` + a backup-file existence
guard, so a missing/empty dump silently restores nothing instead of failing loud.
**Fix plan (next tick, methodical — not a blind re-run):**
(a) Re-run ONCE with a FIXED diagnostic watcher that samples db `.State.Health.Status` + the snapshot
volume list DURING the backup tier, to confirm the "db unsettled at backup → volume skipped" link.
(b) Recipe-PR ghost#1: harden mysql_backup.sh (`set -o pipefail`; restore fails loud if backup.sql.gz
missing/empty) AND switch the db backup to the proven mattermost `backupbot.backup.path` schema so
the dump is reliably archived+restored. Keep logical dump+reimport (physical InnoDB copy of a live
data dir is consistency-risky). Then full ghost run green incl upgrade-to-latest → claim.
NB also flagged: snapshots accumulate across runs under the SAME deterministic domain hash
(ghos-9431a1) — restore picks the global LATEST, which can be a prior run's; teardown removes app+
volumes but not the restic repo. Confirm the harness restore targets THIS run's snapshot (likely fine
since each run's latest is its own backup, but worth a harness check).