decisions(2): ghost P4 restore dead-end + root cause (abra backup intermittently omits mysql volume; restore post-hook silent no-op); fix plan

2026-05-30 20:52:19 +00:00
parent 1aca09d4db
commit b9b7293298
1 changed files with 40 additions and 0 deletions
--- a/machine-docs/DECISIONS.md
+++ b/machine-docs/DECISIONS.md
@ -1054,3 +1054,43 @@ sub-plan's own lasuite-drive collabora "start_period [KEYSTONE]" recipe-PR.
 - **discourse**: recipe-PR `recipe-maintainers/discourse#1` sets `start_period: 20m` (covers the
  15-25min Rails first-boot; default was 5m). cc-ci recipe_meta no longer sets APP_START_PERIOD.
 - **ghost (E1)**: must use the SAME literal-bump approach, NOT an env var (same abra limitation).
+
+## 2026-05-30 — ghost P4 restore: 3rd-failure DEAD-END + root cause (backup omits mysql volume)
+
+**Dead-end (stop per §guardrail, 3 identical fails):** ghost full5/full6/full7 (REF=ae43ffe, db-grace
+overlay db@15m start_period) ALL failed P4 restore — `ci_marker` table absent post-restore — after
+full3 (app-only overlay, db@native 1m) passed it once. Re-running unchanged is futile.
+
+**Definitive root cause (restic snapshot inspection, repo /backups/restic in backupbot container):**
+`abra app backup create` is INTERMITTENTLY OMITTING the mysql volume from the snapshot.
+- full5 backup snapshot `b6200e44`: contains `…_mysql/_data/backup.sql.gz` ✓
+- full6 `7daac418` + full7 `410a63b9` (latest): contain ONLY `…_ghost_content/_data` + `/secrets` —
+  **no mysql volume / no backup.sql.gz**.
+`abra app restore` restores the LATEST snapshot → for full6/7 that snapshot has no db dump →
+`/var/lib/mysql/backup.sql.gz` absent at restore → the recipe restore post-hook
+`gunzip -c …/backup.sql.gz | mysql -u root` reads nothing. mysql_backup.sh has `set -e` but NOT
+`set -o pipefail`, so a failed/empty gunzip pipes empty input to mysql (rc 0) → restore "succeeds"
+SILENTLY while reimporting nothing → dropped ci_marker never returns. (cc-ci P4 correctly caught a
+real data-loss path; generic test_restore_healthy passed = app up, masking it — exactly why P4 exists.)
+
+**Two distinct defects, both in the ghost recipe-PR (recipe-maintainers/ghost#1), NOT cc-ci tests:**
+1. **Unreliable backup capture.** The db service uses `backupbot.backup.volumes.mysql.path: backup.sql.gz`
+   — a schema whose volume capture is intermittent here (worked full5, silently skipped full6/7;
+   correlates with the db-grace overlay disturbing db-container settle timing around the upgrade
+   chaos-redeploy + backup). The PROVEN pattern in this project is mattermost-lts#1's
+   `backupbot.backup.path: "/var/lib/postgresql/data/"` (container-path schema, pre-hook dumps into the
+   dir, post-hook rm's the dump, physical restore) — never intermittent.
+2. **Silent no-op restore.** mysql_backup.sh restore lacks `set -o pipefail` + a backup-file existence
+   guard, so a missing/empty dump silently restores nothing instead of failing loud.
+
+**Fix plan (next tick, methodical — not a blind re-run):**
+(a) Re-run ONCE with a FIXED diagnostic watcher that samples db `.State.Health.Status` + the snapshot
+    volume list DURING the backup tier, to confirm the "db unsettled at backup → volume skipped" link.
+(b) Recipe-PR ghost#1: harden mysql_backup.sh (`set -o pipefail`; restore fails loud if backup.sql.gz
+    missing/empty) AND switch the db backup to the proven mattermost `backupbot.backup.path` schema so
+    the dump is reliably archived+restored. Keep logical dump+reimport (physical InnoDB copy of a live
+    data dir is consistency-risky). Then full ghost run green incl upgrade-to-latest → claim.
+NB also flagged: snapshots accumulate across runs under the SAME deterministic domain hash
+(ghos-9431a1) — restore picks the global LATEST, which can be a prior run's; teardown removes app+
+volumes but not the restic repo. Confirm the harness restore targets THIS run's snapshot (likely fine
+since each run's latest is its own backup, but worth a harness check).