From b9b729329845f8aba1fed8677cb687958f915dea Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Sat, 30 May 2026 20:52:19 +0000 Subject: [PATCH] decisions(2): ghost P4 restore dead-end + root cause (abra backup intermittently omits mysql volume; restore post-hook silent no-op); fix plan --- machine-docs/DECISIONS.md | 40 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index 6c4375e..3e17996 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -1054,3 +1054,43 @@ sub-plan's own lasuite-drive collabora "start_period [KEYSTONE]" recipe-PR. - **discourse**: recipe-PR `recipe-maintainers/discourse#1` sets `start_period: 20m` (covers the 15-25min Rails first-boot; default was 5m). cc-ci recipe_meta no longer sets APP_START_PERIOD. - **ghost (E1)**: must use the SAME literal-bump approach, NOT an env var (same abra limitation). + +## 2026-05-30 — ghost P4 restore: 3rd-failure DEAD-END + root cause (backup omits mysql volume) + +**Dead-end (stop per §guardrail, 3 identical fails):** ghost full5/full6/full7 (REF=ae43ffe, db-grace +overlay db@15m start_period) ALL failed P4 restore — `ci_marker` table absent post-restore — after +full3 (app-only overlay, db@native 1m) passed it once. Re-running unchanged is futile. + +**Definitive root cause (restic snapshot inspection, repo /backups/restic in backupbot container):** +`abra app backup create` is INTERMITTENTLY OMITTING the mysql volume from the snapshot. +- full5 backup snapshot `b6200e44`: contains `…_mysql/_data/backup.sql.gz` ✓ +- full6 `7daac418` + full7 `410a63b9` (latest): contain ONLY `…_ghost_content/_data` + `/secrets` — + **no mysql volume / no backup.sql.gz**. +`abra app restore` restores the LATEST snapshot → for full6/7 that snapshot has no db dump → +`/var/lib/mysql/backup.sql.gz` absent at restore → the recipe restore post-hook +`gunzip -c …/backup.sql.gz | mysql -u root` reads nothing. mysql_backup.sh has `set -e` but NOT +`set -o pipefail`, so a failed/empty gunzip pipes empty input to mysql (rc 0) → restore "succeeds" +SILENTLY while reimporting nothing → dropped ci_marker never returns. (cc-ci P4 correctly caught a +real data-loss path; generic test_restore_healthy passed = app up, masking it — exactly why P4 exists.) + +**Two distinct defects, both in the ghost recipe-PR (recipe-maintainers/ghost#1), NOT cc-ci tests:** +1. **Unreliable backup capture.** The db service uses `backupbot.backup.volumes.mysql.path: backup.sql.gz` + — a schema whose volume capture is intermittent here (worked full5, silently skipped full6/7; + correlates with the db-grace overlay disturbing db-container settle timing around the upgrade + chaos-redeploy + backup). The PROVEN pattern in this project is mattermost-lts#1's + `backupbot.backup.path: "/var/lib/postgresql/data/"` (container-path schema, pre-hook dumps into the + dir, post-hook rm's the dump, physical restore) — never intermittent. +2. **Silent no-op restore.** mysql_backup.sh restore lacks `set -o pipefail` + a backup-file existence + guard, so a missing/empty dump silently restores nothing instead of failing loud. + +**Fix plan (next tick, methodical — not a blind re-run):** +(a) Re-run ONCE with a FIXED diagnostic watcher that samples db `.State.Health.Status` + the snapshot + volume list DURING the backup tier, to confirm the "db unsettled at backup → volume skipped" link. +(b) Recipe-PR ghost#1: harden mysql_backup.sh (`set -o pipefail`; restore fails loud if backup.sql.gz + missing/empty) AND switch the db backup to the proven mattermost `backupbot.backup.path` schema so + the dump is reliably archived+restored. Keep logical dump+reimport (physical InnoDB copy of a live + data dir is consistency-risky). Then full ghost run green incl upgrade-to-latest → claim. +NB also flagged: snapshots accumulate across runs under the SAME deterministic domain hash +(ghos-9431a1) — restore picks the global LATEST, which can be a prior run's; teardown removes app+ +volumes but not the restic repo. Confirm the harness restore targets THIS run's snapshot (likely fine +since each run's latest is its own backup, but worth a harness check).