journal(2): ghost full8 instrumented — DEFINITIVE root cause = db container cycled by backup op, racing backupbot volume capture (not OOM/not-healthcheck); next: read backupbot backup flow

2026-05-30 21:17:44 +00:00
parent 427b8ff8c7
commit ad7b3d0e8c
1 changed files with 27 additions and 0 deletions
--- a/machine-docs/JOURNAL-2.md
+++ b/machine-docs/JOURNAL-2.md
@ -1407,3 +1407,30 @@ So the reimport post-hook is configured + present yet ci_marker doesn't return O
 start_period=15m. Mechanism unclear by reasoning (start_period shouldn't keep a ready mysql
 "starting"). Next: full7 with the restore tier WATCHED LIVE — db health state, `abra app restore`
 output, backup.sql.gz presence, ci_marker immediately post-restore — to get the actual mechanism.
+
+## 2026-05-30T21:15Z — ghost full8 INSTRUMENTED: DEFINITIVE root cause = db container cycles DURING backup op
+
+Fixed the diag watcher (NixOS has no /bin/bash → must `bash gdw3.sh`) and captured db state every 4s
+through full8's backup+restore tiers (`/root/ghost-diag8.log`). Decisive timeline (backup op):
+  21:08:43–51 db cid=93865743 repl=1/1 healthy
+  21:08:58    db cid=93865743 repl=1/1 **unhealthy**
+  21:09:03    repl=0/1 cid= (container GONE)
+  21:09:07–19 repl=0/1 **starting**, NEW cid=784ec680
+  21:09:24→   repl=1/1 healthy (new cid), stays healthy through the whole restore tier
+So the db container is REPLACED during the BACKUP op (abra app backup create), well before the restore
+tier. This races backupbot's volume enumeration (which resolves each volume path from running service
+specs at backup time) → the mysql volume is intermittently omitted from the snapshot (proven earlier:
+full5 snapshot had …_mysql/backup.sql.gz; full6/7 had only ghost_content). Restore then restores a
+dump-less snapshot → reimport reads nothing → silent no-op (hook lacks `set -o pipefail`) → ci_marker lost.
+
+Ruled OUT: not OOM (db has NO mem_limit; host 5.8G free), not healthcheck-timing (base hc retries=10 ×
+interval=30s = 5min to unhealthy — impossible in the observed ~16s window; merged hc keeps test+retries).
+So the cycle is driven BY the backup op, not a crash/healthcheck. The db-grace start_period overlay was
+a RED HERRING for restore (the cycle is past start_period). Likely abra/backup-bot-two stops/restarts
+the db to take a consistent volume snapshot; the omission is a timing race in that flow.
+
+NEXT (precise): read backup-bot-two `/usr/bin/backup` backup flow (does it stop/cycle containers? how
+does it enumerate volumes relative to that?) to confirm the cycle+capture interaction, THEN fix:
+candidate = harness verifies the backup snapshot contains the db volume and retries if not, AND/OR the
+recipe-PR backup is made resilient (+ `set -o pipefail` + fail-loud on missing dump so it can never be
+silent again). 5 ghost runs done (full4 timeout-fixed; full5/6/7/8 restore race) — stop blind re-runs.