journal(2): ghost full8 instrumented — DEFINITIVE root cause = db container cycled by backup op, racing backupbot volume capture (not OOM/not-healthcheck); next: read backupbot backup flow

2026-05-30 21:17:44 +00:00
parent 427b8ff8c7
commit ad7b3d0e8c
1 changed files with 27 additions and 0 deletions
--- a/machine-docs/JOURNAL-2.md
+++ b/machine-docs/JOURNAL-2.md
@ -1407,3 +1407,30 @@ So the reimport post-hook is configured + present yet ci_marker doesn't return O
 start_period=15m. Mechanism unclear by reasoning (start_period shouldn't keep a ready mysql
 "starting"). Next: full7 with the restore tier WATCHED LIVE — db health state, `abra app restore`
 output, backup.sql.gz presence, ci_marker immediately post-restore — to get the actual mechanism.
 ## 2026-05-30T21:15Z — ghost full8 INSTRUMENTED: DEFINITIVE root cause = db container cycles DURING backup op
 Fixed the diag watcher (NixOS has no /bin/bash → must `bash gdw3.sh`) and captured db state every 4s
 through full8's backup+restore tiers (`/root/ghost-diag8.log`). Decisive timeline (backup op):
  21:08:43–51 db cid=93865743 repl=1/1 healthy
  21:08:58    db cid=93865743 repl=1/1 **unhealthy**
  21:09:03    repl=0/1 cid= (container GONE)
  21:09:07–19 repl=0/1 **starting**, NEW cid=784ec680
  21:09:24→   repl=1/1 healthy (new cid), stays healthy through the whole restore tier
 So the db container is REPLACED during the BACKUP op (abra app backup create), well before the restore
 tier. This races backupbot's volume enumeration (which resolves each volume path from running service
 specs at backup time) → the mysql volume is intermittently omitted from the snapshot (proven earlier:
 full5 snapshot had …_mysql/backup.sql.gz; full6/7 had only ghost_content). Restore then restores a
 dump-less snapshot → reimport reads nothing → silent no-op (hook lacks `set -o pipefail`) → ci_marker lost.
 Ruled OUT: not OOM (db has NO mem_limit; host 5.8G free), not healthcheck-timing (base hc retries=10 ×
 interval=30s = 5min to unhealthy — impossible in the observed ~16s window; merged hc keeps test+retries).
 So the cycle is driven BY the backup op, not a crash/healthcheck. The db-grace start_period overlay was
 a RED HERRING for restore (the cycle is past start_period). Likely abra/backup-bot-two stops/restarts
 the db to take a consistent volume snapshot; the omission is a timing race in that flow.
 NEXT (precise): read backup-bot-two `/usr/bin/backup` backup flow (does it stop/cycle containers? how
 does it enumerate volumes relative to that?) to confirm the cycle+capture interaction, THEN fix:
 candidate = harness verifies the backup snapshot contains the db volume and retries if not, AND/OR the
 recipe-PR backup is made resilient (+ `set -o pipefail` + fail-loud on missing dump so it can never be
 silent again). 5 ghost runs done (full4 timeout-fixed; full5/6/7/8 restore race) — stop blind re-runs.