diff --git a/machine-docs/JOURNAL-2.md b/machine-docs/JOURNAL-2.md index 4572bbf..a7e260c 100644 --- a/machine-docs/JOURNAL-2.md +++ b/machine-docs/JOURNAL-2.md @@ -1407,3 +1407,30 @@ So the reimport post-hook is configured + present yet ci_marker doesn't return O start_period=15m. Mechanism unclear by reasoning (start_period shouldn't keep a ready mysql "starting"). Next: full7 with the restore tier WATCHED LIVE — db health state, `abra app restore` output, backup.sql.gz presence, ci_marker immediately post-restore — to get the actual mechanism. + +## 2026-05-30T21:15Z — ghost full8 INSTRUMENTED: DEFINITIVE root cause = db container cycles DURING backup op + +Fixed the diag watcher (NixOS has no /bin/bash → must `bash gdw3.sh`) and captured db state every 4s +through full8's backup+restore tiers (`/root/ghost-diag8.log`). Decisive timeline (backup op): + 21:08:43–51 db cid=93865743 repl=1/1 healthy + 21:08:58 db cid=93865743 repl=1/1 **unhealthy** + 21:09:03 repl=0/1 cid= (container GONE) + 21:09:07–19 repl=0/1 **starting**, NEW cid=784ec680 + 21:09:24→ repl=1/1 healthy (new cid), stays healthy through the whole restore tier +So the db container is REPLACED during the BACKUP op (abra app backup create), well before the restore +tier. This races backupbot's volume enumeration (which resolves each volume path from running service +specs at backup time) → the mysql volume is intermittently omitted from the snapshot (proven earlier: +full5 snapshot had …_mysql/backup.sql.gz; full6/7 had only ghost_content). Restore then restores a +dump-less snapshot → reimport reads nothing → silent no-op (hook lacks `set -o pipefail`) → ci_marker lost. + +Ruled OUT: not OOM (db has NO mem_limit; host 5.8G free), not healthcheck-timing (base hc retries=10 × +interval=30s = 5min to unhealthy — impossible in the observed ~16s window; merged hc keeps test+retries). +So the cycle is driven BY the backup op, not a crash/healthcheck. The db-grace start_period overlay was +a RED HERRING for restore (the cycle is past start_period). Likely abra/backup-bot-two stops/restarts +the db to take a consistent volume snapshot; the omission is a timing race in that flow. + +NEXT (precise): read backup-bot-two `/usr/bin/backup` backup flow (does it stop/cycle containers? how +does it enumerate volumes relative to that?) to confirm the cycle+capture interaction, THEN fix: +candidate = harness verifies the backup snapshot contains the db volume and retries if not, AND/OR the +recipe-PR backup is made resilient (+ `set -o pipefail` + fail-loud on missing dump so it can never be +silent again). 5 ghost runs done (full4 timeout-fixed; full5/6/7/8 restore race) — stop blind re-runs.