journal(2): ghost full8 instrumented — DEFINITIVE root cause = db container cycled by backup op, racing backupbot volume capture (not OOM/not-healthcheck); next: read backupbot backup flow

This commit is contained in:
autonomic-bot
2026-05-30 21:17:44 +00:00
parent 427b8ff8c7
commit ad7b3d0e8c

View File

@ -1407,3 +1407,30 @@ So the reimport post-hook is configured + present yet ci_marker doesn't return O
start_period=15m. Mechanism unclear by reasoning (start_period shouldn't keep a ready mysql
"starting"). Next: full7 with the restore tier WATCHED LIVE — db health state, `abra app restore`
output, backup.sql.gz presence, ci_marker immediately post-restore — to get the actual mechanism.
## 2026-05-30T21:15Z — ghost full8 INSTRUMENTED: DEFINITIVE root cause = db container cycles DURING backup op
Fixed the diag watcher (NixOS has no /bin/bash → must `bash gdw3.sh`) and captured db state every 4s
through full8's backup+restore tiers (`/root/ghost-diag8.log`). Decisive timeline (backup op):
21:08:4351 db cid=93865743 repl=1/1 healthy
21:08:58 db cid=93865743 repl=1/1 **unhealthy**
21:09:03 repl=0/1 cid= (container GONE)
21:09:0719 repl=0/1 **starting**, NEW cid=784ec680
21:09:24→ repl=1/1 healthy (new cid), stays healthy through the whole restore tier
So the db container is REPLACED during the BACKUP op (abra app backup create), well before the restore
tier. This races backupbot's volume enumeration (which resolves each volume path from running service
specs at backup time) → the mysql volume is intermittently omitted from the snapshot (proven earlier:
full5 snapshot had …_mysql/backup.sql.gz; full6/7 had only ghost_content). Restore then restores a
dump-less snapshot → reimport reads nothing → silent no-op (hook lacks `set -o pipefail`) → ci_marker lost.
Ruled OUT: not OOM (db has NO mem_limit; host 5.8G free), not healthcheck-timing (base hc retries=10 ×
interval=30s = 5min to unhealthy — impossible in the observed ~16s window; merged hc keeps test+retries).
So the cycle is driven BY the backup op, not a crash/healthcheck. The db-grace start_period overlay was
a RED HERRING for restore (the cycle is past start_period). Likely abra/backup-bot-two stops/restarts
the db to take a consistent volume snapshot; the omission is a timing race in that flow.
NEXT (precise): read backup-bot-two `/usr/bin/backup` backup flow (does it stop/cycle containers? how
does it enumerate volumes relative to that?) to confirm the cycle+capture interaction, THEN fix:
candidate = harness verifies the backup snapshot contains the db volume and retries if not, AND/OR the
recipe-PR backup is made resilient (+ `set -o pipefail` + fail-loud on missing dump so it can never be
silent again). 5 ghost runs done (full4 timeout-fixed; full5/6/7/8 restore race) — stop blind re-runs.