journal(2): ghost full8 instrumented — DEFINITIVE root cause = db container cycled by backup op, racing backupbot volume capture (not OOM/not-healthcheck); next: read backupbot backup flow
This commit is contained in:
@ -1407,3 +1407,30 @@ So the reimport post-hook is configured + present yet ci_marker doesn't return O
|
||||
start_period=15m. Mechanism unclear by reasoning (start_period shouldn't keep a ready mysql
|
||||
"starting"). Next: full7 with the restore tier WATCHED LIVE — db health state, `abra app restore`
|
||||
output, backup.sql.gz presence, ci_marker immediately post-restore — to get the actual mechanism.
|
||||
|
||||
## 2026-05-30T21:15Z — ghost full8 INSTRUMENTED: DEFINITIVE root cause = db container cycles DURING backup op
|
||||
|
||||
Fixed the diag watcher (NixOS has no /bin/bash → must `bash gdw3.sh`) and captured db state every 4s
|
||||
through full8's backup+restore tiers (`/root/ghost-diag8.log`). Decisive timeline (backup op):
|
||||
21:08:43–51 db cid=93865743 repl=1/1 healthy
|
||||
21:08:58 db cid=93865743 repl=1/1 **unhealthy**
|
||||
21:09:03 repl=0/1 cid= (container GONE)
|
||||
21:09:07–19 repl=0/1 **starting**, NEW cid=784ec680
|
||||
21:09:24→ repl=1/1 healthy (new cid), stays healthy through the whole restore tier
|
||||
So the db container is REPLACED during the BACKUP op (abra app backup create), well before the restore
|
||||
tier. This races backupbot's volume enumeration (which resolves each volume path from running service
|
||||
specs at backup time) → the mysql volume is intermittently omitted from the snapshot (proven earlier:
|
||||
full5 snapshot had …_mysql/backup.sql.gz; full6/7 had only ghost_content). Restore then restores a
|
||||
dump-less snapshot → reimport reads nothing → silent no-op (hook lacks `set -o pipefail`) → ci_marker lost.
|
||||
|
||||
Ruled OUT: not OOM (db has NO mem_limit; host 5.8G free), not healthcheck-timing (base hc retries=10 ×
|
||||
interval=30s = 5min to unhealthy — impossible in the observed ~16s window; merged hc keeps test+retries).
|
||||
So the cycle is driven BY the backup op, not a crash/healthcheck. The db-grace start_period overlay was
|
||||
a RED HERRING for restore (the cycle is past start_period). Likely abra/backup-bot-two stops/restarts
|
||||
the db to take a consistent volume snapshot; the omission is a timing race in that flow.
|
||||
|
||||
NEXT (precise): read backup-bot-two `/usr/bin/backup` backup flow (does it stop/cycle containers? how
|
||||
does it enumerate volumes relative to that?) to confirm the cycle+capture interaction, THEN fix:
|
||||
candidate = harness verifies the backup snapshot contains the db volume and retries if not, AND/OR the
|
||||
recipe-PR backup is made resilient (+ `set -o pipefail` + fail-loud on missing dump so it can never be
|
||||
silent again). 5 ghost runs done (full4 timeout-fixed; full5/6/7/8 restore race) — stop blind re-runs.
|
||||
|
||||
Reference in New Issue
Block a user