journal(2): ghost full8 instrumented — DEFINITIVE root cause = db container cycled by backup op, racing backupbot volume capture (not OOM/not-healthcheck); next: read backupbot backup flow
This commit is contained in:
@ -1407,3 +1407,30 @@ So the reimport post-hook is configured + present yet ci_marker doesn't return O
|
|||||||
start_period=15m. Mechanism unclear by reasoning (start_period shouldn't keep a ready mysql
|
start_period=15m. Mechanism unclear by reasoning (start_period shouldn't keep a ready mysql
|
||||||
"starting"). Next: full7 with the restore tier WATCHED LIVE — db health state, `abra app restore`
|
"starting"). Next: full7 with the restore tier WATCHED LIVE — db health state, `abra app restore`
|
||||||
output, backup.sql.gz presence, ci_marker immediately post-restore — to get the actual mechanism.
|
output, backup.sql.gz presence, ci_marker immediately post-restore — to get the actual mechanism.
|
||||||
|
|
||||||
|
## 2026-05-30T21:15Z — ghost full8 INSTRUMENTED: DEFINITIVE root cause = db container cycles DURING backup op
|
||||||
|
|
||||||
|
Fixed the diag watcher (NixOS has no /bin/bash → must `bash gdw3.sh`) and captured db state every 4s
|
||||||
|
through full8's backup+restore tiers (`/root/ghost-diag8.log`). Decisive timeline (backup op):
|
||||||
|
21:08:43–51 db cid=93865743 repl=1/1 healthy
|
||||||
|
21:08:58 db cid=93865743 repl=1/1 **unhealthy**
|
||||||
|
21:09:03 repl=0/1 cid= (container GONE)
|
||||||
|
21:09:07–19 repl=0/1 **starting**, NEW cid=784ec680
|
||||||
|
21:09:24→ repl=1/1 healthy (new cid), stays healthy through the whole restore tier
|
||||||
|
So the db container is REPLACED during the BACKUP op (abra app backup create), well before the restore
|
||||||
|
tier. This races backupbot's volume enumeration (which resolves each volume path from running service
|
||||||
|
specs at backup time) → the mysql volume is intermittently omitted from the snapshot (proven earlier:
|
||||||
|
full5 snapshot had …_mysql/backup.sql.gz; full6/7 had only ghost_content). Restore then restores a
|
||||||
|
dump-less snapshot → reimport reads nothing → silent no-op (hook lacks `set -o pipefail`) → ci_marker lost.
|
||||||
|
|
||||||
|
Ruled OUT: not OOM (db has NO mem_limit; host 5.8G free), not healthcheck-timing (base hc retries=10 ×
|
||||||
|
interval=30s = 5min to unhealthy — impossible in the observed ~16s window; merged hc keeps test+retries).
|
||||||
|
So the cycle is driven BY the backup op, not a crash/healthcheck. The db-grace start_period overlay was
|
||||||
|
a RED HERRING for restore (the cycle is past start_period). Likely abra/backup-bot-two stops/restarts
|
||||||
|
the db to take a consistent volume snapshot; the omission is a timing race in that flow.
|
||||||
|
|
||||||
|
NEXT (precise): read backup-bot-two `/usr/bin/backup` backup flow (does it stop/cycle containers? how
|
||||||
|
does it enumerate volumes relative to that?) to confirm the cycle+capture interaction, THEN fix:
|
||||||
|
candidate = harness verifies the backup snapshot contains the db volume and retries if not, AND/OR the
|
||||||
|
recipe-PR backup is made resilient (+ `set -o pipefail` + fail-loud on missing dump so it can never be
|
||||||
|
silent again). 5 ghost runs done (full4 timeout-fixed; full5/6/7/8 restore race) — stop blind re-runs.
|
||||||
|
|||||||
Reference in New Issue
Block a user