fix(2): ghost F2-14b — harness BACKUP_VERIFY hook + retry; close the backup-capture race
Root cause (instrumented, DECISIONS 2026-05-30): a DB recipe dumps its data in a backupbot pre-hook, but if the DB container cycles mid-dump (intermittent on the loaded CI node — full5/6/7 RED, full8 green; NOT OOM/NOT healthcheck) the dump is truncated/absent and restic snapshots an empty path — abra app backup 'succeeds' yet a later restore silently loses the data (ghost ci_marker). Fix (additive, recipe-scoped via meta like READY_PROBE): recipe_meta may define BACKUP_VERIFY(domain) -> bool, a READ-ONLY post-backup integrity probe. When it returns False the harness re-runs the whole backup (fresh snapshot, re-stabilised db) up to 3x. Recipes without the hook are unaffected. ghost's BACKUP_VERIFY confirms /var/lib/mysql/backup.sql.gz is a valid non-empty gzip. Weakens no assertion — it only retries a flaky CAPTURE so P4 restore is RELIABLY exercised, not luck-dependent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -45,3 +45,30 @@ EXTRA_ENV = {
|
||||
"TIMEOUT": "2400",
|
||||
"COMPOSE_FILE": "compose.yml:compose.ccci.yml",
|
||||
}
|
||||
|
||||
|
||||
def BACKUP_VERIFY(domain):
|
||||
"""Post-backup integrity check (F2-14b). The recipe's backupbot db pre-hook dumps the ghost MySQL
|
||||
DB to `/var/lib/mysql/backup.sql.gz` (then restic captures that path). On the loaded single CI node
|
||||
the db container intermittently CYCLES mid-dump (observed: full5/6/7 RED, full8 green — pure race;
|
||||
NOT OOM, NOT healthcheck — db hc retries=10), so the dump is truncated/never written and restic
|
||||
snapshots an empty mysql path → a later restore reimports nothing → the seeded ci_marker is lost
|
||||
(P4 RED). This proves the dump completed: backup.sql.gz exists, is a VALID gzip, and is non-empty.
|
||||
Returning False makes the harness re-run the whole backup with a re-stabilised db (run_recipe_ci
|
||||
_perform_op). It is a READ-ONLY probe — it weakens no assertion; it only retries a flaky CAPTURE."""
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
|
||||
from harness import lifecycle # noqa: E402
|
||||
|
||||
try:
|
||||
out = lifecycle.exec_in_app(
|
||||
domain,
|
||||
["sh", "-c", "gzip -t /var/lib/mysql/backup.sql.gz && wc -c < /var/lib/mysql/backup.sql.gz"],
|
||||
service="db",
|
||||
timeout=60,
|
||||
).strip()
|
||||
except Exception: # noqa: BLE001 — exec fails if the db is mid-cycle: treat as not-yet-captured
|
||||
return False
|
||||
return out.isdigit() and int(out) > 0
|
||||
|
||||
Reference in New Issue
Block a user