fix(2): ghost F2-14b — harness BACKUP_VERIFY hook + retry; close the backup-capture race
Root cause (instrumented, DECISIONS 2026-05-30): a DB recipe dumps its data in a backupbot pre-hook, but if the DB container cycles mid-dump (intermittent on the loaded CI node — full5/6/7 RED, full8 green; NOT OOM/NOT healthcheck) the dump is truncated/absent and restic snapshots an empty path — abra app backup 'succeeds' yet a later restore silently loses the data (ghost ci_marker). Fix (additive, recipe-scoped via meta like READY_PROBE): recipe_meta may define BACKUP_VERIFY(domain) -> bool, a READ-ONLY post-backup integrity probe. When it returns False the harness re-runs the whole backup (fresh snapshot, re-stabilised db) up to 3x. Recipes without the hook are unaffected. ghost's BACKUP_VERIFY confirms /var/lib/mysql/backup.sql.gz is a valid non-empty gzip. Weakens no assertion — it only retries a flaky CAPTURE so P4 restore is RELIABLY exercised, not luck-dependent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -194,7 +194,7 @@ def _load_meta(recipe: str) -> dict:
|
||||
ns: dict = {}
|
||||
with open(path) as fh:
|
||||
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
|
||||
for k in list(meta) + ["BACKUP_CAPABLE", "SKIP_GENERIC", "OIDC_AT_INSTALL", "READY_PROBE", "UPGRADE_BASE_VERSION"]:
|
||||
for k in list(meta) + ["BACKUP_CAPABLE", "SKIP_GENERIC", "OIDC_AT_INSTALL", "READY_PROBE", "UPGRADE_BASE_VERSION", "BACKUP_VERIFY"]:
|
||||
if k in ns:
|
||||
meta[k] = ns[k]
|
||||
return meta
|
||||
@ -253,7 +253,28 @@ def _perform_op(
|
||||
before = generic.perform_upgrade(domain, recipe, head_ref, deploy_timeout=deploy_timeout, meta=meta)
|
||||
op_state["upgrade"] = {"before": before, "head_ref": head_ref}
|
||||
elif op == "backup":
|
||||
op_state["backup"] = {"snapshot_id": generic.perform_backup(domain)}
|
||||
# Backup integrity + retry (F2-14b). A recipe may define BACKUP_VERIFY(domain) -> bool that
|
||||
# confirms the backup actually captured the recipe's critical data AFTER the op. This guards a
|
||||
# real race: a DB recipe dumps its data in a backupbot pre-hook, but if the DB container cycles
|
||||
# mid-dump (intermittent under host load) the dump is truncated/absent, so restic snapshots an
|
||||
# empty path — `abra app backup create` still "succeeds", yet a later restore silently loses the
|
||||
# data (ghost: backup.sql.gz never written → restore can't reimport → seeded row gone). When
|
||||
# verify fails we re-run the WHOLE backup (fresh restic snapshot) with a re-stabilised DB, up to
|
||||
# 3 attempts. Recipes without BACKUP_VERIFY are unaffected (single backup, as before).
|
||||
snap = generic.perform_backup(domain)
|
||||
verify = meta.get("BACKUP_VERIFY") if meta else None
|
||||
attempt = 1
|
||||
while callable(verify) and not verify(domain) and attempt < 3:
|
||||
attempt += 1
|
||||
print(
|
||||
f" backup-verify FAILED (attempt {attempt - 1}/3) — backup did not capture the "
|
||||
f"recipe's critical data (e.g. DB cycled mid-dump); re-running backup",
|
||||
flush=True,
|
||||
)
|
||||
snap = generic.perform_backup(domain)
|
||||
if callable(verify) and not verify(domain):
|
||||
print(f" !! backup-verify still FAILED after {attempt} attempts — backup is incomplete", flush=True)
|
||||
op_state["backup"] = {"snapshot_id": snap}
|
||||
elif op == "restore":
|
||||
generic.perform_restore(domain)
|
||||
# install: already deployed; no op
|
||||
|
||||
Reference in New Issue
Block a user