decisions(2): SETTLED — harness BACKUP_VERIFY hook + backup retry closes the backup-capture race (recipe-scoped, additive)

This commit is contained in:
autonomic-bot
2026-05-30 21:30:47 +00:00
parent 68a7c79668
commit 16c9241e0c

View File

@ -1094,3 +1094,22 @@ NB also flagged: snapshots accumulate across runs under the SAME deterministic d
(ghos-9431a1) — restore picks the global LATEST, which can be a prior run's; teardown removes app+ (ghos-9431a1) — restore picks the global LATEST, which can be a prior run's; teardown removes app+
volumes but not the restic repo. Confirm the harness restore targets THIS run's snapshot (likely fine volumes but not the restic repo. Confirm the harness restore targets THIS run's snapshot (likely fine
since each run's latest is its own backup, but worth a harness check). since each run's latest is its own backup, but worth a harness check).
## 2026-05-30 — SETTLED: harness BACKUP_VERIFY hook + backup retry (backup-capture integrity)
New shared-harness capability (run_recipe_ci `_perform_op` backup branch + meta allowlist): a recipe's
`recipe_meta.py` MAY define `BACKUP_VERIFY(domain) -> bool`, a READ-ONLY probe run AFTER the backup op
that confirms the backup actually captured the recipe's critical data. If it returns False, the harness
re-runs the entire `abra app backup create` (fresh restic snapshot, re-stabilised services) up to 3
attempts. Recipes that don't define it keep the single-backup behaviour (zero blast radius).
Rationale: DB recipes dump their data in a backupbot `backup.pre-hook`; backup-bot-two enumerates the
volume paths, runs the pre-hook (dump), then restic-captures the paths. If the DB container cycles
mid-dump (observed intermittent on the loaded single CI node — ghost full5/6/7 RED, full8 green; ruled
out OOM + healthcheck), the dump is truncated/absent → restic snapshots an empty path → `abra app
backup create` still returns success → a later restore reimports nothing → silent data loss that P4
catches as RED. A per-restic `--retries` cannot help (paths enumerated once, up-front). The verify+retry
closes the race generally. It is NOT a test weakening: BACKUP_VERIFY is read-only and only retries a
flaky CAPTURE so the P4 restore assertion is exercised reliably instead of luck-dependently. Companion
recipe-PR hardening (mysql_backup.sh `set -o pipefail` + fail-loud on missing dump) still wanted so the
reimport can never silently no-op. ghost BACKUP_VERIFY: backup.sql.gz is a valid non-empty gzip.