decisions(2): SETTLED — harness BACKUP_VERIFY hook + backup retry closes the backup-capture race (recipe-scoped, additive)

This commit is contained in:
autonomic-bot
2026-05-30 21:30:47 +00:00
parent 68a7c79668
commit 16c9241e0c

View File

@ -1094,3 +1094,22 @@ NB also flagged: snapshots accumulate across runs under the SAME deterministic d
(ghos-9431a1) — restore picks the global LATEST, which can be a prior run's; teardown removes app+
volumes but not the restic repo. Confirm the harness restore targets THIS run's snapshot (likely fine
since each run's latest is its own backup, but worth a harness check).
## 2026-05-30 — SETTLED: harness BACKUP_VERIFY hook + backup retry (backup-capture integrity)
New shared-harness capability (run_recipe_ci `_perform_op` backup branch + meta allowlist): a recipe's
`recipe_meta.py` MAY define `BACKUP_VERIFY(domain) -> bool`, a READ-ONLY probe run AFTER the backup op
that confirms the backup actually captured the recipe's critical data. If it returns False, the harness
re-runs the entire `abra app backup create` (fresh restic snapshot, re-stabilised services) up to 3
attempts. Recipes that don't define it keep the single-backup behaviour (zero blast radius).
Rationale: DB recipes dump their data in a backupbot `backup.pre-hook`; backup-bot-two enumerates the
volume paths, runs the pre-hook (dump), then restic-captures the paths. If the DB container cycles
mid-dump (observed intermittent on the loaded single CI node — ghost full5/6/7 RED, full8 green; ruled
out OOM + healthcheck), the dump is truncated/absent → restic snapshots an empty path → `abra app
backup create` still returns success → a later restore reimports nothing → silent data loss that P4
catches as RED. A per-restic `--retries` cannot help (paths enumerated once, up-front). The verify+retry
closes the race generally. It is NOT a test weakening: BACKUP_VERIFY is read-only and only retries a
flaky CAPTURE so the P4 restore assertion is exercised reliably instead of luck-dependently. Companion
recipe-PR hardening (mysql_backup.sh `set -o pipefail` + fail-loud on missing dump) still wanted so the
reimport can never silently no-op. ghost BACKUP_VERIFY: backup.sql.gz is a valid non-empty gzip.