fix(1e): F1e-1 exec_in_app race + HC1 head_ref/move hardening

F1e-1 (Adversary): exec_in_app silently returned '' on a failed docker exec, flipping a healthy
recipe RED under opt-out (post-backup container cycle, no readiness buffer). Now polls (re-resolve
container + re-exec) until rc==0 or 90s, then RAISES — never masks an exec failure as empty data.
No assertion weakened. Verified: opt-out install,backup,restore on custom-html now PASS.

HC1: head_ref = ref or recipe_head_commit (prefer explicit PR head sha $REF — robust, no git race;
production !testme always sets REF). assert_upgraded, when head_ref known, REQUIRES the deployed
chaos-version commit to MATCH head_ref (direct + non-vacuous proof the PR-head code was deployed; a
stale prev-checkout chaos redeploy fails). Falls back to version/image/chaos move check otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 03:41:42 +01:00
parent 4334e19a7b
commit 6eabfdc0fb
5 changed files with 149 additions and 35 deletions

View File

@ -76,3 +76,28 @@ Next: confirm opt-out result, claim E1/HC3 gate, then E2 (HC1 chaos-to-PR-head).
leftover custom-html stack). Log: /root/ccci-1e-optout.log.
- HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out =
generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary.
## 2026-05-28 — Adversary F1e-1 (HC3 opt-out race) + HC1 hardening
- **F1e-1 (E1/HC3 FAIL withheld):** under `CCCI_SKIP_GENERIC=1`, `test_backup_captures_state` flaked
`'' == 'original'`. Root cause (valid): `lifecycle.exec_in_app` returned `proc.stdout` WITHOUT
checking returncode — when backup-bot cycles the app container, `docker exec` fails and the empty
stdout was silently returned as data; the generic pytest spawn (~1s) had been an accidental timing
buffer that opt-out removes. **Fix (no assertion weakened):** `exec_in_app` now polls — re-resolves
the container + re-execs until returncode==0 or a 90s timeout, then RAISES. A container-cycle race
now waits-and-succeeds; a genuine exec failure is loud, never masquerades as empty data. This makes
the backup/restore overlays robust to the post-op cycle independent of the generic timing buffer, so
opt-out is behavior-neutral.
- **HC1 hardening (my own findings from E2 e2e):**
- `head_ref` capture was racy (returned None under a concurrent run wiping the shared recipe dir),
and a chaos-redeploy of the SAME prev checkout falsely "moved" via the chaos label alone. Fixes:
`head_ref = ref or recipe_head_commit(recipe)` (prefer the explicit PR head sha $REF — robust, no
git race; production `!testme` always sets REF); store head_ref in op_state.
- `assert_upgraded` now, when head_ref is known, REQUIRES the deployed `chaos-version` commit to
MATCH head_ref — direct proof the PR-head code under test was deployed, and non-vacuous (a stale
prev-checkout chaos redeploy stamps prev's commit ≠ head_ref → FAIL). Falls back to the
version/image/chaos move check only when head_ref is unknown.
- **Coordination note:** my E2 manual custom-html e2e ran concurrently with the Adversary's E1
cold-verify — both share `/root/.abra/recipes/custom-html` + (at PR=0) the same run domain, so they
collided (explains my non-deterministic 1.10→1.11 vs 1.10→1.10 and the None head_ref). Manual ad-hoc
runs bypass Drone's capacity=1 queue. Going forward I serialize: don't run a recipe manually while a
gate is under Adversary verification; verify when `pgrep run_recipe_ci` is clear.