review(1e): CORRECT F1e-1 — isolated repro disproves opt-out theory (3/3 pass); reframe as load/concurrency trigger; file F1e-2 (recipe-fetch race); fix-verify in flight
This commit is contained in:
@ -63,12 +63,33 @@ deploy-count stays 1; two e2e (default + opt-out) "clean."
|
||||
overlay-only, **deploy-count=1** ✓ — **but backup=FAIL**: `test_backup_captures_state` →
|
||||
`AssertionError: '' == 'original'`. Same code/recipe; only diff is the opt-out flag.
|
||||
|
||||
**Verdict: FAIL — opt-out is not behavior-neutral.** Opting out of the generic removes an accidental
|
||||
~1s timing buffer (the generic pytest spawn) and surfaces a real race: the backup/restore overlays
|
||||
read the marker via `exec_in_app` immediately after a container-cycling op with no readiness/retry, and
|
||||
`exec_in_app` silently returns empty stdout on a failed `docker exec` (returncode ignored). A healthy
|
||||
recipe can thus be reported RED under opt-out. Filed **F1e-1 [adversary]** (BACKLOG-1e) with root cause
|
||||
+ repro + fix direction (check exec returncode + bounded readiness retry; do NOT weaken the assertion).
|
||||
Isolated (no-concurrency) reproduction in flight to rule out the parallel-Builder-run confound — which
|
||||
would itself be a concurrency-collision finding. **HC3 PASS withheld until F1e-1 is fixed + re-verified
|
||||
cold under opt-out.**
|
||||
**Interim verdict (commit 4334e19): FAIL — opt-out flipped backup RED**, theorised cause was the
|
||||
opt-out path removing an accidental ~1s generic-pytest timing buffer. **Filed F1e-1.**
|
||||
|
||||
### CORRECTION @2026-05-28 (isolated repro disproved the opt-out theory)
|
||||
Isolated, no-concurrency repro of `STAGES=install,backup,restore` on custom-html:
|
||||
- **opt-out × 3** (`CCCI_SKIP_GENERIC=1`): backup PASS, restore PASS, deploy-count=1. **3/3.**
|
||||
- **default × 1**: backup PASS, restore PASS, deploy-count=1.
|
||||
|
||||
So opting out of the generic is **NOT** what flips the backup RED — the original symptom occurred while
|
||||
the Builder was running concurrent custom-html e2e on the same node. The real trigger is **load /
|
||||
concurrency** putting the post-backup container cycle into a window where `exec_in_app`'s `docker exec`
|
||||
fails. The **static defect stays the same** (and the fix direction in F1e-1 is still correct):
|
||||
`exec_in_app` silently returns empty stdout on a failed exec (returncode ignored) + no readiness retry.
|
||||
F1e-1 reframed in BACKLOG-1e; my earlier "opt-out is not behavior-neutral" framing is **withdrawn**.
|
||||
|
||||
### Builder's fix (commit 6eabfdc) — verification pending
|
||||
`exec_in_app` now polls (re-resolves container + re-execs) until `rc==0` or 90s, then **raises** —
|
||||
never masks a failed exec as empty data. No assertion weakened. Same commit also lands HC1 plumbing
|
||||
(`chaos_redeploy`, `recipe_head_commit`, `.chaos-version` parsing in `deployed_identity`, head_ref
|
||||
match in `assert_upgraded`) — out-of-scope for this re-verification, will check at E2 claim.
|
||||
|
||||
Fix-verify in flight on `/tmp/adv-fix` (HEAD 6eabfdc shipped): opt-out install,backup,restore on
|
||||
custom-html. Will close F1e-1 + finalise E1/HC3 verdict once verified.
|
||||
|
||||
### Separate observation while testing (NOT F1e-1)
|
||||
A controlled 2-concurrent same-recipe test (PR=8001/PR=8002, both custom-html) on the **OLD** code
|
||||
showed run-a die in `abra recipe fetch custom-html -n` (rc=1) — concurrent rm-rf + abra-fetch on the
|
||||
same `~/.abra/recipes/custom-html` collide. Pre-existing (in 1d too), orthogonal to E1/HC3, not the
|
||||
F1e-1 trigger. Filing separately as **F1e-2 [adversary]** for HC4 visibility (§6 D-gate requires
|
||||
concurrent runs to be safe). Drone caps `MAX_TESTS=1-2` today, so practical impact is bounded.
|
||||
|
||||
Reference in New Issue
Block a user