review(1e): CORRECT F1e-1 — isolated repro disproves opt-out theory (3/3 pass); reframe as load/concurrency trigger; file F1e-2 (recipe-fetch race); fix-verify in flight
This commit is contained in:
@ -17,34 +17,41 @@ Phase-namespaced backlog. Builder edits `## Build backlog`; Adversary edits `##
|
||||
|
||||
## Adversary findings
|
||||
|
||||
- [ ] **F1e-1 [adversary]** — *opt-out is NOT behavior-neutral: the backup/restore data-continuity
|
||||
overlays are racy and `exec_in_app` silently swallows a failed exec → a healthy recipe goes RED.*
|
||||
Found cold-verifying E1/HC3 (commit b7e6cbd). My cold e2e of custom-html
|
||||
`STAGES=install,upgrade,backup,restore,custom`:
|
||||
- **default** (generic additive): all tiers PASS, deploy-count=1. ✓
|
||||
- **`CCCI_SKIP_GENERIC=1`** (opt-out): generic skipped on every tier (0 `_generic/` files ran),
|
||||
deploy-count=1 ✓ — BUT **backup=FAIL**: `tests/custom-html/test_backup.py::test_backup_captures_state`
|
||||
→ `AssertionError: '' == 'original'` (the `exec_in_app(... cat ci-marker.txt)` returned **empty**).
|
||||
**Root cause (static):** `lifecycle.exec_in_app` runs `docker exec <cid> …` and returns
|
||||
`proc.stdout` **without checking `returncode`**. When backup-bot cycles the app container during
|
||||
the backup op, `_app_container` resolves a container that is mid-transition, `docker exec` fails,
|
||||
stdout is empty, and the failure is silently returned as `''`. The backup/restore overlays read
|
||||
the marker via `exec_in_app` **immediately after** the container-cycling op with **no readiness
|
||||
wait/retry**, despite their docstrings claiming immunity ("immune/robust to the post-backup/restore
|
||||
serving race"). In the **default** path the generic `assert_backup_artifact` pytest runs first
|
||||
(~1s spawn), an accidental timing buffer that lets the container settle; **opt-out removes that
|
||||
buffer and the race surfaces.** So `CCCI_SKIP_GENERIC` changes observable behavior and can flip a
|
||||
GREEN recipe to RED — contradicting "additive/opt-out is safe" and the Builder's E1 claim that the
|
||||
opt-out run was "clean."
|
||||
**Why it matters:** (1) a flaky false-RED blocks legitimate PRs and erodes trust; (2) `exec_in_app`
|
||||
swallowing a failed exec is itself unsafe (an exec error masquerades as empty data — could also
|
||||
make a real failure *pass* in a different assertion). Per plan guardrails: add real readiness/retry
|
||||
robustness to the harness (and check the exec returncode / raise on failure), do **not** weaken or
|
||||
delete the assertion.
|
||||
**Repro:** `cd <repo> && CCCI_SKIP_GENERIC=1 RECIPE=custom-html STAGES=install,backup,restore
|
||||
cc-ci-run runner/run_recipe_ci.py` → backup tier intermittently `'' == 'original'`.
|
||||
**Status:** isolated (no-concurrency) reproduction in flight to rule out the confound that the
|
||||
Builder was running parallel custom-html e2e at the same time (which would ALSO be a finding —
|
||||
concurrent runs must not collide on backup-bot, §6/D-gate). Closing this finding requires: exec
|
||||
returncode checked + a bounded readiness/retry on the post-op volume read, re-verified cold under
|
||||
opt-out (and concurrency). **E1/HC3 PASS withheld pending fix.**
|
||||
- [ ] **F1e-1 [adversary]** — *`lifecycle.exec_in_app` silently swallows a failed `docker exec`
|
||||
(returns empty stdout, returncode ignored) → backup/restore data-continuity overlays go RED on a
|
||||
healthy recipe when the post-op container cycle is slow.* Found cold-verifying E1/HC3 (commit
|
||||
b7e6cbd) on custom-html: one opt-out run had backup=FAIL with `AssertionError: '' == 'original'`
|
||||
from `tests/custom-html/test_backup.py::test_backup_captures_state` — the marker `cat` returned
|
||||
empty. **CORRECTION (2026-05-28):** isolated, no-concurrency repro (3× opt-out + 1× default,
|
||||
install,backup,restore) — **4/4 PASS**, deploy-count=1 each. So the opt-out flag is **NOT** the
|
||||
trigger (my earlier "removes the ~1s generic-pytest timing buffer" theory is **withdrawn**); the
|
||||
original symptom coincided with parallel Builder e2e runs loading the node. Real trigger: load /
|
||||
concurrency slowing the post-backup container cycle into a window where `exec_in_app`'s
|
||||
`docker exec` fails. The **static defect is the same** regardless of trigger.
|
||||
**Root cause (static):** `exec_in_app` runs `docker exec <cid> …` and returns `proc.stdout`
|
||||
**without checking `returncode`**; when backup-bot cycles the app container post-op, `docker exec`
|
||||
can fail → empty stdout silently passed back as data. The backup/restore overlays read via
|
||||
`exec_in_app` immediately after the cycling op with no readiness retry, despite docstrings
|
||||
claiming immunity. (Secondary risk: a failed exec masquerading as `""` could also make a real
|
||||
failure spuriously *pass* in a different assertion.)
|
||||
**Repro (orig symptom):** under any concurrent same-recipe load, an opt-out
|
||||
`STAGES=install,backup,restore` custom-html run can show `test_backup_captures_state` empty-string
|
||||
AssertionError.
|
||||
**Status:** Builder pushed fix at **commit 6eabfdc** — `exec_in_app` now polls (re-resolve
|
||||
container + re-exec) until `rc==0` or 90s, then **raises** (never masks failed exec as empty).
|
||||
No assertion weakened. Adversary fix-verification in flight on `/tmp/adv-fix`. **Closes when:**
|
||||
cold-verified PASS under opt-out (and a reasonable concurrency probe), per Adversary close-rule.
|
||||
|
||||
- [ ] **F1e-2 [adversary]** — *Two concurrent same-recipe runs collide on `~/.abra/recipes/<recipe>`
|
||||
(rm-rf + abra-fetch race).* Found during a controlled 2-concurrent custom-html test (PR=8001,
|
||||
PR=8002): run-a died at `subprocess.CalledProcessError: 'abra recipe fetch custom-html -n' rc=1`;
|
||||
run-b completed all-green. Cause: `runner/run_recipe_ci.py::fetch_recipe` does `rm -rf
|
||||
~/.abra/recipes/<recipe>` then `abra recipe fetch <recipe> -n` — concurrent execution on the same
|
||||
recipe races on the same directory. Domain/volume/secret isolation hold (different PRs ⇒ different
|
||||
domains), but the shared recipe checkout is a serialisation point.
|
||||
**Why it matters:** §6/D-gate requires "two concurrent !testme runs don't collide." Drone caps
|
||||
`MAX_TESTS=1-2` today so practical impact is bounded, but as breadth scales (D10) this surfaces.
|
||||
Pre-existing in 1d; orthogonal to E1/HC3; not blocking E1.
|
||||
**Fix direction:** per-run recipe snapshot dir (`~/.abra/recipes/<recipe>` may need to be
|
||||
run-scoped, or a flock around fetch+checkout, or move PR-head clones out of the shared abra dir).
|
||||
**Status:** Filed for HC4 / no-regression scope.
|
||||
|
||||
Reference in New Issue
Block a user