diff --git a/machine-docs/BACKLOG-1e.md b/machine-docs/BACKLOG-1e.md index aa6e28b..5da4af7 100644 --- a/machine-docs/BACKLOG-1e.md +++ b/machine-docs/BACKLOG-1e.md @@ -17,34 +17,41 @@ Phase-namespaced backlog. Builder edits `## Build backlog`; Adversary edits `## ## Adversary findings -- [ ] **F1e-1 [adversary]** — *opt-out is NOT behavior-neutral: the backup/restore data-continuity - overlays are racy and `exec_in_app` silently swallows a failed exec → a healthy recipe goes RED.* - Found cold-verifying E1/HC3 (commit b7e6cbd). My cold e2e of custom-html - `STAGES=install,upgrade,backup,restore,custom`: - - **default** (generic additive): all tiers PASS, deploy-count=1. ✓ - - **`CCCI_SKIP_GENERIC=1`** (opt-out): generic skipped on every tier (0 `_generic/` files ran), - deploy-count=1 ✓ — BUT **backup=FAIL**: `tests/custom-html/test_backup.py::test_backup_captures_state` - → `AssertionError: '' == 'original'` (the `exec_in_app(... cat ci-marker.txt)` returned **empty**). - **Root cause (static):** `lifecycle.exec_in_app` runs `docker exec …` and returns - `proc.stdout` **without checking `returncode`**. When backup-bot cycles the app container during - the backup op, `_app_container` resolves a container that is mid-transition, `docker exec` fails, - stdout is empty, and the failure is silently returned as `''`. The backup/restore overlays read - the marker via `exec_in_app` **immediately after** the container-cycling op with **no readiness - wait/retry**, despite their docstrings claiming immunity ("immune/robust to the post-backup/restore - serving race"). In the **default** path the generic `assert_backup_artifact` pytest runs first - (~1s spawn), an accidental timing buffer that lets the container settle; **opt-out removes that - buffer and the race surfaces.** So `CCCI_SKIP_GENERIC` changes observable behavior and can flip a - GREEN recipe to RED — contradicting "additive/opt-out is safe" and the Builder's E1 claim that the - opt-out run was "clean." - **Why it matters:** (1) a flaky false-RED blocks legitimate PRs and erodes trust; (2) `exec_in_app` - swallowing a failed exec is itself unsafe (an exec error masquerades as empty data — could also - make a real failure *pass* in a different assertion). Per plan guardrails: add real readiness/retry - robustness to the harness (and check the exec returncode / raise on failure), do **not** weaken or - delete the assertion. - **Repro:** `cd && CCCI_SKIP_GENERIC=1 RECIPE=custom-html STAGES=install,backup,restore - cc-ci-run runner/run_recipe_ci.py` → backup tier intermittently `'' == 'original'`. - **Status:** isolated (no-concurrency) reproduction in flight to rule out the confound that the - Builder was running parallel custom-html e2e at the same time (which would ALSO be a finding — - concurrent runs must not collide on backup-bot, §6/D-gate). Closing this finding requires: exec - returncode checked + a bounded readiness/retry on the post-op volume read, re-verified cold under - opt-out (and concurrency). **E1/HC3 PASS withheld pending fix.** +- [ ] **F1e-1 [adversary]** — *`lifecycle.exec_in_app` silently swallows a failed `docker exec` + (returns empty stdout, returncode ignored) → backup/restore data-continuity overlays go RED on a + healthy recipe when the post-op container cycle is slow.* Found cold-verifying E1/HC3 (commit + b7e6cbd) on custom-html: one opt-out run had backup=FAIL with `AssertionError: '' == 'original'` + from `tests/custom-html/test_backup.py::test_backup_captures_state` — the marker `cat` returned + empty. **CORRECTION (2026-05-28):** isolated, no-concurrency repro (3× opt-out + 1× default, + install,backup,restore) — **4/4 PASS**, deploy-count=1 each. So the opt-out flag is **NOT** the + trigger (my earlier "removes the ~1s generic-pytest timing buffer" theory is **withdrawn**); the + original symptom coincided with parallel Builder e2e runs loading the node. Real trigger: load / + concurrency slowing the post-backup container cycle into a window where `exec_in_app`'s + `docker exec` fails. The **static defect is the same** regardless of trigger. + **Root cause (static):** `exec_in_app` runs `docker exec …` and returns `proc.stdout` + **without checking `returncode`**; when backup-bot cycles the app container post-op, `docker exec` + can fail → empty stdout silently passed back as data. The backup/restore overlays read via + `exec_in_app` immediately after the cycling op with no readiness retry, despite docstrings + claiming immunity. (Secondary risk: a failed exec masquerading as `""` could also make a real + failure spuriously *pass* in a different assertion.) + **Repro (orig symptom):** under any concurrent same-recipe load, an opt-out + `STAGES=install,backup,restore` custom-html run can show `test_backup_captures_state` empty-string + AssertionError. + **Status:** Builder pushed fix at **commit 6eabfdc** — `exec_in_app` now polls (re-resolve + container + re-exec) until `rc==0` or 90s, then **raises** (never masks failed exec as empty). + No assertion weakened. Adversary fix-verification in flight on `/tmp/adv-fix`. **Closes when:** + cold-verified PASS under opt-out (and a reasonable concurrency probe), per Adversary close-rule. + +- [ ] **F1e-2 [adversary]** — *Two concurrent same-recipe runs collide on `~/.abra/recipes/` + (rm-rf + abra-fetch race).* Found during a controlled 2-concurrent custom-html test (PR=8001, + PR=8002): run-a died at `subprocess.CalledProcessError: 'abra recipe fetch custom-html -n' rc=1`; + run-b completed all-green. Cause: `runner/run_recipe_ci.py::fetch_recipe` does `rm -rf + ~/.abra/recipes/` then `abra recipe fetch -n` — concurrent execution on the same + recipe races on the same directory. Domain/volume/secret isolation hold (different PRs ⇒ different + domains), but the shared recipe checkout is a serialisation point. + **Why it matters:** §6/D-gate requires "two concurrent !testme runs don't collide." Drone caps + `MAX_TESTS=1-2` today so practical impact is bounded, but as breadth scales (D10) this surfaces. + Pre-existing in 1d; orthogonal to E1/HC3; not blocking E1. + **Fix direction:** per-run recipe snapshot dir (`~/.abra/recipes/` may need to be + run-scoped, or a flock around fetch+checkout, or move PR-head clones out of the shared abra dir). + **Status:** Filed for HC4 / no-regression scope. diff --git a/machine-docs/REVIEW-1e.md b/machine-docs/REVIEW-1e.md index 59d51d6..9ad4bf8 100644 --- a/machine-docs/REVIEW-1e.md +++ b/machine-docs/REVIEW-1e.md @@ -63,12 +63,33 @@ deploy-count stays 1; two e2e (default + opt-out) "clean." overlay-only, **deploy-count=1** ✓ — **but backup=FAIL**: `test_backup_captures_state` → `AssertionError: '' == 'original'`. Same code/recipe; only diff is the opt-out flag. -**Verdict: FAIL — opt-out is not behavior-neutral.** Opting out of the generic removes an accidental -~1s timing buffer (the generic pytest spawn) and surfaces a real race: the backup/restore overlays -read the marker via `exec_in_app` immediately after a container-cycling op with no readiness/retry, and -`exec_in_app` silently returns empty stdout on a failed `docker exec` (returncode ignored). A healthy -recipe can thus be reported RED under opt-out. Filed **F1e-1 [adversary]** (BACKLOG-1e) with root cause -+ repro + fix direction (check exec returncode + bounded readiness retry; do NOT weaken the assertion). -Isolated (no-concurrency) reproduction in flight to rule out the parallel-Builder-run confound — which -would itself be a concurrency-collision finding. **HC3 PASS withheld until F1e-1 is fixed + re-verified -cold under opt-out.** +**Interim verdict (commit 4334e19): FAIL — opt-out flipped backup RED**, theorised cause was the +opt-out path removing an accidental ~1s generic-pytest timing buffer. **Filed F1e-1.** + +### CORRECTION @2026-05-28 (isolated repro disproved the opt-out theory) +Isolated, no-concurrency repro of `STAGES=install,backup,restore` on custom-html: +- **opt-out × 3** (`CCCI_SKIP_GENERIC=1`): backup PASS, restore PASS, deploy-count=1. **3/3.** +- **default × 1**: backup PASS, restore PASS, deploy-count=1. + +So opting out of the generic is **NOT** what flips the backup RED — the original symptom occurred while +the Builder was running concurrent custom-html e2e on the same node. The real trigger is **load / +concurrency** putting the post-backup container cycle into a window where `exec_in_app`'s `docker exec` +fails. The **static defect stays the same** (and the fix direction in F1e-1 is still correct): +`exec_in_app` silently returns empty stdout on a failed exec (returncode ignored) + no readiness retry. +F1e-1 reframed in BACKLOG-1e; my earlier "opt-out is not behavior-neutral" framing is **withdrawn**. + +### Builder's fix (commit 6eabfdc) — verification pending +`exec_in_app` now polls (re-resolves container + re-execs) until `rc==0` or 90s, then **raises** — +never masks a failed exec as empty data. No assertion weakened. Same commit also lands HC1 plumbing +(`chaos_redeploy`, `recipe_head_commit`, `.chaos-version` parsing in `deployed_identity`, head_ref +match in `assert_upgraded`) — out-of-scope for this re-verification, will check at E2 claim. + +Fix-verify in flight on `/tmp/adv-fix` (HEAD 6eabfdc shipped): opt-out install,backup,restore on +custom-html. Will close F1e-1 + finalise E1/HC3 verdict once verified. + +### Separate observation while testing (NOT F1e-1) +A controlled 2-concurrent same-recipe test (PR=8001/PR=8002, both custom-html) on the **OLD** code +showed run-a die in `abra recipe fetch custom-html -n` (rc=1) — concurrent rm-rf + abra-fetch on the +same `~/.abra/recipes/custom-html` collide. Pre-existing (in 1d too), orthogonal to E1/HC3, not the +F1e-1 trigger. Filing separately as **F1e-2 [adversary]** for HC4 visibility (§6 D-gate requires +concurrent runs to be safe). Drone caps `MAX_TESTS=1-2` today, so practical impact is bounded.