diff --git a/BACKLOG-conc.md b/BACKLOG-conc.md index e3efe0e..a1337a8 100644 --- a/BACKLOG-conc.md +++ b/BACKLOG-conc.md @@ -19,4 +19,42 @@ ## Adversary findings -(adversary-owned) +### [adversary] CONC-A1 — double-!testme same domain corrupts the shared deploy-count file (M2(c) FAIL) + +**Severity:** blocks M2(c). Both runs of a same-domain double-!testme go RED. + +**Root cause (two coupled defects, one shared root):** +1. The DG4.1 deploy-counter file is keyed by DOMAIN in the *shared* system tempdir, NOT per-run: + `run_recipe_ci.py:930 countfile = /tmp/ccci-deploys-`. P3 isolated `ABRA_DIR` per run + but this per-run state file was missed — it predates the restructure (ef44d46) and the OLD + recipe-flock used to serialize same-recipe runs end-to-end, incidentally masking it. +2. `lifecycle.deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE + `acquire_app_lock(domain)` (lifecycle.py:254, introduced by P2 b302f3a). So the counter + increment happens OUTSIDE the serialization window — a second same-domain run bumps the + shared counter before it ever blocks on the lock. + +**Observed (live, builds 279 + 281, immich PR#2, same domain immi-ad3e33, 2026-06-10T05:04Z):** +- Lock serialization itself WORKS: 281 logged `== app lock: ... in flight — waiting ==` at 2s, + then `== app lock: acquired ==` at 194s — exactly when 279 exited (279 finished 05:07:35). +- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`. The `2` = 281's pre-lock `_record_deploy` + (fired ~2s, before 281 blocked) polluting the shared counter 279 was actively using. +- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33...` at run_recipe_ci.py:1213 — + 279's end-of-run `os.remove(countfile)` (line 1215) deleted the shared file out from under 281, + whose single `_record_deploy` had already fired at 2s and never recreates it. +- Control: isolated immich (build 275, same fixed wrapper) → `deploy-count = 1`, GREEN. So this + is concurrency-specific, not a pre-existing immich/wrapper issue. + +**Repro:** two `!testme` comments on the same recipe PR (same domain) in quick succession on the +deployed main harness → both builds RED (one DG4.1 false-violation, one FileNotFoundError). + +**Fix direction (Builder owns):** key the deploy-counter per RUN, not per domain — e.g. put it in +`/var/lib/cc-ci-runs//` (alongside the per-run artifacts) or include the build/run id in the +filename, and export that path via `CCCI_DEPLOY_COUNT_FILE`. Per-run keying fixes BOTH defects at +once (no cross-run pollution; no shared remove). Moving `_record_deploy()` after `acquire_app_lock` +alone is INSUFFICIENT — the shared `os.remove`/`FileNotFoundError` collision survives. Add a +tests/concurrency case: two same-domain runs serialized on the app lock → each sees its own +deploy-count, neither removes the other's file (this is the gap vs the 19 planned cases — case 4 +serialises acquire but never asserts deploy-count isolation across the two). + +**Closure:** adversary-owned. Re-test the (c) double-!testme live (both GREEN, visible block line, +zero leakage) + the new unit case before this clears. Only I close it. diff --git a/REVIEW-conc.md b/REVIEW-conc.md index 728d760..7a5049b 100644 --- a/REVIEW-conc.md +++ b/REVIEW-conc.md @@ -258,3 +258,49 @@ real drone exec shell). main now = d3fe9e2 + this .drone.yml wrapper fix; the fi Open for the formal M2 verdict: re-confirm lint green on the new .drone.yml (yamllint), the push build green, and live (a) cancel-no-leak / (b) parallel both-green / (c) double-!testme blocks / (d) one full green run — cold, once the Builder posts the M2 claim with evidence. + +## M2(c): FAIL @2026-06-10T08:10Z — double-!testme same domain corrupts shared deploy-count → both runs RED + VETO + +Proactive cold break-it probe of the live M2 evidence (M2 not yet formally `claim(conc)`'d — the +Builder's JOURNAL shows (c) "triggered" but NOT evidenced as PASS; I went straight to the Drone API +to verify the in-flight (c) runs independently, not to the JOURNAL narrative). I found a REAL defect +that breaks M2(c). Filed as BACKLOG-conc CONC-A1. + +EVIDENCE (Drone API, recipe-maintainers/cc-ci, cold via /run/secrets/bridge_drone_token — my own +access path, not the Builder's word): +- (c) = builds **279 + 281**, both `event=custom PR=2 RECIPE=immich REF=a92b28d…` → SAME domain + `immi-ad3e33.ci.commoninternet.net`. Both `status=failure` (step `ci` exit_code=1). +- 281 (the blocked run): log `== app lock: ... in flight — waiting ==` @2s → `== acquired ==` @194s, + which is exactly when 279's process exited (279 finished 05:07:35Z). **Lock serialisation + the + visible block line WORK** — that half of (c) is fine. +- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`. +- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33….ci.commoninternet.net` at + run_recipe_ci.py:1213. +- Control build 275 (isolated immich, same fixed wrapper) → `deploy-count = 1`, GREEN. Confirms the + failure is concurrency-specific, NOT a pre-existing immich/wrapper regression. + +ROOT CAUSE (code, confirmed): +- DG4.1 counter file is DOMAIN-keyed in shared /tmp, not per-run: `run_recipe_ci.py:930 + /tmp/ccci-deploys-`. P3 isolated ABRA_DIR per run but this per-run state file was missed + (predates the restructure, ef44d46; the old recipe-flock serialised same-recipe runs end-to-end, + masking it). +- `deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE `acquire_app_lock()` (:254, + introduced by P2 b302f3a) → the increment races OUTSIDE the lock. 281's single pre-lock + `_record_deploy` (@2s) bumps the shared counter 279 is using (→2, false violation), and 279's + end-of-run `os.remove(countfile)` (:1215) deletes the file under 281 → FileNotFoundError. +- Interleaving is fully reconstructed and self-consistent with the build timestamps (see CONC-A1). + +This is squarely in M2(c) scope: the plan's DoD (c) requires the second run to "block … then RUN" +(implicitly green), and the phase's whole premise is "two concurrent !testme don't collide on +domain/volume/secrets." This is a domain-keyed-state collision — the restructure's narrower domain +lock no longer covers the deploy-count file. M1 (code/suites/lint/diff of d3fe9e2) is unaffected — +this is a live concurrency behavior M1's checks could not exercise; the tests/concurrency suite has +the matching blind spot (case 4 serialises acquire but never asserts deploy-count isolation across +two same-domain runs). + +## VETO — M2 may NOT be marked DONE until CONC-A1 is fixed and I log a fresh (c) PASS +Forbidding `## DONE` in STATUS-conc until: (1) deploy-counter keyed per-run; (2) a tests/concurrency +case asserts same-domain deploy-count isolation; (3) live (c) re-run shows BOTH builds GREEN with +the visible block line and zero leakage; (4) (a),(b),(d) re-confirmed unaffected. Only I clear this. +(After this verdict I may consult JOURNAL-conc to contextualise — noting I had NOT read the (c) +journal reasoning before forming this FAIL; I verified from the Drone API + code directly.)