Files
cc-ci/REVIEW-conc.md
autonomic-bot 9987fba4b6
All checks were successful
continuous-integration/drone/push Build is passing
review(conc): M2 PASS — merged + live-verified (a)-(d) on final main 139e319; M1+M2 both fresh PASS, no open veto — DONE unblocked
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:55:19 +00:00

32 KiB
Raw Blame History

REVIEW-conc.md — Adversary ledger, concurrency-restructure phase

Append-only. Verdicts: <gate>: PASS @<ts> + evidence, or FAIL + [adversary] finding in BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.

2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)

Pulled main @5b65c6c. No STATUS-conc.md, no restructure/concurrency branch — nothing claimed yet. Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to anchor my later diff review in the as-is code, not the Builder's narrative.

Current-system facts I will hold the restructure against:

  • Registry symbols slated for deletion (will grep for dangling refs at M1): register_run_app (lifecycle.py:69, call site :283), unregister_run_app (:78, call sites :723, :766), _run_owner_state (:83), ACTIVE_RUN_DIR (:43), CCCI_JANITOR_MAX_AGE (janitor :738), acquire_recipe_lock (:46, call site run_recipe_ci.py:843), RECIPE_LOCK_DIR (:42).
  • Must survive untouched: RUN_APP_RE (lifecycle.py:26) allowlist semantics (warm/canonical apps never probed), services_converged() paused-is-settled logic, docker-service sweep discovery, teardown_app(verify=False) idempotence.
  • M1 verification plan (cold, my clone): checkout branch; pytest tests/unit -q, pytest tests/concurrency -q, scripts/lint.sh; full diff review hunting: probe-vs-acquire ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM), teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a waiter blocks on the old inode → two "held" locks on different inodes for one domain), PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration, setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases.
  • Tests/concurrency must NOT be wired into the default pytest tests/unit gate (plan decision).
  • M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.

No verdict yet — waiting for Builder bootstrap/claim.

2026-06-10T04:05Z — cold-verify environment established (prep, no gate)

Builder seeded STATUS/BACKLOG/JOURNAL-conc; STATUS says P1 in flight, no gate claimed. Mapped the test-execution environment I'll use for the M1 cold run so a time-sensitive gate isn't spent debugging tooling:

  • Local VM devshell (nix develop) has only lintTools (no pytest). So pytest does NOT run here.
  • pytest 8.3.3 + playwright live in the host pyEnv (nix/modules/harness.nix) exposed as cc-ci-run on cc-ci. cc-ci-run -m pytest <path> -q works as the real harness interpreter (verified: cc-ci-run -c "import pytest" -> 8.3.3).
  • .drone.yml lint stage runs nix develop .#lint --command bash scripts/lint.sh.
  • COLD M1 PLAN: fresh git clone/checkout of restructure/concurrency into a throwaway dir ON cc-ci → cc-ci-run -m pytest tests/unit -q + cc-ci-run -m pytest tests/concurrency -q + nix develop .#lint --command bash scripts/lint.sh, all from that clean checkout (not the Builder's working tree). Then adversarial diff review per my baseline hit-list.
  • Baseline .drone.yml on main is still the pre-restructure version (concurrency.limit=2, acquire_recipe_lock / /run/cc-ci-active registry referenced) — confirms P1/P4 edits are branch-only so far. Good.

2026-06-10T04:23Z — early pre-review of P1+P2 (branch @b302f3a, NO gate claimed — NOT a verdict)

Builder has pushed P1 (b492f99) + P2 (b302f3a) to restructure/concurrency; P3/P4/P5/tests still pending, so M1 is not claimable and this is NOT a PASS — it's pre-review to front-load the M1 diff audit and avoid re-doing it under gate time pressure. Read code/diff + git only; did NOT read JOURNAL (anti-anchoring intact). I actively tried to break the following and each concern was REFUTED:

  1. Green-on-red via the .drone.yml EXIT trap (my lead hypothesis). The wrapper is setsid cc-ci-run … & PID=$!; trap 'kill -TERM -- -$PID' TERM EXIT; wait $PID. I worried the EXIT trap's final kill status would override the harness exit code and mask a failing run. EMPIRICALLY TESTED (4 bash repros incl. failing harness with a lingering group member that makes kill succeed=0): bash PRESERVES the pre-trap exit status when the EXIT trap doesn't call exit. Exit code propagates correctly in all cases (RED stays RED, GREEN stays GREEN). Refuted.
  2. P2 unlink/reacquire inode race (janitor unlinks a reaped orphan's lockfile while a new run blocks on the old inode). Handled: both acquire_app_lock and _probe_and_reap recheck fstat(fd).st_ino == stat(path).st_ino after acquiring and retry/bail on mismatch — a lock on an unlinked (anonymous) inode is never treated as authoritative, and the path's lockfile is never unlinked out from under a newer run. Refuted.
  3. Half-reaped/new-app coexistence. Reap runs WHILE HOLDING the probe lock; a new same-domain run blocks in acquire_app_lock until reap completes. The pre-deploy window (lock held, app not yet created) is covered: the stale-lockfile sweep sees the held lock (BlockingIOError) and leaves it. Refuted.
  4. Signal mid-normal-teardown aborting cleanup. begin_teardown() is the FIRST line of BOTH finally blocks (run_recipe_ci.py:663 run_quick, :1134 main); the _funnel_handler swallows (logs+returns) any SIGTERM/SIGALRM once tearing_down is set, so a second signal can't abort the cleanup the first asked for. install_lifetime_guards() is the FIRST statement of main() (:829), before any abra/lock call, with prctl→ppid==1 recheck in the correct order. Refuted.

Open items to confirm AT M1 (cold, full suite) — NOT defects, just unverified-until-then:

  • datetime import removed from lifecycle.py along with _stack_age_seconds — grep for any remaining datetime use (ruff would catch an undefined name; confirm import truly orphaned).
  • _stack_name / age-fallback deadcode after the janitor rewrite — confirm no dangling refs.
  • Registry-symbol deletion is only PARTIAL on this commit: acquire_recipe_lock still present (P3 deletes it); register/unregister/_run_owner_state/ACTIVE_RUN_DIR/CCCI_JANITOR_MAX_AGE are gone — full dangling-ref grep belongs at M1 once P3 lands.
  • setsid-fork edge: if setsid ever forks (only when it's a pgrp leader; not the case for a backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the harness would reparent to ppid==1 and self-abort. Live-verify the trap+cancel path at M2(a).
  • begin_teardown is process-global module state (lifetime._state) — fine for one harness process; the tests/concurrency suite must not import-share it across in-process cases (verify at M1).

2026-06-10T04:32Z — pre-review P3+P4 (branch @91d3cc7, NO gate claimed — NOT a verdict)

Builder pushed P3 (17ebdf3 per-run ABRA_DIR) + P4 (91d3cc7 config cleanup). tests/concurrency + P5 docs still pending, so M1 still not claimable. Continued the front-loaded diff audit (code/git only; JOURNAL still unread). Findings — all CLEAN:

  • Dangling-ref grep across runner/bridge/dashboard/nix = ZERO hits for all 9 deleted symbols: acquire_recipe_lock, register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE, RECIPE_LOCK_DIR, _stack_age_seconds, _registry_path. The orphaned datetime import is also gone from lifecycle.py. Clean deletion.
  • Path centralization: all ~/.abra/recipes/<recipe> literals replaced by abra.recipe_dir() (resolves $ABRA_DIR else ~/.abra) across abra.py (recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions), generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, fetch_recipe. prepull's env_path stays canonical ~/.abra/servers/... which is correct (servers/ is the shared symlink target).
  • Ordering verified (main(), the only structural risk): install_lifetime_guards() is the FIRST stmt (873); between it and setup_run_abra_dir() (891) there are ONLY env reads + a print — no abra call; ABRA_DIR is exported at 891 BEFORE fetch_recipe (892) and before the first path-helper recipe_head_commit (895). The --quick dispatch (run_quick, ~908) is AFTER 891, so the quick lane inherits the per-run ABRA_DIR too. No tree is touched before ABRA_DIR is set.
  • Manual-run isolation: rid=="manual" → "manual-" so two hand-runs don't share a tree.

Open items to confirm AT M1 (cold) — not defects:

  • setup_run_abra_dir symlink idempotency: if not os.path.islink(link): os.symlink(...) — if a NON-symlink file pre-exists at servers/catalogue (reused run dir from a crashed partial), symlink raises FileExistsError. Low risk (fresh run-id per Drone build) but worth a glance.
  • CCCI_SKIP_FETCH=1 now rm -rf dest + copytree(canonical, dest, symlinks=True) — confirm the --quick rollback-proof staging tests still pass (they set CCCI_SKIP_FETCH).
  • tests/{ghost,discourse}/install_steps.sh RECIPE_DIR=${ABRA_DIR:-$HOME/.abra} mechanical path fix — confirm it changed NO assertion/gate (guardrail: never weaken recipe-test gates). Diff-check.

Net: the entire P1P4 diff has been pre-audited and is clean against my break-it hit-list. M1 cold run, once claimed (after tests/concurrency + P5 land), reduces to: fresh checkout on cc-ci → cc-ci-run -m pytest tests/unit -q + cc-ci-run -m pytest tests/concurrency -q + lint, plus a focused review of only the tests/concurrency suite (vs the 19 planned cases) and the P5 doc delta.

M1: PASS @2026-06-10T04:38Z — implementation verified (branch restructure/concurrency @d3fe9e2)

Verdict formed from the plan (SSOT), the code/git, the STATUS claim's verify recipe, and my own COLD acceptance run — WITHOUT reading JOURNAL first (anti-anchoring honored; noting here that I had NOT consulted JOURNAL-conc at verdict time).

COLD ENVIRONMENT: fresh git clone --branch restructure/concurrency into /tmp/adv-m1 on cc-ci (NOT the Builder's tree); git rev-parse HEAD == d3fe9e26bb0fbaedb37383539ba3973bc1c80aff (matches claim), git status clean. Ran via the host cc-ci-run pyEnv (pytest 8.3.3 + playwright) and the pinned .#lint devshell.

ACCEPTANCE RESULTS (expected → observed):

  • cc-ci-run -m pytest tests/unit -q → 138 passed in 4.72s ✓ (claim: 138 passed)
  • cc-ci-run -m pytest tests/concurrency -q → 20 passed in 9.91s ✓ (claim: 20 passed)
  • nix develop .#lint --command bash scripts/lint.shlint: PASS
  • pytest tests/unit --collect-only concurrency items → 0 ✓ (suite NOT in default gate)
  • dangling-ref grep (register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE, acquire_recipe_lock, RECIPE_LOCK_DIR, _stack_age_seconds) over .py/.nix/.yml/.sh → ZERO hits outside docs/ ✓

GATE-INTEGRITY (guardrails honored):

  • RUN_APP_RE regex unchanged (lifecycle.py:26, identical pattern); warm/canonical apps still never become probe candidates (test_11 asserts no lockfiles even created for warm names).
  • services_converged() / paused-is-settled / backup_app() waits: NOT in the code diff — all RUN_APP_RE/services_converged/paused diff hits are docs/concurrency.md prose (P5 rewrite).
  • teardown_app ordering untouched; only its trailing unregister call removed (registry gone).
  • Only tests/<recipe>/ change is the mechanical RECIPE_DIR=${ABRA_DIR:-$HOME/.abra}/... line in ghost+discourse install_steps.sh — NO assertion/gate touched (diff-confirmed). Guardrail "never weaken recipe-test gates / touch tests// content" honored.
  • P4: concurrency.limit block removed from .drone.yml; drone-runner.nix comment makes DRONE_RUNNER_CAPACITY the single knob.

ADVERSARIAL DIFF REVIEW (P1P4 pre-audited in the two notes above; refuted: green-on-red exit-code masking [empirically tested], unlink/reacquire inode race [fstat==stat identity recheck], half-reaped coexistence [reap-under-probe-lock], signal-mid-teardown reentrancy [begin_teardown first line of both finally blocks], guard/ABRA_DIR/fetch ordering [no abra call pre-export]).

TEST-SUITE AUDIT vs the 19 plan cases: real kernel flocks, NEVER mocked (only teardown_app + abra-discovery stubbed, both disclosed). Coverage complete: cases 14 test_locks, 512 test_janitor, 1316 test_lifetime, 1719 test_abra_dir, +test_18b (manual-pid isolation) = 20. Assertions are substantive, not tautological: exact funnel exit codes 142/143 (test_15/16), reap-vs-new-run timestamp ordering + fresh-inode lock_state=="held" (test_7), two-janitor arbitration via separate open()s (test_8 — valid: flock binds the open file description, so threads-with-distinct-fds model processes), long-held mtime-backdate flag-not-steal (test_10), PEP 446 fd non-inheritance with a surviving child (test_3), divergent per-run trees + canonical untouched (test_18).

INDEPENDENT PROBE (my own driver, NOT the Builder's helpers.py): drove the real lifecycle.acquire_app_lock from a standalone script with a sandbox CCCI_APP_LOCK_DIR on cc-ci → state held after acquire; a second acquirer BLOCKED while the first held (no ack2 after 1.5s); after SIGKILL of the holder the second acquired within 10s (kernel auto-release). Core invariant confirmed against the real code, not just the Builder's tests.

NON-BLOCKING NOTES (carry to M2 live-verify; none gate M1):

  • setsid-fork edge in the .drone.yml trap wrapper: if setsid ever forks (only when it's a pgrp leader — not the case for a backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the harness could reparent (ppid==1) and self-abort. MUST be live-verified by the actual drone-cancel path at M2(a) — the plan already flags this ("verify drone exec runner signal delivery; the trap must fire on drone cancel"). Not unit-testable here.
  • End-of-janitor stale-lockfile tidy sweep (appless leftover lockfile unlink) is not directly covered by a named test (not one of the 19); low risk (tidiness only). Noted, not a defect.
  • test_14 (ppid race) depends on the helper reparenting to pid 1; under a subreaper it marks NEVER_REPARENTED and FAILS VISIBLY (never false-passes). Passed in this env.

CONCLUSION: M1 — implementation verified — PASS. M2 (merge to main + live verification ad) is unblocked. Reminder for both loops: recipe-mirror PRs are !testme targets only — never merge/push them. (After this verdict I may consult JOURNAL-conc to contextualize, per §6.1.)

2026-06-10T04:49Z — M2 merge integrity pre-check (M2 NOT yet claimed — not a verdict)

Builder merged the branch to main (merge commit bb5eb3d, 2 parents 83a6c6e∘d3fe9e2, no force) after my M1 PASS, and is mid-M2 live verification (journal: M2(a) cancel-mid-run evidence, (b) parallel runs triggered). No claim(conc): M2 commit yet; STATUS-conc still shows the stale M1 line (Builder's file — will update at the M2 claim). Independent merge check:

  • git diff bb5eb3d d3fe9e2 -- runner/ .drone.yml docs/concurrency.md tests/ nix/ = EMPTY → the merge preserved EXACTLY the code I cold-verified at M1. No conflict-resolution drift introduced.
  • git merge-base --is-ancestor d3fe9e2 bb5eb3d = true. So deployed main == M1-verified tree. At the M2 claim I therefore re-verify only LIVE behavior + the push build, not the code again: push build green; (a) cancel mid-run → no leaked python/lock, next janitor reaps the app, zero leakage; (b) two parallel !testme (immich#2 + plausible#3) → both green, zero leakage; (c) double-!testme same PR → 2nd blocks on the app lock (visible in its drone log) then runs; (d) one full green end-to-end run. Evidence to come from Drone build logs + cc-ci state (abra app ls / lslocks / docker), cold from my own access path.

2026-06-10T05:00Z — wrapper exit-code fix verified + CORRECTION to my P1 pre-review (inbox consumed)

Consumed ADVERSARY-INBOX.md (deleted) — Builder reported an M2 live-verify finding + fix. Folded in:

The defect (real, Builder-found, build 269 plausible#3): the drone exec step shell is set -e. On a NORMAL (green) harness exit the P1 EXIT trap still fired and its kill -TERM -- -$PID of the already-exited process group returned ESRCH (exit 1), which under set -e poisoned the step's exit status to 1 — a fully GREEN run (all tiers pass, level=4) reported RED.

CORRECTION — my P1 pre-review was wrong on this point. In my 04:23Z pre-review I claimed to have "empirically tested" green-on-red exit-code masking and REFUTED it. That test was run with plain bash -c WITHOUT set -e — the wrong shell mode. The real drone step runs set -e, where the bug manifests. I re-ran the matrix correctly now (bash -e), reproducing the bug (old wrapper + green + set -e → exit 1) and confirming I had the shell mode wrong. Lesson: model the EXACT runtime (set -e) for shell-trap behavior. The Builder caught this live; I did not. Owning it. NB the failure direction was false-RED (green reported red) — fail-safe-ish, not a green-on-red (no failing run was ever reported green); still a real defect.

The fix (e1c4198 on branch, merged to main b7a009c) — independently verified by me, cold under set -e (the correct mode this time):

setsid cc-ci-run runner/run_recipe_ci.py & PID=$!
trap 'kill -TERM -- "-$PID" 2>/dev/null || true' TERM EXIT
rc=0; wait "$PID" || rc=$?
trap - TERM EXIT
exit "$rc"

My 4-path matrix (all under bash -e, exact-shape repros):

  • A green harness → step exit 0 ✓ (poisoning gone: || true on the trap kill + trap - EXIT before exit)
  • B red harness (exit 7) → step exit 7 ✓ — NOT masked to green. Critical false-GREEN check PASSES: wait || rc=$? captures the real rc and exit "$rc" propagates it. The "failing PR must report RED" gate is preserved by the fix.
  • C old wrapper + green + set -e → exit 1 ✓ (bug reproduced — root-cause confirmed)
  • D cancel (TERM to wrapper mid-wait) → wrapper exits 143 AND the child received TERM (CHILD_GOT_TERM logged) ✓ — cancel-forwarding semantics unchanged; the trap - TERM EXIT runs only AFTER wait returns (post-forward), so it can't disarm the forward during a real cancel.

Verdict on the fix: CORRECT and SAFE — resolves the false-RED poisoning without introducing false-GREEN, and preserves cancel forwarding. Folds cleanly into the pending M2 review.

M1 status unaffected: M1 PASS was for the code/suites/lint/diff of d3fe9e2; this wrapper exit-code-under-set-e is a LIVE behavior M1's checks could not exercise (the trap only runs in the real drone exec shell). main now = d3fe9e2 + this .drone.yml wrapper fix; the fix is verified above. Open for the formal M2 verdict: re-confirm lint green on the new .drone.yml (yamllint), the push build green, and live (a) cancel-no-leak / (b) parallel both-green / (c) double-!testme blocks / (d) one full green run — cold, once the Builder posts the M2 claim with evidence.

M2(c): FAIL @2026-06-10T08:10Z — double-!testme same domain corrupts shared deploy-count → both runs RED + VETO

Proactive cold break-it probe of the live M2 evidence (M2 not yet formally claim(conc)'d — the Builder's JOURNAL shows (c) "triggered" but NOT evidenced as PASS; I went straight to the Drone API to verify the in-flight (c) runs independently, not to the JOURNAL narrative). I found a REAL defect that breaks M2(c). Filed as BACKLOG-conc CONC-A1.

EVIDENCE (Drone API, recipe-maintainers/cc-ci, cold via /run/secrets/bridge_drone_token — my own access path, not the Builder's word):

  • (c) = builds 279 + 281, both event=custom PR=2 RECIPE=immich REF=a92b28d… → SAME domain immi-ad3e33.ci.commoninternet.net. Both status=failure (step ci exit_code=1).
  • 281 (the blocked run): log == app lock: ... in flight — waiting == @2s → == acquired == @194s, which is exactly when 279's process exited (279 finished 05:07:35Z). Lock serialisation + the visible block line WORK — that half of (c) is fine.
  • 279 RED: !! deploy-count 2 != 1 (DG4.1 violation).
  • 281 RED: FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33….ci.commoninternet.net at run_recipe_ci.py:1213.
  • Control build 275 (isolated immich, same fixed wrapper) → deploy-count = 1, GREEN. Confirms the failure is concurrency-specific, NOT a pre-existing immich/wrapper regression.

ROOT CAUSE (code, confirmed):

  • DG4.1 counter file is DOMAIN-keyed in shared /tmp, not per-run: run_recipe_ci.py:930 /tmp/ccci-deploys-<domain>. P3 isolated ABRA_DIR per run but this per-run state file was missed (predates the restructure, ef44d46; the old recipe-flock serialised same-recipe runs end-to-end, masking it).
  • deploy_app() calls _record_deploy() (lifecycle.py:250) BEFORE acquire_app_lock() (:254, introduced by P2 b302f3a) → the increment races OUTSIDE the lock. 281's single pre-lock _record_deploy (@2s) bumps the shared counter 279 is using (→2, false violation), and 279's end-of-run os.remove(countfile) (:1215) deletes the file under 281 → FileNotFoundError.
  • Interleaving is fully reconstructed and self-consistent with the build timestamps (see CONC-A1).

This is squarely in M2(c) scope: the plan's DoD (c) requires the second run to "block … then RUN" (implicitly green), and the phase's whole premise is "two concurrent !testme don't collide on domain/volume/secrets." This is a domain-keyed-state collision — the restructure's narrower domain lock no longer covers the deploy-count file. M1 (code/suites/lint/diff of d3fe9e2) is unaffected — this is a live concurrency behavior M1's checks could not exercise; the tests/concurrency suite has the matching blind spot (case 4 serialises acquire but never asserts deploy-count isolation across two same-domain runs).

VETO — M2 may NOT be marked DONE until CONC-A1 is fixed and I log a fresh (c) PASS

Forbidding ## DONE in STATUS-conc until: (1) deploy-counter keyed per-run; (2) a tests/concurrency case asserts same-domain deploy-count isolation; (3) live (c) re-run shows BOTH builds GREEN with the visible block line and zero leakage; (4) (a),(b),(d) re-confirmed unaffected. Only I clear this. (After this verdict I may consult JOURNAL-conc to contextualise — noting I had NOT read the (c) journal reasoning before forming this FAIL; I verified from the Drone API + code directly.)

2026-06-10T08:20Z — CONC-A1 fix CODE-verified (veto conditions 1+2 met; 3+4 still pending — NOT cleared)

Builder fixed CONC-A1 (b6e12ef, merged main 139e319) and is re-running M2 live (a)(d). I cold-verified the FIX CODE from my own clone + a fresh checkout on cc-ci (not the Builder's word):

  • Condition (1) per-run keying — MET. run_recipe_ci._run_state_path(name) keys all four run-scoped state files (deploys, opstate, deps, depskip) by run_id() + os.getpid(), never domain. Grep: ZERO residual ccci-<state>-{domain} literals in prod code (only the app-LOCK path stays domain-keyed, which is correct). All consumers env-read CCCI_*_FILE (lifecycle:148, deps:72/155, generic:134) — no path re-derivation. Uniqueness holds even in the manual fallback (run_id()→domain) because the +pid suffix separates two processes.
  • Condition (2) same-domain isolation test — MET, and proven non-tautological. tests/concurrency/test_run_state.py adds test_20/20b/20c. test_20c drives REAL processes + the REAL lock + real _run_state_path/_record_deploy, reproducing the 279/281 interleaving: run A reads COUNT 1 (NOT polluted to 2 by B's pre-lock increment) and B's file survives A's remove (no FileNotFoundError). Mutation check (my own): reverting _run_state_path to domain-keying in a throwaway cc-ci clone → all 3 test_run_state cases FAIL (incl. test_20c). So the test genuinely guards the fix.
  • Suites cold (fresh clone @4f6c955 on cc-ci): unit 138 passed, concurrency 23 passed (was 20), concurrency still NOT collected by the default pytest tests/unit run (0). lint not re-run here (no .drone.yml/nix change in the fix; will confirm at the M2 claim).

VETO NOT cleared. Conditions (3) live (c) re-run BOTH builds GREEN + visible block line + zero leakage, and (4) (a)/(b)/(d) re-confirmed on the fixed harness, still require the Builder's live evidence (in flight). The code fix strongly predicts a (c) pass but M2 is a LIVE gate — I will re-verify the (c) double-!testme cold from the Drone API once the Builder posts the M2 claim, and only then clear the veto.

2026-06-10T08:43Z — live (c) round-2 (builds 290+291): serialization CONFIRMED via lslocks; delay is an immich-ML flake, NOT the restructure (not a verdict)

(b)+(d) re-passed on the fixed harness (builds 287 immich#2 + 288 plausible#3, parallel, both success — I'll re-confirm at the M2 claim). (c) round 2 = builds 290+291 (both custom PR=2 immich, same domain immi-ad3e33), started 08:22:30Z. I inspected the LIVE host state cold (my own ssh):

  • CORE INVARIANT DIRECTLY OBSERVED in the kernel lock table — strongest possible proof of the double-!testme serialization: lslocks: pid 739163 (build 290) holds WRITE on cc-ci-app-immi-ad3e33….lock; pid 739341 (build 291) is blocked WRITE* on the SAME lock. Exactly one holder, one waiter, one inode.
  • 290 (holder) is sleeping in services_converged() poll (hrtimer_nanosleep, no abra child) because immich-machine-learning is stuck 0/1: its container repeatedly fails the healthcheck (non-zero exit (143): dockerexec: unhealthy container, swarm restarting every 16 min). Current attempt (08:43) has gunicorn up, health starting — slow/flaky ML readiness, not a deploy break.
  • NOT caused by the restructure / teardown: 290's immich volumes (model-cache/postgres/uploads) + .env are all from 290's OWN fresh deploy (08:23), not inherited from the earlier same-domain run 287. ML image present (1.36GB, no pull), host healthy (5.2Gi mem free, 65G disk). So this is an immich-ML healthcheck flake, orthogonal to concurrency.

Bearing on M2(c): the SERIALIZATION mechanism under test is verified working live. The "both GREEN" half of condition (3) is not yet demonstrated only because 290 is flake-blocked on immich-ML; if 290 REDs on deploy-timeout, (c) needs a clean re-run (flake, not a code fault). VETO unchanged — I still require one clean (c) where both same-domain builds go GREEN with the block line + zero leakage. Continuing to watch 290/291 to terminal.

M2(c): PASS @2026-06-10T09:05Z — double-!testme same domain, CONC-A1 fixed; VETO LIFTED

(c) round-2 builds 290+291 (both custom PR=2 immich, same domain immi-ad3e33, on CONC-A1-fixed main) both reached terminal status=success. Cold-verified from the Drone API + live host (my own access path), not the Builder's word:

  • Both GREEN: 290 success, 291 success (Drone API).
  • Visible block line (the (c) requirement): 291 log — == app lock: another run of immi-ad3e33….ci.commoninternet.net is in flight — waiting == then == app lock: acquired … ==. I ALSO observed the serialization directly in the kernel lock table mid-run (lslocks: 290 held WRITE, 291 blocked WRITE* on the same inode; after 290 exited, 291 held it). Strongest possible proof of the double-!testme serialization invariant.
  • CONC-A1 regression GONE — the two exact round-1 failure points are now clean:
    • 290 (round-1 build 279 got false deploy-count 2 != 1) → now deploy-count = 1 (expect 1), all 5 tiers pass, level=4. Its run-keyed counter was NOT polluted by 291's concurrent pre-lock _record_deploy.
    • 291 (round-1 build 281 crashed FileNotFoundError at run_recipe_ci.py:1213) → now deploy-count = 1 (expect 1), all tiers pass, level=4, no traceback. Its own run-keyed countfile survived 290's end-of-run remove.
  • Zero leakage after both: 0 harness procs, 0 immich apps / services / volumes / secrets, no held cc-ci locks. One unheld 0-byte leftover lockfile (mtime 08:46, 291's acquisition touch) — reaped on sight by the next janitor probe, harmless by design.
  • The ~20-min runtime each was an immich-machine-learning healthcheck slowness/flake (ML eventually converged), NOT the restructure — already diagnosed in the 08:43Z note; serialization + isolation both verified correct regardless.

VETO LIFTED. The CONC-A1 veto ("no DONE until CONC-A1 fixed + a fresh (c) PASS") is cleared: conditions (1) per-run keying [code + mutation-proven], (2) same-domain isolation test [non-tautological], and (3) live (c) both-GREEN + block line + zero leakage are ALL met. CONC-A1 closed in BACKLOG-conc.

Still required before DONE (full M2 gate, not the CONC-A1 veto): the Builder must post the formal M2 claim in STATUS-conc with consolidated evidence, and I re-confirm condition (4) — specifically M2(a) cancel-mid-run re-run on the CONC-A1-fixed harness (b+d already re-confirmed: builds 287+288 parallel both success on fixed main; a's only prior evidence (build 267) was on the pre-CONC-A1, pre-wrapper-fix harness) — plus the push build green on current main. (a) re-run had not yet appeared in Drone as of this verdict (Builder sequenced it after (c)). I will verify it cold when it lands.

M2: PASS @2026-06-10T08:55Z — merged + live-verified (a)(d) on final main 139e319/74ed240

Formal M2 gate verdict against the Builder's M2 claim (STATUS-conc, commit 74ed240). Formed from the plan (SSOT), the code/git, the claim's verify recipe, and my OWN cold re-runs from my own clone

  • fresh checkouts/Drone-API on cc-ci — not the Builder's narrative. All seven claim items confirmed:
  1. Merge integritygit diff 139e319 b6e12ef -- runner/ tests/ docs/ .drone.yml nix/ = 0 lines; b6e12ef ⊆ 139e319; merge parents 2173894 ∘ b6e12ef. So deployed main code == the CONC-A1 tree I code-verified + mutation-proofed. No force-push (history linear). NB the claim mis-states the first parent as 4ad55ed (actual 2173894, my M2(c)-FAIL commit) — immaterial: that's a state- file commit, and the code-diff-empty check is authoritative.
  2. Push build green — Drone push builds 283298 on main all status=success; no red push since the merge.
  3. Suites + lint (cold, fresh clone on cc-ci) — unit 138 passed, concurrency 23 passed (concurrency NOT in the default unit gate), lint: PASS on final main 74ed240. test_run_state mutation-proofed (reverting to domain-keying fails all 3 cases).
  4. (a) cancel-mid-run on fixed harness — build 295 (custom immich#2): lockfile mtime 08:50:17 proves it acquired the app lock 7s in → canceled @08:51:05 MID-DEPLOY. After cancel (verified cold ~1 min later): 0 harness procs (no leaked python — old §8.1 gap stays closed), no held locks (lock released), no immich app/.env/containers(even stopped)/services/volumes/secrets → ZERO leakage, full teardown. Killed-step logs not API-retrievable (Drone truncates), but the end-state is the actual test and it is clean.
  5. (b) parallel runs — builds 287 (immich#2) + 288 (plausible#3), parallel, both status=success, both deploy-count = 1 (expect 1), level=4; host after = zero leakage.
  6. (c) double-!testme same PR — builds 290 + 291 (same immich domain): both success, 291 logged the block line then acquired, both deploy-count = 1, zero leakage. Serialization also observed directly in the kernel lock table mid-run (lslocks). Covered in detail by my M2(c) PASS @09:05Z.
  7. (d) full green e2e — build 287 (and 290): complete immich run, all 5 tiers pass, level=4.

Both M2-found fixes are folded in and independently verified: wrapper exit-code-under-set-e (e1c4198/b7a009c, my 05:00Z note — red still propagates) and CONC-A1 run-keyed state files (b6e12ef/139e319, my 09:05Z M2(c) PASS + mutation proof). The ~20-min (c) runtimes were an immich-ML healthcheck flake (converged within DEPLOY_TIMEOUT=1500s), orthogonal to the restructure (diagnosed 08:43Z). Unheld 0-byte leftover lockfiles are by-design (next-janitor tidy-sweep).

GUARDRAILS honored end-to-end: recipe-mirror PRs (immich#2, plausible#3) used as !testme targets only, never merged/pushed; cc-ci main touched only by the gated merges (no force-push); no secrets in any commit. RUN_APP_RE / services_converged / warm-canonical flows untouched (M1 diff review).

CONCLUSION: M2 — merged + live-verified — PASS. M1 PASS (04:38Z) + M2 PASS (here) are both fresh in REVIEW-conc; no open VETO (CONC-A1 lifted). Per the phase DoD the Builder may now write ## DONE to STATUS-conc. (Post-verdict I may consult JOURNAL-conc to contextualize; I had NOT read its M2 reasoning before forming this verdict — verified from plan + code/git + Drone API + my own cold runs.)