Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
32 KiB
REVIEW-conc.md — Adversary ledger, concurrency-restructure phase
Append-only. Verdicts: <gate>: PASS @<ts> + evidence, or FAIL + [adversary] finding in
BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.
2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)
Pulled main @5b65c6c. No STATUS-conc.md, no restructure/concurrency branch — nothing claimed yet.
Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to
anchor my later diff review in the as-is code, not the Builder's narrative.
Current-system facts I will hold the restructure against:
- Registry symbols slated for deletion (will grep for dangling refs at M1):
register_run_app(lifecycle.py:69, call site :283),unregister_run_app(:78, call sites :723, :766),_run_owner_state(:83),ACTIVE_RUN_DIR(:43),CCCI_JANITOR_MAX_AGE(janitor :738),acquire_recipe_lock(:46, call site run_recipe_ci.py:843),RECIPE_LOCK_DIR(:42). - Must survive untouched:
RUN_APP_RE(lifecycle.py:26) allowlist semantics (warm/canonical apps never probed),services_converged()paused-is-settled logic, docker-service sweep discovery,teardown_app(verify=False)idempotence. - M1 verification plan (cold, my clone): checkout branch;
pytest tests/unit -q,pytest tests/concurrency -q,scripts/lint.sh; full diff review hunting: probe-vs-acquire ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM), teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a waiter blocks on the old inode → two "held" locks on different inodes for one domain), PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration, setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases. - Tests/concurrency must NOT be wired into the default
pytest tests/unitgate (plan decision). - M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.
No verdict yet — waiting for Builder bootstrap/claim.
2026-06-10T04:05Z — cold-verify environment established (prep, no gate)
Builder seeded STATUS/BACKLOG/JOURNAL-conc; STATUS says P1 in flight, no gate claimed. Mapped the test-execution environment I'll use for the M1 cold run so a time-sensitive gate isn't spent debugging tooling:
- Local VM devshell (
nix develop) has only lintTools (no pytest). So pytest does NOT run here. - pytest 8.3.3 + playwright live in the host
pyEnv(nix/modules/harness.nix) exposed ascc-ci-runon cc-ci.cc-ci-run -m pytest <path> -qworks as the real harness interpreter (verified:cc-ci-run -c "import pytest" -> 8.3.3). .drone.ymllint stage runsnix develop .#lint --command bash scripts/lint.sh.- COLD M1 PLAN: fresh
git clone/checkout ofrestructure/concurrencyinto a throwaway dir ON cc-ci →cc-ci-run -m pytest tests/unit -q+cc-ci-run -m pytest tests/concurrency -q+nix develop .#lint --command bash scripts/lint.sh, all from that clean checkout (not the Builder's working tree). Then adversarial diff review per my baseline hit-list. - Baseline
.drone.ymlon main is still the pre-restructure version (concurrency.limit=2, acquire_recipe_lock / /run/cc-ci-active registry referenced) — confirms P1/P4 edits are branch-only so far. Good.
2026-06-10T04:23Z — early pre-review of P1+P2 (branch @b302f3a, NO gate claimed — NOT a verdict)
Builder has pushed P1 (b492f99) + P2 (b302f3a) to restructure/concurrency; P3/P4/P5/tests still
pending, so M1 is not claimable and this is NOT a PASS — it's pre-review to front-load the M1 diff
audit and avoid re-doing it under gate time pressure. Read code/diff + git only; did NOT read
JOURNAL (anti-anchoring intact). I actively tried to break the following and each concern was
REFUTED:
- Green-on-red via the .drone.yml EXIT trap (my lead hypothesis). The wrapper is
setsid cc-ci-run … & PID=$!; trap 'kill -TERM -- -$PID' TERM EXIT; wait $PID. I worried the EXIT trap's finalkillstatus would override the harness exit code and mask a failing run. EMPIRICALLY TESTED (4 bash repros incl. failing harness with a lingering group member that makes kill succeed=0): bash PRESERVES the pre-trap exit status when the EXIT trap doesn't callexit. Exit code propagates correctly in all cases (RED stays RED, GREEN stays GREEN). Refuted. - P2 unlink/reacquire inode race (janitor unlinks a reaped orphan's lockfile while a new run
blocks on the old inode). Handled: both acquire_app_lock and _probe_and_reap recheck
fstat(fd).st_ino == stat(path).st_inoafter acquiring and retry/bail on mismatch — a lock on an unlinked (anonymous) inode is never treated as authoritative, and the path's lockfile is never unlinked out from under a newer run. Refuted. - Half-reaped/new-app coexistence. Reap runs WHILE HOLDING the probe lock; a new same-domain run blocks in acquire_app_lock until reap completes. The pre-deploy window (lock held, app not yet created) is covered: the stale-lockfile sweep sees the held lock (BlockingIOError) and leaves it. Refuted.
- Signal mid-normal-teardown aborting cleanup. begin_teardown() is the FIRST line of BOTH finally blocks (run_recipe_ci.py:663 run_quick, :1134 main); the _funnel_handler swallows (logs+returns) any SIGTERM/SIGALRM once tearing_down is set, so a second signal can't abort the cleanup the first asked for. install_lifetime_guards() is the FIRST statement of main() (:829), before any abra/lock call, with prctl→ppid==1 recheck in the correct order. Refuted.
Open items to confirm AT M1 (cold, full suite) — NOT defects, just unverified-until-then:
datetimeimport removed from lifecycle.py along with _stack_age_seconds — grep for any remaining datetime use (ruff would catch an undefined name; confirm import truly orphaned)._stack_name/ age-fallback deadcode after the janitor rewrite — confirm no dangling refs.- Registry-symbol deletion is only PARTIAL on this commit: acquire_recipe_lock still present (P3 deletes it); register/unregister/_run_owner_state/ACTIVE_RUN_DIR/CCCI_JANITOR_MAX_AGE are gone — full dangling-ref grep belongs at M1 once P3 lands.
- setsid-fork edge: if
setsidever forks (only when it's a pgrp leader; not the case for a backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the harness would reparent to ppid==1 and self-abort. Live-verify the trap+cancel path at M2(a). - begin_teardown is process-global module state (lifetime._state) — fine for one harness process; the tests/concurrency suite must not import-share it across in-process cases (verify at M1).
2026-06-10T04:32Z — pre-review P3+P4 (branch @91d3cc7, NO gate claimed — NOT a verdict)
Builder pushed P3 (17ebdf3 per-run ABRA_DIR) + P4 (91d3cc7 config cleanup). tests/concurrency +
P5 docs still pending, so M1 still not claimable. Continued the front-loaded diff audit (code/git
only; JOURNAL still unread). Findings — all CLEAN:
- Dangling-ref grep across runner/bridge/dashboard/nix = ZERO hits for all 9 deleted symbols:
acquire_recipe_lock, register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
CCCI_JANITOR_MAX_AGE, RECIPE_LOCK_DIR, _stack_age_seconds, _registry_path. The orphaned
datetimeimport is also gone from lifecycle.py. Clean deletion. - Path centralization: all
~/.abra/recipes/<recipe>literals replaced byabra.recipe_dir()(resolves$ABRA_DIR else ~/.abra) across abra.py (recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions), generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, fetch_recipe. prepull's env_path stays canonical~/.abra/servers/...which is correct (servers/ is the shared symlink target). - Ordering verified (main(), the only structural risk): install_lifetime_guards() is the FIRST
stmt (873); between it and setup_run_abra_dir() (891) there are ONLY env reads + a print — no
abra call; ABRA_DIR is exported at 891 BEFORE fetch_recipe (892) and before the first path-helper
recipe_head_commit (895). The
--quickdispatch (run_quick, ~908) is AFTER 891, so the quick lane inherits the per-run ABRA_DIR too. No tree is touched before ABRA_DIR is set. - Manual-run isolation: rid=="manual" → "manual-" so two hand-runs don't share a tree.
Open items to confirm AT M1 (cold) — not defects:
- setup_run_abra_dir symlink idempotency:
if not os.path.islink(link): os.symlink(...)— if a NON-symlink file pre-exists at servers/catalogue (reused run dir from a crashed partial), symlink raises FileExistsError. Low risk (fresh run-id per Drone build) but worth a glance. - CCCI_SKIP_FETCH=1 now
rm -rf dest+ copytree(canonical, dest, symlinks=True) — confirm the --quick rollback-proof staging tests still pass (they set CCCI_SKIP_FETCH). - tests/{ghost,discourse}/install_steps.sh RECIPE_DIR=${ABRA_DIR:-$HOME/.abra} mechanical path fix — confirm it changed NO assertion/gate (guardrail: never weaken recipe-test gates). Diff-check.
Net: the entire P1–P4 diff has been pre-audited and is clean against my break-it hit-list. M1 cold
run, once claimed (after tests/concurrency + P5 land), reduces to: fresh checkout on cc-ci →
cc-ci-run -m pytest tests/unit -q + cc-ci-run -m pytest tests/concurrency -q + lint, plus a
focused review of only the tests/concurrency suite (vs the 19 planned cases) and the P5 doc delta.
M1: PASS @2026-06-10T04:38Z — implementation verified (branch restructure/concurrency @d3fe9e2)
Verdict formed from the plan (SSOT), the code/git, the STATUS claim's verify recipe, and my own COLD acceptance run — WITHOUT reading JOURNAL first (anti-anchoring honored; noting here that I had NOT consulted JOURNAL-conc at verdict time).
COLD ENVIRONMENT: fresh git clone --branch restructure/concurrency into /tmp/adv-m1 on cc-ci
(NOT the Builder's tree); git rev-parse HEAD == d3fe9e26bb0fbaedb37383539ba3973bc1c80aff (matches
claim), git status clean. Ran via the host cc-ci-run pyEnv (pytest 8.3.3 + playwright) and the
pinned .#lint devshell.
ACCEPTANCE RESULTS (expected → observed):
cc-ci-run -m pytest tests/unit -q→ 138 passed in 4.72s ✓ (claim: 138 passed)cc-ci-run -m pytest tests/concurrency -q→ 20 passed in 9.91s ✓ (claim: 20 passed)nix develop .#lint --command bash scripts/lint.sh→lint: PASS✓pytest tests/unit --collect-onlyconcurrency items → 0 ✓ (suite NOT in default gate)- dangling-ref grep (register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE, acquire_recipe_lock, RECIPE_LOCK_DIR, _stack_age_seconds) over .py/.nix/.yml/.sh → ZERO hits outside docs/ ✓
GATE-INTEGRITY (guardrails honored):
RUN_APP_REregex unchanged (lifecycle.py:26, identical pattern); warm/canonical apps still never become probe candidates (test_11 asserts no lockfiles even created for warm names).services_converged()/ paused-is-settled /backup_app()waits: NOT in the code diff — all RUN_APP_RE/services_converged/paused diff hits are docs/concurrency.md prose (P5 rewrite).teardown_appordering untouched; only its trailing unregister call removed (registry gone).- Only
tests/<recipe>/change is the mechanicalRECIPE_DIR=${ABRA_DIR:-$HOME/.abra}/...line in ghost+discourse install_steps.sh — NO assertion/gate touched (diff-confirmed). Guardrail "never weaken recipe-test gates / touch tests// content" honored. - P4:
concurrency.limitblock removed from .drone.yml; drone-runner.nix comment makes DRONE_RUNNER_CAPACITY the single knob.
ADVERSARIAL DIFF REVIEW (P1–P4 pre-audited in the two notes above; refuted: green-on-red exit-code masking [empirically tested], unlink/reacquire inode race [fstat==stat identity recheck], half-reaped coexistence [reap-under-probe-lock], signal-mid-teardown reentrancy [begin_teardown first line of both finally blocks], guard/ABRA_DIR/fetch ordering [no abra call pre-export]).
TEST-SUITE AUDIT vs the 19 plan cases: real kernel flocks, NEVER mocked (only teardown_app +
abra-discovery stubbed, both disclosed). Coverage complete: cases 1–4 test_locks, 5–12
test_janitor, 13–16 test_lifetime, 17–19 test_abra_dir, +test_18b (manual-pid isolation) = 20.
Assertions are substantive, not tautological: exact funnel exit codes 142/143 (test_15/16),
reap-vs-new-run timestamp ordering + fresh-inode lock_state=="held" (test_7), two-janitor
arbitration via separate open()s (test_8 — valid: flock binds the open file description, so
threads-with-distinct-fds model processes), long-held mtime-backdate flag-not-steal (test_10),
PEP 446 fd non-inheritance with a surviving child (test_3), divergent per-run trees + canonical
untouched (test_18).
INDEPENDENT PROBE (my own driver, NOT the Builder's helpers.py): drove the real
lifecycle.acquire_app_lock from a standalone script with a sandbox CCCI_APP_LOCK_DIR on cc-ci →
state held after acquire; a second acquirer BLOCKED while the first held (no ack2 after 1.5s);
after SIGKILL of the holder the second acquired within 10s (kernel auto-release). Core invariant
confirmed against the real code, not just the Builder's tests.
NON-BLOCKING NOTES (carry to M2 live-verify; none gate M1):
- setsid-fork edge in the .drone.yml trap wrapper: if
setsidever forks (only when it's a pgrp leader — not the case for a backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the harness could reparent (ppid==1) and self-abort. MUST be live-verified by the actual drone-cancel path at M2(a) — the plan already flags this ("verify drone exec runner signal delivery; the trap must fire on drone cancel"). Not unit-testable here. - End-of-janitor stale-lockfile tidy sweep (appless leftover lockfile unlink) is not directly covered by a named test (not one of the 19); low risk (tidiness only). Noted, not a defect.
- test_14 (ppid race) depends on the helper reparenting to pid 1; under a subreaper it marks NEVER_REPARENTED and FAILS VISIBLY (never false-passes). Passed in this env.
CONCLUSION: M1 — implementation verified — PASS. M2 (merge to main + live verification a–d) is unblocked. Reminder for both loops: recipe-mirror PRs are !testme targets only — never merge/push them. (After this verdict I may consult JOURNAL-conc to contextualize, per §6.1.)
2026-06-10T04:49Z — M2 merge integrity pre-check (M2 NOT yet claimed — not a verdict)
Builder merged the branch to main (merge commit bb5eb3d, 2 parents 83a6c6e∘d3fe9e2, no force)
after my M1 PASS, and is mid-M2 live verification (journal: M2(a) cancel-mid-run evidence, (b)
parallel runs triggered). No claim(conc): M2 commit yet; STATUS-conc still shows the stale M1
line (Builder's file — will update at the M2 claim). Independent merge check:
git diff bb5eb3d d3fe9e2 -- runner/ .drone.yml docs/concurrency.md tests/ nix/= EMPTY → the merge preserved EXACTLY the code I cold-verified at M1. No conflict-resolution drift introduced.git merge-base --is-ancestor d3fe9e2 bb5eb3d= true. So deployed main == M1-verified tree. At the M2 claim I therefore re-verify only LIVE behavior + the push build, not the code again: push build green; (a) cancel mid-run → no leaked python/lock, next janitor reaps the app, zero leakage; (b) two parallel !testme (immich#2 + plausible#3) → both green, zero leakage; (c) double-!testme same PR → 2nd blocks on the app lock (visible in its drone log) then runs; (d) one full green end-to-end run. Evidence to come from Drone build logs + cc-ci state (abra app ls / lslocks / docker), cold from my own access path.
2026-06-10T05:00Z — wrapper exit-code fix verified + CORRECTION to my P1 pre-review (inbox consumed)
Consumed ADVERSARY-INBOX.md (deleted) — Builder reported an M2 live-verify finding + fix. Folded in:
The defect (real, Builder-found, build 269 plausible#3): the drone exec step shell is set -e.
On a NORMAL (green) harness exit the P1 EXIT trap still fired and its kill -TERM -- -$PID of the
already-exited process group returned ESRCH (exit 1), which under set -e poisoned the step's exit
status to 1 — a fully GREEN run (all tiers pass, level=4) reported RED.
CORRECTION — my P1 pre-review was wrong on this point. In my 04:23Z pre-review I claimed to have
"empirically tested" green-on-red exit-code masking and REFUTED it. That test was run with plain
bash -c WITHOUT set -e — the wrong shell mode. The real drone step runs set -e, where the bug
manifests. I re-ran the matrix correctly now (bash -e), reproducing the bug (old wrapper + green +
set -e → exit 1) and confirming I had the shell mode wrong. Lesson: model the EXACT runtime
(set -e) for shell-trap behavior. The Builder caught this live; I did not. Owning it.
NB the failure direction was false-RED (green reported red) — fail-safe-ish, not a green-on-red
(no failing run was ever reported green); still a real defect.
The fix (e1c4198 on branch, merged to main b7a009c) — independently verified by me, cold under
set -e (the correct mode this time):
setsid cc-ci-run runner/run_recipe_ci.py & PID=$!
trap 'kill -TERM -- "-$PID" 2>/dev/null || true' TERM EXIT
rc=0; wait "$PID" || rc=$?
trap - TERM EXIT
exit "$rc"
My 4-path matrix (all under bash -e, exact-shape repros):
- A green harness → step exit 0 ✓ (poisoning gone:
|| trueon the trap kill +trap - EXITbefore exit) - B red harness (exit 7) → step exit 7 ✓ — NOT masked to green. Critical false-GREEN check
PASSES:
wait || rc=$?captures the real rc andexit "$rc"propagates it. The "failing PR must report RED" gate is preserved by the fix. - C old wrapper + green + set -e → exit 1 ✓ (bug reproduced — root-cause confirmed)
- D cancel (TERM to wrapper mid-wait) → wrapper exits 143 AND the child received TERM
(CHILD_GOT_TERM logged) ✓ — cancel-forwarding semantics unchanged; the
trap - TERM EXITruns only AFTERwaitreturns (post-forward), so it can't disarm the forward during a real cancel.
Verdict on the fix: CORRECT and SAFE — resolves the false-RED poisoning without introducing false-GREEN, and preserves cancel forwarding. Folds cleanly into the pending M2 review.
M1 status unaffected: M1 PASS was for the code/suites/lint/diff of d3fe9e2; this wrapper
exit-code-under-set-e is a LIVE behavior M1's checks could not exercise (the trap only runs in the
real drone exec shell). main now = d3fe9e2 + this .drone.yml wrapper fix; the fix is verified above.
Open for the formal M2 verdict: re-confirm lint green on the new .drone.yml (yamllint), the push
build green, and live (a) cancel-no-leak / (b) parallel both-green / (c) double-!testme blocks /
(d) one full green run — cold, once the Builder posts the M2 claim with evidence.
M2(c): FAIL @2026-06-10T08:10Z — double-!testme same domain corrupts shared deploy-count → both runs RED + VETO
Proactive cold break-it probe of the live M2 evidence (M2 not yet formally claim(conc)'d — the
Builder's JOURNAL shows (c) "triggered" but NOT evidenced as PASS; I went straight to the Drone API
to verify the in-flight (c) runs independently, not to the JOURNAL narrative). I found a REAL defect
that breaks M2(c). Filed as BACKLOG-conc CONC-A1.
EVIDENCE (Drone API, recipe-maintainers/cc-ci, cold via /run/secrets/bridge_drone_token — my own access path, not the Builder's word):
- (c) = builds 279 + 281, both
event=custom PR=2 RECIPE=immich REF=a92b28d…→ SAME domainimmi-ad3e33.ci.commoninternet.net. Bothstatus=failure(stepciexit_code=1). - 281 (the blocked run): log
== app lock: ... in flight — waiting ==@2s →== acquired ==@194s, which is exactly when 279's process exited (279 finished 05:07:35Z). Lock serialisation + the visible block line WORK — that half of (c) is fine. - 279 RED:
!! deploy-count 2 != 1 (DG4.1 violation). - 281 RED:
FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33….ci.commoninternet.netat run_recipe_ci.py:1213. - Control build 275 (isolated immich, same fixed wrapper) →
deploy-count = 1, GREEN. Confirms the failure is concurrency-specific, NOT a pre-existing immich/wrapper regression.
ROOT CAUSE (code, confirmed):
- DG4.1 counter file is DOMAIN-keyed in shared /tmp, not per-run:
run_recipe_ci.py:930 /tmp/ccci-deploys-<domain>. P3 isolated ABRA_DIR per run but this per-run state file was missed (predates the restructure, ef44d46; the old recipe-flock serialised same-recipe runs end-to-end, masking it). deploy_app()calls_record_deploy()(lifecycle.py:250) BEFOREacquire_app_lock()(:254, introduced by P2b302f3a) → the increment races OUTSIDE the lock. 281's single pre-lock_record_deploy(@2s) bumps the shared counter 279 is using (→2, false violation), and 279's end-of-runos.remove(countfile)(:1215) deletes the file under 281 → FileNotFoundError.- Interleaving is fully reconstructed and self-consistent with the build timestamps (see CONC-A1).
This is squarely in M2(c) scope: the plan's DoD (c) requires the second run to "block … then RUN"
(implicitly green), and the phase's whole premise is "two concurrent !testme don't collide on
domain/volume/secrets." This is a domain-keyed-state collision — the restructure's narrower domain
lock no longer covers the deploy-count file. M1 (code/suites/lint/diff of d3fe9e2) is unaffected —
this is a live concurrency behavior M1's checks could not exercise; the tests/concurrency suite has
the matching blind spot (case 4 serialises acquire but never asserts deploy-count isolation across
two same-domain runs).
VETO — M2 may NOT be marked DONE until CONC-A1 is fixed and I log a fresh (c) PASS
Forbidding ## DONE in STATUS-conc until: (1) deploy-counter keyed per-run; (2) a tests/concurrency
case asserts same-domain deploy-count isolation; (3) live (c) re-run shows BOTH builds GREEN with
the visible block line and zero leakage; (4) (a),(b),(d) re-confirmed unaffected. Only I clear this.
(After this verdict I may consult JOURNAL-conc to contextualise — noting I had NOT read the (c)
journal reasoning before forming this FAIL; I verified from the Drone API + code directly.)
2026-06-10T08:20Z — CONC-A1 fix CODE-verified (veto conditions 1+2 met; 3+4 still pending — NOT cleared)
Builder fixed CONC-A1 (b6e12ef, merged main 139e319) and is re-running M2 live (a)–(d). I
cold-verified the FIX CODE from my own clone + a fresh checkout on cc-ci (not the Builder's word):
- Condition (1) per-run keying — MET.
run_recipe_ci._run_state_path(name)keys all four run-scoped state files (deploys,opstate,deps,depskip) byrun_id()+os.getpid(), never domain. Grep: ZERO residualccci-<state>-{domain}literals in prod code (only the app-LOCK path stays domain-keyed, which is correct). All consumers env-readCCCI_*_FILE(lifecycle:148, deps:72/155, generic:134) — no path re-derivation. Uniqueness holds even in the manual fallback (run_id()→domain) because the+pidsuffix separates two processes. - Condition (2) same-domain isolation test — MET, and proven non-tautological.
tests/concurrency/test_run_state.py adds test_20/20b/20c. test_20c drives REAL processes + the
REAL lock + real
_run_state_path/_record_deploy, reproducing the 279/281 interleaving: run A readsCOUNT 1(NOT polluted to 2 by B's pre-lock increment) and B's file survives A's remove (no FileNotFoundError). Mutation check (my own): reverting_run_state_pathto domain-keying in a throwaway cc-ci clone → all 3 test_run_state cases FAIL (incl. test_20c). So the test genuinely guards the fix. - Suites cold (fresh clone @4f6c955 on cc-ci): unit 138 passed, concurrency 23 passed (was 20),
concurrency still NOT collected by the default
pytest tests/unitrun (0). lint not re-run here (no .drone.yml/nix change in the fix; will confirm at the M2 claim).
VETO NOT cleared. Conditions (3) live (c) re-run BOTH builds GREEN + visible block line + zero leakage, and (4) (a)/(b)/(d) re-confirmed on the fixed harness, still require the Builder's live evidence (in flight). The code fix strongly predicts a (c) pass but M2 is a LIVE gate — I will re-verify the (c) double-!testme cold from the Drone API once the Builder posts the M2 claim, and only then clear the veto.
2026-06-10T08:43Z — live (c) round-2 (builds 290+291): serialization CONFIRMED via lslocks; delay is an immich-ML flake, NOT the restructure (not a verdict)
(b)+(d) re-passed on the fixed harness (builds 287 immich#2 + 288 plausible#3, parallel, both success — I'll re-confirm at the M2 claim). (c) round 2 = builds 290+291 (both custom PR=2 immich, same domain immi-ad3e33), started 08:22:30Z. I inspected the LIVE host state cold (my own ssh):
- CORE INVARIANT DIRECTLY OBSERVED in the kernel lock table — strongest possible proof of the
double-!testme serialization:
lslocks: pid 739163 (build 290) holdsWRITEon cc-ci-app-immi-ad3e33….lock; pid 739341 (build 291) is blockedWRITE*on the SAME lock. Exactly one holder, one waiter, one inode. - 290 (holder) is sleeping in
services_converged()poll (hrtimer_nanosleep, no abra child) becauseimmich-machine-learningis stuck 0/1: its container repeatedly fails the healthcheck (non-zero exit (143): dockerexec: unhealthy container, swarm restarting every 1–6 min). Current attempt (08:43) has gunicorn up, healthstarting— slow/flaky ML readiness, not a deploy break. - NOT caused by the restructure / teardown: 290's immich volumes (model-cache/postgres/uploads) + .env are all from 290's OWN fresh deploy (08:23), not inherited from the earlier same-domain run 287. ML image present (1.36GB, no pull), host healthy (5.2Gi mem free, 65G disk). So this is an immich-ML healthcheck flake, orthogonal to concurrency.
Bearing on M2(c): the SERIALIZATION mechanism under test is verified working live. The "both GREEN" half of condition (3) is not yet demonstrated only because 290 is flake-blocked on immich-ML; if 290 REDs on deploy-timeout, (c) needs a clean re-run (flake, not a code fault). VETO unchanged — I still require one clean (c) where both same-domain builds go GREEN with the block line + zero leakage. Continuing to watch 290/291 to terminal.
M2(c): PASS @2026-06-10T09:05Z — double-!testme same domain, CONC-A1 fixed; VETO LIFTED
(c) round-2 builds 290+291 (both custom PR=2 immich, same domain immi-ad3e33, on CONC-A1-fixed
main) both reached terminal status=success. Cold-verified from the Drone API + live host (my own
access path), not the Builder's word:
- Both GREEN: 290 success, 291 success (Drone API).
- Visible block line (the (c) requirement): 291 log —
== app lock: another run of immi-ad3e33….ci.commoninternet.net is in flight — waiting ==then== app lock: acquired … ==. I ALSO observed the serialization directly in the kernel lock table mid-run (lslocks: 290 held WRITE, 291 blocked WRITE* on the same inode; after 290 exited, 291 held it). Strongest possible proof of the double-!testme serialization invariant. - CONC-A1 regression GONE — the two exact round-1 failure points are now clean:
- 290 (round-1 build 279 got false
deploy-count 2 != 1) → nowdeploy-count = 1 (expect 1), all 5 tiers pass, level=4. Its run-keyed counter was NOT polluted by 291's concurrent pre-lock_record_deploy. - 291 (round-1 build 281 crashed
FileNotFoundErrorat run_recipe_ci.py:1213) → nowdeploy-count = 1 (expect 1), all tiers pass, level=4, no traceback. Its own run-keyed countfile survived 290's end-of-run remove.
- 290 (round-1 build 279 got false
- Zero leakage after both: 0 harness procs, 0 immich apps / services / volumes / secrets, no held cc-ci locks. One unheld 0-byte leftover lockfile (mtime 08:46, 291's acquisition touch) — reaped on sight by the next janitor probe, harmless by design.
- The ~20-min runtime each was an immich-machine-learning healthcheck slowness/flake (ML eventually converged), NOT the restructure — already diagnosed in the 08:43Z note; serialization + isolation both verified correct regardless.
VETO LIFTED. The CONC-A1 veto ("no DONE until CONC-A1 fixed + a fresh (c) PASS") is cleared: conditions (1) per-run keying [code + mutation-proven], (2) same-domain isolation test [non-tautological], and (3) live (c) both-GREEN + block line + zero leakage are ALL met. CONC-A1 closed in BACKLOG-conc.
Still required before DONE (full M2 gate, not the CONC-A1 veto): the Builder must post the formal M2 claim in STATUS-conc with consolidated evidence, and I re-confirm condition (4) — specifically M2(a) cancel-mid-run re-run on the CONC-A1-fixed harness (b+d already re-confirmed: builds 287+288 parallel both success on fixed main; a's only prior evidence (build 267) was on the pre-CONC-A1, pre-wrapper-fix harness) — plus the push build green on current main. (a) re-run had not yet appeared in Drone as of this verdict (Builder sequenced it after (c)). I will verify it cold when it lands.
M2: PASS @2026-06-10T08:55Z — merged + live-verified (a)–(d) on final main 139e319/74ed240
Formal M2 gate verdict against the Builder's M2 claim (STATUS-conc, commit 74ed240). Formed from
the plan (SSOT), the code/git, the claim's verify recipe, and my OWN cold re-runs from my own clone
- fresh checkouts/Drone-API on cc-ci — not the Builder's narrative. All seven claim items confirmed:
- Merge integrity —
git diff 139e319 b6e12ef -- runner/ tests/ docs/ .drone.yml nix/= 0 lines;b6e12ef ⊆ 139e319; merge parents2173894 ∘ b6e12ef. So deployed main code == the CONC-A1 tree I code-verified + mutation-proofed. No force-push (history linear). NB the claim mis-states the first parent as4ad55ed(actual2173894, my M2(c)-FAIL commit) — immaterial: that's a state- file commit, and the code-diff-empty check is authoritative. - Push build green — Drone push builds 283–298 on main all
status=success; no red push since the merge. - Suites + lint (cold, fresh clone on cc-ci) — unit 138 passed, concurrency 23 passed
(concurrency NOT in the default unit gate),
lint: PASSon final main74ed240. test_run_state mutation-proofed (reverting to domain-keying fails all 3 cases). - (a) cancel-mid-run on fixed harness — build 295 (custom immich#2): lockfile mtime 08:50:17 proves it acquired the app lock 7s in → canceled @08:51:05 MID-DEPLOY. After cancel (verified cold ~1 min later): 0 harness procs (no leaked python — old §8.1 gap stays closed), no held locks (lock released), no immich app/.env/containers(even stopped)/services/volumes/secrets → ZERO leakage, full teardown. Killed-step logs not API-retrievable (Drone truncates), but the end-state is the actual test and it is clean.
- (b) parallel runs — builds 287 (immich#2) + 288 (plausible#3), parallel, both
status=success, bothdeploy-count = 1 (expect 1), level=4; host after = zero leakage. - (c) double-!testme same PR — builds 290 + 291 (same immich domain): both success, 291 logged
the block line then
acquired, bothdeploy-count = 1, zero leakage. Serialization also observed directly in the kernel lock table mid-run (lslocks). Covered in detail by my M2(c) PASS @09:05Z. - (d) full green e2e — build 287 (and 290): complete immich run, all 5 tiers pass, level=4.
Both M2-found fixes are folded in and independently verified: wrapper exit-code-under-set-e (e1c4198/b7a009c, my 05:00Z note — red still propagates) and CONC-A1 run-keyed state files (b6e12ef/139e319, my 09:05Z M2(c) PASS + mutation proof). The ~20-min (c) runtimes were an immich-ML healthcheck flake (converged within DEPLOY_TIMEOUT=1500s), orthogonal to the restructure (diagnosed 08:43Z). Unheld 0-byte leftover lockfiles are by-design (next-janitor tidy-sweep).
GUARDRAILS honored end-to-end: recipe-mirror PRs (immich#2, plausible#3) used as !testme targets only, never merged/pushed; cc-ci main touched only by the gated merges (no force-push); no secrets in any commit. RUN_APP_RE / services_converged / warm-canonical flows untouched (M1 diff review).
CONCLUSION: M2 — merged + live-verified — PASS. M1 PASS (04:38Z) + M2 PASS (here) are both fresh
in REVIEW-conc; no open VETO (CONC-A1 lifted). Per the phase DoD the Builder may now write ## DONE
to STATUS-conc. (Post-verdict I may consult JOURNAL-conc to contextualize; I had NOT read its M2
reasoning before forming this verdict — verified from plan + code/git + Drone API + my own cold runs.)