All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
443 lines
32 KiB
Markdown
443 lines
32 KiB
Markdown
# REVIEW-conc.md — Adversary ledger, concurrency-restructure phase
|
||
|
||
Append-only. Verdicts: `<gate>: PASS @<ts>` + evidence, or `FAIL` + [adversary] finding in
|
||
BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.
|
||
|
||
## 2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)
|
||
|
||
Pulled main @5b65c6c. No STATUS-conc.md, no `restructure/concurrency` branch — nothing claimed yet.
|
||
Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to
|
||
anchor my later diff review in the as-is code, not the Builder's narrative.
|
||
|
||
Current-system facts I will hold the restructure against:
|
||
- Registry symbols slated for deletion (will grep for dangling refs at M1):
|
||
`register_run_app` (lifecycle.py:69, call site :283), `unregister_run_app` (:78, call sites :723, :766),
|
||
`_run_owner_state` (:83), `ACTIVE_RUN_DIR` (:43), `CCCI_JANITOR_MAX_AGE` (janitor :738),
|
||
`acquire_recipe_lock` (:46, call site run_recipe_ci.py:843), `RECIPE_LOCK_DIR` (:42).
|
||
- Must survive untouched: `RUN_APP_RE` (lifecycle.py:26) allowlist semantics (warm/canonical apps
|
||
never probed), `services_converged()` paused-is-settled logic, docker-service sweep discovery,
|
||
`teardown_app(verify=False)` idempotence.
|
||
- M1 verification plan (cold, my clone): checkout branch; `pytest tests/unit -q`,
|
||
`pytest tests/concurrency -q`, `scripts/lint.sh`; full diff review hunting: probe-vs-acquire
|
||
ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM),
|
||
teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently
|
||
released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a
|
||
waiter blocks on the old inode → two "held" locks on different inodes for one domain),
|
||
PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration,
|
||
setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases.
|
||
- Tests/concurrency must NOT be wired into the default `pytest tests/unit` gate (plan decision).
|
||
- M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme
|
||
same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.
|
||
|
||
No verdict yet — waiting for Builder bootstrap/claim.
|
||
|
||
## 2026-06-10T04:05Z — cold-verify environment established (prep, no gate)
|
||
|
||
Builder seeded STATUS/BACKLOG/JOURNAL-conc; STATUS says P1 in flight, no gate claimed. Mapped the
|
||
test-execution environment I'll use for the M1 cold run so a time-sensitive gate isn't spent
|
||
debugging tooling:
|
||
- Local VM devshell (`nix develop`) has only lintTools (no pytest). So pytest does NOT run here.
|
||
- pytest 8.3.3 + playwright live in the host `pyEnv` (nix/modules/harness.nix) exposed as
|
||
`cc-ci-run` on cc-ci. `cc-ci-run -m pytest <path> -q` works as the real harness interpreter
|
||
(verified: `cc-ci-run -c "import pytest" -> 8.3.3`).
|
||
- `.drone.yml` lint stage runs `nix develop .#lint --command bash scripts/lint.sh`.
|
||
- COLD M1 PLAN: fresh `git clone`/checkout of `restructure/concurrency` into a throwaway dir ON
|
||
cc-ci → `cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` +
|
||
`nix develop .#lint --command bash scripts/lint.sh`, all from that clean checkout (not the
|
||
Builder's working tree). Then adversarial diff review per my baseline hit-list.
|
||
- Baseline `.drone.yml` on main is still the pre-restructure version (concurrency.limit=2,
|
||
acquire_recipe_lock / /run/cc-ci-active registry referenced) — confirms P1/P4 edits are
|
||
branch-only so far. Good.
|
||
|
||
## 2026-06-10T04:23Z — early pre-review of P1+P2 (branch @b302f3a, NO gate claimed — NOT a verdict)
|
||
|
||
Builder has pushed P1 (b492f99) + P2 (b302f3a) to restructure/concurrency; P3/P4/P5/tests still
|
||
pending, so M1 is not claimable and this is NOT a PASS — it's pre-review to front-load the M1 diff
|
||
audit and avoid re-doing it under gate time pressure. Read code/diff + git only; did NOT read
|
||
JOURNAL (anti-anchoring intact). I actively tried to break the following and each concern was
|
||
REFUTED:
|
||
|
||
1. **Green-on-red via the .drone.yml EXIT trap** (my lead hypothesis). The wrapper is
|
||
`setsid cc-ci-run … & PID=$!; trap 'kill -TERM -- -$PID' TERM EXIT; wait $PID`. I worried the
|
||
EXIT trap's final `kill` status would override the harness exit code and mask a failing run.
|
||
EMPIRICALLY TESTED (4 bash repros incl. failing harness with a lingering group member that
|
||
makes kill succeed=0): bash PRESERVES the pre-trap exit status when the EXIT trap doesn't call
|
||
`exit`. Exit code propagates correctly in all cases (RED stays RED, GREEN stays GREEN). Refuted.
|
||
2. **P2 unlink/reacquire inode race** (janitor unlinks a reaped orphan's lockfile while a new run
|
||
blocks on the old inode). Handled: both acquire_app_lock and _probe_and_reap recheck
|
||
`fstat(fd).st_ino == stat(path).st_ino` after acquiring and retry/bail on mismatch — a lock on
|
||
an unlinked (anonymous) inode is never treated as authoritative, and the path's lockfile is
|
||
never unlinked out from under a newer run. Refuted.
|
||
3. **Half-reaped/new-app coexistence.** Reap runs WHILE HOLDING the probe lock; a new same-domain
|
||
run blocks in acquire_app_lock until reap completes. The pre-deploy window (lock held, app not
|
||
yet created) is covered: the stale-lockfile sweep sees the held lock (BlockingIOError) and
|
||
leaves it. Refuted.
|
||
4. **Signal mid-normal-teardown aborting cleanup.** begin_teardown() is the FIRST line of BOTH
|
||
finally blocks (run_recipe_ci.py:663 run_quick, :1134 main); the _funnel_handler swallows
|
||
(logs+returns) any SIGTERM/SIGALRM once tearing_down is set, so a second signal can't abort the
|
||
cleanup the first asked for. install_lifetime_guards() is the FIRST statement of main() (:829),
|
||
before any abra/lock call, with prctl→ppid==1 recheck in the correct order. Refuted.
|
||
|
||
Open items to confirm AT M1 (cold, full suite) — NOT defects, just unverified-until-then:
|
||
- `datetime` import removed from lifecycle.py along with _stack_age_seconds — grep for any
|
||
remaining datetime use (ruff would catch an undefined name; confirm import truly orphaned).
|
||
- `_stack_name` / age-fallback deadcode after the janitor rewrite — confirm no dangling refs.
|
||
- Registry-symbol deletion is only PARTIAL on this commit: acquire_recipe_lock still present
|
||
(P3 deletes it); register/unregister/_run_owner_state/ACTIVE_RUN_DIR/CCCI_JANITOR_MAX_AGE are
|
||
gone — full dangling-ref grep belongs at M1 once P3 lands.
|
||
- setsid-fork edge: if `setsid` ever forks (only when it's a pgrp leader; not the case for a
|
||
backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the
|
||
harness would reparent to ppid==1 and self-abort. Live-verify the trap+cancel path at M2(a).
|
||
- begin_teardown is process-global module state (lifetime._state) — fine for one harness process;
|
||
the tests/concurrency suite must not import-share it across in-process cases (verify at M1).
|
||
|
||
## 2026-06-10T04:32Z — pre-review P3+P4 (branch @91d3cc7, NO gate claimed — NOT a verdict)
|
||
|
||
Builder pushed P3 (17ebdf3 per-run ABRA_DIR) + P4 (91d3cc7 config cleanup). tests/concurrency +
|
||
P5 docs still pending, so M1 still not claimable. Continued the front-loaded diff audit (code/git
|
||
only; JOURNAL still unread). Findings — all CLEAN:
|
||
|
||
- **Dangling-ref grep across runner/bridge/dashboard/nix = ZERO hits** for all 9 deleted symbols:
|
||
acquire_recipe_lock, register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
|
||
CCCI_JANITOR_MAX_AGE, RECIPE_LOCK_DIR, _stack_age_seconds, _registry_path. The orphaned
|
||
`datetime` import is also gone from lifecycle.py. Clean deletion.
|
||
- **Path centralization**: all `~/.abra/recipes/<recipe>` literals replaced by `abra.recipe_dir()`
|
||
(resolves `$ABRA_DIR else ~/.abra`) across abra.py (recipe_checkout, has_lightweight_version_tags,
|
||
recipe_head_commit, recipe_versions), generic._recipe_dir, lifecycle.prepull_images,
|
||
snapshot_recipe_tests, fetch_recipe. prepull's env_path stays canonical `~/.abra/servers/...`
|
||
which is correct (servers/ is the shared symlink target).
|
||
- **Ordering verified** (main(), the only structural risk): install_lifetime_guards() is the FIRST
|
||
stmt (873); between it and setup_run_abra_dir() (891) there are ONLY env reads + a print — no
|
||
abra call; ABRA_DIR is exported at 891 BEFORE fetch_recipe (892) and before the first path-helper
|
||
recipe_head_commit (895). The `--quick` dispatch (run_quick, ~908) is AFTER 891, so the quick lane
|
||
inherits the per-run ABRA_DIR too. No tree is touched before ABRA_DIR is set.
|
||
- **Manual-run isolation**: rid=="manual" → "manual-<pid>" so two hand-runs don't share a tree.
|
||
|
||
Open items to confirm AT M1 (cold) — not defects:
|
||
- setup_run_abra_dir symlink idempotency: `if not os.path.islink(link): os.symlink(...)` — if a
|
||
NON-symlink file pre-exists at servers/catalogue (reused run dir from a crashed partial), symlink
|
||
raises FileExistsError. Low risk (fresh run-id per Drone build) but worth a glance.
|
||
- CCCI_SKIP_FETCH=1 now `rm -rf dest` + copytree(canonical, dest, symlinks=True) — confirm the
|
||
--quick rollback-proof staging tests still pass (they set CCCI_SKIP_FETCH).
|
||
- tests/{ghost,discourse}/install_steps.sh RECIPE_DIR=${ABRA_DIR:-$HOME/.abra} mechanical path fix
|
||
— confirm it changed NO assertion/gate (guardrail: never weaken recipe-test gates). Diff-check.
|
||
|
||
Net: the entire P1–P4 diff has been pre-audited and is clean against my break-it hit-list. M1 cold
|
||
run, once claimed (after tests/concurrency + P5 land), reduces to: fresh checkout on cc-ci →
|
||
`cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` + lint, plus a
|
||
focused review of only the tests/concurrency suite (vs the 19 planned cases) and the P5 doc delta.
|
||
|
||
## M1: PASS @2026-06-10T04:38Z — implementation verified (branch restructure/concurrency @d3fe9e2)
|
||
|
||
Verdict formed from the plan (SSOT), the code/git, the STATUS claim's verify recipe, and my own
|
||
COLD acceptance run — WITHOUT reading JOURNAL first (anti-anchoring honored; noting here that I had
|
||
NOT consulted JOURNAL-conc at verdict time).
|
||
|
||
COLD ENVIRONMENT: fresh `git clone --branch restructure/concurrency` into /tmp/adv-m1 on cc-ci
|
||
(NOT the Builder's tree); `git rev-parse HEAD == d3fe9e26bb0fbaedb37383539ba3973bc1c80aff` (matches
|
||
claim), `git status` clean. Ran via the host `cc-ci-run` pyEnv (pytest 8.3.3 + playwright) and the
|
||
pinned `.#lint` devshell.
|
||
|
||
ACCEPTANCE RESULTS (expected → observed):
|
||
- `cc-ci-run -m pytest tests/unit -q` → 138 passed in 4.72s ✓ (claim: 138 passed)
|
||
- `cc-ci-run -m pytest tests/concurrency -q` → 20 passed in 9.91s ✓ (claim: 20 passed)
|
||
- `nix develop .#lint --command bash scripts/lint.sh` → `lint: PASS` ✓
|
||
- `pytest tests/unit --collect-only` concurrency items → 0 ✓ (suite NOT in default gate)
|
||
- dangling-ref grep (register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
|
||
CCCI_JANITOR_MAX_AGE, acquire_recipe_lock, RECIPE_LOCK_DIR, _stack_age_seconds) over
|
||
*.py/*.nix/*.yml/*.sh → ZERO hits outside docs/ ✓
|
||
|
||
GATE-INTEGRITY (guardrails honored):
|
||
- `RUN_APP_RE` regex unchanged (lifecycle.py:26, identical pattern); warm/canonical apps still
|
||
never become probe candidates (test_11 asserts no lockfiles even created for warm names).
|
||
- `services_converged()` / paused-is-settled / `backup_app()` waits: NOT in the code diff — all
|
||
RUN_APP_RE/services_converged/paused diff hits are docs/concurrency.md prose (P5 rewrite).
|
||
- `teardown_app` ordering untouched; only its trailing unregister call removed (registry gone).
|
||
- Only `tests/<recipe>/` change is the mechanical `RECIPE_DIR=${ABRA_DIR:-$HOME/.abra}/...` line
|
||
in ghost+discourse install_steps.sh — NO assertion/gate touched (diff-confirmed). Guardrail
|
||
"never weaken recipe-test gates / touch tests/<recipe>/ content" honored.
|
||
- P4: `concurrency.limit` block removed from .drone.yml; drone-runner.nix comment makes
|
||
DRONE_RUNNER_CAPACITY the single knob.
|
||
|
||
ADVERSARIAL DIFF REVIEW (P1–P4 pre-audited in the two notes above; refuted: green-on-red exit-code
|
||
masking [empirically tested], unlink/reacquire inode race [fstat==stat identity recheck],
|
||
half-reaped coexistence [reap-under-probe-lock], signal-mid-teardown reentrancy [begin_teardown
|
||
first line of both finally blocks], guard/ABRA_DIR/fetch ordering [no abra call pre-export]).
|
||
|
||
TEST-SUITE AUDIT vs the 19 plan cases: real kernel flocks, NEVER mocked (only teardown_app +
|
||
abra-discovery stubbed, both disclosed). Coverage complete: cases 1–4 test_locks, 5–12
|
||
test_janitor, 13–16 test_lifetime, 17–19 test_abra_dir, +test_18b (manual-pid isolation) = 20.
|
||
Assertions are substantive, not tautological: exact funnel exit codes 142/143 (test_15/16),
|
||
reap-vs-new-run timestamp ordering + fresh-inode `lock_state=="held"` (test_7), two-janitor
|
||
arbitration via separate open()s (test_8 — valid: flock binds the open file description, so
|
||
threads-with-distinct-fds model processes), long-held mtime-backdate flag-not-steal (test_10),
|
||
PEP 446 fd non-inheritance with a surviving child (test_3), divergent per-run trees + canonical
|
||
untouched (test_18).
|
||
|
||
INDEPENDENT PROBE (my own driver, NOT the Builder's helpers.py): drove the real
|
||
`lifecycle.acquire_app_lock` from a standalone script with a sandbox CCCI_APP_LOCK_DIR on cc-ci →
|
||
state `held` after acquire; a second acquirer BLOCKED while the first held (no ack2 after 1.5s);
|
||
after `SIGKILL` of the holder the second acquired within 10s (kernel auto-release). Core invariant
|
||
confirmed against the real code, not just the Builder's tests.
|
||
|
||
NON-BLOCKING NOTES (carry to M2 live-verify; none gate M1):
|
||
- setsid-fork edge in the .drone.yml trap wrapper: if `setsid` ever forks (only when it's a pgrp
|
||
leader — not the case for a backgrounded job in a non-job-control drone shell), $PID would be the
|
||
intermediate and the harness could reparent (ppid==1) and self-abort. MUST be live-verified by
|
||
the actual drone-cancel path at M2(a) — the plan already flags this ("verify drone exec runner
|
||
signal delivery; the trap must fire on drone cancel"). Not unit-testable here.
|
||
- End-of-janitor stale-lockfile tidy sweep (appless leftover lockfile unlink) is not directly
|
||
covered by a named test (not one of the 19); low risk (tidiness only). Noted, not a defect.
|
||
- test_14 (ppid race) depends on the helper reparenting to pid 1; under a subreaper it marks
|
||
NEVER_REPARENTED and FAILS VISIBLY (never false-passes). Passed in this env.
|
||
|
||
CONCLUSION: M1 — implementation verified — PASS. M2 (merge to main + live verification a–d) is
|
||
unblocked. Reminder for both loops: recipe-mirror PRs are !testme targets only — never merge/push
|
||
them. (After this verdict I may consult JOURNAL-conc to contextualize, per §6.1.)
|
||
|
||
## 2026-06-10T04:49Z — M2 merge integrity pre-check (M2 NOT yet claimed — not a verdict)
|
||
|
||
Builder merged the branch to main (merge commit `bb5eb3d`, 2 parents 83a6c6e∘d3fe9e2, no force)
|
||
after my M1 PASS, and is mid-M2 live verification (journal: M2(a) cancel-mid-run evidence, (b)
|
||
parallel runs triggered). No `claim(conc): M2` commit yet; STATUS-conc still shows the stale M1
|
||
line (Builder's file — will update at the M2 claim). Independent merge check:
|
||
- `git diff bb5eb3d d3fe9e2 -- runner/ .drone.yml docs/concurrency.md tests/ nix/` = EMPTY → the
|
||
merge preserved EXACTLY the code I cold-verified at M1. No conflict-resolution drift introduced.
|
||
- `git merge-base --is-ancestor d3fe9e2 bb5eb3d` = true.
|
||
So deployed main == M1-verified tree. At the M2 claim I therefore re-verify only LIVE behavior +
|
||
the push build, not the code again:
|
||
push build green; (a) cancel mid-run → no leaked python/lock, next janitor reaps the app, zero
|
||
leakage; (b) two parallel !testme (immich#2 + plausible#3) → both green, zero leakage; (c)
|
||
double-!testme same PR → 2nd blocks on the app lock (visible in its drone log) then runs; (d) one
|
||
full green end-to-end run. Evidence to come from Drone build logs + cc-ci state (abra app ls /
|
||
lslocks / docker), cold from my own access path.
|
||
|
||
## 2026-06-10T05:00Z — wrapper exit-code fix verified + CORRECTION to my P1 pre-review (inbox consumed)
|
||
|
||
Consumed ADVERSARY-INBOX.md (deleted) — Builder reported an M2 live-verify finding + fix. Folded in:
|
||
|
||
**The defect (real, Builder-found, build 269 plausible#3):** the drone exec step shell is `set -e`.
|
||
On a NORMAL (green) harness exit the P1 EXIT trap still fired and its `kill -TERM -- -$PID` of the
|
||
already-exited process group returned ESRCH (exit 1), which under `set -e` poisoned the step's exit
|
||
status to 1 — a fully GREEN run (all tiers pass, level=4) reported RED.
|
||
|
||
**CORRECTION — my P1 pre-review was wrong on this point.** In my 04:23Z pre-review I claimed to have
|
||
"empirically tested" green-on-red exit-code masking and REFUTED it. That test was run with plain
|
||
`bash -c` WITHOUT `set -e` — the wrong shell mode. The real drone step runs `set -e`, where the bug
|
||
manifests. I re-ran the matrix correctly now (bash -e), reproducing the bug (old wrapper + green +
|
||
set -e → exit 1) and confirming I had the shell mode wrong. Lesson: model the EXACT runtime
|
||
(set -e) for shell-trap behavior. The Builder caught this live; I did not. Owning it.
|
||
NB the failure direction was false-RED (green reported red) — fail-safe-ish, not a green-on-red
|
||
(no failing run was ever reported green); still a real defect.
|
||
|
||
**The fix (e1c4198 on branch, merged to main b7a009c) — independently verified by me, cold under
|
||
`set -e` (the correct mode this time):**
|
||
```
|
||
setsid cc-ci-run runner/run_recipe_ci.py & PID=$!
|
||
trap 'kill -TERM -- "-$PID" 2>/dev/null || true' TERM EXIT
|
||
rc=0; wait "$PID" || rc=$?
|
||
trap - TERM EXIT
|
||
exit "$rc"
|
||
```
|
||
My 4-path matrix (all under `bash -e`, exact-shape repros):
|
||
- A green harness → step exit 0 ✓ (poisoning gone: `|| true` on the trap kill + `trap - EXIT` before exit)
|
||
- B **red harness (exit 7) → step exit 7 ✓ — NOT masked to green.** Critical false-GREEN check
|
||
PASSES: `wait || rc=$?` captures the real rc and `exit "$rc"` propagates it. The
|
||
"failing PR must report RED" gate is preserved by the fix.
|
||
- C old wrapper + green + set -e → exit 1 ✓ (bug reproduced — root-cause confirmed)
|
||
- D cancel (TERM to wrapper mid-wait) → wrapper exits 143 AND the child received TERM
|
||
(CHILD_GOT_TERM logged) ✓ — cancel-forwarding semantics unchanged; the `trap - TERM EXIT` runs
|
||
only AFTER `wait` returns (post-forward), so it can't disarm the forward during a real cancel.
|
||
|
||
Verdict on the fix: CORRECT and SAFE — resolves the false-RED poisoning without introducing
|
||
false-GREEN, and preserves cancel forwarding. Folds cleanly into the pending M2 review.
|
||
|
||
**M1 status unaffected:** M1 PASS was for the code/suites/lint/diff of d3fe9e2; this wrapper
|
||
exit-code-under-set-e is a LIVE behavior M1's checks could not exercise (the trap only runs in the
|
||
real drone exec shell). main now = d3fe9e2 + this .drone.yml wrapper fix; the fix is verified above.
|
||
Open for the formal M2 verdict: re-confirm lint green on the new .drone.yml (yamllint), the push
|
||
build green, and live (a) cancel-no-leak / (b) parallel both-green / (c) double-!testme blocks /
|
||
(d) one full green run — cold, once the Builder posts the M2 claim with evidence.
|
||
|
||
## M2(c): FAIL @2026-06-10T08:10Z — double-!testme same domain corrupts shared deploy-count → both runs RED + VETO
|
||
|
||
Proactive cold break-it probe of the live M2 evidence (M2 not yet formally `claim(conc)`'d — the
|
||
Builder's JOURNAL shows (c) "triggered" but NOT evidenced as PASS; I went straight to the Drone API
|
||
to verify the in-flight (c) runs independently, not to the JOURNAL narrative). I found a REAL defect
|
||
that breaks M2(c). Filed as BACKLOG-conc CONC-A1.
|
||
|
||
EVIDENCE (Drone API, recipe-maintainers/cc-ci, cold via /run/secrets/bridge_drone_token — my own
|
||
access path, not the Builder's word):
|
||
- (c) = builds **279 + 281**, both `event=custom PR=2 RECIPE=immich REF=a92b28d…` → SAME domain
|
||
`immi-ad3e33.ci.commoninternet.net`. Both `status=failure` (step `ci` exit_code=1).
|
||
- 281 (the blocked run): log `== app lock: ... in flight — waiting ==` @2s → `== acquired ==` @194s,
|
||
which is exactly when 279's process exited (279 finished 05:07:35Z). **Lock serialisation + the
|
||
visible block line WORK** — that half of (c) is fine.
|
||
- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`.
|
||
- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33….ci.commoninternet.net` at
|
||
run_recipe_ci.py:1213.
|
||
- Control build 275 (isolated immich, same fixed wrapper) → `deploy-count = 1`, GREEN. Confirms the
|
||
failure is concurrency-specific, NOT a pre-existing immich/wrapper regression.
|
||
|
||
ROOT CAUSE (code, confirmed):
|
||
- DG4.1 counter file is DOMAIN-keyed in shared /tmp, not per-run: `run_recipe_ci.py:930
|
||
/tmp/ccci-deploys-<domain>`. P3 isolated ABRA_DIR per run but this per-run state file was missed
|
||
(predates the restructure, ef44d46; the old recipe-flock serialised same-recipe runs end-to-end,
|
||
masking it).
|
||
- `deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE `acquire_app_lock()` (:254,
|
||
introduced by P2 b302f3a) → the increment races OUTSIDE the lock. 281's single pre-lock
|
||
`_record_deploy` (@2s) bumps the shared counter 279 is using (→2, false violation), and 279's
|
||
end-of-run `os.remove(countfile)` (:1215) deletes the file under 281 → FileNotFoundError.
|
||
- Interleaving is fully reconstructed and self-consistent with the build timestamps (see CONC-A1).
|
||
|
||
This is squarely in M2(c) scope: the plan's DoD (c) requires the second run to "block … then RUN"
|
||
(implicitly green), and the phase's whole premise is "two concurrent !testme don't collide on
|
||
domain/volume/secrets." This is a domain-keyed-state collision — the restructure's narrower domain
|
||
lock no longer covers the deploy-count file. M1 (code/suites/lint/diff of d3fe9e2) is unaffected —
|
||
this is a live concurrency behavior M1's checks could not exercise; the tests/concurrency suite has
|
||
the matching blind spot (case 4 serialises acquire but never asserts deploy-count isolation across
|
||
two same-domain runs).
|
||
|
||
## VETO — M2 may NOT be marked DONE until CONC-A1 is fixed and I log a fresh (c) PASS
|
||
Forbidding `## DONE` in STATUS-conc until: (1) deploy-counter keyed per-run; (2) a tests/concurrency
|
||
case asserts same-domain deploy-count isolation; (3) live (c) re-run shows BOTH builds GREEN with
|
||
the visible block line and zero leakage; (4) (a),(b),(d) re-confirmed unaffected. Only I clear this.
|
||
(After this verdict I may consult JOURNAL-conc to contextualise — noting I had NOT read the (c)
|
||
journal reasoning before forming this FAIL; I verified from the Drone API + code directly.)
|
||
|
||
## 2026-06-10T08:20Z — CONC-A1 fix CODE-verified (veto conditions 1+2 met; 3+4 still pending — NOT cleared)
|
||
|
||
Builder fixed CONC-A1 (b6e12ef, merged main 139e319) and is re-running M2 live (a)–(d). I
|
||
cold-verified the FIX CODE from my own clone + a fresh checkout on cc-ci (not the Builder's word):
|
||
|
||
- **Condition (1) per-run keying — MET.** `run_recipe_ci._run_state_path(name)` keys all four
|
||
run-scoped state files (`deploys`, `opstate`, `deps`, `depskip`) by `run_id()` + `os.getpid()`,
|
||
never domain. Grep: ZERO residual `ccci-<state>-{domain}` literals in prod code (only the
|
||
app-LOCK path stays domain-keyed, which is correct). All consumers env-read `CCCI_*_FILE`
|
||
(lifecycle:148, deps:72/155, generic:134) — no path re-derivation. Uniqueness holds even in the
|
||
manual fallback (`run_id()`→domain) because the `+pid` suffix separates two processes.
|
||
- **Condition (2) same-domain isolation test — MET, and proven non-tautological.**
|
||
tests/concurrency/test_run_state.py adds test_20/20b/20c. test_20c drives REAL processes + the
|
||
REAL lock + real `_run_state_path`/`_record_deploy`, reproducing the 279/281 interleaving: run A
|
||
reads `COUNT 1` (NOT polluted to 2 by B's pre-lock increment) and B's file survives A's remove
|
||
(no FileNotFoundError). **Mutation check (my own):** reverting `_run_state_path` to domain-keying
|
||
in a throwaway cc-ci clone → all 3 test_run_state cases FAIL (incl. test_20c). So the test
|
||
genuinely guards the fix.
|
||
- **Suites cold (fresh clone @4f6c955 on cc-ci):** unit 138 passed, concurrency 23 passed (was 20),
|
||
concurrency still NOT collected by the default `pytest tests/unit` run (0). lint not re-run here
|
||
(no .drone.yml/nix change in the fix; will confirm at the M2 claim).
|
||
|
||
**VETO NOT cleared.** Conditions (3) live (c) re-run BOTH builds GREEN + visible block line + zero
|
||
leakage, and (4) (a)/(b)/(d) re-confirmed on the fixed harness, still require the Builder's live
|
||
evidence (in flight). The code fix strongly predicts a (c) pass but M2 is a LIVE gate — I will
|
||
re-verify the (c) double-!testme cold from the Drone API once the Builder posts the M2 claim, and
|
||
only then clear the veto.
|
||
|
||
## 2026-06-10T08:43Z — live (c) round-2 (builds 290+291): serialization CONFIRMED via lslocks; delay is an immich-ML flake, NOT the restructure (not a verdict)
|
||
|
||
(b)+(d) re-passed on the fixed harness (builds 287 immich#2 + 288 plausible#3, parallel, both
|
||
success — I'll re-confirm at the M2 claim). (c) round 2 = builds 290+291 (both custom PR=2 immich,
|
||
same domain immi-ad3e33), started 08:22:30Z. I inspected the LIVE host state cold (my own ssh):
|
||
|
||
- **CORE INVARIANT DIRECTLY OBSERVED in the kernel lock table** — strongest possible proof of the
|
||
double-!testme serialization:
|
||
`lslocks`: pid 739163 (build 290) holds `WRITE` on cc-ci-app-immi-ad3e33….lock; pid 739341
|
||
(build 291) is blocked `WRITE*` on the SAME lock. Exactly one holder, one waiter, one inode.
|
||
- 290 (holder) is sleeping in `services_converged()` poll (hrtimer_nanosleep, no abra child) because
|
||
`immich-machine-learning` is stuck 0/1: its container repeatedly fails the healthcheck
|
||
(`non-zero exit (143): dockerexec: unhealthy container`, swarm restarting every 1–6 min). Current
|
||
attempt (08:43) has gunicorn up, health `starting` — slow/flaky ML readiness, not a deploy break.
|
||
- NOT caused by the restructure / teardown: 290's immich volumes (model-cache/postgres/uploads) +
|
||
.env are all from 290's OWN fresh deploy (08:23), not inherited from the earlier same-domain run
|
||
287. ML image present (1.36GB, no pull), host healthy (5.2Gi mem free, 65G disk). So this is an
|
||
immich-ML healthcheck flake, orthogonal to concurrency.
|
||
|
||
Bearing on M2(c): the SERIALIZATION mechanism under test is verified working live. The "both GREEN"
|
||
half of condition (3) is not yet demonstrated only because 290 is flake-blocked on immich-ML; if 290
|
||
REDs on deploy-timeout, (c) needs a clean re-run (flake, not a code fault). VETO unchanged — I still
|
||
require one clean (c) where both same-domain builds go GREEN with the block line + zero leakage.
|
||
Continuing to watch 290/291 to terminal.
|
||
|
||
## M2(c): PASS @2026-06-10T09:05Z — double-!testme same domain, CONC-A1 fixed; VETO LIFTED
|
||
|
||
(c) round-2 builds 290+291 (both `custom PR=2 immich`, same domain immi-ad3e33, on CONC-A1-fixed
|
||
main) both reached terminal **status=success**. Cold-verified from the Drone API + live host (my own
|
||
access path), not the Builder's word:
|
||
|
||
- **Both GREEN:** 290 success, 291 success (Drone API).
|
||
- **Visible block line (the (c) requirement):** 291 log —
|
||
`== app lock: another run of immi-ad3e33….ci.commoninternet.net is in flight — waiting ==`
|
||
then `== app lock: acquired … ==`. I ALSO observed the serialization directly in the kernel lock
|
||
table mid-run (lslocks: 290 held WRITE, 291 blocked WRITE* on the same inode; after 290 exited,
|
||
291 held it). Strongest possible proof of the double-!testme serialization invariant.
|
||
- **CONC-A1 regression GONE — the two exact round-1 failure points are now clean:**
|
||
- 290 (round-1 build 279 got false `deploy-count 2 != 1`) → now `deploy-count = 1 (expect 1)`,
|
||
all 5 tiers pass, level=4. Its run-keyed counter was NOT polluted by 291's concurrent pre-lock
|
||
`_record_deploy`.
|
||
- 291 (round-1 build 281 crashed `FileNotFoundError` at run_recipe_ci.py:1213) → now
|
||
`deploy-count = 1 (expect 1)`, all tiers pass, level=4, no traceback. Its own run-keyed countfile
|
||
survived 290's end-of-run remove.
|
||
- **Zero leakage after both:** 0 harness procs, 0 immich apps / services / volumes / secrets, no held
|
||
cc-ci locks. One unheld 0-byte leftover lockfile (mtime 08:46, 291's acquisition touch) — reaped
|
||
on sight by the next janitor probe, harmless by design.
|
||
- The ~20-min runtime each was an immich-machine-learning healthcheck slowness/flake (ML eventually
|
||
converged), NOT the restructure — already diagnosed in the 08:43Z note; serialization + isolation
|
||
both verified correct regardless.
|
||
|
||
**VETO LIFTED.** The CONC-A1 veto ("no DONE until CONC-A1 fixed + a fresh (c) PASS") is cleared:
|
||
conditions (1) per-run keying [code + mutation-proven], (2) same-domain isolation test
|
||
[non-tautological], and (3) live (c) both-GREEN + block line + zero leakage are ALL met. CONC-A1
|
||
closed in BACKLOG-conc.
|
||
|
||
**Still required before DONE (full M2 gate, not the CONC-A1 veto):** the Builder must post the formal
|
||
M2 claim in STATUS-conc with consolidated evidence, and I re-confirm condition (4) — specifically
|
||
**M2(a) cancel-mid-run re-run on the CONC-A1-fixed harness** (b+d already re-confirmed: builds
|
||
287+288 parallel both success on fixed main; a's only prior evidence (build 267) was on the
|
||
pre-CONC-A1, pre-wrapper-fix harness) — plus the push build green on current main. (a) re-run had
|
||
not yet appeared in Drone as of this verdict (Builder sequenced it after (c)). I will verify it cold
|
||
when it lands.
|
||
|
||
## M2: PASS @2026-06-10T08:55Z — merged + live-verified (a)–(d) on final main 139e319/74ed240
|
||
|
||
Formal M2 gate verdict against the Builder's M2 claim (STATUS-conc, commit 74ed240). Formed from
|
||
the plan (SSOT), the code/git, the claim's verify recipe, and my OWN cold re-runs from my own clone
|
||
+ fresh checkouts/Drone-API on cc-ci — not the Builder's narrative. All seven claim items confirmed:
|
||
|
||
1. **Merge integrity** — `git diff 139e319 b6e12ef -- runner/ tests/ docs/ .drone.yml nix/` = 0 lines;
|
||
`b6e12ef ⊆ 139e319`; merge parents `2173894 ∘ b6e12ef`. So deployed main code == the CONC-A1 tree
|
||
I code-verified + mutation-proofed. No force-push (history linear). NB the claim mis-states the
|
||
first parent as `4ad55ed` (actual `2173894`, my M2(c)-FAIL commit) — immaterial: that's a state-
|
||
file commit, and the code-diff-empty check is authoritative.
|
||
2. **Push build green** — Drone push builds 283–298 on main all `status=success`; no red push since
|
||
the merge.
|
||
3. **Suites + lint (cold, fresh clone on cc-ci)** — unit 138 passed, concurrency 23 passed
|
||
(concurrency NOT in the default unit gate), `lint: PASS` on final main 74ed240. test_run_state
|
||
mutation-proofed (reverting to domain-keying fails all 3 cases).
|
||
4. **(a) cancel-mid-run on fixed harness** — build 295 (custom immich#2): lockfile mtime 08:50:17
|
||
proves it acquired the app lock 7s in → canceled @08:51:05 MID-DEPLOY. After cancel (verified cold
|
||
~1 min later): 0 harness procs (no leaked python — old §8.1 gap stays closed), no held locks (lock
|
||
released), no immich app/.env/containers(even stopped)/services/volumes/secrets → ZERO leakage,
|
||
full teardown. Killed-step logs not API-retrievable (Drone truncates), but the end-state is the
|
||
actual test and it is clean.
|
||
5. **(b) parallel runs** — builds 287 (immich#2) + 288 (plausible#3), parallel, both
|
||
`status=success`, both `deploy-count = 1 (expect 1)`, level=4; host after = zero leakage.
|
||
6. **(c) double-!testme same PR** — builds 290 + 291 (same immich domain): both success, 291 logged
|
||
the block line then `acquired`, both `deploy-count = 1`, zero leakage. Serialization also observed
|
||
directly in the kernel lock table mid-run (lslocks). Covered in detail by my M2(c) PASS @09:05Z.
|
||
7. **(d) full green e2e** — build 287 (and 290): complete immich run, all 5 tiers pass, level=4.
|
||
|
||
Both M2-found fixes are folded in and independently verified: wrapper exit-code-under-set-e
|
||
(e1c4198/b7a009c, my 05:00Z note — red still propagates) and CONC-A1 run-keyed state files
|
||
(b6e12ef/139e319, my 09:05Z M2(c) PASS + mutation proof). The ~20-min (c) runtimes were an
|
||
immich-ML healthcheck flake (converged within DEPLOY_TIMEOUT=1500s), orthogonal to the restructure
|
||
(diagnosed 08:43Z). Unheld 0-byte leftover lockfiles are by-design (next-janitor tidy-sweep).
|
||
|
||
GUARDRAILS honored end-to-end: recipe-mirror PRs (immich#2, plausible#3) used as !testme targets
|
||
only, never merged/pushed; cc-ci main touched only by the gated merges (no force-push); no secrets in
|
||
any commit. RUN_APP_RE / services_converged / warm-canonical flows untouched (M1 diff review).
|
||
|
||
CONCLUSION: **M2 — merged + live-verified — PASS.** M1 PASS (04:38Z) + M2 PASS (here) are both fresh
|
||
in REVIEW-conc; no open VETO (CONC-A1 lifted). Per the phase DoD the Builder may now write `## DONE`
|
||
to STATUS-conc. (Post-verdict I may consult JOURNAL-conc to contextualize; I had NOT read its M2
|
||
reasoning before forming this verdict — verified from plan + code/git + Drone API + my own cold runs.)
|