Files
cc-ci/REVIEW-conc.md
autonomic-bot 56723ae0ec
All checks were successful
continuous-integration/drone/push Build is passing
chore(conc): M2 merge-integrity pre-check — merged main == M1-verified tree (not a verdict)
2026-06-10 04:49:55 +00:00

214 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW-conc.md — Adversary ledger, concurrency-restructure phase
Append-only. Verdicts: `<gate>: PASS @<ts>` + evidence, or `FAIL` + [adversary] finding in
BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.
## 2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)
Pulled main @5b65c6c. No STATUS-conc.md, no `restructure/concurrency` branch — nothing claimed yet.
Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to
anchor my later diff review in the as-is code, not the Builder's narrative.
Current-system facts I will hold the restructure against:
- Registry symbols slated for deletion (will grep for dangling refs at M1):
`register_run_app` (lifecycle.py:69, call site :283), `unregister_run_app` (:78, call sites :723, :766),
`_run_owner_state` (:83), `ACTIVE_RUN_DIR` (:43), `CCCI_JANITOR_MAX_AGE` (janitor :738),
`acquire_recipe_lock` (:46, call site run_recipe_ci.py:843), `RECIPE_LOCK_DIR` (:42).
- Must survive untouched: `RUN_APP_RE` (lifecycle.py:26) allowlist semantics (warm/canonical apps
never probed), `services_converged()` paused-is-settled logic, docker-service sweep discovery,
`teardown_app(verify=False)` idempotence.
- M1 verification plan (cold, my clone): checkout branch; `pytest tests/unit -q`,
`pytest tests/concurrency -q`, `scripts/lint.sh`; full diff review hunting: probe-vs-acquire
ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM),
teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently
released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a
waiter blocks on the old inode → two "held" locks on different inodes for one domain),
PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration,
setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases.
- Tests/concurrency must NOT be wired into the default `pytest tests/unit` gate (plan decision).
- M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme
same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.
No verdict yet — waiting for Builder bootstrap/claim.
## 2026-06-10T04:05Z — cold-verify environment established (prep, no gate)
Builder seeded STATUS/BACKLOG/JOURNAL-conc; STATUS says P1 in flight, no gate claimed. Mapped the
test-execution environment I'll use for the M1 cold run so a time-sensitive gate isn't spent
debugging tooling:
- Local VM devshell (`nix develop`) has only lintTools (no pytest). So pytest does NOT run here.
- pytest 8.3.3 + playwright live in the host `pyEnv` (nix/modules/harness.nix) exposed as
`cc-ci-run` on cc-ci. `cc-ci-run -m pytest <path> -q` works as the real harness interpreter
(verified: `cc-ci-run -c "import pytest" -> 8.3.3`).
- `.drone.yml` lint stage runs `nix develop .#lint --command bash scripts/lint.sh`.
- COLD M1 PLAN: fresh `git clone`/checkout of `restructure/concurrency` into a throwaway dir ON
cc-ci → `cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` +
`nix develop .#lint --command bash scripts/lint.sh`, all from that clean checkout (not the
Builder's working tree). Then adversarial diff review per my baseline hit-list.
- Baseline `.drone.yml` on main is still the pre-restructure version (concurrency.limit=2,
acquire_recipe_lock / /run/cc-ci-active registry referenced) — confirms P1/P4 edits are
branch-only so far. Good.
## 2026-06-10T04:23Z — early pre-review of P1+P2 (branch @b302f3a, NO gate claimed — NOT a verdict)
Builder has pushed P1 (b492f99) + P2 (b302f3a) to restructure/concurrency; P3/P4/P5/tests still
pending, so M1 is not claimable and this is NOT a PASS — it's pre-review to front-load the M1 diff
audit and avoid re-doing it under gate time pressure. Read code/diff + git only; did NOT read
JOURNAL (anti-anchoring intact). I actively tried to break the following and each concern was
REFUTED:
1. **Green-on-red via the .drone.yml EXIT trap** (my lead hypothesis). The wrapper is
`setsid cc-ci-run … & PID=$!; trap 'kill -TERM -- -$PID' TERM EXIT; wait $PID`. I worried the
EXIT trap's final `kill` status would override the harness exit code and mask a failing run.
EMPIRICALLY TESTED (4 bash repros incl. failing harness with a lingering group member that
makes kill succeed=0): bash PRESERVES the pre-trap exit status when the EXIT trap doesn't call
`exit`. Exit code propagates correctly in all cases (RED stays RED, GREEN stays GREEN). Refuted.
2. **P2 unlink/reacquire inode race** (janitor unlinks a reaped orphan's lockfile while a new run
blocks on the old inode). Handled: both acquire_app_lock and _probe_and_reap recheck
`fstat(fd).st_ino == stat(path).st_ino` after acquiring and retry/bail on mismatch — a lock on
an unlinked (anonymous) inode is never treated as authoritative, and the path's lockfile is
never unlinked out from under a newer run. Refuted.
3. **Half-reaped/new-app coexistence.** Reap runs WHILE HOLDING the probe lock; a new same-domain
run blocks in acquire_app_lock until reap completes. The pre-deploy window (lock held, app not
yet created) is covered: the stale-lockfile sweep sees the held lock (BlockingIOError) and
leaves it. Refuted.
4. **Signal mid-normal-teardown aborting cleanup.** begin_teardown() is the FIRST line of BOTH
finally blocks (run_recipe_ci.py:663 run_quick, :1134 main); the _funnel_handler swallows
(logs+returns) any SIGTERM/SIGALRM once tearing_down is set, so a second signal can't abort the
cleanup the first asked for. install_lifetime_guards() is the FIRST statement of main() (:829),
before any abra/lock call, with prctl→ppid==1 recheck in the correct order. Refuted.
Open items to confirm AT M1 (cold, full suite) — NOT defects, just unverified-until-then:
- `datetime` import removed from lifecycle.py along with _stack_age_seconds — grep for any
remaining datetime use (ruff would catch an undefined name; confirm import truly orphaned).
- `_stack_name` / age-fallback deadcode after the janitor rewrite — confirm no dangling refs.
- Registry-symbol deletion is only PARTIAL on this commit: acquire_recipe_lock still present
(P3 deletes it); register/unregister/_run_owner_state/ACTIVE_RUN_DIR/CCCI_JANITOR_MAX_AGE are
gone — full dangling-ref grep belongs at M1 once P3 lands.
- setsid-fork edge: if `setsid` ever forks (only when it's a pgrp leader; not the case for a
backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the
harness would reparent to ppid==1 and self-abort. Live-verify the trap+cancel path at M2(a).
- begin_teardown is process-global module state (lifetime._state) — fine for one harness process;
the tests/concurrency suite must not import-share it across in-process cases (verify at M1).
## 2026-06-10T04:32Z — pre-review P3+P4 (branch @91d3cc7, NO gate claimed — NOT a verdict)
Builder pushed P3 (17ebdf3 per-run ABRA_DIR) + P4 (91d3cc7 config cleanup). tests/concurrency +
P5 docs still pending, so M1 still not claimable. Continued the front-loaded diff audit (code/git
only; JOURNAL still unread). Findings — all CLEAN:
- **Dangling-ref grep across runner/bridge/dashboard/nix = ZERO hits** for all 9 deleted symbols:
acquire_recipe_lock, register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
CCCI_JANITOR_MAX_AGE, RECIPE_LOCK_DIR, _stack_age_seconds, _registry_path. The orphaned
`datetime` import is also gone from lifecycle.py. Clean deletion.
- **Path centralization**: all `~/.abra/recipes/<recipe>` literals replaced by `abra.recipe_dir()`
(resolves `$ABRA_DIR else ~/.abra`) across abra.py (recipe_checkout, has_lightweight_version_tags,
recipe_head_commit, recipe_versions), generic._recipe_dir, lifecycle.prepull_images,
snapshot_recipe_tests, fetch_recipe. prepull's env_path stays canonical `~/.abra/servers/...`
which is correct (servers/ is the shared symlink target).
- **Ordering verified** (main(), the only structural risk): install_lifetime_guards() is the FIRST
stmt (873); between it and setup_run_abra_dir() (891) there are ONLY env reads + a print — no
abra call; ABRA_DIR is exported at 891 BEFORE fetch_recipe (892) and before the first path-helper
recipe_head_commit (895). The `--quick` dispatch (run_quick, ~908) is AFTER 891, so the quick lane
inherits the per-run ABRA_DIR too. No tree is touched before ABRA_DIR is set.
- **Manual-run isolation**: rid=="manual" → "manual-<pid>" so two hand-runs don't share a tree.
Open items to confirm AT M1 (cold) — not defects:
- setup_run_abra_dir symlink idempotency: `if not os.path.islink(link): os.symlink(...)` — if a
NON-symlink file pre-exists at servers/catalogue (reused run dir from a crashed partial), symlink
raises FileExistsError. Low risk (fresh run-id per Drone build) but worth a glance.
- CCCI_SKIP_FETCH=1 now `rm -rf dest` + copytree(canonical, dest, symlinks=True) — confirm the
--quick rollback-proof staging tests still pass (they set CCCI_SKIP_FETCH).
- tests/{ghost,discourse}/install_steps.sh RECIPE_DIR=${ABRA_DIR:-$HOME/.abra} mechanical path fix
— confirm it changed NO assertion/gate (guardrail: never weaken recipe-test gates). Diff-check.
Net: the entire P1P4 diff has been pre-audited and is clean against my break-it hit-list. M1 cold
run, once claimed (after tests/concurrency + P5 land), reduces to: fresh checkout on cc-ci →
`cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` + lint, plus a
focused review of only the tests/concurrency suite (vs the 19 planned cases) and the P5 doc delta.
## M1: PASS @2026-06-10T04:38Z — implementation verified (branch restructure/concurrency @d3fe9e2)
Verdict formed from the plan (SSOT), the code/git, the STATUS claim's verify recipe, and my own
COLD acceptance run — WITHOUT reading JOURNAL first (anti-anchoring honored; noting here that I had
NOT consulted JOURNAL-conc at verdict time).
COLD ENVIRONMENT: fresh `git clone --branch restructure/concurrency` into /tmp/adv-m1 on cc-ci
(NOT the Builder's tree); `git rev-parse HEAD == d3fe9e26bb0fbaedb37383539ba3973bc1c80aff` (matches
claim), `git status` clean. Ran via the host `cc-ci-run` pyEnv (pytest 8.3.3 + playwright) and the
pinned `.#lint` devshell.
ACCEPTANCE RESULTS (expected → observed):
- `cc-ci-run -m pytest tests/unit -q` → 138 passed in 4.72s ✓ (claim: 138 passed)
- `cc-ci-run -m pytest tests/concurrency -q` → 20 passed in 9.91s ✓ (claim: 20 passed)
- `nix develop .#lint --command bash scripts/lint.sh``lint: PASS`
- `pytest tests/unit --collect-only` concurrency items → 0 ✓ (suite NOT in default gate)
- dangling-ref grep (register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
CCCI_JANITOR_MAX_AGE, acquire_recipe_lock, RECIPE_LOCK_DIR, _stack_age_seconds) over
*.py/*.nix/*.yml/*.sh → ZERO hits outside docs/ ✓
GATE-INTEGRITY (guardrails honored):
- `RUN_APP_RE` regex unchanged (lifecycle.py:26, identical pattern); warm/canonical apps still
never become probe candidates (test_11 asserts no lockfiles even created for warm names).
- `services_converged()` / paused-is-settled / `backup_app()` waits: NOT in the code diff — all
RUN_APP_RE/services_converged/paused diff hits are docs/concurrency.md prose (P5 rewrite).
- `teardown_app` ordering untouched; only its trailing unregister call removed (registry gone).
- Only `tests/<recipe>/` change is the mechanical `RECIPE_DIR=${ABRA_DIR:-$HOME/.abra}/...` line
in ghost+discourse install_steps.sh — NO assertion/gate touched (diff-confirmed). Guardrail
"never weaken recipe-test gates / touch tests/<recipe>/ content" honored.
- P4: `concurrency.limit` block removed from .drone.yml; drone-runner.nix comment makes
DRONE_RUNNER_CAPACITY the single knob.
ADVERSARIAL DIFF REVIEW (P1P4 pre-audited in the two notes above; refuted: green-on-red exit-code
masking [empirically tested], unlink/reacquire inode race [fstat==stat identity recheck],
half-reaped coexistence [reap-under-probe-lock], signal-mid-teardown reentrancy [begin_teardown
first line of both finally blocks], guard/ABRA_DIR/fetch ordering [no abra call pre-export]).
TEST-SUITE AUDIT vs the 19 plan cases: real kernel flocks, NEVER mocked (only teardown_app +
abra-discovery stubbed, both disclosed). Coverage complete: cases 14 test_locks, 512
test_janitor, 1316 test_lifetime, 1719 test_abra_dir, +test_18b (manual-pid isolation) = 20.
Assertions are substantive, not tautological: exact funnel exit codes 142/143 (test_15/16),
reap-vs-new-run timestamp ordering + fresh-inode `lock_state=="held"` (test_7), two-janitor
arbitration via separate open()s (test_8 — valid: flock binds the open file description, so
threads-with-distinct-fds model processes), long-held mtime-backdate flag-not-steal (test_10),
PEP 446 fd non-inheritance with a surviving child (test_3), divergent per-run trees + canonical
untouched (test_18).
INDEPENDENT PROBE (my own driver, NOT the Builder's helpers.py): drove the real
`lifecycle.acquire_app_lock` from a standalone script with a sandbox CCCI_APP_LOCK_DIR on cc-ci →
state `held` after acquire; a second acquirer BLOCKED while the first held (no ack2 after 1.5s);
after `SIGKILL` of the holder the second acquired within 10s (kernel auto-release). Core invariant
confirmed against the real code, not just the Builder's tests.
NON-BLOCKING NOTES (carry to M2 live-verify; none gate M1):
- setsid-fork edge in the .drone.yml trap wrapper: if `setsid` ever forks (only when it's a pgrp
leader — not the case for a backgrounded job in a non-job-control drone shell), $PID would be the
intermediate and the harness could reparent (ppid==1) and self-abort. MUST be live-verified by
the actual drone-cancel path at M2(a) — the plan already flags this ("verify drone exec runner
signal delivery; the trap must fire on drone cancel"). Not unit-testable here.
- End-of-janitor stale-lockfile tidy sweep (appless leftover lockfile unlink) is not directly
covered by a named test (not one of the 19); low risk (tidiness only). Noted, not a defect.
- test_14 (ppid race) depends on the helper reparenting to pid 1; under a subreaper it marks
NEVER_REPARENTED and FAILS VISIBLY (never false-passes). Passed in this env.
CONCLUSION: M1 — implementation verified — PASS. M2 (merge to main + live verification ad) is
unblocked. Reminder for both loops: recipe-mirror PRs are !testme targets only — never merge/push
them. (After this verdict I may consult JOURNAL-conc to contextualize, per §6.1.)
## 2026-06-10T04:49Z — M2 merge integrity pre-check (M2 NOT yet claimed — not a verdict)
Builder merged the branch to main (merge commit `bb5eb3d`, 2 parents 83a6c6e∘d3fe9e2, no force)
after my M1 PASS, and is mid-M2 live verification (journal: M2(a) cancel-mid-run evidence, (b)
parallel runs triggered). No `claim(conc): M2` commit yet; STATUS-conc still shows the stale M1
line (Builder's file — will update at the M2 claim). Independent merge check:
- `git diff bb5eb3d d3fe9e2 -- runner/ .drone.yml docs/concurrency.md tests/ nix/` = EMPTY → the
merge preserved EXACTLY the code I cold-verified at M1. No conflict-resolution drift introduced.
- `git merge-base --is-ancestor d3fe9e2 bb5eb3d` = true.
So deployed main == M1-verified tree. At the M2 claim I therefore re-verify only LIVE behavior +
the push build, not the code again:
push build green; (a) cancel mid-run → no leaked python/lock, next janitor reaps the app, zero
leakage; (b) two parallel !testme (immich#2 + plausible#3) → both green, zero leakage; (c)
double-!testme same PR → 2nd blocks on the app lock (visible in its drone log) then runs; (d) one
full green end-to-end run. Evidence to come from Drone build logs + cc-ci state (abra app ls /
lslocks / docker), cold from my own access path.