197 lines
14 KiB
Markdown
197 lines
14 KiB
Markdown
# REVIEW-conc.md — Adversary ledger, concurrency-restructure phase
|
||
|
||
Append-only. Verdicts: `<gate>: PASS @<ts>` + evidence, or `FAIL` + [adversary] finding in
|
||
BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.
|
||
|
||
## 2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)
|
||
|
||
Pulled main @5b65c6c. No STATUS-conc.md, no `restructure/concurrency` branch — nothing claimed yet.
|
||
Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to
|
||
anchor my later diff review in the as-is code, not the Builder's narrative.
|
||
|
||
Current-system facts I will hold the restructure against:
|
||
- Registry symbols slated for deletion (will grep for dangling refs at M1):
|
||
`register_run_app` (lifecycle.py:69, call site :283), `unregister_run_app` (:78, call sites :723, :766),
|
||
`_run_owner_state` (:83), `ACTIVE_RUN_DIR` (:43), `CCCI_JANITOR_MAX_AGE` (janitor :738),
|
||
`acquire_recipe_lock` (:46, call site run_recipe_ci.py:843), `RECIPE_LOCK_DIR` (:42).
|
||
- Must survive untouched: `RUN_APP_RE` (lifecycle.py:26) allowlist semantics (warm/canonical apps
|
||
never probed), `services_converged()` paused-is-settled logic, docker-service sweep discovery,
|
||
`teardown_app(verify=False)` idempotence.
|
||
- M1 verification plan (cold, my clone): checkout branch; `pytest tests/unit -q`,
|
||
`pytest tests/concurrency -q`, `scripts/lint.sh`; full diff review hunting: probe-vs-acquire
|
||
ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM),
|
||
teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently
|
||
released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a
|
||
waiter blocks on the old inode → two "held" locks on different inodes for one domain),
|
||
PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration,
|
||
setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases.
|
||
- Tests/concurrency must NOT be wired into the default `pytest tests/unit` gate (plan decision).
|
||
- M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme
|
||
same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.
|
||
|
||
No verdict yet — waiting for Builder bootstrap/claim.
|
||
|
||
## 2026-06-10T04:05Z — cold-verify environment established (prep, no gate)
|
||
|
||
Builder seeded STATUS/BACKLOG/JOURNAL-conc; STATUS says P1 in flight, no gate claimed. Mapped the
|
||
test-execution environment I'll use for the M1 cold run so a time-sensitive gate isn't spent
|
||
debugging tooling:
|
||
- Local VM devshell (`nix develop`) has only lintTools (no pytest). So pytest does NOT run here.
|
||
- pytest 8.3.3 + playwright live in the host `pyEnv` (nix/modules/harness.nix) exposed as
|
||
`cc-ci-run` on cc-ci. `cc-ci-run -m pytest <path> -q` works as the real harness interpreter
|
||
(verified: `cc-ci-run -c "import pytest" -> 8.3.3`).
|
||
- `.drone.yml` lint stage runs `nix develop .#lint --command bash scripts/lint.sh`.
|
||
- COLD M1 PLAN: fresh `git clone`/checkout of `restructure/concurrency` into a throwaway dir ON
|
||
cc-ci → `cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` +
|
||
`nix develop .#lint --command bash scripts/lint.sh`, all from that clean checkout (not the
|
||
Builder's working tree). Then adversarial diff review per my baseline hit-list.
|
||
- Baseline `.drone.yml` on main is still the pre-restructure version (concurrency.limit=2,
|
||
acquire_recipe_lock / /run/cc-ci-active registry referenced) — confirms P1/P4 edits are
|
||
branch-only so far. Good.
|
||
|
||
## 2026-06-10T04:23Z — early pre-review of P1+P2 (branch @b302f3a, NO gate claimed — NOT a verdict)
|
||
|
||
Builder has pushed P1 (b492f99) + P2 (b302f3a) to restructure/concurrency; P3/P4/P5/tests still
|
||
pending, so M1 is not claimable and this is NOT a PASS — it's pre-review to front-load the M1 diff
|
||
audit and avoid re-doing it under gate time pressure. Read code/diff + git only; did NOT read
|
||
JOURNAL (anti-anchoring intact). I actively tried to break the following and each concern was
|
||
REFUTED:
|
||
|
||
1. **Green-on-red via the .drone.yml EXIT trap** (my lead hypothesis). The wrapper is
|
||
`setsid cc-ci-run … & PID=$!; trap 'kill -TERM -- -$PID' TERM EXIT; wait $PID`. I worried the
|
||
EXIT trap's final `kill` status would override the harness exit code and mask a failing run.
|
||
EMPIRICALLY TESTED (4 bash repros incl. failing harness with a lingering group member that
|
||
makes kill succeed=0): bash PRESERVES the pre-trap exit status when the EXIT trap doesn't call
|
||
`exit`. Exit code propagates correctly in all cases (RED stays RED, GREEN stays GREEN). Refuted.
|
||
2. **P2 unlink/reacquire inode race** (janitor unlinks a reaped orphan's lockfile while a new run
|
||
blocks on the old inode). Handled: both acquire_app_lock and _probe_and_reap recheck
|
||
`fstat(fd).st_ino == stat(path).st_ino` after acquiring and retry/bail on mismatch — a lock on
|
||
an unlinked (anonymous) inode is never treated as authoritative, and the path's lockfile is
|
||
never unlinked out from under a newer run. Refuted.
|
||
3. **Half-reaped/new-app coexistence.** Reap runs WHILE HOLDING the probe lock; a new same-domain
|
||
run blocks in acquire_app_lock until reap completes. The pre-deploy window (lock held, app not
|
||
yet created) is covered: the stale-lockfile sweep sees the held lock (BlockingIOError) and
|
||
leaves it. Refuted.
|
||
4. **Signal mid-normal-teardown aborting cleanup.** begin_teardown() is the FIRST line of BOTH
|
||
finally blocks (run_recipe_ci.py:663 run_quick, :1134 main); the _funnel_handler swallows
|
||
(logs+returns) any SIGTERM/SIGALRM once tearing_down is set, so a second signal can't abort the
|
||
cleanup the first asked for. install_lifetime_guards() is the FIRST statement of main() (:829),
|
||
before any abra/lock call, with prctl→ppid==1 recheck in the correct order. Refuted.
|
||
|
||
Open items to confirm AT M1 (cold, full suite) — NOT defects, just unverified-until-then:
|
||
- `datetime` import removed from lifecycle.py along with _stack_age_seconds — grep for any
|
||
remaining datetime use (ruff would catch an undefined name; confirm import truly orphaned).
|
||
- `_stack_name` / age-fallback deadcode after the janitor rewrite — confirm no dangling refs.
|
||
- Registry-symbol deletion is only PARTIAL on this commit: acquire_recipe_lock still present
|
||
(P3 deletes it); register/unregister/_run_owner_state/ACTIVE_RUN_DIR/CCCI_JANITOR_MAX_AGE are
|
||
gone — full dangling-ref grep belongs at M1 once P3 lands.
|
||
- setsid-fork edge: if `setsid` ever forks (only when it's a pgrp leader; not the case for a
|
||
backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the
|
||
harness would reparent to ppid==1 and self-abort. Live-verify the trap+cancel path at M2(a).
|
||
- begin_teardown is process-global module state (lifetime._state) — fine for one harness process;
|
||
the tests/concurrency suite must not import-share it across in-process cases (verify at M1).
|
||
|
||
## 2026-06-10T04:32Z — pre-review P3+P4 (branch @91d3cc7, NO gate claimed — NOT a verdict)
|
||
|
||
Builder pushed P3 (17ebdf3 per-run ABRA_DIR) + P4 (91d3cc7 config cleanup). tests/concurrency +
|
||
P5 docs still pending, so M1 still not claimable. Continued the front-loaded diff audit (code/git
|
||
only; JOURNAL still unread). Findings — all CLEAN:
|
||
|
||
- **Dangling-ref grep across runner/bridge/dashboard/nix = ZERO hits** for all 9 deleted symbols:
|
||
acquire_recipe_lock, register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
|
||
CCCI_JANITOR_MAX_AGE, RECIPE_LOCK_DIR, _stack_age_seconds, _registry_path. The orphaned
|
||
`datetime` import is also gone from lifecycle.py. Clean deletion.
|
||
- **Path centralization**: all `~/.abra/recipes/<recipe>` literals replaced by `abra.recipe_dir()`
|
||
(resolves `$ABRA_DIR else ~/.abra`) across abra.py (recipe_checkout, has_lightweight_version_tags,
|
||
recipe_head_commit, recipe_versions), generic._recipe_dir, lifecycle.prepull_images,
|
||
snapshot_recipe_tests, fetch_recipe. prepull's env_path stays canonical `~/.abra/servers/...`
|
||
which is correct (servers/ is the shared symlink target).
|
||
- **Ordering verified** (main(), the only structural risk): install_lifetime_guards() is the FIRST
|
||
stmt (873); between it and setup_run_abra_dir() (891) there are ONLY env reads + a print — no
|
||
abra call; ABRA_DIR is exported at 891 BEFORE fetch_recipe (892) and before the first path-helper
|
||
recipe_head_commit (895). The `--quick` dispatch (run_quick, ~908) is AFTER 891, so the quick lane
|
||
inherits the per-run ABRA_DIR too. No tree is touched before ABRA_DIR is set.
|
||
- **Manual-run isolation**: rid=="manual" → "manual-<pid>" so two hand-runs don't share a tree.
|
||
|
||
Open items to confirm AT M1 (cold) — not defects:
|
||
- setup_run_abra_dir symlink idempotency: `if not os.path.islink(link): os.symlink(...)` — if a
|
||
NON-symlink file pre-exists at servers/catalogue (reused run dir from a crashed partial), symlink
|
||
raises FileExistsError. Low risk (fresh run-id per Drone build) but worth a glance.
|
||
- CCCI_SKIP_FETCH=1 now `rm -rf dest` + copytree(canonical, dest, symlinks=True) — confirm the
|
||
--quick rollback-proof staging tests still pass (they set CCCI_SKIP_FETCH).
|
||
- tests/{ghost,discourse}/install_steps.sh RECIPE_DIR=${ABRA_DIR:-$HOME/.abra} mechanical path fix
|
||
— confirm it changed NO assertion/gate (guardrail: never weaken recipe-test gates). Diff-check.
|
||
|
||
Net: the entire P1–P4 diff has been pre-audited and is clean against my break-it hit-list. M1 cold
|
||
run, once claimed (after tests/concurrency + P5 land), reduces to: fresh checkout on cc-ci →
|
||
`cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` + lint, plus a
|
||
focused review of only the tests/concurrency suite (vs the 19 planned cases) and the P5 doc delta.
|
||
|
||
## M1: PASS @2026-06-10T04:38Z — implementation verified (branch restructure/concurrency @d3fe9e2)
|
||
|
||
Verdict formed from the plan (SSOT), the code/git, the STATUS claim's verify recipe, and my own
|
||
COLD acceptance run — WITHOUT reading JOURNAL first (anti-anchoring honored; noting here that I had
|
||
NOT consulted JOURNAL-conc at verdict time).
|
||
|
||
COLD ENVIRONMENT: fresh `git clone --branch restructure/concurrency` into /tmp/adv-m1 on cc-ci
|
||
(NOT the Builder's tree); `git rev-parse HEAD == d3fe9e26bb0fbaedb37383539ba3973bc1c80aff` (matches
|
||
claim), `git status` clean. Ran via the host `cc-ci-run` pyEnv (pytest 8.3.3 + playwright) and the
|
||
pinned `.#lint` devshell.
|
||
|
||
ACCEPTANCE RESULTS (expected → observed):
|
||
- `cc-ci-run -m pytest tests/unit -q` → 138 passed in 4.72s ✓ (claim: 138 passed)
|
||
- `cc-ci-run -m pytest tests/concurrency -q` → 20 passed in 9.91s ✓ (claim: 20 passed)
|
||
- `nix develop .#lint --command bash scripts/lint.sh` → `lint: PASS` ✓
|
||
- `pytest tests/unit --collect-only` concurrency items → 0 ✓ (suite NOT in default gate)
|
||
- dangling-ref grep (register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
|
||
CCCI_JANITOR_MAX_AGE, acquire_recipe_lock, RECIPE_LOCK_DIR, _stack_age_seconds) over
|
||
*.py/*.nix/*.yml/*.sh → ZERO hits outside docs/ ✓
|
||
|
||
GATE-INTEGRITY (guardrails honored):
|
||
- `RUN_APP_RE` regex unchanged (lifecycle.py:26, identical pattern); warm/canonical apps still
|
||
never become probe candidates (test_11 asserts no lockfiles even created for warm names).
|
||
- `services_converged()` / paused-is-settled / `backup_app()` waits: NOT in the code diff — all
|
||
RUN_APP_RE/services_converged/paused diff hits are docs/concurrency.md prose (P5 rewrite).
|
||
- `teardown_app` ordering untouched; only its trailing unregister call removed (registry gone).
|
||
- Only `tests/<recipe>/` change is the mechanical `RECIPE_DIR=${ABRA_DIR:-$HOME/.abra}/...` line
|
||
in ghost+discourse install_steps.sh — NO assertion/gate touched (diff-confirmed). Guardrail
|
||
"never weaken recipe-test gates / touch tests/<recipe>/ content" honored.
|
||
- P4: `concurrency.limit` block removed from .drone.yml; drone-runner.nix comment makes
|
||
DRONE_RUNNER_CAPACITY the single knob.
|
||
|
||
ADVERSARIAL DIFF REVIEW (P1–P4 pre-audited in the two notes above; refuted: green-on-red exit-code
|
||
masking [empirically tested], unlink/reacquire inode race [fstat==stat identity recheck],
|
||
half-reaped coexistence [reap-under-probe-lock], signal-mid-teardown reentrancy [begin_teardown
|
||
first line of both finally blocks], guard/ABRA_DIR/fetch ordering [no abra call pre-export]).
|
||
|
||
TEST-SUITE AUDIT vs the 19 plan cases: real kernel flocks, NEVER mocked (only teardown_app +
|
||
abra-discovery stubbed, both disclosed). Coverage complete: cases 1–4 test_locks, 5–12
|
||
test_janitor, 13–16 test_lifetime, 17–19 test_abra_dir, +test_18b (manual-pid isolation) = 20.
|
||
Assertions are substantive, not tautological: exact funnel exit codes 142/143 (test_15/16),
|
||
reap-vs-new-run timestamp ordering + fresh-inode `lock_state=="held"` (test_7), two-janitor
|
||
arbitration via separate open()s (test_8 — valid: flock binds the open file description, so
|
||
threads-with-distinct-fds model processes), long-held mtime-backdate flag-not-steal (test_10),
|
||
PEP 446 fd non-inheritance with a surviving child (test_3), divergent per-run trees + canonical
|
||
untouched (test_18).
|
||
|
||
INDEPENDENT PROBE (my own driver, NOT the Builder's helpers.py): drove the real
|
||
`lifecycle.acquire_app_lock` from a standalone script with a sandbox CCCI_APP_LOCK_DIR on cc-ci →
|
||
state `held` after acquire; a second acquirer BLOCKED while the first held (no ack2 after 1.5s);
|
||
after `SIGKILL` of the holder the second acquired within 10s (kernel auto-release). Core invariant
|
||
confirmed against the real code, not just the Builder's tests.
|
||
|
||
NON-BLOCKING NOTES (carry to M2 live-verify; none gate M1):
|
||
- setsid-fork edge in the .drone.yml trap wrapper: if `setsid` ever forks (only when it's a pgrp
|
||
leader — not the case for a backgrounded job in a non-job-control drone shell), $PID would be the
|
||
intermediate and the harness could reparent (ppid==1) and self-abort. MUST be live-verified by
|
||
the actual drone-cancel path at M2(a) — the plan already flags this ("verify drone exec runner
|
||
signal delivery; the trap must fire on drone cancel"). Not unit-testable here.
|
||
- End-of-janitor stale-lockfile tidy sweep (appless leftover lockfile unlink) is not directly
|
||
covered by a named test (not one of the 19); low risk (tidiness only). Noted, not a defect.
|
||
- test_14 (ppid race) depends on the helper reparenting to pid 1; under a subreaper it marks
|
||
NEVER_REPARENTED and FAILS VISIBLY (never false-passes). Passed in this env.
|
||
|
||
CONCLUSION: M1 — implementation verified — PASS. M2 (merge to main + live verification a–d) is
|
||
unblocked. Reminder for both loops: recipe-mirror PRs are !testme targets only — never merge/push
|
||
them. (After this verdict I may consult JOURNAL-conc to contextualize, per §6.1.)
|