# Concurrency restructure — full implementation plan Expands `concurrency-restructure-plan.md` (operator-approved 2026-06-10). Reference spec of current system: cc-ci `docs/concurrency.md`. ## Decisions locked (operator + orchestrator) | Question | Decision | |---|---| | Hard deadline | **60 min** (`signal.alarm(60*60)`); recipes keep their own smaller per-tier deadlines | | `servers/` in per-run ABRA_DIR | **symlink** to `/root/.abra/servers` — app `.env` files land in the shared canonical path, so the janitor's `abra app ls` discovery and env-based teardown work unchanged (kills the P3 orphan-env risk). Per-domain filenames + app-domain lock prevent write conflicts. `catalogue/` also symlinked (read-mostly). `recipes/` is fresh per-run — that's the isolation that matters. | | Landing | one branch `restructure/concurrency`, one commit per phase (P1..P5 + tests), merged to main only after adversary pass + lint green; live verification on main afterwards | | Concurrency test suite | `tests/concurrency/` — NOT in the default `pytest tests/unit` run; invoked explicitly (`pytest tests/concurrency -q`); documented in docs/concurrency.md | ## Target invariant chain ``` lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min deadline ``` Never steal a held lock; manage the holder's lifetime. Held lock = live owner (kernel-guaranteed). --- ## P1 — Lock-lifetime hardening `runner/run_recipe_ci.py` (startup, before any abra call): 1. `PR_SET_PDEATHSIG(SIGTERM)` via `ctypes.CDLL("libc.so.6", use_errno=True).prctl(1, signal.SIGTERM)`; check return code; then `if os.getppid() == 1: sys.exit("parent died before prctl — refusing to run orphaned")`. 2. SIGTERM handler: run teardown (reuse the existing finally/teardown funnel — raise SystemExit from the handler so `finally:` blocks run), exit non-zero. 3. Self-deadline: `signal.alarm(60*60)` + SIGALRM handler → same teardown funnel, distinct log line (`== run exceeded 60-minute hard deadline — tearing down ==`), exit non-zero. `.drone.yml` recipe-ci step: wrap the harness invocation: ```sh setsid python3 runner/run_recipe_ci.py ... & PID=$!; trap 'kill -TERM -- -$PID 2>/dev/null' TERM EXIT; wait $PID ``` (builder: verify drone exec runner signal delivery semantics; the trap must fire on drone cancel.) Also: one-line comment on the lock `open()` noting PEP 446 makes the fd non-inheritable (subprocess children never carry the lock). ## P2 — Flock-probe janitor (replaces the pidfile registry) `runner/harness/lifecycle.py`: 1. New: `acquire_app_lock(domain)` — `/run/lock/cc-ci-app-.lock`, exclusive flock, non-blocking try first → on `BlockingIOError` log `== app lock: another run of is in flight — waiting ==` and block. Returns the kept-alive file object. Touch mtime at acquisition (mtime = lock age for the long-held flag). 2. Call site: in `deploy_app()` exactly where `register_run_app()` is today (BEFORE app creation); the file object must be stored on a module-level/owner object so it lives for the process lifetime. 3. Janitor rewrite (`janitor()`): for each candidate domain (discovery unchanged: `abra app ls` matched against `RUN_APP_RE` + docker-service sweep): - open/create its lockfile, try `LOCK_EX|LOCK_NB`: - **acquired** → orphan → `teardown_app(domain, verify=False)` WHILE HOLDING the probe lock (closes janitor-vs-new-run race: a new run of that domain blocks on the lock until the reap finishes), then release + unlink lockfile. - **held** → live run → leave it; if lockfile mtime older than 2× deadline (120 min), log: `!! lock for held >120min — possible leaked run; inspect with lslocks` (flag, never steal). 4. Delete: `register_run_app`, `unregister_run_app`, `_run_owner_state`, `ACTIVE_RUN_DIR`, `CCCI_JANITOR_MAX_AGE` handling + age fallback, pid-reuse guard, and their call sites (`deploy_app`, `teardown_app`, both janitor call sites keep working — only internals change). 5. `teardown_app()` no longer unregisters; normal-exit lock release stays implicit (process exit). Janitor unlinks lockfiles only for reaped orphans; a clean run's stale lockfile (0 bytes, unlocked) is harmless and reaped-on-sight by the next janitor probe (unlink after acquiring probe lock, even when no app exists for it — keep /run/lock tidy). Edge semantics to preserve/verify: - Two janitors racing: both probe; one wins the flock and reaps; the loser sees "held" and leaves. Reap must be idempotent (`teardown_app(verify=False)` already tolerates half-gone stacks). - Post-reboot: lockfiles gone (tmpfs) → probe trivially acquires → immediate reap. (Improvement over the 2h age fallback — note in spec.) - Warm/canonical apps never match `RUN_APP_RE` → never probed. Unchanged. DO NOT touch. ## P3 — Per-run ABRA_DIR (deletes the per-recipe flock) `runner/run_recipe_ci.py` / `runner/harness/`: 1. Early in main(): build `run_abra_dir = /var/lib/cc-ci-runs//abra`; create it; symlink `servers` → `/root/.abra/servers`, `catalogue` → `/root/.abra/catalogue` (builder: verify the minimal set abra needs — probe showed it fails only on missing `servers/`); fresh empty `recipes/`. Set `os.environ["ABRA_DIR"] = run_abra_dir` before ANY abra/harness call. 2. `fetch_recipe()` becomes a plain clone into `$ABRA_DIR/recipes/` (no rm-rf of a shared tree; keep the checkout-of-ref logic). 3. Delete: `acquire_recipe_lock`, `RECIPE_LOCK_DIR` recipe-lock usage, its call site in main(), and the "must be taken before fetch_recipe" ordering rule. 4. `.drone.yml`: HOME=/root comment updated — abra config still found via symlinked servers; recipe trees now per-run. 5. Janitor / warm-app / canonical flows that read `/root/.abra/recipes/...` (if any — builder: grep) must be checked; they run outside a per-run ABRA_DIR and keep using the default. 6. Run-dir cleanup: `/var/lib/cc-ci-runs//` retention already exists — abra dir rides along. ## P4 — Config cleanup - Remove `concurrency.limit` from `.drone.yml` recipe-ci pipeline; `DRONE_RUNNER_CAPACITY` (nix/modules/drone-runner.nix `maxTests`) is the single knob. Update its comment. ## P5 — Spec rewrite - Rewrite `docs/concurrency.md` to the new model: one mechanism (per-app-domain flock), the invariant chain (§ lock-lifetime management), flock-probe janitor decision table, per-run ABRA_DIR isolation, updated failure-mode table (cancel now kills harness; reboot reaps immediately; deadline bounds everything), updated file/symbol index, and how to run `tests/concurrency`. --- ## Test suite: `tests/concurrency/` Real-kernel tests (no mocking of flock itself — spawn helper subprocesses that hold locks; use tmp dirs for lock/run dirs via env/monkeypatch). NOT part of `pytest tests/unit`. Run with `pytest tests/concurrency -q` (document in docs; CI does not run it on every build — optional manual/nightly invocation). Lock fundamentals: 1. acquire → child process SIGKILL'd → lock immediately acquirable (kernel auto-release). 2. probe (`LOCK_NB`) against a held lock raises `BlockingIOError`; against unheld succeeds. 3. lock fd NOT inherited: holder spawns a `subprocess` child, holder dies, child still alive → lock acquirable (PEP 446). 4. second same-domain acquire blocks until first holder exits (double-!testme serialisation). Janitor / probe semantics: 5. orphan (no holder) → reaped, lockfile unlinked; teardown invoked exactly once (stub teardown). 6. live (held by helper process) → never reaped, logged as live. 7. reap-under-probe-lock race: janitor holds probe lock mid-reap → a new run's acquire of the same domain blocks until reap completes (no window where new app coexists with half-reaped one). 8. two concurrent janitors → exactly one reaps (flock arbitration), reap idempotent. 9. reboot simulation: app exists, lockfile absent → treated as orphan immediately (no age wait). 10. long-held flag: held lock with mtime backdated >120min → warning logged, NOT reaped. 11. `RUN_APP_RE` allowlist: warm/canonical-shaped names never become candidates. 12. garbled/unwritable lockfile, missing lock dir → janitor degrades safely (skip + log, never crash). Lifetime hardening: 13. PDEATHSIG: spawn wrapper-parent + harness-child (helper script calling the real prctl setup); kill parent → child receives SIGTERM and exits, lock released. 14. ppid race: helper started with already-dead parent (ppid 1) exits immediately. 15. deadline: helper with `alarm(2)` → teardown hook fires, exit non-zero, lock released. 16. SIGTERM → teardown funnel runs (`finally:` executed), lock released, non-zero exit. ABRA_DIR isolation: 17. per-run dir built correctly: `servers`/`catalogue` symlinks resolve, `recipes/` empty + writable, `ABRA_DIR` exported before first abra call (assert via stub abra recording its env). 18. two concurrent fetch_recipe of the SAME recipe into different ABRA_DIRs → both trees correct, no cross-talk (the old corruption scenario, now safe without a lock). 19. app `.env` written through the servers symlink lands in the canonical shared path (janitor discovery keeps working). Each test must clean up helper processes in `finally`/fixtures (no leaked children in the test VM). ## Roles, gates & Definition of Done (loop protocol — plan.md §6.1 applies) This phase runs as Builder/Adversary loops with phase-namespaced state files (STATUS-conc.md, REVIEW-conc.md, BACKLOG-conc.md, JOURNAL-conc.md). **Builder**: implements P1→P5 + the tests/concurrency suite on branch `restructure/concurrency` in YOUR clone; one commit per phase; before each commit ALL of: `pytest tests/unit -q`, `pytest tests/concurrency -q` (once it exists), `scripts/lint.sh` — green. Push the branch (NOT main — main merge is gated below). Claim gates via `claim(conc): ...` commits + STATUS-conc.md per protocol. **Adversary**: cold-verify from your own clone. For M1: check out the branch, run both suites + lint yourself, then adversarially review the full diff — hunt races (probe vs acquire ordering, signal-handler reentrancy, teardown-during-teardown), deleted-code fallout (grep for dangling references to registry symbols: register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE, acquire_recipe_lock), gate integrity (recipe tests / RUN_APP_RE / warm apps untouched), and test-suite blind spots vs the 19 cases above. Default-skeptical; findings to REVIEW-conc.md. **Gates:** - **M1 — implementation verified.** Branch complete (P1–P5 + tests), both suites + lint green on the Adversary's cold clone, adversarial diff review PASS in REVIEW-conc.md. - **M2 — merged + live-verified.** After M1 PASS only, Builder merges the branch to main (merge commit, never force) and confirms the push build green. Then live verification on the real CI host (evidence in STATUS-conc.md, Adversary re-checks): a. trigger an immich !testme, cancel mid-run via drone API → harness pid dies (no leaked python), lock released, next janitor reaps the app — zero leakage; b. two parallel !testme runs (immich PR#2 + plausible PR#3) → both green, zero leakage; c. double-!testme same PR → second blocks on the app lock (visible in its drone log), then runs; d. one full green run end-to-end. **## DONE in STATUS-conc.md** only when M1 and M2 both show a fresh Adversary PASS in REVIEW-conc.md. NOTE for both loops: recipe-mirror PRs (immich#2, plausible#3) are used as !testme targets only — NEVER merge or push to recipe mirror repos. ## Guardrails (builder + adversary MUST honor) - NEVER weaken recipe-test gates or touch `tests//` content. - cc-ci main is touched ONLY by the M2 merge after M1 PASS; all other work stays on the branch. Never force-push. NEVER merge/push recipe mirror repos. - No secrets in commits; reference `.testenv` / `/run/secrets` locations only. - Don't touch `services_converged()` / paused-is-settled logic except where the plan says. - Match repo commit style (`feat(...)`/`fix(harness)`/`test(...)`/`docs:`). ## Rollback Single revert of the merge commit restores the registry+recipe-flock world; no data migration (lock/registry state is tmpfs-ephemeral, old and new harness can't run simultaneously anyway — deploys land from main atomically per build).