diff --git a/cc-ci-plan/concurrency-restructure-full-plan.md b/cc-ci-plan/concurrency-restructure-full-plan.md new file mode 100644 index 0000000..c08ef76 --- /dev/null +++ b/cc-ci-plan/concurrency-restructure-full-plan.md @@ -0,0 +1,185 @@ +# Concurrency restructure — full implementation plan + +Expands `concurrency-restructure-plan.md` (operator-approved 2026-06-10). +Reference spec of current system: cc-ci `docs/concurrency.md`. + +## Decisions locked (operator + orchestrator) + +| Question | Decision | +|---|---| +| Hard deadline | **60 min** (`signal.alarm(60*60)`); recipes keep their own smaller per-tier deadlines | +| `servers/` in per-run ABRA_DIR | **symlink** to `/root/.abra/servers` — app `.env` files land in the shared canonical path, so the janitor's `abra app ls` discovery and env-based teardown work unchanged (kills the P3 orphan-env risk). Per-domain filenames + app-domain lock prevent write conflicts. `catalogue/` also symlinked (read-mostly). `recipes/` is fresh per-run — that's the isolation that matters. | +| Landing | one branch `restructure/concurrency`, one commit per phase (P1..P5 + tests), merged to main only after adversary pass + lint green; live verification on main afterwards | +| Concurrency test suite | `tests/concurrency/` — NOT in the default `pytest tests/unit` run; invoked explicitly (`pytest tests/concurrency -q`); documented in docs/concurrency.md | + +## Target invariant chain + +``` +lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min deadline +``` + +Never steal a held lock; manage the holder's lifetime. Held lock = live owner (kernel-guaranteed). + +--- + +## P1 — Lock-lifetime hardening + +`runner/run_recipe_ci.py` (startup, before any abra call): + +1. `PR_SET_PDEATHSIG(SIGTERM)` via `ctypes.CDLL("libc.so.6", use_errno=True).prctl(1, signal.SIGTERM)`; + check return code; then `if os.getppid() == 1: sys.exit("parent died before prctl — refusing to run orphaned")`. +2. SIGTERM handler: run teardown (reuse the existing finally/teardown funnel — raise SystemExit + from the handler so `finally:` blocks run), exit non-zero. +3. Self-deadline: `signal.alarm(60*60)` + SIGALRM handler → same teardown funnel, distinct log line + (`== run exceeded 60-minute hard deadline — tearing down ==`), exit non-zero. + +`.drone.yml` recipe-ci step: wrap the harness invocation: +```sh +setsid python3 runner/run_recipe_ci.py ... & +PID=$!; trap 'kill -TERM -- -$PID 2>/dev/null' TERM EXIT; wait $PID +``` +(builder: verify drone exec runner signal delivery semantics; the trap must fire on drone cancel.) + +Also: one-line comment on the lock `open()` noting PEP 446 makes the fd non-inheritable +(subprocess children never carry the lock). + +## P2 — Flock-probe janitor (replaces the pidfile registry) + +`runner/harness/lifecycle.py`: + +1. New: `acquire_app_lock(domain)` — `/run/lock/cc-ci-app-.lock`, exclusive flock, + non-blocking try first → on `BlockingIOError` log + `== app lock: another run of is in flight — waiting ==` and block. Returns the kept-alive + file object. Touch mtime at acquisition (mtime = lock age for the long-held flag). +2. Call site: in `deploy_app()` exactly where `register_run_app()` is today (BEFORE app creation); + the file object must be stored on a module-level/owner object so it lives for the process lifetime. +3. Janitor rewrite (`janitor()`): for each candidate domain (discovery unchanged: `abra app ls` + matched against `RUN_APP_RE` + docker-service sweep): + - open/create its lockfile, try `LOCK_EX|LOCK_NB`: + - **acquired** → orphan → `teardown_app(domain, verify=False)` WHILE HOLDING the probe lock + (closes janitor-vs-new-run race: a new run of that domain blocks on the lock until the reap + finishes), then release + unlink lockfile. + - **held** → live run → leave it; if lockfile mtime older than 2× deadline (120 min), log: + `!! lock for held >120min — possible leaked run; inspect with lslocks` (flag, never steal). +4. Delete: `register_run_app`, `unregister_run_app`, `_run_owner_state`, `ACTIVE_RUN_DIR`, + `CCCI_JANITOR_MAX_AGE` handling + age fallback, pid-reuse guard, and their call sites + (`deploy_app`, `teardown_app`, both janitor call sites keep working — only internals change). +5. `teardown_app()` no longer unregisters; normal-exit lock release stays implicit (process exit). + Janitor unlinks lockfiles only for reaped orphans; a clean run's stale lockfile (0 bytes, unlocked) + is harmless and reaped-on-sight by the next janitor probe (unlink after acquiring probe lock, + even when no app exists for it — keep /run/lock tidy). + +Edge semantics to preserve/verify: +- Two janitors racing: both probe; one wins the flock and reaps; the loser sees "held" and leaves. + Reap must be idempotent (`teardown_app(verify=False)` already tolerates half-gone stacks). +- Post-reboot: lockfiles gone (tmpfs) → probe trivially acquires → immediate reap. (Improvement + over the 2h age fallback — note in spec.) +- Warm/canonical apps never match `RUN_APP_RE` → never probed. Unchanged. DO NOT touch. + +## P3 — Per-run ABRA_DIR (deletes the per-recipe flock) + +`runner/run_recipe_ci.py` / `runner/harness/`: + +1. Early in main(): build `run_abra_dir = /var/lib/cc-ci-runs//abra`; create it; + symlink `servers` → `/root/.abra/servers`, `catalogue` → `/root/.abra/catalogue` (builder: + verify the minimal set abra needs — probe showed it fails only on missing `servers/`); + fresh empty `recipes/`. Set `os.environ["ABRA_DIR"] = run_abra_dir` before ANY abra/harness call. +2. `fetch_recipe()` becomes a plain clone into `$ABRA_DIR/recipes/` (no rm-rf of a shared + tree; keep the checkout-of-ref logic). +3. Delete: `acquire_recipe_lock`, `RECIPE_LOCK_DIR` recipe-lock usage, its call site in main(), + and the "must be taken before fetch_recipe" ordering rule. +4. `.drone.yml`: HOME=/root comment updated — abra config still found via symlinked servers; + recipe trees now per-run. +5. Janitor / warm-app / canonical flows that read `/root/.abra/recipes/...` (if any — builder: + grep) must be checked; they run outside a per-run ABRA_DIR and keep using the default. +6. Run-dir cleanup: `/var/lib/cc-ci-runs//` retention already exists — abra dir rides along. + +## P4 — Config cleanup + +- Remove `concurrency.limit` from `.drone.yml` recipe-ci pipeline; `DRONE_RUNNER_CAPACITY` + (nix/modules/drone-runner.nix `maxTests`) is the single knob. Update its comment. + +## P5 — Spec rewrite + +- Rewrite `docs/concurrency.md` to the new model: one mechanism (per-app-domain flock), the + invariant chain (§ lock-lifetime management), flock-probe janitor decision table, per-run + ABRA_DIR isolation, updated failure-mode table (cancel now kills harness; reboot reaps + immediately; deadline bounds everything), updated file/symbol index, and how to run + `tests/concurrency`. + +--- + +## Test suite: `tests/concurrency/` + +Real-kernel tests (no mocking of flock itself — spawn helper subprocesses that hold locks; use +tmp dirs for lock/run dirs via env/monkeypatch). NOT part of `pytest tests/unit`. Run with +`pytest tests/concurrency -q` (document in docs; CI does not run it on every build — optional +manual/nightly invocation). + +Lock fundamentals: +1. acquire → child process SIGKILL'd → lock immediately acquirable (kernel auto-release). +2. probe (`LOCK_NB`) against a held lock raises `BlockingIOError`; against unheld succeeds. +3. lock fd NOT inherited: holder spawns a `subprocess` child, holder dies, child still alive → + lock acquirable (PEP 446). +4. second same-domain acquire blocks until first holder exits (double-!testme serialisation). + +Janitor / probe semantics: +5. orphan (no holder) → reaped, lockfile unlinked; teardown invoked exactly once (stub teardown). +6. live (held by helper process) → never reaped, logged as live. +7. reap-under-probe-lock race: janitor holds probe lock mid-reap → a new run's acquire of the + same domain blocks until reap completes (no window where new app coexists with half-reaped one). +8. two concurrent janitors → exactly one reaps (flock arbitration), reap idempotent. +9. reboot simulation: app exists, lockfile absent → treated as orphan immediately (no age wait). +10. long-held flag: held lock with mtime backdated >120min → warning logged, NOT reaped. +11. `RUN_APP_RE` allowlist: warm/canonical-shaped names never become candidates. +12. garbled/unwritable lockfile, missing lock dir → janitor degrades safely (skip + log, never crash). + +Lifetime hardening: +13. PDEATHSIG: spawn wrapper-parent + harness-child (helper script calling the real prctl setup); + kill parent → child receives SIGTERM and exits, lock released. +14. ppid race: helper started with already-dead parent (ppid 1) exits immediately. +15. deadline: helper with `alarm(2)` → teardown hook fires, exit non-zero, lock released. +16. SIGTERM → teardown funnel runs (`finally:` executed), lock released, non-zero exit. + +ABRA_DIR isolation: +17. per-run dir built correctly: `servers`/`catalogue` symlinks resolve, `recipes/` empty + writable, + `ABRA_DIR` exported before first abra call (assert via stub abra recording its env). +18. two concurrent fetch_recipe of the SAME recipe into different ABRA_DIRs → both trees correct, + no cross-talk (the old corruption scenario, now safe without a lock). +19. app `.env` written through the servers symlink lands in the canonical shared path (janitor + discovery keeps working). + +Each test must clean up helper processes in `finally`/fixtures (no leaked children in the test VM). + +## Roles & flow + +1. **Builder** (fable — main-loop model, no override): implements P1→P5 + test suite on branch + `restructure/concurrency` in a fresh clone; one commit per phase; runs + `pytest tests/unit -q`, `pytest tests/concurrency -q`, `scripts/lint.sh` before each commit; + pushes the branch (never main). Commit author `autonomic-bot `. +2. **Adversary** (opus): adversarial review of the full branch diff — hunts races (probe vs + acquire ordering, signal-handler reentrancy, teardown-during-teardown), deleted-code fallout + (grep for dangling references to registry symbols), gate integrity (recipe tests/`RUN_APP_RE`/ + warm apps untouched), and test-suite blind spots. Produces findings list; default-skeptical. +3. Builder addresses findings; adversary re-checks. Orchestrator merges to main only when: + adversary pass + both test suites green + lint green + branch push build green. +4. **Live verification** (orchestrator, post-merge): + a. trigger an immich !testme, cancel mid-run via drone API → assert harness pid dies (no leak), + lock released, next janitor reaps the app; + b. two parallel !testme runs (immich + plausible) → both green, zero leakage; + c. double-!testme same PR → second blocks on app lock, then runs; + d. confirm a full green run end-to-end (the standing regression canary). + +## Guardrails (builder + adversary MUST honor) + +- NEVER weaken recipe-test gates or touch `tests//` content. +- NEVER push cc-ci main; branch only. Never force-push. +- No secrets in commits; reference `.testenv` / `/run/secrets` locations only. +- Don't touch `services_converged()` / paused-is-settled logic except where the plan says. +- Match repo commit style (`feat(...)`/`fix(harness)`/`test(...)`/`docs:`). + +## Rollback + +Single revert of the merge commit restores the registry+recipe-flock world; no data migration +(lock/registry state is tmpfs-ephemeral, old and new harness can't run simultaneously anyway — +deploys land from main atomically per build).