diff --git a/cc-ci-plan/concurrency-restructure-plan.md b/cc-ci-plan/concurrency-restructure-plan.md new file mode 100644 index 0000000..58d3a3a --- /dev/null +++ b/cc-ci-plan/concurrency-restructure-plan.md @@ -0,0 +1,60 @@ +# Concurrency restructure — concise plan (for review) + +Goal: collapse the two concurrency mechanisms (per-recipe flock + pidfile registry/janitor) +into **one primitive** — a per-app-domain kernel flock — and make lock lifetime bounded by +construction. Reference: cc-ci `docs/concurrency.md` §8. + +## Target invariant chain + +``` +lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ hard deadline +``` + +Never steal a held lock; manage the holder's lifetime. A held lock = live owner (kernel-guaranteed). + +## Phases (each independently landable, in order) + +### P1 — Lock-lifetime hardening (no behavior change to locking model) +- Harness startup: `PR_SET_PDEATHSIG(SIGTERM)` + post-prctl `getppid()==1` race check + → drone cancel now kills the harness, fixing the leaked-python gap. +- `.drone.yml` step wrapper: `setsid` + trap killing the process group on TERM/EXIT. +- Self-deadline: `signal.alarm(90*60)` → teardown + exit. Hard upper bound on any run. +- Verify: trigger a build, cancel via API, assert harness pid gone + app torn down by next janitor. + +### P2 — Flock-probe janitor (replace the pidfile registry) +- Run start: exclusive flock on `/run/lock/cc-ci-app-.lock` **before app creation**, + held for process lifetime (also serialises double-`!testme` of the same PR/ref). +- Janitor: probe each candidate's lockfile with `LOCK_EX|LOCK_NB`. + Acquirable → orphan → reap **while holding the probe lock** (closes janitor-vs-new-run race). + Held → live run → leave; if lockfile mtime > 2× max run time, log loudly (`lslocks` hint) — flag, never steal. +- Delete: `register_run_app`, `unregister_run_app`, `_run_owner_state`, `ACTIVE_RUN_DIR`, + `CCCI_JANITOR_MAX_AGE` + age fallback, pid-reuse guard. +- Bonus: post-reboot orphans reaped immediately (locks gone = acquirable) instead of after 2h. +- Keep: `RUN_APP_RE` allowlist, candidate discovery (abra ls + docker-service sweep). + +### P3 — Per-run ABRA_DIR (delete the per-recipe flock) +- Each build: `ABRA_DIR=/var/lib/cc-ci-runs//abra`, seeded with the server config + (verified: abra honors `ABRA_DIR`; fails only on missing `servers/`). +- `fetch_recipe` becomes a plain clone into a fresh dir (no rm-rf of a shared tree). +- Delete: `acquire_recipe_lock` + `/run/lock/cc-ci-recipe-*` + the "before fetch_recipe" ordering rule. +- Wins: same-recipe PRs truly parallel (kills head-of-line blocking §8.2); manual operator + activity on `/root/.abra` can no longer corrupt CI runs (§8.3/§8.4 moot). +- Risk to resolve: orphaned app's `.env` lives in the dead run's tree — janitor must rely on the + existing env-less teardown path (docker-service sweep). Confirm `teardown_app(verify=False)` + fully works without the `.env`. + +### P4 — Config cleanup +- Drop `concurrency.limit` from `.drone.yml`; `DRONE_RUNNER_CAPACITY` (drone-runner.nix) + becomes the single knob (§8.6). + +### P5 — Spec rewrite +- Rewrite `docs/concurrency.md` to the new model (one mechanism, one table) + §10 lock-lifetime chain. + +## Out of scope (do not touch) +- `services_converged()` / paused-is-settled logic; `RUN_APP_RE`; recipe-test gates; warm/canonical apps. + +## Open questions for review +1. P3 server-config seeding: copy vs symlink of `$ABRA_DIR/servers/`? (symlink simpler; + copy isolates harder — does abra write into `servers/` during a run? e.g. app `.env` lives there → copy needed, per-run env file then dies with the run dir, see P3 risk.) +2. Self-deadline value: 90 min ok? (longest observed green run ~35 min incl. flock wait.) +3. Land P1+P2 together or separately? (P2 testable solo; P1 is pure hardening — could merge.)