plan: concurrency restructure — flock-probe janitor, per-run ABRA_DIR, lock-lifetime chain

This commit is contained in:
autonomic-bot
2026-06-10 03:41:05 +00:00
parent 335ea1d7c1
commit 0d169c2a20

View File

@ -0,0 +1,60 @@
# Concurrency restructure — concise plan (for review)
Goal: collapse the two concurrency mechanisms (per-recipe flock + pidfile registry/janitor)
into **one primitive** — a per-app-domain kernel flock — and make lock lifetime bounded by
construction. Reference: cc-ci `docs/concurrency.md` §8.
## Target invariant chain
```
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ hard deadline
```
Never steal a held lock; manage the holder's lifetime. A held lock = live owner (kernel-guaranteed).
## Phases (each independently landable, in order)
### P1 — Lock-lifetime hardening (no behavior change to locking model)
- Harness startup: `PR_SET_PDEATHSIG(SIGTERM)` + post-prctl `getppid()==1` race check
→ drone cancel now kills the harness, fixing the leaked-python gap.
- `.drone.yml` step wrapper: `setsid` + trap killing the process group on TERM/EXIT.
- Self-deadline: `signal.alarm(90*60)` → teardown + exit. Hard upper bound on any run.
- Verify: trigger a build, cancel via API, assert harness pid gone + app torn down by next janitor.
### P2 — Flock-probe janitor (replace the pidfile registry)
- Run start: exclusive flock on `/run/lock/cc-ci-app-<domain>.lock` **before app creation**,
held for process lifetime (also serialises double-`!testme` of the same PR/ref).
- Janitor: probe each candidate's lockfile with `LOCK_EX|LOCK_NB`.
Acquirable → orphan → reap **while holding the probe lock** (closes janitor-vs-new-run race).
Held → live run → leave; if lockfile mtime > 2× max run time, log loudly (`lslocks` hint) — flag, never steal.
- Delete: `register_run_app`, `unregister_run_app`, `_run_owner_state`, `ACTIVE_RUN_DIR`,
`CCCI_JANITOR_MAX_AGE` + age fallback, pid-reuse guard.
- Bonus: post-reboot orphans reaped immediately (locks gone = acquirable) instead of after 2h.
- Keep: `RUN_APP_RE` allowlist, candidate discovery (abra ls + docker-service sweep).
### P3 — Per-run ABRA_DIR (delete the per-recipe flock)
- Each build: `ABRA_DIR=/var/lib/cc-ci-runs/<n>/abra`, seeded with the server config
(verified: abra honors `ABRA_DIR`; fails only on missing `servers/`).
- `fetch_recipe` becomes a plain clone into a fresh dir (no rm-rf of a shared tree).
- Delete: `acquire_recipe_lock` + `/run/lock/cc-ci-recipe-*` + the "before fetch_recipe" ordering rule.
- Wins: same-recipe PRs truly parallel (kills head-of-line blocking §8.2); manual operator
activity on `/root/.abra` can no longer corrupt CI runs (§8.3/§8.4 moot).
- Risk to resolve: orphaned app's `.env` lives in the dead run's tree — janitor must rely on the
existing env-less teardown path (docker-service sweep). Confirm `teardown_app(verify=False)`
fully works without the `.env`.
### P4 — Config cleanup
- Drop `concurrency.limit` from `.drone.yml`; `DRONE_RUNNER_CAPACITY` (drone-runner.nix)
becomes the single knob (§8.6).
### P5 — Spec rewrite
- Rewrite `docs/concurrency.md` to the new model (one mechanism, one table) + §10 lock-lifetime chain.
## Out of scope (do not touch)
- `services_converged()` / paused-is-settled logic; `RUN_APP_RE`; recipe-test gates; warm/canonical apps.
## Open questions for review
1. P3 server-config seeding: copy vs symlink of `$ABRA_DIR/servers/<server>`? (symlink simpler;
copy isolates harder — does abra write into `servers/` during a run? e.g. app `.env` lives there → copy needed, per-run env file then dies with the run dir, see P3 risk.)
2. Self-deadline value: 90 min ok? (longest observed green run ~35 min incl. flock wait.)
3. Land P1+P2 together or separately? (P2 testable solo; P1 is pure hardening — could merge.)