Files
cc-ci-orchestrator/cc-ci-plan/concurrency-restructure-plan.md

61 lines
3.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Concurrency restructure — concise plan (for review)
Goal: collapse the two concurrency mechanisms (per-recipe flock + pidfile registry/janitor)
into **one primitive** — a per-app-domain kernel flock — and make lock lifetime bounded by
construction. Reference: cc-ci `docs/concurrency.md` §8.
## Target invariant chain
```
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ hard deadline
```
Never steal a held lock; manage the holder's lifetime. A held lock = live owner (kernel-guaranteed).
## Phases (each independently landable, in order)
### P1 — Lock-lifetime hardening (no behavior change to locking model)
- Harness startup: `PR_SET_PDEATHSIG(SIGTERM)` + post-prctl `getppid()==1` race check
→ drone cancel now kills the harness, fixing the leaked-python gap.
- `.drone.yml` step wrapper: `setsid` + trap killing the process group on TERM/EXIT.
- Self-deadline: `signal.alarm(90*60)` → teardown + exit. Hard upper bound on any run.
- Verify: trigger a build, cancel via API, assert harness pid gone + app torn down by next janitor.
### P2 — Flock-probe janitor (replace the pidfile registry)
- Run start: exclusive flock on `/run/lock/cc-ci-app-<domain>.lock` **before app creation**,
held for process lifetime (also serialises double-`!testme` of the same PR/ref).
- Janitor: probe each candidate's lockfile with `LOCK_EX|LOCK_NB`.
Acquirable → orphan → reap **while holding the probe lock** (closes janitor-vs-new-run race).
Held → live run → leave; if lockfile mtime > 2× max run time, log loudly (`lslocks` hint) — flag, never steal.
- Delete: `register_run_app`, `unregister_run_app`, `_run_owner_state`, `ACTIVE_RUN_DIR`,
`CCCI_JANITOR_MAX_AGE` + age fallback, pid-reuse guard.
- Bonus: post-reboot orphans reaped immediately (locks gone = acquirable) instead of after 2h.
- Keep: `RUN_APP_RE` allowlist, candidate discovery (abra ls + docker-service sweep).
### P3 — Per-run ABRA_DIR (delete the per-recipe flock)
- Each build: `ABRA_DIR=/var/lib/cc-ci-runs/<n>/abra`, seeded with the server config
(verified: abra honors `ABRA_DIR`; fails only on missing `servers/`).
- `fetch_recipe` becomes a plain clone into a fresh dir (no rm-rf of a shared tree).
- Delete: `acquire_recipe_lock` + `/run/lock/cc-ci-recipe-*` + the "before fetch_recipe" ordering rule.
- Wins: same-recipe PRs truly parallel (kills head-of-line blocking §8.2); manual operator
activity on `/root/.abra` can no longer corrupt CI runs (§8.3/§8.4 moot).
- Risk to resolve: orphaned app's `.env` lives in the dead run's tree — janitor must rely on the
existing env-less teardown path (docker-service sweep). Confirm `teardown_app(verify=False)`
fully works without the `.env`.
### P4 — Config cleanup
- Drop `concurrency.limit` from `.drone.yml`; `DRONE_RUNNER_CAPACITY` (drone-runner.nix)
becomes the single knob (§8.6).
### P5 — Spec rewrite
- Rewrite `docs/concurrency.md` to the new model (one mechanism, one table) + §10 lock-lifetime chain.
## Out of scope (do not touch)
- `services_converged()` / paused-is-settled logic; `RUN_APP_RE`; recipe-test gates; warm/canonical apps.
## Open questions for review
1. P3 server-config seeding: copy vs symlink of `$ABRA_DIR/servers/<server>`? (symlink simpler;
copy isolates harder — does abra write into `servers/` during a run? e.g. app `.env` lives there → copy needed, per-run env file then dies with the run dir, see P3 risk.)
2. Self-deadline value: 90 min ok? (longest observed green run ~35 min incl. flock wait.)
3. Land P1+P2 together or separately? (P2 testable solo; P1 is pure hardening — could merge.)