3.6 KiB
3.6 KiB
Concurrency restructure — concise plan (for review)
Goal: collapse the two concurrency mechanisms (per-recipe flock + pidfile registry/janitor)
into one primitive — a per-app-domain kernel flock — and make lock lifetime bounded by
construction. Reference: cc-ci docs/concurrency.md §8.
Target invariant chain
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ hard deadline
Never steal a held lock; manage the holder's lifetime. A held lock = live owner (kernel-guaranteed).
Phases (each independently landable, in order)
P1 — Lock-lifetime hardening (no behavior change to locking model)
- Harness startup:
PR_SET_PDEATHSIG(SIGTERM)+ post-prctlgetppid()==1race check → drone cancel now kills the harness, fixing the leaked-python gap. .drone.ymlstep wrapper:setsid+ trap killing the process group on TERM/EXIT.- Self-deadline:
signal.alarm(90*60)→ teardown + exit. Hard upper bound on any run. - Verify: trigger a build, cancel via API, assert harness pid gone + app torn down by next janitor.
P2 — Flock-probe janitor (replace the pidfile registry)
- Run start: exclusive flock on
/run/lock/cc-ci-app-<domain>.lockbefore app creation, held for process lifetime (also serialises double-!testmeof the same PR/ref). - Janitor: probe each candidate's lockfile with
LOCK_EX|LOCK_NB. Acquirable → orphan → reap while holding the probe lock (closes janitor-vs-new-run race). Held → live run → leave; if lockfile mtime > 2× max run time, log loudly (lslockshint) — flag, never steal. - Delete:
register_run_app,unregister_run_app,_run_owner_state,ACTIVE_RUN_DIR,CCCI_JANITOR_MAX_AGE+ age fallback, pid-reuse guard. - Bonus: post-reboot orphans reaped immediately (locks gone = acquirable) instead of after 2h.
- Keep:
RUN_APP_REallowlist, candidate discovery (abra ls + docker-service sweep).
P3 — Per-run ABRA_DIR (delete the per-recipe flock)
- Each build:
ABRA_DIR=/var/lib/cc-ci-runs/<n>/abra, seeded with the server config (verified: abra honorsABRA_DIR; fails only on missingservers/). fetch_recipebecomes a plain clone into a fresh dir (no rm-rf of a shared tree).- Delete:
acquire_recipe_lock+/run/lock/cc-ci-recipe-*+ the "before fetch_recipe" ordering rule. - Wins: same-recipe PRs truly parallel (kills head-of-line blocking §8.2); manual operator
activity on
/root/.abracan no longer corrupt CI runs (§8.3/§8.4 moot). - Risk to resolve: orphaned app's
.envlives in the dead run's tree — janitor must rely on the existing env-less teardown path (docker-service sweep). Confirmteardown_app(verify=False)fully works without the.env.
P4 — Config cleanup
- Drop
concurrency.limitfrom.drone.yml;DRONE_RUNNER_CAPACITY(drone-runner.nix) becomes the single knob (§8.6).
P5 — Spec rewrite
- Rewrite
docs/concurrency.mdto the new model (one mechanism, one table) + §10 lock-lifetime chain.
Out of scope (do not touch)
services_converged()/ paused-is-settled logic;RUN_APP_RE; recipe-test gates; warm/canonical apps.
Open questions for review
- P3 server-config seeding: copy vs symlink of
$ABRA_DIR/servers/<server>? (symlink simpler; copy isolates harder — does abra write intoservers/during a run? e.g. app.envlives there → copy needed, per-run env file then dies with the run dir, see P3 risk.) - Self-deadline value: 90 min ok? (longest observed green run ~35 min incl. flock wait.)
- Land P1+P2 together or separately? (P2 testable solo; P1 is pure hardening — could merge.)