Files
cc-ci-orchestrator/cc-ci-plan/concurrency-restructure-plan.md

3.6 KiB
Raw Blame History

Concurrency restructure — concise plan (for review)

Goal: collapse the two concurrency mechanisms (per-recipe flock + pidfile registry/janitor) into one primitive — a per-app-domain kernel flock — and make lock lifetime bounded by construction. Reference: cc-ci docs/concurrency.md §8.

Target invariant chain

lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ hard deadline

Never steal a held lock; manage the holder's lifetime. A held lock = live owner (kernel-guaranteed).

Phases (each independently landable, in order)

P1 — Lock-lifetime hardening (no behavior change to locking model)

  • Harness startup: PR_SET_PDEATHSIG(SIGTERM) + post-prctl getppid()==1 race check → drone cancel now kills the harness, fixing the leaked-python gap.
  • .drone.yml step wrapper: setsid + trap killing the process group on TERM/EXIT.
  • Self-deadline: signal.alarm(90*60) → teardown + exit. Hard upper bound on any run.
  • Verify: trigger a build, cancel via API, assert harness pid gone + app torn down by next janitor.

P2 — Flock-probe janitor (replace the pidfile registry)

  • Run start: exclusive flock on /run/lock/cc-ci-app-<domain>.lock before app creation, held for process lifetime (also serialises double-!testme of the same PR/ref).
  • Janitor: probe each candidate's lockfile with LOCK_EX|LOCK_NB. Acquirable → orphan → reap while holding the probe lock (closes janitor-vs-new-run race). Held → live run → leave; if lockfile mtime > 2× max run time, log loudly (lslocks hint) — flag, never steal.
  • Delete: register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, pid-reuse guard.
  • Bonus: post-reboot orphans reaped immediately (locks gone = acquirable) instead of after 2h.
  • Keep: RUN_APP_RE allowlist, candidate discovery (abra ls + docker-service sweep).

P3 — Per-run ABRA_DIR (delete the per-recipe flock)

  • Each build: ABRA_DIR=/var/lib/cc-ci-runs/<n>/abra, seeded with the server config (verified: abra honors ABRA_DIR; fails only on missing servers/).
  • fetch_recipe becomes a plain clone into a fresh dir (no rm-rf of a shared tree).
  • Delete: acquire_recipe_lock + /run/lock/cc-ci-recipe-* + the "before fetch_recipe" ordering rule.
  • Wins: same-recipe PRs truly parallel (kills head-of-line blocking §8.2); manual operator activity on /root/.abra can no longer corrupt CI runs (§8.3/§8.4 moot).
  • Risk to resolve: orphaned app's .env lives in the dead run's tree — janitor must rely on the existing env-less teardown path (docker-service sweep). Confirm teardown_app(verify=False) fully works without the .env.

P4 — Config cleanup

  • Drop concurrency.limit from .drone.yml; DRONE_RUNNER_CAPACITY (drone-runner.nix) becomes the single knob (§8.6).

P5 — Spec rewrite

  • Rewrite docs/concurrency.md to the new model (one mechanism, one table) + §10 lock-lifetime chain.

Out of scope (do not touch)

  • services_converged() / paused-is-settled logic; RUN_APP_RE; recipe-test gates; warm/canonical apps.

Open questions for review

  1. P3 server-config seeding: copy vs symlink of $ABRA_DIR/servers/<server>? (symlink simpler; copy isolates harder — does abra write into servers/ during a run? e.g. app .env lives there → copy needed, per-run env file then dies with the run dir, see P3 risk.)
  2. Self-deadline value: 90 min ok? (longest observed green run ~35 min incl. flock wait.)
  3. Land P1+P2 together or separately? (P2 testable solo; P1 is pure hardening — could merge.)