Files
cc-ci-orchestrator/cc-ci-plan/concurrency-restructure-full-plan.md

11 KiB
Raw Blame History

Concurrency restructure — full implementation plan

Expands concurrency-restructure-plan.md (operator-approved 2026-06-10). Reference spec of current system: cc-ci docs/concurrency.md.

Decisions locked (operator + orchestrator)

Question Decision
Hard deadline 60 min (signal.alarm(60*60)); recipes keep their own smaller per-tier deadlines
servers/ in per-run ABRA_DIR symlink to /root/.abra/servers — app .env files land in the shared canonical path, so the janitor's abra app ls discovery and env-based teardown work unchanged (kills the P3 orphan-env risk). Per-domain filenames + app-domain lock prevent write conflicts. catalogue/ also symlinked (read-mostly). recipes/ is fresh per-run — that's the isolation that matters.
Landing one branch restructure/concurrency, one commit per phase (P1..P5 + tests), merged to main only after adversary pass + lint green; live verification on main afterwards
Concurrency test suite tests/concurrency/ — NOT in the default pytest tests/unit run; invoked explicitly (pytest tests/concurrency -q); documented in docs/concurrency.md

Target invariant chain

lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min deadline

Never steal a held lock; manage the holder's lifetime. Held lock = live owner (kernel-guaranteed).


P1 — Lock-lifetime hardening

runner/run_recipe_ci.py (startup, before any abra call):

  1. PR_SET_PDEATHSIG(SIGTERM) via ctypes.CDLL("libc.so.6", use_errno=True).prctl(1, signal.SIGTERM); check return code; then if os.getppid() == 1: sys.exit("parent died before prctl — refusing to run orphaned").
  2. SIGTERM handler: run teardown (reuse the existing finally/teardown funnel — raise SystemExit from the handler so finally: blocks run), exit non-zero.
  3. Self-deadline: signal.alarm(60*60) + SIGALRM handler → same teardown funnel, distinct log line (== run exceeded 60-minute hard deadline — tearing down ==), exit non-zero.

.drone.yml recipe-ci step: wrap the harness invocation:

setsid python3 runner/run_recipe_ci.py ... &
PID=$!; trap 'kill -TERM -- -$PID 2>/dev/null' TERM EXIT; wait $PID

(builder: verify drone exec runner signal delivery semantics; the trap must fire on drone cancel.)

Also: one-line comment on the lock open() noting PEP 446 makes the fd non-inheritable (subprocess children never carry the lock).

P2 — Flock-probe janitor (replaces the pidfile registry)

runner/harness/lifecycle.py:

  1. New: acquire_app_lock(domain)/run/lock/cc-ci-app-<domain>.lock, exclusive flock, non-blocking try first → on BlockingIOError log == app lock: another run of <domain> is in flight — waiting == and block. Returns the kept-alive file object. Touch mtime at acquisition (mtime = lock age for the long-held flag).
  2. Call site: in deploy_app() exactly where register_run_app() is today (BEFORE app creation); the file object must be stored on a module-level/owner object so it lives for the process lifetime.
  3. Janitor rewrite (janitor()): for each candidate domain (discovery unchanged: abra app ls matched against RUN_APP_RE + docker-service sweep):
    • open/create its lockfile, try LOCK_EX|LOCK_NB:
      • acquired → orphan → teardown_app(domain, verify=False) WHILE HOLDING the probe lock (closes janitor-vs-new-run race: a new run of that domain blocks on the lock until the reap finishes), then release + unlink lockfile.
      • held → live run → leave it; if lockfile mtime older than 2× deadline (120 min), log: !! lock for <domain> held >120min — possible leaked run; inspect with lslocks (flag, never steal).
  4. Delete: register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE handling + age fallback, pid-reuse guard, and their call sites (deploy_app, teardown_app, both janitor call sites keep working — only internals change).
  5. teardown_app() no longer unregisters; normal-exit lock release stays implicit (process exit). Janitor unlinks lockfiles only for reaped orphans; a clean run's stale lockfile (0 bytes, unlocked) is harmless and reaped-on-sight by the next janitor probe (unlink after acquiring probe lock, even when no app exists for it — keep /run/lock tidy).

Edge semantics to preserve/verify:

  • Two janitors racing: both probe; one wins the flock and reaps; the loser sees "held" and leaves. Reap must be idempotent (teardown_app(verify=False) already tolerates half-gone stacks).
  • Post-reboot: lockfiles gone (tmpfs) → probe trivially acquires → immediate reap. (Improvement over the 2h age fallback — note in spec.)
  • Warm/canonical apps never match RUN_APP_RE → never probed. Unchanged. DO NOT touch.

P3 — Per-run ABRA_DIR (deletes the per-recipe flock)

runner/run_recipe_ci.py / runner/harness/:

  1. Early in main(): build run_abra_dir = /var/lib/cc-ci-runs/<build>/abra; create it; symlink servers/root/.abra/servers, catalogue/root/.abra/catalogue (builder: verify the minimal set abra needs — probe showed it fails only on missing servers/); fresh empty recipes/. Set os.environ["ABRA_DIR"] = run_abra_dir before ANY abra/harness call.
  2. fetch_recipe() becomes a plain clone into $ABRA_DIR/recipes/<recipe> (no rm-rf of a shared tree; keep the checkout-of-ref logic).
  3. Delete: acquire_recipe_lock, RECIPE_LOCK_DIR recipe-lock usage, its call site in main(), and the "must be taken before fetch_recipe" ordering rule.
  4. .drone.yml: HOME=/root comment updated — abra config still found via symlinked servers; recipe trees now per-run.
  5. Janitor / warm-app / canonical flows that read /root/.abra/recipes/... (if any — builder: grep) must be checked; they run outside a per-run ABRA_DIR and keep using the default.
  6. Run-dir cleanup: /var/lib/cc-ci-runs/<n>/ retention already exists — abra dir rides along.

P4 — Config cleanup

  • Remove concurrency.limit from .drone.yml recipe-ci pipeline; DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) is the single knob. Update its comment.

P5 — Spec rewrite

  • Rewrite docs/concurrency.md to the new model: one mechanism (per-app-domain flock), the invariant chain (§ lock-lifetime management), flock-probe janitor decision table, per-run ABRA_DIR isolation, updated failure-mode table (cancel now kills harness; reboot reaps immediately; deadline bounds everything), updated file/symbol index, and how to run tests/concurrency.

Test suite: tests/concurrency/

Real-kernel tests (no mocking of flock itself — spawn helper subprocesses that hold locks; use tmp dirs for lock/run dirs via env/monkeypatch). NOT part of pytest tests/unit. Run with pytest tests/concurrency -q (document in docs; CI does not run it on every build — optional manual/nightly invocation).

Lock fundamentals:

  1. acquire → child process SIGKILL'd → lock immediately acquirable (kernel auto-release).
  2. probe (LOCK_NB) against a held lock raises BlockingIOError; against unheld succeeds.
  3. lock fd NOT inherited: holder spawns a subprocess child, holder dies, child still alive → lock acquirable (PEP 446).
  4. second same-domain acquire blocks until first holder exits (double-!testme serialisation).

Janitor / probe semantics: 5. orphan (no holder) → reaped, lockfile unlinked; teardown invoked exactly once (stub teardown). 6. live (held by helper process) → never reaped, logged as live. 7. reap-under-probe-lock race: janitor holds probe lock mid-reap → a new run's acquire of the same domain blocks until reap completes (no window where new app coexists with half-reaped one). 8. two concurrent janitors → exactly one reaps (flock arbitration), reap idempotent. 9. reboot simulation: app exists, lockfile absent → treated as orphan immediately (no age wait). 10. long-held flag: held lock with mtime backdated >120min → warning logged, NOT reaped. 11. RUN_APP_RE allowlist: warm/canonical-shaped names never become candidates. 12. garbled/unwritable lockfile, missing lock dir → janitor degrades safely (skip + log, never crash).

Lifetime hardening: 13. PDEATHSIG: spawn wrapper-parent + harness-child (helper script calling the real prctl setup); kill parent → child receives SIGTERM and exits, lock released. 14. ppid race: helper started with already-dead parent (ppid 1) exits immediately. 15. deadline: helper with alarm(2) → teardown hook fires, exit non-zero, lock released. 16. SIGTERM → teardown funnel runs (finally: executed), lock released, non-zero exit.

ABRA_DIR isolation: 17. per-run dir built correctly: servers/catalogue symlinks resolve, recipes/ empty + writable, ABRA_DIR exported before first abra call (assert via stub abra recording its env). 18. two concurrent fetch_recipe of the SAME recipe into different ABRA_DIRs → both trees correct, no cross-talk (the old corruption scenario, now safe without a lock). 19. app .env written through the servers symlink lands in the canonical shared path (janitor discovery keeps working).

Each test must clean up helper processes in finally/fixtures (no leaked children in the test VM).

Roles & flow

  1. Builder (fable — main-loop model, no override): implements P1→P5 + test suite on branch restructure/concurrency in a fresh clone; one commit per phase; runs pytest tests/unit -q, pytest tests/concurrency -q, scripts/lint.sh before each commit; pushes the branch (never main). Commit author autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>.
  2. Adversary (opus): adversarial review of the full branch diff — hunts races (probe vs acquire ordering, signal-handler reentrancy, teardown-during-teardown), deleted-code fallout (grep for dangling references to registry symbols), gate integrity (recipe tests/RUN_APP_RE/ warm apps untouched), and test-suite blind spots. Produces findings list; default-skeptical.
  3. Builder addresses findings; adversary re-checks. Orchestrator merges to main only when: adversary pass + both test suites green + lint green + branch push build green.
  4. Live verification (orchestrator, post-merge): a. trigger an immich !testme, cancel mid-run via drone API → assert harness pid dies (no leak), lock released, next janitor reaps the app; b. two parallel !testme runs (immich + plausible) → both green, zero leakage; c. double-!testme same PR → second blocks on app lock, then runs; d. confirm a full green run end-to-end (the standing regression canary).

Guardrails (builder + adversary MUST honor)

  • NEVER weaken recipe-test gates or touch tests/<recipe>/ content.
  • NEVER push cc-ci main; branch only. Never force-push.
  • No secrets in commits; reference .testenv / /run/secrets locations only.
  • Don't touch services_converged() / paused-is-settled logic except where the plan says.
  • Match repo commit style (feat(...)/fix(harness)/test(...)/docs:).

Rollback

Single revert of the merge commit restores the registry+recipe-flock world; no data migration (lock/registry state is tmpfs-ephemeral, old and new harness can't run simultaneously anyway — deploys land from main atomically per build).