11 KiB
Concurrency restructure — full implementation plan
Expands concurrency-restructure-plan.md (operator-approved 2026-06-10).
Reference spec of current system: cc-ci docs/concurrency.md.
Decisions locked (operator + orchestrator)
| Question | Decision |
|---|---|
| Hard deadline | 60 min (signal.alarm(60*60)); recipes keep their own smaller per-tier deadlines |
servers/ in per-run ABRA_DIR |
symlink to /root/.abra/servers — app .env files land in the shared canonical path, so the janitor's abra app ls discovery and env-based teardown work unchanged (kills the P3 orphan-env risk). Per-domain filenames + app-domain lock prevent write conflicts. catalogue/ also symlinked (read-mostly). recipes/ is fresh per-run — that's the isolation that matters. |
| Landing | one branch restructure/concurrency, one commit per phase (P1..P5 + tests), merged to main only after adversary pass + lint green; live verification on main afterwards |
| Concurrency test suite | tests/concurrency/ — NOT in the default pytest tests/unit run; invoked explicitly (pytest tests/concurrency -q); documented in docs/concurrency.md |
Target invariant chain
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min deadline
Never steal a held lock; manage the holder's lifetime. Held lock = live owner (kernel-guaranteed).
P1 — Lock-lifetime hardening
runner/run_recipe_ci.py (startup, before any abra call):
PR_SET_PDEATHSIG(SIGTERM)viactypes.CDLL("libc.so.6", use_errno=True).prctl(1, signal.SIGTERM); check return code; thenif os.getppid() == 1: sys.exit("parent died before prctl — refusing to run orphaned").- SIGTERM handler: run teardown (reuse the existing finally/teardown funnel — raise SystemExit
from the handler so
finally:blocks run), exit non-zero. - Self-deadline:
signal.alarm(60*60)+ SIGALRM handler → same teardown funnel, distinct log line (== run exceeded 60-minute hard deadline — tearing down ==), exit non-zero.
.drone.yml recipe-ci step: wrap the harness invocation:
setsid python3 runner/run_recipe_ci.py ... &
PID=$!; trap 'kill -TERM -- -$PID 2>/dev/null' TERM EXIT; wait $PID
(builder: verify drone exec runner signal delivery semantics; the trap must fire on drone cancel.)
Also: one-line comment on the lock open() noting PEP 446 makes the fd non-inheritable
(subprocess children never carry the lock).
P2 — Flock-probe janitor (replaces the pidfile registry)
runner/harness/lifecycle.py:
- New:
acquire_app_lock(domain)—/run/lock/cc-ci-app-<domain>.lock, exclusive flock, non-blocking try first → onBlockingIOErrorlog== app lock: another run of <domain> is in flight — waiting ==and block. Returns the kept-alive file object. Touch mtime at acquisition (mtime = lock age for the long-held flag). - Call site: in
deploy_app()exactly whereregister_run_app()is today (BEFORE app creation); the file object must be stored on a module-level/owner object so it lives for the process lifetime. - Janitor rewrite (
janitor()): for each candidate domain (discovery unchanged:abra app lsmatched againstRUN_APP_RE+ docker-service sweep):- open/create its lockfile, try
LOCK_EX|LOCK_NB:- acquired → orphan →
teardown_app(domain, verify=False)WHILE HOLDING the probe lock (closes janitor-vs-new-run race: a new run of that domain blocks on the lock until the reap finishes), then release + unlink lockfile. - held → live run → leave it; if lockfile mtime older than 2× deadline (120 min), log:
!! lock for <domain> held >120min — possible leaked run; inspect with lslocks(flag, never steal).
- acquired → orphan →
- open/create its lockfile, try
- Delete:
register_run_app,unregister_run_app,_run_owner_state,ACTIVE_RUN_DIR,CCCI_JANITOR_MAX_AGEhandling + age fallback, pid-reuse guard, and their call sites (deploy_app,teardown_app, both janitor call sites keep working — only internals change). teardown_app()no longer unregisters; normal-exit lock release stays implicit (process exit). Janitor unlinks lockfiles only for reaped orphans; a clean run's stale lockfile (0 bytes, unlocked) is harmless and reaped-on-sight by the next janitor probe (unlink after acquiring probe lock, even when no app exists for it — keep /run/lock tidy).
Edge semantics to preserve/verify:
- Two janitors racing: both probe; one wins the flock and reaps; the loser sees "held" and leaves.
Reap must be idempotent (
teardown_app(verify=False)already tolerates half-gone stacks). - Post-reboot: lockfiles gone (tmpfs) → probe trivially acquires → immediate reap. (Improvement over the 2h age fallback — note in spec.)
- Warm/canonical apps never match
RUN_APP_RE→ never probed. Unchanged. DO NOT touch.
P3 — Per-run ABRA_DIR (deletes the per-recipe flock)
runner/run_recipe_ci.py / runner/harness/:
- Early in main(): build
run_abra_dir = /var/lib/cc-ci-runs/<build>/abra; create it; symlinkservers→/root/.abra/servers,catalogue→/root/.abra/catalogue(builder: verify the minimal set abra needs — probe showed it fails only on missingservers/); fresh emptyrecipes/. Setos.environ["ABRA_DIR"] = run_abra_dirbefore ANY abra/harness call. fetch_recipe()becomes a plain clone into$ABRA_DIR/recipes/<recipe>(no rm-rf of a shared tree; keep the checkout-of-ref logic).- Delete:
acquire_recipe_lock,RECIPE_LOCK_DIRrecipe-lock usage, its call site in main(), and the "must be taken before fetch_recipe" ordering rule. .drone.yml: HOME=/root comment updated — abra config still found via symlinked servers; recipe trees now per-run.- Janitor / warm-app / canonical flows that read
/root/.abra/recipes/...(if any — builder: grep) must be checked; they run outside a per-run ABRA_DIR and keep using the default. - Run-dir cleanup:
/var/lib/cc-ci-runs/<n>/retention already exists — abra dir rides along.
P4 — Config cleanup
- Remove
concurrency.limitfrom.drone.ymlrecipe-ci pipeline;DRONE_RUNNER_CAPACITY(nix/modules/drone-runner.nixmaxTests) is the single knob. Update its comment.
P5 — Spec rewrite
- Rewrite
docs/concurrency.mdto the new model: one mechanism (per-app-domain flock), the invariant chain (§ lock-lifetime management), flock-probe janitor decision table, per-run ABRA_DIR isolation, updated failure-mode table (cancel now kills harness; reboot reaps immediately; deadline bounds everything), updated file/symbol index, and how to runtests/concurrency.
Test suite: tests/concurrency/
Real-kernel tests (no mocking of flock itself — spawn helper subprocesses that hold locks; use
tmp dirs for lock/run dirs via env/monkeypatch). NOT part of pytest tests/unit. Run with
pytest tests/concurrency -q (document in docs; CI does not run it on every build — optional
manual/nightly invocation).
Lock fundamentals:
- acquire → child process SIGKILL'd → lock immediately acquirable (kernel auto-release).
- probe (
LOCK_NB) against a held lock raisesBlockingIOError; against unheld succeeds. - lock fd NOT inherited: holder spawns a
subprocesschild, holder dies, child still alive → lock acquirable (PEP 446). - second same-domain acquire blocks until first holder exits (double-!testme serialisation).
Janitor / probe semantics:
5. orphan (no holder) → reaped, lockfile unlinked; teardown invoked exactly once (stub teardown).
6. live (held by helper process) → never reaped, logged as live.
7. reap-under-probe-lock race: janitor holds probe lock mid-reap → a new run's acquire of the
same domain blocks until reap completes (no window where new app coexists with half-reaped one).
8. two concurrent janitors → exactly one reaps (flock arbitration), reap idempotent.
9. reboot simulation: app exists, lockfile absent → treated as orphan immediately (no age wait).
10. long-held flag: held lock with mtime backdated >120min → warning logged, NOT reaped.
11. RUN_APP_RE allowlist: warm/canonical-shaped names never become candidates.
12. garbled/unwritable lockfile, missing lock dir → janitor degrades safely (skip + log, never crash).
Lifetime hardening:
13. PDEATHSIG: spawn wrapper-parent + harness-child (helper script calling the real prctl setup);
kill parent → child receives SIGTERM and exits, lock released.
14. ppid race: helper started with already-dead parent (ppid 1) exits immediately.
15. deadline: helper with alarm(2) → teardown hook fires, exit non-zero, lock released.
16. SIGTERM → teardown funnel runs (finally: executed), lock released, non-zero exit.
ABRA_DIR isolation:
17. per-run dir built correctly: servers/catalogue symlinks resolve, recipes/ empty + writable,
ABRA_DIR exported before first abra call (assert via stub abra recording its env).
18. two concurrent fetch_recipe of the SAME recipe into different ABRA_DIRs → both trees correct,
no cross-talk (the old corruption scenario, now safe without a lock).
19. app .env written through the servers symlink lands in the canonical shared path (janitor
discovery keeps working).
Each test must clean up helper processes in finally/fixtures (no leaked children in the test VM).
Roles & flow
- Builder (fable — main-loop model, no override): implements P1→P5 + test suite on branch
restructure/concurrencyin a fresh clone; one commit per phase; runspytest tests/unit -q,pytest tests/concurrency -q,scripts/lint.shbefore each commit; pushes the branch (never main). Commit authorautonomic-bot <autonomic-bot@noreply.git.autonomic.zone>. - Adversary (opus): adversarial review of the full branch diff — hunts races (probe vs
acquire ordering, signal-handler reentrancy, teardown-during-teardown), deleted-code fallout
(grep for dangling references to registry symbols), gate integrity (recipe tests/
RUN_APP_RE/ warm apps untouched), and test-suite blind spots. Produces findings list; default-skeptical. - Builder addresses findings; adversary re-checks. Orchestrator merges to main only when: adversary pass + both test suites green + lint green + branch push build green.
- Live verification (orchestrator, post-merge): a. trigger an immich !testme, cancel mid-run via drone API → assert harness pid dies (no leak), lock released, next janitor reaps the app; b. two parallel !testme runs (immich + plausible) → both green, zero leakage; c. double-!testme same PR → second blocks on app lock, then runs; d. confirm a full green run end-to-end (the standing regression canary).
Guardrails (builder + adversary MUST honor)
- NEVER weaken recipe-test gates or touch
tests/<recipe>/content. - NEVER push cc-ci main; branch only. Never force-push.
- No secrets in commits; reference
.testenv//run/secretslocations only. - Don't touch
services_converged()/ paused-is-settled logic except where the plan says. - Match repo commit style (
feat(...)/fix(harness)/test(...)/docs:).
Rollback
Single revert of the merge commit restores the registry+recipe-flock world; no data migration (lock/registry state is tmpfs-ephemeral, old and new harness can't run simultaneously anyway — deploys land from main atomically per build).