Documents the capacity=2 concurrent-run system as landed inc0df77d,68ef0f8,e6d55b5: config knobs, isolation model, per-recipe flock, active-run registry + three-way janitor, convergence interactions, failure-mode guarantees, and known limitations / restructuring candidates.
11 KiB
Concurrency: how parallel recipe CI runs stay safe
Spec of the concurrent-run system as of 2026-06-10 (commits c0df77d, 68ef0f8, e6d55b5).
Written for review/restructuring — it documents what IS, including known limitations.
1. Goal and design summary
Two recipe CI builds may run at the same time on the single cc-ci host (e.g. immich and plausible under active development at once). Safety is enforced by the harness, not by serialising everything:
| Rule | Mechanism |
|---|---|
| Different recipes run in parallel | nothing blocks them (isolation, §3) |
| Same-recipe runs serialise | per-recipe flock (§4) |
| A starting run never reaps a live concurrent run's app | active-run registry + three-way janitor (§5) |
| A crashed run's leftovers still get reaped | registry owner-dead detection, age fallback (§5) |
There is no daemon and no shared state service: both mechanisms are kernel/file primitives
under /run, scoped to the harness process lifetime, so a SIGKILL'd run can never leak a stale
lock or a stale "I'm alive" claim.
2. Configuration knobs
| Knob | Where | Current | Meaning |
|---|---|---|---|
DRONE_RUNNER_CAPACITY (aka MAX_TESTS) |
nix/modules/drone-runner.nix (maxTests let-binding) |
2 |
THE cap. Max builds the exec runner executes at once; Drone queues the rest in its native pending queue. Change requires nixos-rebuild switch on cc-ci. |
concurrency.limit |
.drone.yml, recipe-ci pipeline |
2 |
Server-side cap on concurrent recipe-ci pipelines. Kept equal to capacity; redundant belt (the push pipeline shares runner capacity too, so lint builds can interleave). |
CCCI_JANITOR_MAX_AGE |
env, read in lifecycle.janitor() |
unset → 7200s |
Age fallback only — applies solely to apps with no registry entry (§5 case 3). The capacity=1-era "0" override in .drone.yml is GONE; do not reintroduce it (it made a starting build reap in-flight runs). |
RECIPE_LOCK_DIR |
lifecycle.py constant |
/run/lock |
Where per-recipe lock files live. |
ACTIVE_RUN_DIR |
lifecycle.py constant |
/run/cc-ci-active |
Where active-run pidfiles live. |
Memory budget rationale for capacity=2 (Hetzner cpx22, ~7.6 GiB): a full immich stack measured
~1 GiB; two concurrent recipes fit. Revert maxTests to "1" if OOM/disk-I/O contention appears.
3. Isolation model: what is shared, what is per-run
Per-run (no conflict possible):
- App + stack + volumes + secrets. The run app domain is deterministic and unique per
(recipe, pr, ref):
naming.app_domain()→<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net. Everything abra creates is namespaced by it. Run apps are recognised byRUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$; canonical/warm apps (e.g.warm-keycloak...) deliberately do NOT match, so the janitor never touches them. - Drone build workspace. The exec runner gives each build its own clone under
/var/lib/drone-runner/drone-<id>/— harness code and test files are per-build. - Run artifacts.
/var/lib/cc-ci-runs/<build-number>/.
Shared (the two hazards the mechanisms exist for):
~/.abra/recipes/<recipe>— ONE working tree per recipe (abra's own layout). The harnessfetch_recipe()rm -rf's + reclones it at run start, and the upgrade tiergit checkouts it mid-run for the chaos redeploy. Two same-recipe runs would corrupt each other's deploy tree (observed: immich builds 229/230 deployed a tree missing its config). → per-recipe flock (§4).HOME=/root— forced in.drone.ymlso abra finds its server config under/root/.abra. Safe given the above: app names are unique and same-recipe runs serialise, so no two builds touch the same recipe checkout or app env file.
4. Mechanism 1: per-recipe flock
Code: lifecycle.acquire_recipe_lock(recipe); taken in run_recipe_ci.main() before
fetch_recipe() (the first shared-tree mutation).
- Lock file:
/run/lock/cc-ci-recipe-<recipe>.lock, exclusivefcntl.flock. - Non-blocking attempt first; on
BlockingIOErrorit logs== recipe lock: another <recipe> run is in flight — waiting ... ==and blocks. The waiting run is visibly "stuck" in its drone log at that line — that is by design. - The open file object is returned and kept alive (
_recipe_lock = ... # noqa: F841) for the whole process lifetime. Release is implicit: the kernel drops a flock when the fd closes — including on crash or SIGKILL. There is no stale-lock failure mode and no unlock code path. - Scope: serialises only runs of the SAME recipe. Different recipes never contend.
5. Mechanism 2: active-run registry + three-way janitor
Why: every run starts with lifecycle.janitor() (called from both the cold path and the
warm/quick path in run_recipe_ci.py) to reap orphans left by crashed/SIGKILL'd runs (whose
finally: teardown never ran). Under capacity=2 "any run app that isn't mine" may be a LIVE
concurrent run — age alone can't tell. The registry makes ownership explicit.
Registry protocol (all in lifecycle.py):
register_run_app(domain)— writes/run/cc-ci-active/<domain>containing the harness pid. Called insidedeploy_app()before the app is created, so no window exists where a concurrent janitor can see the app without its registration.unregister_run_app(domain)— removes the pidfile. Called at the end ofteardown_app()(every exit path funnels through teardown) and by the janitor after reaping._run_owner_state(domain)— classifies the owner:- reads the pid; missing/garbled file →
"unknown" /proc/<pid>/cmdlinegone →"dead"- cmdline must contain
run_recipe_ci→"alive", else"dead"(pid-reuse guard: a recycled pid won't look like a harness run)
- reads the pid; missing/garbled file →
Janitor decision table (lifecycle.janitor()):
| Owner state | Meaning | Action |
|---|---|---|
alive |
live concurrent run | never reap (logs "is a live concurrent run — leaving it") |
dead |
crashed run's definite orphan | reap immediately (teardown_app(verify=False)) + unregister |
unknown |
pre-registry app, or post-reboot (/run is tmpfs) |
age fallback: reap only if stack age ≥ CCCI_JANITOR_MAX_AGE (default 2h) |
Candidate discovery is unchanged from before: abra app ls matches against RUN_APP_RE, plus a
docker-service sweep that reconstructs domains for stacks whose .env was already deleted.
6. Where convergence fits (adjacent, landed with this work)
Parallel runs surfaced two swarm-convergence bugs that look like concurrency bugs but aren't —
documented here because any restructuring must keep them fixed (services_converged() in
lifecycle.py):
- N/N replicas ≠ converged during a stop-first rolling update: the update is registered
instantly but the OLD task still shows 1/1 until swarm cycles it (build 238: backupbot exec'd a
pre-hook into a container killed seconds later → 409 → empty snapshot).
services_converged()therefore also inspects each service'sUpdateStatus.State. pausedpersists forever: swarm's defaultupdate-failure-action: pausesets it on one task flicker and it never clears, even at N/N healthy (build 241 hung 22 min). Onlyupdatingandrollback_startedblock convergence;paused/rollback_paused/completedare settled — the HTTP-health and tier assertions still gate actual app correctness.backup_app()additionally waits (bounded, 300s) forservices_converged()beforeabra app backup create, as defence in depth for the backupbot race.
7. Failure-mode guarantees
| Event | Outcome |
|---|---|
| Run crashes / SIGKILL mid-run | flock auto-released by kernel; pidfile remains but owner is dead → next janitor (any run's start) reaps app + pidfile |
| Drone build canceled via API | known gap: cancel kills the step's sh wrapper but can LEAK the python harness child — it keeps running (holding lock + registry) until killed by hand. See §8. |
| Host reboot | /run is tmpfs → locks and registry vanish (correct: no processes survived either). All surviving apps become unknown → 2h age fallback governs. |
Two same-recipe !testmes |
second blocks on the flock at run start (before touching the shared tree), runs after the first finishes |
| Janitor vs. app being created | impossible to mis-reap: registration happens before app creation, and an alive owner is never reaped |
| Pid reuse after crash | cmdline check (run_recipe_ci) classifies as dead, orphan still reaped |
8. Known limitations / restructuring candidates
- Drone cancel leaks the harness process. The exec runner kills the step shell, not the
process tree; the leaked python continues deploying/holding the lock. Fix ideas: run the step
under
setsid+ a trap that kills the process group, or have the harness watchDRONE_BUILD_STATUS/parent-death (PR_SET_PDEATHSIG). - Head-of-line blocking on same-recipe serialisation. A run waiting on the recipe flock
still occupies one of the 2 runner slots, so two builds of the SAME recipe temporarily starve
all other recipes. Alternatives: a Drone-level per-recipe fan-out (one pipeline
concurrencygroup per recipe is not natively expressible), or detect-and-requeue in the harness. - The lock protects harness runs only. Manual
abra/gitactivity on~/.abra/recipes/<recipe>(operator or another agent) bypasses the flock — the "park the checkout, then hands off" discipline is still required. A restructure could make the harness deploy from a per-run copy of the recipe tree instead of the shared checkout (eliminates the lock entirely, at the cost of diverging from abra's expected layout). HOME=/rootis still shared. Safe today by argument (§3), not by enforcement. Per-buildABRA_DIRwith a shared read-only server config would make isolation structural.- Registry is advisory. Nothing stops a non-harness actor from creating run-app-shaped stacks the janitor will eventually age-reap; conversely the janitor trusts pidfiles it can parse. Acceptable on a single-purpose CI host.
- Capacity is configured in two places (
drone-runner.nix+.drone.yml) that must be kept in step by hand.
9. File / symbol index
| What | Where |
|---|---|
maxTests / DRONE_RUNNER_CAPACITY |
nix/modules/drone-runner.nix |
concurrency.limit, HOME=/root, env |
.drone.yml (recipe-ci pipeline) |
| Lock + registry constants & helpers | runner/harness/lifecycle.py (top, after TeardownError) |
acquire_recipe_lock call site |
runner/run_recipe_ci.py main(), before fetch_recipe() |
register_run_app call site |
lifecycle.deploy_app() (before app creation) |
unregister_run_app call sites |
lifecycle.teardown_app(), lifecycle.janitor() |
| Janitor + decision table | lifecycle.janitor(), _run_owner_state() |
| Run-app naming | runner/harness/naming.py (app_domain), RUN_APP_RE in lifecycle.py |
| Convergence (adjacent) | lifecycle.services_converged(), lifecycle.backup_app() |