Files

continuous-integration/drone/push Build is passing

Details

docs: concurrency spec — how parallel recipe runs stay safe (for review/restructuring)

Documents the capacity=2 concurrent-run system as landed in c0df77d,
68ef0f8, e6d55b5: config knobs, isolation model, per-recipe flock,
active-run registry + three-way janitor, convergence interactions,
failure-mode guarantees, and known limitations / restructuring
candidates.

2026-06-10 03:05:20 +00:00

11 KiB

Raw Blame History

Concurrency: how parallel recipe CI runs stay safe

Spec of the concurrent-run system as of 2026-06-10 (commits c0df77d, 68ef0f8, e6d55b5). Written for review/restructuring — it documents what IS, including known limitations.

1. Goal and design summary

Two recipe CI builds may run at the same time on the single cc-ci host (e.g. immich and plausible under active development at once). Safety is enforced by the harness, not by serialising everything:

Rule	Mechanism
Different recipes run in parallel	nothing blocks them (isolation, §3)
Same-recipe runs serialise	per-recipe `flock` (§4)
A starting run never reaps a live concurrent run's app	active-run registry + three-way janitor (§5)
A crashed run's leftovers still get reaped	registry owner-dead detection, age fallback (§5)

There is no daemon and no shared state service: both mechanisms are kernel/file primitives under /run, scoped to the harness process lifetime, so a SIGKILL'd run can never leak a stale lock or a stale "I'm alive" claim.

2. Configuration knobs

Knob	Where	Current	Meaning
`DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`)	`nix/modules/drone-runner.nix` (`maxTests` let-binding)	`2`	THE cap. Max builds the exec runner executes at once; Drone queues the rest in its native pending queue. Change requires `nixos-rebuild switch` on cc-ci.
`concurrency.limit`	`.drone.yml`, `recipe-ci` pipeline	`2`	Server-side cap on concurrent `recipe-ci` pipelines. Kept equal to capacity; redundant belt (the push pipeline shares runner capacity too, so lint builds can interleave).
`CCCI_JANITOR_MAX_AGE`	env, read in `lifecycle.janitor()`	unset → `7200`s	Age fallback only — applies solely to apps with no registry entry (§5 case 3). The capacity=1-era `"0"` override in `.drone.yml` is GONE; do not reintroduce it (it made a starting build reap in-flight runs).
`RECIPE_LOCK_DIR`	`lifecycle.py` constant	`/run/lock`	Where per-recipe lock files live.
`ACTIVE_RUN_DIR`	`lifecycle.py` constant	`/run/cc-ci-active`	Where active-run pidfiles live.

Memory budget rationale for capacity=2 (Hetzner cpx22, ~7.6 GiB): a full immich stack measured ~1 GiB; two concurrent recipes fit. Revert maxTests to "1" if OOM/disk-I/O contention appears.

3. Isolation model: what is shared, what is per-run

Per-run (no conflict possible):

App + stack + volumes + secrets. The run app domain is deterministic and unique per (recipe, pr, ref): naming.app_domain() → <recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net. Everything abra creates is namespaced by it. Run apps are recognised by RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$; canonical/warm apps (e.g. warm-keycloak...) deliberately do NOT match, so the janitor never touches them.
Drone build workspace. The exec runner gives each build its own clone under /var/lib/drone-runner/drone-<id>/ — harness code and test files are per-build.
Run artifacts. /var/lib/cc-ci-runs/<build-number>/.

Shared (the two hazards the mechanisms exist for):

~/.abra/recipes/<recipe> — ONE working tree per recipe (abra's own layout). The harness fetch_recipe() rm -rf's + reclones it at run start, and the upgrade tier git checkouts it mid-run for the chaos redeploy. Two same-recipe runs would corrupt each other's deploy tree (observed: immich builds 229/230 deployed a tree missing its config). → per-recipe flock (§4).
HOME=/root — forced in .drone.yml so abra finds its server config under /root/.abra. Safe given the above: app names are unique and same-recipe runs serialise, so no two builds touch the same recipe checkout or app env file.

4. Mechanism 1: per-recipe flock

Code: lifecycle.acquire_recipe_lock(recipe); taken in run_recipe_ci.main() before fetch_recipe() (the first shared-tree mutation).

Lock file: /run/lock/cc-ci-recipe-<recipe>.lock, exclusive fcntl.flock.
Non-blocking attempt first; on BlockingIOError it logs == recipe lock: another <recipe> run is in flight — waiting ... == and blocks. The waiting run is visibly "stuck" in its drone log at that line — that is by design.
The open file object is returned and kept alive (_recipe_lock = ... # noqa: F841) for the whole process lifetime. Release is implicit: the kernel drops a flock when the fd closes — including on crash or SIGKILL. There is no stale-lock failure mode and no unlock code path.
Scope: serialises only runs of the SAME recipe. Different recipes never contend.

5. Mechanism 2: active-run registry + three-way janitor

Why: every run starts with lifecycle.janitor() (called from both the cold path and the warm/quick path in run_recipe_ci.py) to reap orphans left by crashed/SIGKILL'd runs (whose finally: teardown never ran). Under capacity=2 "any run app that isn't mine" may be a LIVE concurrent run — age alone can't tell. The registry makes ownership explicit.

Registry protocol (all in lifecycle.py):

register_run_app(domain) — writes /run/cc-ci-active/<domain> containing the harness pid. Called inside deploy_app() before the app is created, so no window exists where a concurrent janitor can see the app without its registration.
unregister_run_app(domain) — removes the pidfile. Called at the end of teardown_app() (every exit path funnels through teardown) and by the janitor after reaping.
_run_owner_state(domain) — classifies the owner:
- reads the pid; missing/garbled file → "unknown"
- /proc/<pid>/cmdline gone → "dead"
- cmdline must contain run_recipe_ci → "alive", else "dead" (pid-reuse guard: a recycled pid won't look like a harness run)

Janitor decision table (lifecycle.janitor()):

Owner state	Meaning	Action
`alive`	live concurrent run	never reap (logs "is a live concurrent run — leaving it")
`dead`	crashed run's definite orphan	reap immediately (`teardown_app(verify=False)`) + unregister
`unknown`	pre-registry app, or post-reboot (`/run` is tmpfs)	age fallback: reap only if stack age ≥ `CCCI_JANITOR_MAX_AGE` (default 2h)

Candidate discovery is unchanged from before: abra app ls matches against RUN_APP_RE, plus a docker-service sweep that reconstructs domains for stacks whose .env was already deleted.

6. Where convergence fits (adjacent, landed with this work)

Parallel runs surfaced two swarm-convergence bugs that look like concurrency bugs but aren't — documented here because any restructuring must keep them fixed (services_converged() in lifecycle.py):

N/N replicas ≠ converged during a stop-first rolling update: the update is registered instantly but the OLD task still shows 1/1 until swarm cycles it (build 238: backupbot exec'd a pre-hook into a container killed seconds later → 409 → empty snapshot). services_converged() therefore also inspects each service's UpdateStatus.State.
paused persists forever: swarm's default update-failure-action: pause sets it on one task flicker and it never clears, even at N/N healthy (build 241 hung 22 min). Only updating and rollback_started block convergence; paused/rollback_paused/completed are settled — the HTTP-health and tier assertions still gate actual app correctness.
backup_app() additionally waits (bounded, 300s) for services_converged() before abra app backup create, as defence in depth for the backupbot race.

7. Failure-mode guarantees

Event	Outcome
Run crashes / SIGKILL mid-run	flock auto-released by kernel; pidfile remains but owner is `dead` → next janitor (any run's start) reaps app + pidfile
Drone build canceled via API	known gap: cancel kills the step's `sh` wrapper but can LEAK the python harness child — it keeps running (holding lock + registry) until killed by hand. See §8.
Host reboot	`/run` is tmpfs → locks and registry vanish (correct: no processes survived either). All surviving apps become `unknown` → 2h age fallback governs.
Two same-recipe `!testme`s	second blocks on the flock at run start (before touching the shared tree), runs after the first finishes
Janitor vs. app being created	impossible to mis-reap: registration happens before app creation, and an `alive` owner is never reaped
Pid reuse after crash	cmdline check (`run_recipe_ci`) classifies as `dead`, orphan still reaped

8. Known limitations / restructuring candidates

Drone cancel leaks the harness process. The exec runner kills the step shell, not the process tree; the leaked python continues deploying/holding the lock. Fix ideas: run the step under setsid + a trap that kills the process group, or have the harness watch DRONE_BUILD_STATUS/parent-death (PR_SET_PDEATHSIG).
Head-of-line blocking on same-recipe serialisation. A run waiting on the recipe flock still occupies one of the 2 runner slots, so two builds of the SAME recipe temporarily starve all other recipes. Alternatives: a Drone-level per-recipe fan-out (one pipeline concurrency group per recipe is not natively expressible), or detect-and-requeue in the harness.
The lock protects harness runs only. Manual abra/git activity on ~/.abra/recipes/<recipe> (operator or another agent) bypasses the flock — the "park the checkout, then hands off" discipline is still required. A restructure could make the harness deploy from a per-run copy of the recipe tree instead of the shared checkout (eliminates the lock entirely, at the cost of diverging from abra's expected layout).
HOME=/root is still shared. Safe today by argument (§3), not by enforcement. Per-build ABRA_DIR with a shared read-only server config would make isolation structural.
Registry is advisory. Nothing stops a non-harness actor from creating run-app-shaped stacks the janitor will eventually age-reap; conversely the janitor trusts pidfiles it can parse. Acceptable on a single-purpose CI host.
Capacity is configured in two places (drone-runner.nix + .drone.yml) that must be kept in step by hand.

9. File / symbol index

What	Where
`maxTests` / `DRONE_RUNNER_CAPACITY`	`nix/modules/drone-runner.nix`
`concurrency.limit`, `HOME=/root`, env	`.drone.yml` (`recipe-ci` pipeline)
Lock + registry constants & helpers	`runner/harness/lifecycle.py` (top, after `TeardownError`)
`acquire_recipe_lock` call site	`runner/run_recipe_ci.py` `main()`, before `fetch_recipe()`
`register_run_app` call site	`lifecycle.deploy_app()` (before app creation)
`unregister_run_app` call sites	`lifecycle.teardown_app()`, `lifecycle.janitor()`
Janitor + decision table	`lifecycle.janitor()`, `_run_owner_state()`
Run-app naming	`runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py`
Convergence (adjacent)	`lifecycle.services_converged()`, `lifecycle.backup_app()`

11 KiB Raw Blame History