Files
cc-ci/docs/concurrency.md
autonomic-bot 5b65c6caa3
All checks were successful
continuous-integration/drone/push Build is passing
docs: concurrency spec — how parallel recipe runs stay safe (for review/restructuring)
Documents the capacity=2 concurrent-run system as landed in c0df77d,
68ef0f8, e6d55b5: config knobs, isolation model, per-recipe flock,
active-run registry + three-way janitor, convergence interactions,
failure-mode guarantees, and known limitations / restructuring
candidates.
2026-06-10 03:05:20 +00:00

11 KiB

Concurrency: how parallel recipe CI runs stay safe

Spec of the concurrent-run system as of 2026-06-10 (commits c0df77d, 68ef0f8, e6d55b5). Written for review/restructuring — it documents what IS, including known limitations.

1. Goal and design summary

Two recipe CI builds may run at the same time on the single cc-ci host (e.g. immich and plausible under active development at once). Safety is enforced by the harness, not by serialising everything:

Rule Mechanism
Different recipes run in parallel nothing blocks them (isolation, §3)
Same-recipe runs serialise per-recipe flock (§4)
A starting run never reaps a live concurrent run's app active-run registry + three-way janitor (§5)
A crashed run's leftovers still get reaped registry owner-dead detection, age fallback (§5)

There is no daemon and no shared state service: both mechanisms are kernel/file primitives under /run, scoped to the harness process lifetime, so a SIGKILL'd run can never leak a stale lock or a stale "I'm alive" claim.

2. Configuration knobs

Knob Where Current Meaning
DRONE_RUNNER_CAPACITY (aka MAX_TESTS) nix/modules/drone-runner.nix (maxTests let-binding) 2 THE cap. Max builds the exec runner executes at once; Drone queues the rest in its native pending queue. Change requires nixos-rebuild switch on cc-ci.
concurrency.limit .drone.yml, recipe-ci pipeline 2 Server-side cap on concurrent recipe-ci pipelines. Kept equal to capacity; redundant belt (the push pipeline shares runner capacity too, so lint builds can interleave).
CCCI_JANITOR_MAX_AGE env, read in lifecycle.janitor() unset → 7200s Age fallback only — applies solely to apps with no registry entry (§5 case 3). The capacity=1-era "0" override in .drone.yml is GONE; do not reintroduce it (it made a starting build reap in-flight runs).
RECIPE_LOCK_DIR lifecycle.py constant /run/lock Where per-recipe lock files live.
ACTIVE_RUN_DIR lifecycle.py constant /run/cc-ci-active Where active-run pidfiles live.

Memory budget rationale for capacity=2 (Hetzner cpx22, ~7.6 GiB): a full immich stack measured ~1 GiB; two concurrent recipes fit. Revert maxTests to "1" if OOM/disk-I/O contention appears.

3. Isolation model: what is shared, what is per-run

Per-run (no conflict possible):

  • App + stack + volumes + secrets. The run app domain is deterministic and unique per (recipe, pr, ref): naming.app_domain()<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net. Everything abra creates is namespaced by it. Run apps are recognised by RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$; canonical/warm apps (e.g. warm-keycloak...) deliberately do NOT match, so the janitor never touches them.
  • Drone build workspace. The exec runner gives each build its own clone under /var/lib/drone-runner/drone-<id>/ — harness code and test files are per-build.
  • Run artifacts. /var/lib/cc-ci-runs/<build-number>/.

Shared (the two hazards the mechanisms exist for):

  • ~/.abra/recipes/<recipe> — ONE working tree per recipe (abra's own layout). The harness fetch_recipe() rm -rf's + reclones it at run start, and the upgrade tier git checkouts it mid-run for the chaos redeploy. Two same-recipe runs would corrupt each other's deploy tree (observed: immich builds 229/230 deployed a tree missing its config). → per-recipe flock (§4).
  • HOME=/root — forced in .drone.yml so abra finds its server config under /root/.abra. Safe given the above: app names are unique and same-recipe runs serialise, so no two builds touch the same recipe checkout or app env file.

4. Mechanism 1: per-recipe flock

Code: lifecycle.acquire_recipe_lock(recipe); taken in run_recipe_ci.main() before fetch_recipe() (the first shared-tree mutation).

  • Lock file: /run/lock/cc-ci-recipe-<recipe>.lock, exclusive fcntl.flock.
  • Non-blocking attempt first; on BlockingIOError it logs == recipe lock: another <recipe> run is in flight — waiting ... == and blocks. The waiting run is visibly "stuck" in its drone log at that line — that is by design.
  • The open file object is returned and kept alive (_recipe_lock = ... # noqa: F841) for the whole process lifetime. Release is implicit: the kernel drops a flock when the fd closes — including on crash or SIGKILL. There is no stale-lock failure mode and no unlock code path.
  • Scope: serialises only runs of the SAME recipe. Different recipes never contend.

5. Mechanism 2: active-run registry + three-way janitor

Why: every run starts with lifecycle.janitor() (called from both the cold path and the warm/quick path in run_recipe_ci.py) to reap orphans left by crashed/SIGKILL'd runs (whose finally: teardown never ran). Under capacity=2 "any run app that isn't mine" may be a LIVE concurrent run — age alone can't tell. The registry makes ownership explicit.

Registry protocol (all in lifecycle.py):

  1. register_run_app(domain) — writes /run/cc-ci-active/<domain> containing the harness pid. Called inside deploy_app() before the app is created, so no window exists where a concurrent janitor can see the app without its registration.
  2. unregister_run_app(domain) — removes the pidfile. Called at the end of teardown_app() (every exit path funnels through teardown) and by the janitor after reaping.
  3. _run_owner_state(domain) — classifies the owner:
    • reads the pid; missing/garbled file → "unknown"
    • /proc/<pid>/cmdline gone → "dead"
    • cmdline must contain run_recipe_ci"alive", else "dead" (pid-reuse guard: a recycled pid won't look like a harness run)

Janitor decision table (lifecycle.janitor()):

Owner state Meaning Action
alive live concurrent run never reap (logs "is a live concurrent run — leaving it")
dead crashed run's definite orphan reap immediately (teardown_app(verify=False)) + unregister
unknown pre-registry app, or post-reboot (/run is tmpfs) age fallback: reap only if stack age ≥ CCCI_JANITOR_MAX_AGE (default 2h)

Candidate discovery is unchanged from before: abra app ls matches against RUN_APP_RE, plus a docker-service sweep that reconstructs domains for stacks whose .env was already deleted.

6. Where convergence fits (adjacent, landed with this work)

Parallel runs surfaced two swarm-convergence bugs that look like concurrency bugs but aren't — documented here because any restructuring must keep them fixed (services_converged() in lifecycle.py):

  • N/N replicas ≠ converged during a stop-first rolling update: the update is registered instantly but the OLD task still shows 1/1 until swarm cycles it (build 238: backupbot exec'd a pre-hook into a container killed seconds later → 409 → empty snapshot). services_converged() therefore also inspects each service's UpdateStatus.State.
  • paused persists forever: swarm's default update-failure-action: pause sets it on one task flicker and it never clears, even at N/N healthy (build 241 hung 22 min). Only updating and rollback_started block convergence; paused/rollback_paused/completed are settled — the HTTP-health and tier assertions still gate actual app correctness.
  • backup_app() additionally waits (bounded, 300s) for services_converged() before abra app backup create, as defence in depth for the backupbot race.

7. Failure-mode guarantees

Event Outcome
Run crashes / SIGKILL mid-run flock auto-released by kernel; pidfile remains but owner is dead → next janitor (any run's start) reaps app + pidfile
Drone build canceled via API known gap: cancel kills the step's sh wrapper but can LEAK the python harness child — it keeps running (holding lock + registry) until killed by hand. See §8.
Host reboot /run is tmpfs → locks and registry vanish (correct: no processes survived either). All surviving apps become unknown → 2h age fallback governs.
Two same-recipe !testmes second blocks on the flock at run start (before touching the shared tree), runs after the first finishes
Janitor vs. app being created impossible to mis-reap: registration happens before app creation, and an alive owner is never reaped
Pid reuse after crash cmdline check (run_recipe_ci) classifies as dead, orphan still reaped

8. Known limitations / restructuring candidates

  1. Drone cancel leaks the harness process. The exec runner kills the step shell, not the process tree; the leaked python continues deploying/holding the lock. Fix ideas: run the step under setsid + a trap that kills the process group, or have the harness watch DRONE_BUILD_STATUS/parent-death (PR_SET_PDEATHSIG).
  2. Head-of-line blocking on same-recipe serialisation. A run waiting on the recipe flock still occupies one of the 2 runner slots, so two builds of the SAME recipe temporarily starve all other recipes. Alternatives: a Drone-level per-recipe fan-out (one pipeline concurrency group per recipe is not natively expressible), or detect-and-requeue in the harness.
  3. The lock protects harness runs only. Manual abra/git activity on ~/.abra/recipes/<recipe> (operator or another agent) bypasses the flock — the "park the checkout, then hands off" discipline is still required. A restructure could make the harness deploy from a per-run copy of the recipe tree instead of the shared checkout (eliminates the lock entirely, at the cost of diverging from abra's expected layout).
  4. HOME=/root is still shared. Safe today by argument (§3), not by enforcement. Per-build ABRA_DIR with a shared read-only server config would make isolation structural.
  5. Registry is advisory. Nothing stops a non-harness actor from creating run-app-shaped stacks the janitor will eventually age-reap; conversely the janitor trusts pidfiles it can parse. Acceptable on a single-purpose CI host.
  6. Capacity is configured in two places (drone-runner.nix + .drone.yml) that must be kept in step by hand.

9. File / symbol index

What Where
maxTests / DRONE_RUNNER_CAPACITY nix/modules/drone-runner.nix
concurrency.limit, HOME=/root, env .drone.yml (recipe-ci pipeline)
Lock + registry constants & helpers runner/harness/lifecycle.py (top, after TeardownError)
acquire_recipe_lock call site runner/run_recipe_ci.py main(), before fetch_recipe()
register_run_app call site lifecycle.deploy_app() (before app creation)
unregister_run_app call sites lifecycle.teardown_app(), lifecycle.janitor()
Janitor + decision table lifecycle.janitor(), _run_owner_state()
Run-app naming runner/harness/naming.py (app_domain), RUN_APP_RE in lifecycle.py
Convergence (adjacent) lifecycle.services_converged(), lifecycle.backup_app()