docs: concurrency spec — how parallel recipe runs stay safe (for review/restructuring)
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
Documents the capacity=2 concurrent-run system as landed inc0df77d,68ef0f8,e6d55b5: config knobs, isolation model, per-recipe flock, active-run registry + three-way janitor, convergence interactions, failure-mode guarantees, and known limitations / restructuring candidates.
This commit is contained in:
167
docs/concurrency.md
Normal file
167
docs/concurrency.md
Normal file
@ -0,0 +1,167 @@
|
||||
# Concurrency: how parallel recipe CI runs stay safe
|
||||
|
||||
Spec of the concurrent-run system as of 2026-06-10 (commits `c0df77d`, `68ef0f8`, `e6d55b5`).
|
||||
Written for review/restructuring — it documents what IS, including known limitations.
|
||||
|
||||
## 1. Goal and design summary
|
||||
|
||||
Two recipe CI builds may run **at the same time** on the single cc-ci host (e.g. immich and
|
||||
plausible under active development at once). Safety is enforced by the **harness**, not by
|
||||
serialising everything:
|
||||
|
||||
| Rule | Mechanism |
|
||||
|---|---|
|
||||
| Different recipes run in parallel | nothing blocks them (isolation, §3) |
|
||||
| Same-recipe runs serialise | per-recipe `flock` (§4) |
|
||||
| A starting run never reaps a live concurrent run's app | active-run registry + three-way janitor (§5) |
|
||||
| A crashed run's leftovers still get reaped | registry owner-dead detection, age fallback (§5) |
|
||||
|
||||
There is **no daemon and no shared state service**: both mechanisms are kernel/file primitives
|
||||
under `/run`, scoped to the harness process lifetime, so a `SIGKILL`'d run can never leak a stale
|
||||
lock or a stale "I'm alive" claim.
|
||||
|
||||
## 2. Configuration knobs
|
||||
|
||||
| Knob | Where | Current | Meaning |
|
||||
|---|---|---|---|
|
||||
| `DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`) | `nix/modules/drone-runner.nix` (`maxTests` let-binding) | `2` | **THE cap.** Max builds the exec runner executes at once; Drone queues the rest in its native pending queue. Change requires `nixos-rebuild switch` on cc-ci. |
|
||||
| `concurrency.limit` | `.drone.yml`, `recipe-ci` pipeline | `2` | Server-side cap on concurrent `recipe-ci` pipelines. Kept equal to capacity; redundant belt (the push pipeline shares runner capacity too, so lint builds can interleave). |
|
||||
| `CCCI_JANITOR_MAX_AGE` | env, read in `lifecycle.janitor()` | unset → `7200`s | **Age fallback only** — applies solely to apps with no registry entry (§5 case 3). The capacity=1-era `"0"` override in `.drone.yml` is GONE; do not reintroduce it (it made a starting build reap in-flight runs). |
|
||||
| `RECIPE_LOCK_DIR` | `lifecycle.py` constant | `/run/lock` | Where per-recipe lock files live. |
|
||||
| `ACTIVE_RUN_DIR` | `lifecycle.py` constant | `/run/cc-ci-active` | Where active-run pidfiles live. |
|
||||
|
||||
Memory budget rationale for capacity=2 (Hetzner cpx22, ~7.6 GiB): a full immich stack measured
|
||||
~1 GiB; two concurrent recipes fit. Revert `maxTests` to `"1"` if OOM/disk-I/O contention appears.
|
||||
|
||||
## 3. Isolation model: what is shared, what is per-run
|
||||
|
||||
Per-run (no conflict possible):
|
||||
|
||||
- **App + stack + volumes + secrets.** The run app domain is deterministic and unique per
|
||||
(recipe, pr, ref): `naming.app_domain()` → `<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net`.
|
||||
Everything abra creates is namespaced by it. Run apps are recognised by
|
||||
`RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`; canonical/warm apps
|
||||
(e.g. `warm-keycloak...`) deliberately do NOT match, so the janitor never touches them.
|
||||
- **Drone build workspace.** The exec runner gives each build its own clone under
|
||||
`/var/lib/drone-runner/drone-<id>/` — harness code and test files are per-build.
|
||||
- **Run artifacts.** `/var/lib/cc-ci-runs/<build-number>/`.
|
||||
|
||||
Shared (the two hazards the mechanisms exist for):
|
||||
|
||||
- **`~/.abra/recipes/<recipe>`** — ONE working tree per recipe (abra's own layout). The harness
|
||||
`fetch_recipe()` `rm -rf`'s + reclones it at run start, and the upgrade tier `git checkout`s it
|
||||
mid-run for the chaos redeploy. Two same-recipe runs would corrupt each other's deploy tree
|
||||
(observed: immich builds 229/230 deployed a tree missing its config). → per-recipe flock (§4).
|
||||
- **`HOME=/root`** — forced in `.drone.yml` so abra finds its server config under `/root/.abra`.
|
||||
Safe *given* the above: app names are unique and same-recipe runs serialise, so no two builds
|
||||
touch the same recipe checkout or app env file.
|
||||
|
||||
## 4. Mechanism 1: per-recipe flock
|
||||
|
||||
Code: `lifecycle.acquire_recipe_lock(recipe)`; taken in `run_recipe_ci.main()` **before**
|
||||
`fetch_recipe()` (the first shared-tree mutation).
|
||||
|
||||
- Lock file: `/run/lock/cc-ci-recipe-<recipe>.lock`, exclusive `fcntl.flock`.
|
||||
- Non-blocking attempt first; on `BlockingIOError` it logs
|
||||
`== recipe lock: another <recipe> run is in flight — waiting ... ==` and blocks. The waiting
|
||||
run is visibly "stuck" in its drone log at that line — that is by design.
|
||||
- The open file object is returned and kept alive (`_recipe_lock = ... # noqa: F841`) for the
|
||||
**whole process lifetime**. Release is implicit: the kernel drops a flock when the fd closes —
|
||||
including on crash or SIGKILL. **There is no stale-lock failure mode and no unlock code path.**
|
||||
- Scope: serialises only runs of the SAME recipe. Different recipes never contend.
|
||||
|
||||
## 5. Mechanism 2: active-run registry + three-way janitor
|
||||
|
||||
Why: every run starts with `lifecycle.janitor()` (called from both the cold path and the
|
||||
warm/quick path in `run_recipe_ci.py`) to reap orphans left by crashed/SIGKILL'd runs (whose
|
||||
`finally:` teardown never ran). Under capacity=2 "any run app that isn't mine" may be a LIVE
|
||||
concurrent run — age alone can't tell. The registry makes ownership explicit.
|
||||
|
||||
Registry protocol (all in `lifecycle.py`):
|
||||
|
||||
1. `register_run_app(domain)` — writes `/run/cc-ci-active/<domain>` containing the harness pid.
|
||||
Called inside `deploy_app()` **before** the app is created, so no window exists where a
|
||||
concurrent janitor can see the app without its registration.
|
||||
2. `unregister_run_app(domain)` — removes the pidfile. Called at the end of `teardown_app()`
|
||||
(every exit path funnels through teardown) and by the janitor after reaping.
|
||||
3. `_run_owner_state(domain)` — classifies the owner:
|
||||
- reads the pid; missing/garbled file → `"unknown"`
|
||||
- `/proc/<pid>/cmdline` gone → `"dead"`
|
||||
- cmdline must contain `run_recipe_ci` → `"alive"`, else `"dead"` (**pid-reuse guard**: a
|
||||
recycled pid won't look like a harness run)
|
||||
|
||||
Janitor decision table (`lifecycle.janitor()`):
|
||||
|
||||
| Owner state | Meaning | Action |
|
||||
|---|---|---|
|
||||
| `alive` | live concurrent run | **never reap** (logs "is a live concurrent run — leaving it") |
|
||||
| `dead` | crashed run's definite orphan | reap immediately (`teardown_app(verify=False)`) + unregister |
|
||||
| `unknown` | pre-registry app, or post-reboot (`/run` is tmpfs) | age fallback: reap only if stack age ≥ `CCCI_JANITOR_MAX_AGE` (default 2h) |
|
||||
|
||||
Candidate discovery is unchanged from before: `abra app ls` matches against `RUN_APP_RE`, plus a
|
||||
docker-service sweep that reconstructs domains for stacks whose `.env` was already deleted.
|
||||
|
||||
## 6. Where convergence fits (adjacent, landed with this work)
|
||||
|
||||
Parallel runs surfaced two swarm-convergence bugs that look like concurrency bugs but aren't —
|
||||
documented here because any restructuring must keep them fixed (`services_converged()` in
|
||||
`lifecycle.py`):
|
||||
|
||||
- **N/N replicas ≠ converged** during a stop-first rolling update: the update is *registered*
|
||||
instantly but the OLD task still shows 1/1 until swarm cycles it (build 238: backupbot exec'd a
|
||||
pre-hook into a container killed seconds later → 409 → empty snapshot). `services_converged()`
|
||||
therefore also inspects each service's `UpdateStatus.State`.
|
||||
- **`paused` persists forever**: swarm's default `update-failure-action: pause` sets it on one
|
||||
task flicker and it never clears, even at N/N healthy (build 241 hung 22 min). Only `updating`
|
||||
and `rollback_started` block convergence; `paused`/`rollback_paused`/`completed` are settled —
|
||||
the HTTP-health and tier assertions still gate actual app correctness.
|
||||
- `backup_app()` additionally waits (bounded, 300s) for `services_converged()` before
|
||||
`abra app backup create`, as defence in depth for the backupbot race.
|
||||
|
||||
## 7. Failure-mode guarantees
|
||||
|
||||
| Event | Outcome |
|
||||
|---|---|
|
||||
| Run crashes / SIGKILL mid-run | flock auto-released by kernel; pidfile remains but owner is `dead` → next janitor (any run's start) reaps app + pidfile |
|
||||
| Drone build canceled via API | **known gap**: cancel kills the step's `sh` wrapper but can LEAK the python harness child — it keeps running (holding lock + registry) until killed by hand. See §8. |
|
||||
| Host reboot | `/run` is tmpfs → locks and registry vanish (correct: no processes survived either). All surviving apps become `unknown` → 2h age fallback governs. |
|
||||
| Two same-recipe `!testme`s | second blocks on the flock at run start (before touching the shared tree), runs after the first finishes |
|
||||
| Janitor vs. app being created | impossible to mis-reap: registration happens before app creation, and an `alive` owner is never reaped |
|
||||
| Pid reuse after crash | cmdline check (`run_recipe_ci`) classifies as `dead`, orphan still reaped |
|
||||
|
||||
## 8. Known limitations / restructuring candidates
|
||||
|
||||
1. **Drone cancel leaks the harness process.** The exec runner kills the step shell, not the
|
||||
process tree; the leaked python continues deploying/holding the lock. Fix ideas: run the step
|
||||
under `setsid` + a trap that kills the process group, or have the harness watch
|
||||
`DRONE_BUILD_STATUS`/parent-death (`PR_SET_PDEATHSIG`).
|
||||
2. **Head-of-line blocking on same-recipe serialisation.** A run waiting on the recipe flock
|
||||
still occupies one of the 2 runner slots, so two builds of the SAME recipe temporarily starve
|
||||
all other recipes. Alternatives: a Drone-level per-recipe fan-out (one pipeline `concurrency`
|
||||
group per recipe is not natively expressible), or detect-and-requeue in the harness.
|
||||
3. **The lock protects harness runs only.** Manual `abra`/`git` activity on
|
||||
`~/.abra/recipes/<recipe>` (operator or another agent) bypasses the flock — the
|
||||
"park the checkout, then hands off" discipline is still required. A restructure could make the
|
||||
harness deploy from a per-run copy of the recipe tree instead of the shared checkout
|
||||
(eliminates the lock entirely, at the cost of diverging from abra's expected layout).
|
||||
4. **`HOME=/root` is still shared.** Safe today by argument (§3), not by enforcement. Per-build
|
||||
`ABRA_DIR` with a shared read-only server config would make isolation structural.
|
||||
5. **Registry is advisory.** Nothing stops a non-harness actor from creating run-app-shaped
|
||||
stacks the janitor will eventually age-reap; conversely the janitor trusts pidfiles it can
|
||||
parse. Acceptable on a single-purpose CI host.
|
||||
6. **Capacity is configured in two places** (`drone-runner.nix` + `.drone.yml`) that must be
|
||||
kept in step by hand.
|
||||
|
||||
## 9. File / symbol index
|
||||
|
||||
| What | Where |
|
||||
|---|---|
|
||||
| `maxTests` / `DRONE_RUNNER_CAPACITY` | `nix/modules/drone-runner.nix` |
|
||||
| `concurrency.limit`, `HOME=/root`, env | `.drone.yml` (`recipe-ci` pipeline) |
|
||||
| Lock + registry constants & helpers | `runner/harness/lifecycle.py` (top, after `TeardownError`) |
|
||||
| `acquire_recipe_lock` call site | `runner/run_recipe_ci.py` `main()`, before `fetch_recipe()` |
|
||||
| `register_run_app` call site | `lifecycle.deploy_app()` (before app creation) |
|
||||
| `unregister_run_app` call sites | `lifecycle.teardown_app()`, `lifecycle.janitor()` |
|
||||
| Janitor + decision table | `lifecycle.janitor()`, `_run_owner_state()` |
|
||||
| Run-app naming | `runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py` |
|
||||
| Convergence (adjacent) | `lifecycle.services_converged()`, `lifecycle.backup_app()` |
|
||||
Reference in New Issue
Block a user