cc-ci/docs/concurrency.md

# Concurrency: how parallel recipe CI runs stay safe

Spec of the concurrent-run system after the 2026-06-10 restructure (branch
`restructure/concurrency`; plan: cc-ci-plan `concurrency-restructure-full-plan.md`). The previous
registry + per-recipe-flock model is documented in this file's git history (`5b65c6c`).

## 1. Goal and design summary

Two recipe CI builds may run **at the same time** on the single cc-ci host. Safety is enforced by
the **harness**, not by serialising everything, and rests on ONE locking mechanism plus ONE
structural isolation:

| Rule | Mechanism |
|---|---|
| Different recipes run in parallel | nothing blocks them (isolation, §3) |
| Same-RECIPE runs run in parallel too | per-run `ABRA_DIR` recipe trees (§4) — no shared tree, no lock |
| Same-DOMAIN runs (double-`!testme` of one PR) serialise | per-app-domain `flock` (§5) |
| A starting run never reaps a live concurrent run's app | janitor probes the app lock; held = live (§6) |
| A crashed/canceled/rebooted run's leftovers get reaped | lock auto-released by the kernel → probe acquires → reap (§6) |

The invariant chain that makes "held lock = live owner" sound:

```
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
```

- **lock ⊆ process**: locks are kernel flocks on fds the process holds (and PEP 446 makes those
  fds non-inheritable, so abra/docker/pytest children never carry them). The kernel releases them
  on process death, however it dies. There is no unlock code path and no stale-lock failure mode.
- **process ⊆ step**: `PR_SET_PDEATHSIG(SIGTERM)` + the `.drone.yml` setsid/trap wrap (§2) — a
  dead or canceled build cannot leak a running harness.
- **step ⊆ 60 min**: `signal.alarm(3600)` self-deadline (§2).

Never steal a held lock; manage the holder's lifetime. There is **no daemon and no shared state
service** — everything is kernel/file primitives under `/run/lock` and per-run directories.

## 2. Mechanism 0: run-lifetime hardening (`runner/harness/lifetime.py`)

`run_recipe_ci.main()` calls `lifetime.install_lifetime_guards()` before ANY abra call or lock
acquisition:

1. **`PR_SET_PDEATHSIG(SIGTERM)`** (ctypes prctl, return code checked): if the parent — the drone
   step shell — dies, the kernel TERMs the harness. A post-prctl `ppid == 1` re-check closes the
   start race: a harness whose parent died *before* the prctl armed would never get the signal,
   so it refuses to run orphaned.
2. **SIGTERM handler**: logs, then raises `SystemExit(143)` so the run's `finally:` teardown
   funnel executes and the process exits non-zero. Re-entrant signals during teardown are logged
   and IGNORED (`lifetime.begin_teardown()`, also set at the top of the run's `finally:` blocks)
   so a second signal can't abort the cleanup the first one asked for.
3. **`signal.alarm(3600)` hard deadline**: SIGALRM funnels into the same teardown path with a
   distinct log line (`== run exceeded 60-minute hard deadline — tearing down ==`), exit 142.
   Recipes keep their own smaller per-tier timeouts; this bounds the whole run. Teardown time
   after the deadline is deliberately not alarm-bounded — the janitor is the backstop if a
   teardown wedges and the process is killed harder.

The `.drone.yml` recipe-ci step runs the harness as `setsid cc-ci-run … &` with a
`trap 'kill -TERM -- "-$PID"' TERM EXIT; wait "$PID"` — a drone **cancel** (TERM to the step
shell) is forwarded to the harness's whole process group instead of leaking it (the exec runner
only kills the step shell). PDEATHSIG backstops the no-trap paths.

## 3. Isolation model: what is shared, what is per-run

Per-run (no conflict possible):

- **App + stack + volumes + secrets.** Run app domain = `naming.app_domain()` →
  `<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net`, unique per (recipe, pr, ref);
  everything abra creates is namespaced by it. Run apps are recognised by
  `RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`; warm/canonical apps
  (e.g. `warm-keycloak...`) deliberately do NOT match → the janitor never probes them.
- **Recipe working trees** — `$ABRA_DIR/recipes/<recipe>`, per run (§4). NEW in the restructure.
- **Drone build workspace** (`/var/lib/drone-runner/drone-<id>/`) and **run artifacts**
  (`/var/lib/cc-ci-runs/<run-id>/`).
- **Run-scoped state files** (`/tmp/ccci-{deploys,opstate,deps,depskip}-<run-id>-<pid>…`) —
  keyed by run id + harness pid via `run_recipe_ci._run_state_path()`, NEVER by app domain.
  A second run of the same domain executes its `main()` preamble before blocking at the app
  lock (§5), so domain-keyed files would be reset/removed underneath the live first run
  (live finding, M2(c) double-`!testme`: false DG4.1 deploy-count in run 1, countfile
  `FileNotFoundError` in run 2). Tier/hook children get the exact paths via the
  `CCCI_*_FILE` env vars; removed on normal run exit.

Shared (by design, conflict-free):

- **`/root/.abra/servers`** — app `.env` files, one per domain. The per-run `ABRA_DIR` symlinks
  `servers/` here, so .env files land in the canonical path: janitor discovery (`abra app ls`)
  and out-of-run tooling see every app. Per-domain filenames + the app-domain lock prevent write
  conflicts.
- **`/root/.abra/catalogue`** — read-mostly, symlinked into each per-run dir.
- **`HOME=/root`** (forced in `.drone.yml`) — safe: nothing recipe-mutable lives under `~/.abra`
  for a run anymore except through the two symlinks above.

## 4. Mechanism 1: per-run `ABRA_DIR` (replaces the per-recipe flock)

`run_recipe_ci.setup_run_abra_dir()` — called first thing in `main()`, before any abra call —
builds `<runs_dir>/<run-id>/abra/` (run-id = Drone build number; `manual-<pid>` for hand runs):

```
abra/
  servers/    -> /root/.abra/servers     (symlink; canonical shared .env path)
  catalogue/  -> /root/.abra/catalogue   (symlink; read-mostly)
  recipes/    fresh, empty               (THE isolation that matters)
```

and exports it as `$ABRA_DIR` — honored by the abra CLI itself and by every harness path helper
(`abra.abra_dir()` / `abra.recipe_dir()`; `generic._recipe_dir`, `prepull_images`,
`snapshot_recipe_tests`, `warm_reconcile._recipe_dir` all route through the same rule:
`$ABRA_DIR` if set, else `~/.abra`).

- `fetch_recipe()` is now a plain clone into `$ABRA_DIR/recipes/<recipe>` (PR-head clone+checkout
  or `abra recipe fetch`); the upgrade tier's mid-run `git checkout`s happen in the run's own
  tree. Two same-recipe runs can no longer corrupt each other — structurally, with no lock. The
  old observed failure (immich builds 229/230 deploying a tree missing its config) is impossible.
- `CCCI_SKIP_FETCH=1` (test/Adversary staging) copies the canonically-staged
  `~/.abra/recipes/<recipe>` clone into the per-run tree.
- Out-of-run flows (warm_reconcile's systemd timer, manual abra) set no `ABRA_DIR` and keep using
  the canonical `/root/.abra` unchanged. In-run flows that touch canonical state on purpose
  (warm/canonical .env files) go through `servers/` and are unaffected.
- The per-run dir rides along the existing `/var/lib/cc-ci-runs/<run-id>/` retention. abra
  auto-clones any recipe it needs to resolve (e.g. during `app ls`) into the per-run `recipes/` —
  a few seconds of git per run, gone with the run dir.

## 5. Mechanism 2: per-app-domain flock (`lifecycle.acquire_app_lock`)

- Lock file: `/run/lock/cc-ci-app-<domain>.lock` (dir overridable via `CCCI_APP_LOCK_DIR` for the
  test suite), exclusive `fcntl.flock`, taken in `deploy_app()` **before the app is created** — a
  concurrent janitor can never see a run app without its held lock.
- Blocks (with a log line: `== app lock: another run of <domain> is in flight — waiting ==`) when
  another run of the SAME domain is in flight — the double-`!testme` serialisation point; the
  waiting run is visibly parked at that line in its drone log, by design.
- The returned file object is ALSO retained in module-level `_held_app_locks` — if a caller
  dropped it, GC would close the fd and silently release the lock.
- mtime is touched at acquisition: lock age feeds the janitor's long-held flag (§6).
- **Unlink/recreate race guard**: the janitor unlinks reaped lockfiles, so after EVERY
  acquisition the locked fd is verified to still be the inode the path names
  (`fstat().st_ino == stat().st_ino`); a waiter that won a just-unlinked inode closes it and
  retries on the live path. (A lock on an unlinked inode protects nothing: a later opener gets a
  fresh inode and would acquire "the same" lock.)
- Release is implicit: process exit (any kind). `teardown_app()` does NOT release or unlink —
  a clean run's leftover lockfile is unheld and is unlinked on sight by the next janitor sweep.

## 6. The flock-probe janitor (`lifecycle.janitor`)

Runs at every run start (cold + quick paths) and in the warm/upgrade sweeps. Candidate discovery
is unchanged from the old model: `abra app ls` + a docker-service sweep (catches stacks whose
`.env` is already gone), both matched against `RUN_APP_RE` — warm/canonical apps never match and
are never probed.

Decision table (per candidate domain, `_probe_and_reap`):

| Probe (`LOCK_EX\|LOCK_NB`) | Meaning | Action |
|---|---|---|
| acquires (+ inode identity OK) | nobody holds it → owner died (kernel-guaranteed) | **reap**: `teardown_app(verify=False)` WHILE HOLDING the probe lock, then unlink the lockfile, then release |
| acquires, inode stale | another janitor reaped + unlinked while we raced | skip (reap already done; unlinking now would hit a newer run's file) |
| `BlockingIOError` (held) | live concurrent run | leave it; if lockfile mtime > 120 min (2× the hard deadline): `!! lock for <domain> held >120min — possible leaked run; inspect with lslocks` — flag, **never steal** |
| `open()` fails (`OSError`) | garbled/unopenable lockfile | skip + log, never crash |

- Reaping under the probe lock closes the janitor-vs-new-run race: a new run of that domain
  blocks in `acquire_app_lock` until the reap finishes — no window where a fresh app coexists
  with a half-reaped one.
- Two racing janitors arbitrate on the flock: one reaps, the other sees "held" and leaves; reaps
  are idempotent (`teardown_app(verify=False)` tolerates half-gone stacks).
- After the candidates, a tidy sweep unlinks stale **unheld** `cc-ci-app-*.lock` files with no
  app behind them (under their own probe lock + identity check), keeping `/run/lock` clean.
- **Post-reboot**: `/run/lock` is tmpfs → lockfiles gone → every surviving app probes as an
  orphan → reaped immediately. (Improvement over the old 2-hour age fallback; there IS no age
  logic anymore.)

## 7. Failure-mode guarantees

| Event | Outcome |
|---|---|
| Run crashes / SIGKILL mid-run | flock auto-released by kernel → next janitor probe reaps app + lockfile |
| Drone build canceled via API | step trap TERMs the harness process group → SIGTERM funnel runs the run's own teardown (exit 143); if anything still leaks, PDEATHSIG + janitor reap (the old "cancel leaks the harness" gap is CLOSED) |
| Run exceeds 60 min | SIGALRM → distinct log line → own teardown → exit 142 |
| Host reboot | locks and lockfiles vanish (tmpfs, correct: no owners survived) → all surviving run apps reaped at the next run start, immediately |
| Two same-recipe `!testme`s (different PRs) | run in parallel — separate domains, separate per-run recipe trees |
| Double-`!testme` (same PR → same domain) | second blocks on the app lock before creating anything, visibly in its drone log, runs after the first finishes |
| Janitor vs. app being created | impossible to mis-reap: the lock is held before `app new`, and a held lock is never touched |
| Janitor unlink vs. blocked waiter | inode identity re-check on every acquisition → waiter retries on the live path |
| Lock held implausibly long (>120 min) | flagged loudly for a human (`lslocks`), never stolen |

## 8. Where convergence fits (adjacent; unchanged by the restructure)

Two swarm-convergence behaviors in `services_converged()` look like concurrency bugs but aren't —
any future work must keep them fixed:

- **N/N replicas ≠ converged** during a stop-first rolling update — `UpdateStatus.State` is also
  inspected (build 238: backupbot exec'd into a container killed seconds later).
- **`paused` persists forever** (swarm's default `update-failure-action`) — only `updating` and
  `rollback_started` block convergence; `paused`/`rollback_paused` are settled (build 241).
- `backup_app()` additionally waits (bounded 300s) for convergence before `backup create`.

## 9. Configuration knobs

| Knob | Where | Current | Meaning |
|---|---|---|---|
| `DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`) | `nix/modules/drone-runner.nix` (`maxTests`) | `2` | **THE single concurrency knob.** Max builds the exec runner executes at once; Drone queues the rest. (The `.drone.yml` `concurrency.limit` duplicate was removed.) Change requires `nixos-rebuild switch`. |
| `CCCI_APP_LOCK_DIR` | env, read at call time | unset → `/run/lock` | App-domain lockfile dir override — used by `tests/concurrency` to sandbox locks. Never set in production. |
| hard deadline | `lifetime.HARD_DEADLINE_SECONDS` | 3600 s | the whole-run alarm; long-held flag threshold is 2× this (`LONG_HELD_LOCK_SECONDS`) |

## 10. Testing: `tests/concurrency/`

Real-kernel suite (19 planned cases + companions): helper subprocesses hold REAL flocks and
install the REAL prctl/signal/alarm guards — flock itself is never mocked; the janitor runs with
injected candidates + stubbed teardown but probes real locks. **Not part of the default
`pytest tests/unit` gate** (it spawns processes and sleeps); run it explicitly:

```
cc-ci-run -m pytest tests/concurrency -q
```

Covers: kernel auto-release on SIGKILL; LOCK_NB probe semantics; PEP 446 fd non-inheritance;
same-domain serialisation; orphan reap + unlink; live-run protection; reap-under-probe-lock
blocking; two-janitor arbitration; reboot-immediate reap; long-held flag; RUN_APP_RE allowlist;
degrade-on-garbage; PDEATHSIG; ppid start race; deadline + SIGTERM funnels; per-run ABRA_DIR
construction/export; concurrent same-recipe fetch isolation; symlinked-servers .env canonicality;
run-keyed (never domain-keyed) run-scoped state files (M2(c) regression, `test_run_state.py`).

## 11. File / symbol index

| What | Where |
|---|---|
| lifetime guards (PDEATHSIG, signal funnels, deadline) | `runner/harness/lifetime.py`; installed in `run_recipe_ci.main()` |
| setsid/trap cancel forwarding | `.drone.yml` (`recipe-ci` step) |
| `acquire_app_lock`, `_held_app_locks`, `_app_lock_path` | `runner/harness/lifecycle.py` |
| `acquire_app_lock` call site | `lifecycle.deploy_app()` (before app creation) |
| janitor + probe (`janitor`, `_probe_and_reap`, `LONG_HELD_LOCK_SECONDS`) | `runner/harness/lifecycle.py` |
| per-run ABRA_DIR (`setup_run_abra_dir`, `fetch_recipe`) | `runner/run_recipe_ci.py` |
| path resolution (`abra_dir`, `recipe_dir`) | `runner/harness/abra.py` (used by `generic`, `lifecycle.prepull_images`, `warm_reconcile`) |
| run-app naming | `runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py` |
| capacity knob | `nix/modules/drone-runner.nix` (`maxTests`) |
| convergence (adjacent) | `lifecycle.services_converged()`, `lifecycle.backup_app()` |
| the test suite | `tests/concurrency/` (`helpers.py` subprocess entrypoints, `concutil.py` probes) |

Deleted in the restructure (grep should find NOTHING): `register_run_app`, `unregister_run_app`,
`_run_owner_state`, `ACTIVE_RUN_DIR`, `CCCI_JANITOR_MAX_AGE`, `_stack_age_seconds`,
`acquire_recipe_lock`, `RECIPE_LOCK_DIR`.