Files

autonomic-bot b6e12ef428

continuous-integration/drone/push Build is passing

Details

fix(harness): run-keyed run-scoped state files — CONC-A1 (same-domain runs corrupted shared deploy-count)

The four CCCI state files (deploys countfile, opstate, deps, depskip) were keyed
by app domain in shared /tmp. A second run of the same domain executes its main()
preamble + deploy_app's pre-lock _record_deploy BEFORE blocking at the app lock,
so it reset/polluted the live first run's counter (false DG4.1 deploy-count=2,
build 279) and the first run's end-of-run os.remove crashed the second
(FileNotFoundError, build 281). Masked pre-restructure by the end-to-end recipe
flock. Now keyed by run id + harness pid via _run_state_path(); children receive
exact paths via the CCCI_*_FILE env vars, so domain keying was never load-bearing.

tests/concurrency/test_run_state.py: path-invariant cases + a real-process
regression (helpers.py deploy-count-run) reproducing the live interleaving —
verified to FAIL under simulated shared keying. docs/concurrency.md §3 updated.

2026-06-10 08:16:09 +00:00

15 KiB

Raw Blame History

Concurrency: how parallel recipe CI runs stay safe

Spec of the concurrent-run system after the 2026-06-10 restructure (branch restructure/concurrency; plan: cc-ci-plan concurrency-restructure-full-plan.md). The previous registry + per-recipe-flock model is documented in this file's git history (5b65c6c).

1. Goal and design summary

Two recipe CI builds may run at the same time on the single cc-ci host. Safety is enforced by the harness, not by serialising everything, and rests on ONE locking mechanism plus ONE structural isolation:

Rule	Mechanism
Different recipes run in parallel	nothing blocks them (isolation, §3)
Same-RECIPE runs run in parallel too	per-run `ABRA_DIR` recipe trees (§4) — no shared tree, no lock
Same-DOMAIN runs (double-`!testme` of one PR) serialise	per-app-domain `flock` (§5)
A starting run never reaps a live concurrent run's app	janitor probes the app lock; held = live (§6)
A crashed/canceled/rebooted run's leftovers get reaped	lock auto-released by the kernel → probe acquires → reap (§6)

The invariant chain that makes "held lock = live owner" sound:

lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline

lock ⊆ process: locks are kernel flocks on fds the process holds (and PEP 446 makes those fds non-inheritable, so abra/docker/pytest children never carry them). The kernel releases them on process death, however it dies. There is no unlock code path and no stale-lock failure mode.
process ⊆ step: PR_SET_PDEATHSIG(SIGTERM) + the .drone.yml setsid/trap wrap (§2) — a dead or canceled build cannot leak a running harness.
step ⊆ 60 min: signal.alarm(3600) self-deadline (§2).

Never steal a held lock; manage the holder's lifetime. There is no daemon and no shared state service — everything is kernel/file primitives under /run/lock and per-run directories.

2. Mechanism 0: run-lifetime hardening (`runner/harness/lifetime.py`)

run_recipe_ci.main() calls lifetime.install_lifetime_guards() before ANY abra call or lock acquisition:

PR_SET_PDEATHSIG(SIGTERM) (ctypes prctl, return code checked): if the parent — the drone step shell — dies, the kernel TERMs the harness. A post-prctl ppid == 1 re-check closes the start race: a harness whose parent died before the prctl armed would never get the signal, so it refuses to run orphaned.
SIGTERM handler: logs, then raises SystemExit(143) so the run's finally: teardown funnel executes and the process exits non-zero. Re-entrant signals during teardown are logged and IGNORED (lifetime.begin_teardown(), also set at the top of the run's finally: blocks) so a second signal can't abort the cleanup the first one asked for.
signal.alarm(3600) hard deadline: SIGALRM funnels into the same teardown path with a distinct log line (== run exceeded 60-minute hard deadline — tearing down ==), exit 142. Recipes keep their own smaller per-tier timeouts; this bounds the whole run. Teardown time after the deadline is deliberately not alarm-bounded — the janitor is the backstop if a teardown wedges and the process is killed harder.

The .drone.yml recipe-ci step runs the harness as setsid cc-ci-run … & with a trap 'kill -TERM -- "-$PID"' TERM EXIT; wait "$PID" — a drone cancel (TERM to the step shell) is forwarded to the harness's whole process group instead of leaking it (the exec runner only kills the step shell). PDEATHSIG backstops the no-trap paths.

3. Isolation model: what is shared, what is per-run

Per-run (no conflict possible):

App + stack + volumes + secrets. Run app domain = naming.app_domain() → <recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net, unique per (recipe, pr, ref); everything abra creates is namespaced by it. Run apps are recognised by RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$; warm/canonical apps (e.g. warm-keycloak...) deliberately do NOT match → the janitor never probes them.
Recipe working trees — $ABRA_DIR/recipes/<recipe>, per run (§4). NEW in the restructure.
Drone build workspace (/var/lib/drone-runner/drone-<id>/) and run artifacts (/var/lib/cc-ci-runs/<run-id>/).
Run-scoped state files (/tmp/ccci-{deploys,opstate,deps,depskip}-<run-id>-<pid>…) — keyed by run id + harness pid via run_recipe_ci._run_state_path(), NEVER by app domain. A second run of the same domain executes its main() preamble before blocking at the app lock (§5), so domain-keyed files would be reset/removed underneath the live first run (live finding, M2(c) double-!testme: false DG4.1 deploy-count in run 1, countfile FileNotFoundError in run 2). Tier/hook children get the exact paths via the CCCI_*_FILE env vars; removed on normal run exit.

Shared (by design, conflict-free):

/root/.abra/servers — app .env files, one per domain. The per-run ABRA_DIR symlinks servers/ here, so .env files land in the canonical path: janitor discovery (abra app ls) and out-of-run tooling see every app. Per-domain filenames + the app-domain lock prevent write conflicts.
/root/.abra/catalogue — read-mostly, symlinked into each per-run dir.
HOME=/root (forced in .drone.yml) — safe: nothing recipe-mutable lives under ~/.abra for a run anymore except through the two symlinks above.

4. Mechanism 1: per-run `ABRA_DIR` (replaces the per-recipe flock)

run_recipe_ci.setup_run_abra_dir() — called first thing in main(), before any abra call — builds <runs_dir>/<run-id>/abra/ (run-id = Drone build number; manual-<pid> for hand runs):

abra/
  servers/    -> /root/.abra/servers     (symlink; canonical shared .env path)
  catalogue/  -> /root/.abra/catalogue   (symlink; read-mostly)
  recipes/    fresh, empty               (THE isolation that matters)

and exports it as $ABRA_DIR — honored by the abra CLI itself and by every harness path helper (abra.abra_dir() / abra.recipe_dir(); generic._recipe_dir, prepull_images, snapshot_recipe_tests, warm_reconcile._recipe_dir all route through the same rule: $ABRA_DIR if set, else ~/.abra).

fetch_recipe() is now a plain clone into $ABRA_DIR/recipes/<recipe> (PR-head clone+checkout or abra recipe fetch); the upgrade tier's mid-run git checkouts happen in the run's own tree. Two same-recipe runs can no longer corrupt each other — structurally, with no lock. The old observed failure (immich builds 229/230 deploying a tree missing its config) is impossible.
CCCI_SKIP_FETCH=1 (test/Adversary staging) copies the canonically-staged ~/.abra/recipes/<recipe> clone into the per-run tree.
Out-of-run flows (warm_reconcile's systemd timer, manual abra) set no ABRA_DIR and keep using the canonical /root/.abra unchanged. In-run flows that touch canonical state on purpose (warm/canonical .env files) go through servers/ and are unaffected.
The per-run dir rides along the existing /var/lib/cc-ci-runs/<run-id>/ retention. abra auto-clones any recipe it needs to resolve (e.g. during app ls) into the per-run recipes/ — a few seconds of git per run, gone with the run dir.

5. Mechanism 2: per-app-domain flock (`lifecycle.acquire_app_lock`)

Lock file: /run/lock/cc-ci-app-<domain>.lock (dir overridable via CCCI_APP_LOCK_DIR for the test suite), exclusive fcntl.flock, taken in deploy_app() before the app is created — a concurrent janitor can never see a run app without its held lock.
Blocks (with a log line: == app lock: another run of <domain> is in flight — waiting ==) when another run of the SAME domain is in flight — the double-!testme serialisation point; the waiting run is visibly parked at that line in its drone log, by design.
The returned file object is ALSO retained in module-level _held_app_locks — if a caller dropped it, GC would close the fd and silently release the lock.
mtime is touched at acquisition: lock age feeds the janitor's long-held flag (§6).
Unlink/recreate race guard: the janitor unlinks reaped lockfiles, so after EVERY acquisition the locked fd is verified to still be the inode the path names (fstat().st_ino == stat().st_ino); a waiter that won a just-unlinked inode closes it and retries on the live path. (A lock on an unlinked inode protects nothing: a later opener gets a fresh inode and would acquire "the same" lock.)
Release is implicit: process exit (any kind). teardown_app() does NOT release or unlink — a clean run's leftover lockfile is unheld and is unlinked on sight by the next janitor sweep.

6. The flock-probe janitor (`lifecycle.janitor`)

Runs at every run start (cold + quick paths) and in the warm/upgrade sweeps. Candidate discovery is unchanged from the old model: abra app ls + a docker-service sweep (catches stacks whose .env is already gone), both matched against RUN_APP_RE — warm/canonical apps never match and are never probed.

Decision table (per candidate domain, _probe_and_reap):

Probe (`LOCK_EX\|LOCK_NB`)	Meaning	Action
acquires (+ inode identity OK)	nobody holds it → owner died (kernel-guaranteed)	reap: `teardown_app(verify=False)` WHILE HOLDING the probe lock, then unlink the lockfile, then release
acquires, inode stale	another janitor reaped + unlinked while we raced	skip (reap already done; unlinking now would hit a newer run's file)
`BlockingIOError` (held)	live concurrent run	leave it; if lockfile mtime > 120 min (2× the hard deadline): `!! lock for <domain> held >120min — possible leaked run; inspect with lslocks` — flag, never steal
`open()` fails (`OSError`)	garbled/unopenable lockfile	skip + log, never crash

Reaping under the probe lock closes the janitor-vs-new-run race: a new run of that domain blocks in acquire_app_lock until the reap finishes — no window where a fresh app coexists with a half-reaped one.
Two racing janitors arbitrate on the flock: one reaps, the other sees "held" and leaves; reaps are idempotent (teardown_app(verify=False) tolerates half-gone stacks).
After the candidates, a tidy sweep unlinks stale unheld cc-ci-app-*.lock files with no app behind them (under their own probe lock + identity check), keeping /run/lock clean.
Post-reboot: /run/lock is tmpfs → lockfiles gone → every surviving app probes as an orphan → reaped immediately. (Improvement over the old 2-hour age fallback; there IS no age logic anymore.)

7. Failure-mode guarantees

Event	Outcome
Run crashes / SIGKILL mid-run	flock auto-released by kernel → next janitor probe reaps app + lockfile
Drone build canceled via API	step trap TERMs the harness process group → SIGTERM funnel runs the run's own teardown (exit 143); if anything still leaks, PDEATHSIG + janitor reap (the old "cancel leaks the harness" gap is CLOSED)
Run exceeds 60 min	SIGALRM → distinct log line → own teardown → exit 142
Host reboot	locks and lockfiles vanish (tmpfs, correct: no owners survived) → all surviving run apps reaped at the next run start, immediately
Two same-recipe `!testme`s (different PRs)	run in parallel — separate domains, separate per-run recipe trees
Double-`!testme` (same PR → same domain)	second blocks on the app lock before creating anything, visibly in its drone log, runs after the first finishes
Janitor vs. app being created	impossible to mis-reap: the lock is held before `app new`, and a held lock is never touched
Janitor unlink vs. blocked waiter	inode identity re-check on every acquisition → waiter retries on the live path
Lock held implausibly long (>120 min)	flagged loudly for a human (`lslocks`), never stolen

8. Where convergence fits (adjacent; unchanged by the restructure)

Two swarm-convergence behaviors in services_converged() look like concurrency bugs but aren't — any future work must keep them fixed:

N/N replicas ≠ converged during a stop-first rolling update — UpdateStatus.State is also inspected (build 238: backupbot exec'd into a container killed seconds later).
paused persists forever (swarm's default update-failure-action) — only updating and rollback_started block convergence; paused/rollback_paused are settled (build 241).
backup_app() additionally waits (bounded 300s) for convergence before backup create.

9. Configuration knobs

Knob	Where	Current	Meaning
`DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`)	`nix/modules/drone-runner.nix` (`maxTests`)	`2`	THE single concurrency knob. Max builds the exec runner executes at once; Drone queues the rest. (The `.drone.yml` `concurrency.limit` duplicate was removed.) Change requires `nixos-rebuild switch`.
`CCCI_APP_LOCK_DIR`	env, read at call time	unset → `/run/lock`	App-domain lockfile dir override — used by `tests/concurrency` to sandbox locks. Never set in production.
hard deadline	`lifetime.HARD_DEADLINE_SECONDS`	3600 s	the whole-run alarm; long-held flag threshold is 2× this (`LONG_HELD_LOCK_SECONDS`)

10. Testing: `tests/concurrency/`

Real-kernel suite (19 planned cases + companions): helper subprocesses hold REAL flocks and install the REAL prctl/signal/alarm guards — flock itself is never mocked; the janitor runs with injected candidates + stubbed teardown but probes real locks. Not part of the default pytest tests/unit gate (it spawns processes and sleeps); run it explicitly:

cc-ci-run -m pytest tests/concurrency -q

Covers: kernel auto-release on SIGKILL; LOCK_NB probe semantics; PEP 446 fd non-inheritance; same-domain serialisation; orphan reap + unlink; live-run protection; reap-under-probe-lock blocking; two-janitor arbitration; reboot-immediate reap; long-held flag; RUN_APP_RE allowlist; degrade-on-garbage; PDEATHSIG; ppid start race; deadline + SIGTERM funnels; per-run ABRA_DIR construction/export; concurrent same-recipe fetch isolation; symlinked-servers .env canonicality; run-keyed (never domain-keyed) run-scoped state files (M2(c) regression, test_run_state.py).

11. File / symbol index

What	Where
lifetime guards (PDEATHSIG, signal funnels, deadline)	`runner/harness/lifetime.py`; installed in `run_recipe_ci.main()`
setsid/trap cancel forwarding	`.drone.yml` (`recipe-ci` step)
`acquire_app_lock`, `_held_app_locks`, `_app_lock_path`	`runner/harness/lifecycle.py`
`acquire_app_lock` call site	`lifecycle.deploy_app()` (before app creation)
janitor + probe (`janitor`, `_probe_and_reap`, `LONG_HELD_LOCK_SECONDS`)	`runner/harness/lifecycle.py`
per-run ABRA_DIR (`setup_run_abra_dir`, `fetch_recipe`)	`runner/run_recipe_ci.py`
path resolution (`abra_dir`, `recipe_dir`)	`runner/harness/abra.py` (used by `generic`, `lifecycle.prepull_images`, `warm_reconcile`)
run-app naming	`runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py`
capacity knob	`nix/modules/drone-runner.nix` (`maxTests`)
convergence (adjacent)	`lifecycle.services_converged()`, `lifecycle.backup_app()`
the test suite	`tests/concurrency/` (`helpers.py` subprocess entrypoints, `concutil.py` probes)

Deleted in the restructure (grep should find NOTHING): register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE, _stack_age_seconds, acquire_recipe_lock, RECIPE_LOCK_DIR.

15 KiB Raw Blame History Unescape Escape