Merge branch 'restructure/concurrency': concurrency restructure (P1-P5 + tests/concurrency)
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing

M1 Adversary-verified PASS (REVIEW-conc.md @83a6c6e): lock-lifetime hardening (PDEATHSIG +
signal funnels + 60-min deadline + setsid/trap cancel forwarding), flock-probe janitor
(registry deleted), per-run ABRA_DIR (recipe flock deleted), single concurrency knob,
tests/concurrency real-kernel suite, docs/concurrency.md rewrite.
This commit is contained in:
autonomic-bot
2026-06-10 04:40:00 +00:00
18 changed files with 1362 additions and 300 deletions

View File

@ -35,12 +35,12 @@ steps:
# the comment-bridge). Deploys the recipe at the PR head, runs install/upgrade/backup + any # the comment-bridge). Deploys the recipe at the PR head, runs install/upgrade/backup + any
# recipe-local tests via the shared harness, then guarantees teardown (plan §4.2/§4.3). # recipe-local tests via the shared harness, then guarantees teardown (plan §4.2/§4.3).
# #
# Resource safety (plan §4.2/§4.3): DRONE_RUNNER_CAPACITY=2 (nix/modules/drone-runner.nix) + # Resource safety (plan §4.2/§4.3): DRONE_RUNNER_CAPACITY=2 (nix/modules/drone-runner.nix, the
# concurrency.limit=2 below allow two recipe runs in parallel. Concurrent-run safety is enforced by # single concurrency knob) allows two recipe runs in parallel. Concurrent-run safety is enforced by
# the harness, not by serialisation: same-recipe runs serialise on a per-recipe flock # the harness, not by serialisation: every run holds an exclusive flock on its app domain
# (lifecycle.acquire_recipe_lock — the shared ~/.abra/recipes/<recipe> checkout is the conflict), # (/run/lock/cc-ci-app-<domain>.lock) for its whole process lifetime, the run-start janitor probes
# and every run registers its app domain + pid in /run/cc-ci-active so the run-start janitor only # that lock to reap only orphans (held lock = live run, never touched), and recipe working trees
# reaps orphans whose owning run is DEAD (alive → never touched; unknown → age fallback, default 2h). # are per-run ($ABRA_DIR/recipes — no shared checkout, no recipe lock). See docs/concurrency.md.
kind: pipeline kind: pipeline
type: exec type: exec
name: recipe-ci name: recipe-ci
@ -53,21 +53,31 @@ trigger:
event: event:
- custom - custom
concurrency: # NB deliberately NO `concurrency.limit` here: DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix
limit: 2 # maxTests) is the single concurrency knob (P4 — two knobs in two files drifted).
steps: steps:
- name: ci - name: ci
environment: environment:
STAGES: install,upgrade,backup,restore,custom STAGES: install,upgrade,backup,restore,custom
# The exec runner points HOME at a per-build workspace; force it to /root so abra finds its # The exec runner points HOME at a per-build workspace; force it to /root so abra's server
# server config + recipes under /root/.abra (as the manual M4/M5 runs did). Safe with # config is found via the per-run ABRA_DIR's servers/ symlink -> /root/.abra/servers.
# capacity=2: app names are unique per (recipe,pr,ref) and same-recipe runs serialise on the # Recipe trees are PER-RUN ($ABRA_DIR/recipes, exported by run_recipe_ci before any abra
# per-recipe flock, so concurrent builds never touch the same recipe checkout or app. # call), so concurrent builds never share a recipe checkout; app .env files are per-domain
# in the shared canonical servers/ path, guarded by the app-domain flock.
HOME: /root HOME: /root
commands: commands:
# RECIPE/REF/PR/SRC (+ CCCI_QUICK for `!testme --quick`) are injected as env vars from the # RECIPE/REF/PR/SRC (+ CCCI_QUICK for `!testme --quick`) are injected as env vars from the
# build's custom params. CCCI_QUICK=1 makes run_recipe_ci take the opt-in fast lane (WC7); # build's custom params. CCCI_QUICK=1 makes run_recipe_ci take the opt-in fast lane (WC7);
# absent => full cold (default). run_quick ignores STAGES (always upgrade+custom). # absent => full cold (default). run_quick ignores STAGES (always upgrade+custom).
- 'echo "recipe-ci: RECIPE=$RECIPE REF=$REF PR=$PR SRC=$SRC stages=$STAGES quick=${CCCI_QUICK:-0}"' - 'echo "recipe-ci: RECIPE=$RECIPE REF=$REF PR=$PR SRC=$SRC stages=$STAGES quick=${CCCI_QUICK:-0}"'
- cc-ci-run runner/run_recipe_ci.py # P1 lock-lifetime hardening: run the harness in its own session/process group (setsid) and
# forward a drone cancel (TERM to this step shell) to the WHOLE group, so the harness's
# SIGTERM handler runs its teardown funnel instead of being leaked (the exec runner kills
# only the step shell, not the tree). PDEATHSIG inside the harness backstops the case where
# this shell dies without the trap firing. `wait` propagates the harness exit code.
- |
setsid cc-ci-run runner/run_recipe_ci.py &
PID=$!
trap 'kill -TERM -- "-$PID" 2>/dev/null' TERM EXIT
wait "$PID"

View File

@ -1,167 +1,228 @@
# Concurrency: how parallel recipe CI runs stay safe # Concurrency: how parallel recipe CI runs stay safe
Spec of the concurrent-run system as of 2026-06-10 (commits `c0df77d`, `68ef0f8`, `e6d55b5`). Spec of the concurrent-run system after the 2026-06-10 restructure (branch
Written for review/restructuring — it documents what IS, including known limitations. `restructure/concurrency`; plan: cc-ci-plan `concurrency-restructure-full-plan.md`). The previous
registry + per-recipe-flock model is documented in this file's git history (`5b65c6c`).
## 1. Goal and design summary ## 1. Goal and design summary
Two recipe CI builds may run **at the same time** on the single cc-ci host (e.g. immich and Two recipe CI builds may run **at the same time** on the single cc-ci host. Safety is enforced by
plausible under active development at once). Safety is enforced by the **harness**, not by the **harness**, not by serialising everything, and rests on ONE locking mechanism plus ONE
serialising everything: structural isolation:
| Rule | Mechanism | | Rule | Mechanism |
|---|---| |---|---|
| Different recipes run in parallel | nothing blocks them (isolation, §3) | | Different recipes run in parallel | nothing blocks them (isolation, §3) |
| Same-recipe runs serialise | per-recipe `flock` (§4) | | Same-RECIPE runs run in parallel too | per-run `ABRA_DIR` recipe trees (§4) — no shared tree, no lock |
| A starting run never reaps a live concurrent run's app | active-run registry + three-way janitor (§5) | | Same-DOMAIN runs (double-`!testme` of one PR) serialise | per-app-domain `flock` (§5) |
| A crashed run's leftovers still get reaped | registry owner-dead detection, age fallback5) | | A starting run never reaps a live concurrent run's app | janitor probes the app lock; held = live6) |
| A crashed/canceled/rebooted run's leftovers get reaped | lock auto-released by the kernel → probe acquires → reap (§6) |
There is **no daemon and no shared state service**: both mechanisms are kernel/file primitives The invariant chain that makes "held lock = live owner" sound:
under `/run`, scoped to the harness process lifetime, so a `SIGKILL`'d run can never leak a stale
lock or a stale "I'm alive" claim.
## 2. Configuration knobs ```
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
```
| Knob | Where | Current | Meaning | - **lock ⊆ process**: locks are kernel flocks on fds the process holds (and PEP 446 makes those
|---|---|---|---| fds non-inheritable, so abra/docker/pytest children never carry them). The kernel releases them
| `DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`) | `nix/modules/drone-runner.nix` (`maxTests` let-binding) | `2` | **THE cap.** Max builds the exec runner executes at once; Drone queues the rest in its native pending queue. Change requires `nixos-rebuild switch` on cc-ci. | on process death, however it dies. There is no unlock code path and no stale-lock failure mode.
| `concurrency.limit` | `.drone.yml`, `recipe-ci` pipeline | `2` | Server-side cap on concurrent `recipe-ci` pipelines. Kept equal to capacity; redundant belt (the push pipeline shares runner capacity too, so lint builds can interleave). | - **process ⊆ step**: `PR_SET_PDEATHSIG(SIGTERM)` + the `.drone.yml` setsid/trap wrap (§2) — a
| `CCCI_JANITOR_MAX_AGE` | env, read in `lifecycle.janitor()` | unset → `7200`s | **Age fallback only** — applies solely to apps with no registry entry (§5 case 3). The capacity=1-era `"0"` override in `.drone.yml` is GONE; do not reintroduce it (it made a starting build reap in-flight runs). | dead or canceled build cannot leak a running harness.
| `RECIPE_LOCK_DIR` | `lifecycle.py` constant | `/run/lock` | Where per-recipe lock files live. | - **step ⊆ 60 min**: `signal.alarm(3600)` self-deadline (§2).
| `ACTIVE_RUN_DIR` | `lifecycle.py` constant | `/run/cc-ci-active` | Where active-run pidfiles live. |
Memory budget rationale for capacity=2 (Hetzner cpx22, ~7.6 GiB): a full immich stack measured Never steal a held lock; manage the holder's lifetime. There is **no daemon and no shared state
~1 GiB; two concurrent recipes fit. Revert `maxTests` to `"1"` if OOM/disk-I/O contention appears. service** — everything is kernel/file primitives under `/run/lock` and per-run directories.
## 2. Mechanism 0: run-lifetime hardening (`runner/harness/lifetime.py`)
`run_recipe_ci.main()` calls `lifetime.install_lifetime_guards()` before ANY abra call or lock
acquisition:
1. **`PR_SET_PDEATHSIG(SIGTERM)`** (ctypes prctl, return code checked): if the parent — the drone
step shell — dies, the kernel TERMs the harness. A post-prctl `ppid == 1` re-check closes the
start race: a harness whose parent died *before* the prctl armed would never get the signal,
so it refuses to run orphaned.
2. **SIGTERM handler**: logs, then raises `SystemExit(143)` so the run's `finally:` teardown
funnel executes and the process exits non-zero. Re-entrant signals during teardown are logged
and IGNORED (`lifetime.begin_teardown()`, also set at the top of the run's `finally:` blocks)
so a second signal can't abort the cleanup the first one asked for.
3. **`signal.alarm(3600)` hard deadline**: SIGALRM funnels into the same teardown path with a
distinct log line (`== run exceeded 60-minute hard deadline — tearing down ==`), exit 142.
Recipes keep their own smaller per-tier timeouts; this bounds the whole run. Teardown time
after the deadline is deliberately not alarm-bounded — the janitor is the backstop if a
teardown wedges and the process is killed harder.
The `.drone.yml` recipe-ci step runs the harness as `setsid cc-ci-run … &` with a
`trap 'kill -TERM -- "-$PID"' TERM EXIT; wait "$PID"` — a drone **cancel** (TERM to the step
shell) is forwarded to the harness's whole process group instead of leaking it (the exec runner
only kills the step shell). PDEATHSIG backstops the no-trap paths.
## 3. Isolation model: what is shared, what is per-run ## 3. Isolation model: what is shared, what is per-run
Per-run (no conflict possible): Per-run (no conflict possible):
- **App + stack + volumes + secrets.** The run app domain is deterministic and unique per - **App + stack + volumes + secrets.** Run app domain = `naming.app_domain()`
(recipe, pr, ref): `naming.app_domain()` `<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net`. `<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net`, unique per (recipe, pr, ref);
Everything abra creates is namespaced by it. Run apps are recognised by everything abra creates is namespaced by it. Run apps are recognised by
`RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`; canonical/warm apps `RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`; warm/canonical apps
(e.g. `warm-keycloak...`) deliberately do NOT match, so the janitor never touches them. (e.g. `warm-keycloak...`) deliberately do NOT match the janitor never probes them.
- **Drone build workspace.** The exec runner gives each build its own clone under - **Recipe working trees** — `$ABRA_DIR/recipes/<recipe>`, per run (§4). NEW in the restructure.
`/var/lib/drone-runner/drone-<id>/` — harness code and test files are per-build. - **Drone build workspace** (`/var/lib/drone-runner/drone-<id>/`) and **run artifacts**
- **Run artifacts.** `/var/lib/cc-ci-runs/<build-number>/`. (`/var/lib/cc-ci-runs/<run-id>/`).
Shared (the two hazards the mechanisms exist for): Shared (by design, conflict-free):
- **`~/.abra/recipes/<recipe>`** — ONE working tree per recipe (abra's own layout). The harness - **`/root/.abra/servers`** — app `.env` files, one per domain. The per-run `ABRA_DIR` symlinks
`fetch_recipe()` `rm -rf`'s + reclones it at run start, and the upgrade tier `git checkout`s it `servers/` here, so .env files land in the canonical path: janitor discovery (`abra app ls`)
mid-run for the chaos redeploy. Two same-recipe runs would corrupt each other's deploy tree and out-of-run tooling see every app. Per-domain filenames + the app-domain lock prevent write
(observed: immich builds 229/230 deployed a tree missing its config). → per-recipe flock (§4). conflicts.
- **`HOME=/root`** — forced in `.drone.yml` so abra finds its server config under `/root/.abra`. - **`/root/.abra/catalogue`** — read-mostly, symlinked into each per-run dir.
Safe *given* the above: app names are unique and same-recipe runs serialise, so no two builds - **`HOME=/root`** (forced in `.drone.yml`) — safe: nothing recipe-mutable lives under `~/.abra`
touch the same recipe checkout or app env file. for a run anymore except through the two symlinks above.
## 4. Mechanism 1: per-recipe flock ## 4. Mechanism 1: per-run `ABRA_DIR` (replaces the per-recipe flock)
Code: `lifecycle.acquire_recipe_lock(recipe)`; taken in `run_recipe_ci.main()` **before** `run_recipe_ci.setup_run_abra_dir()` — called first thing in `main()`, before any abra call —
`fetch_recipe()` (the first shared-tree mutation). builds `<runs_dir>/<run-id>/abra/` (run-id = Drone build number; `manual-<pid>` for hand runs):
- Lock file: `/run/lock/cc-ci-recipe-<recipe>.lock`, exclusive `fcntl.flock`. ```
- Non-blocking attempt first; on `BlockingIOError` it logs abra/
`== recipe lock: another <recipe> run is in flight — waiting ... ==` and blocks. The waiting servers/ -> /root/.abra/servers (symlink; canonical shared .env path)
run is visibly "stuck" in its drone log at that line — that is by design. catalogue/ -> /root/.abra/catalogue (symlink; read-mostly)
- The open file object is returned and kept alive (`_recipe_lock = ... # noqa: F841`) for the recipes/ fresh, empty (THE isolation that matters)
**whole process lifetime**. Release is implicit: the kernel drops a flock when the fd closes — ```
including on crash or SIGKILL. **There is no stale-lock failure mode and no unlock code path.**
- Scope: serialises only runs of the SAME recipe. Different recipes never contend.
## 5. Mechanism 2: active-run registry + three-way janitor and exports it as `$ABRA_DIR` — honored by the abra CLI itself and by every harness path helper
(`abra.abra_dir()` / `abra.recipe_dir()`; `generic._recipe_dir`, `prepull_images`,
`snapshot_recipe_tests`, `warm_reconcile._recipe_dir` all route through the same rule:
`$ABRA_DIR` if set, else `~/.abra`).
Why: every run starts with `lifecycle.janitor()` (called from both the cold path and the - `fetch_recipe()` is now a plain clone into `$ABRA_DIR/recipes/<recipe>` (PR-head clone+checkout
warm/quick path in `run_recipe_ci.py`) to reap orphans left by crashed/SIGKILL'd runs (whose or `abra recipe fetch`); the upgrade tier's mid-run `git checkout`s happen in the run's own
`finally:` teardown never ran). Under capacity=2 "any run app that isn't mine" may be a LIVE tree. Two same-recipe runs can no longer corrupt each other — structurally, with no lock. The
concurrent run — age alone can't tell. The registry makes ownership explicit. old observed failure (immich builds 229/230 deploying a tree missing its config) is impossible.
- `CCCI_SKIP_FETCH=1` (test/Adversary staging) copies the canonically-staged
`~/.abra/recipes/<recipe>` clone into the per-run tree.
- Out-of-run flows (warm_reconcile's systemd timer, manual abra) set no `ABRA_DIR` and keep using
the canonical `/root/.abra` unchanged. In-run flows that touch canonical state on purpose
(warm/canonical .env files) go through `servers/` and are unaffected.
- The per-run dir rides along the existing `/var/lib/cc-ci-runs/<run-id>/` retention. abra
auto-clones any recipe it needs to resolve (e.g. during `app ls`) into the per-run `recipes/`
a few seconds of git per run, gone with the run dir.
Registry protocol (all in `lifecycle.py`): ## 5. Mechanism 2: per-app-domain flock (`lifecycle.acquire_app_lock`)
1. `register_run_app(domain)` — writes `/run/cc-ci-active/<domain>` containing the harness pid. - Lock file: `/run/lock/cc-ci-app-<domain>.lock` (dir overridable via `CCCI_APP_LOCK_DIR` for the
Called inside `deploy_app()` **before** the app is created, so no window exists where a test suite), exclusive `fcntl.flock`, taken in `deploy_app()` **before the app is created** a
concurrent janitor can see the app without its registration. concurrent janitor can never see a run app without its held lock.
2. `unregister_run_app(domain)` — removes the pidfile. Called at the end of `teardown_app()` - Blocks (with a log line: `== app lock: another run of <domain> is in flight — waiting ==`) when
(every exit path funnels through teardown) and by the janitor after reaping. another run of the SAME domain is in flight — the double-`!testme` serialisation point; the
3. `_run_owner_state(domain)` — classifies the owner: waiting run is visibly parked at that line in its drone log, by design.
- reads the pid; missing/garbled file → `"unknown"` - The returned file object is ALSO retained in module-level `_held_app_locks` — if a caller
- `/proc/<pid>/cmdline` gone → `"dead"` dropped it, GC would close the fd and silently release the lock.
- cmdline must contain `run_recipe_ci``"alive"`, else `"dead"` (**pid-reuse guard**: a - mtime is touched at acquisition: lock age feeds the janitor's long-held flag (§6).
recycled pid won't look like a harness run) - **Unlink/recreate race guard**: the janitor unlinks reaped lockfiles, so after EVERY
acquisition the locked fd is verified to still be the inode the path names
(`fstat().st_ino == stat().st_ino`); a waiter that won a just-unlinked inode closes it and
retries on the live path. (A lock on an unlinked inode protects nothing: a later opener gets a
fresh inode and would acquire "the same" lock.)
- Release is implicit: process exit (any kind). `teardown_app()` does NOT release or unlink —
a clean run's leftover lockfile is unheld and is unlinked on sight by the next janitor sweep.
Janitor decision table (`lifecycle.janitor()`): ## 6. The flock-probe janitor (`lifecycle.janitor`)
| Owner state | Meaning | Action | Runs at every run start (cold + quick paths) and in the warm/upgrade sweeps. Candidate discovery
is unchanged from the old model: `abra app ls` + a docker-service sweep (catches stacks whose
`.env` is already gone), both matched against `RUN_APP_RE` — warm/canonical apps never match and
are never probed.
Decision table (per candidate domain, `_probe_and_reap`):
| Probe (`LOCK_EX\|LOCK_NB`) | Meaning | Action |
|---|---|---| |---|---|---|
| `alive` | live concurrent run | **never reap** (logs "is a live concurrent run — leaving it") | | acquires (+ inode identity OK) | nobody holds it → owner died (kernel-guaranteed) | **reap**: `teardown_app(verify=False)` WHILE HOLDING the probe lock, then unlink the lockfile, then release |
| `dead` | crashed run's definite orphan | reap immediately (`teardown_app(verify=False)`) + unregister | | acquires, inode stale | another janitor reaped + unlinked while we raced | skip (reap already done; unlinking now would hit a newer run's file) |
| `unknown` | pre-registry app, or post-reboot (`/run` is tmpfs) | age fallback: reap only if stack age ≥ `CCCI_JANITOR_MAX_AGE` (default 2h) | | `BlockingIOError` (held) | live concurrent run | leave it; if lockfile mtime > 120 min (2× the hard deadline): `!! lock for <domain> held >120min — possible leaked run; inspect with lslocks` — flag, **never steal** |
| `open()` fails (`OSError`) | garbled/unopenable lockfile | skip + log, never crash |
Candidate discovery is unchanged from before: `abra app ls` matches against `RUN_APP_RE`, plus a - Reaping under the probe lock closes the janitor-vs-new-run race: a new run of that domain
docker-service sweep that reconstructs domains for stacks whose `.env` was already deleted. blocks in `acquire_app_lock` until the reap finishes — no window where a fresh app coexists
with a half-reaped one.
## 6. Where convergence fits (adjacent, landed with this work) - Two racing janitors arbitrate on the flock: one reaps, the other sees "held" and leaves; reaps
are idempotent (`teardown_app(verify=False)` tolerates half-gone stacks).
Parallel runs surfaced two swarm-convergence bugs that look like concurrency bugs but aren't — - After the candidates, a tidy sweep unlinks stale **unheld** `cc-ci-app-*.lock` files with no
documented here because any restructuring must keep them fixed (`services_converged()` in app behind them (under their own probe lock + identity check), keeping `/run/lock` clean.
`lifecycle.py`): - **Post-reboot**: `/run/lock` is tmpfs → lockfiles gone → every surviving app probes as an
orphan → reaped immediately. (Improvement over the old 2-hour age fallback; there IS no age
- **N/N replicas ≠ converged** during a stop-first rolling update: the update is *registered* logic anymore.)
instantly but the OLD task still shows 1/1 until swarm cycles it (build 238: backupbot exec'd a
pre-hook into a container killed seconds later → 409 → empty snapshot). `services_converged()`
therefore also inspects each service's `UpdateStatus.State`.
- **`paused` persists forever**: swarm's default `update-failure-action: pause` sets it on one
task flicker and it never clears, even at N/N healthy (build 241 hung 22 min). Only `updating`
and `rollback_started` block convergence; `paused`/`rollback_paused`/`completed` are settled —
the HTTP-health and tier assertions still gate actual app correctness.
- `backup_app()` additionally waits (bounded, 300s) for `services_converged()` before
`abra app backup create`, as defence in depth for the backupbot race.
## 7. Failure-mode guarantees ## 7. Failure-mode guarantees
| Event | Outcome | | Event | Outcome |
|---|---| |---|---|
| Run crashes / SIGKILL mid-run | flock auto-released by kernel; pidfile remains but owner is `dead` → next janitor (any run's start) reaps app + pidfile | | Run crashes / SIGKILL mid-run | flock auto-released by kernel → next janitor probe reaps app + lockfile |
| Drone build canceled via API | **known gap**: cancel kills the step's `sh` wrapper but can LEAK the python harness child — it keeps running (holding lock + registry) until killed by hand. See §8. | | Drone build canceled via API | step trap TERMs the harness process group → SIGTERM funnel runs the run's own teardown (exit 143); if anything still leaks, PDEATHSIG + janitor reap (the old "cancel leaks the harness" gap is CLOSED) |
| Host reboot | `/run` is tmpfs → locks and registry vanish (correct: no processes survived either). All surviving apps become `unknown`2h age fallback governs. | | Run exceeds 60 min | SIGALRM → distinct log line → own teardown → exit 142 |
| Two same-recipe `!testme`s | second blocks on the flock at run start (before touching the shared tree), runs after the first finishes | | Host reboot | locks and lockfiles vanish (tmpfs, correct: no owners survived) → all surviving run apps reaped at the next run start, immediately |
| Janitor vs. app being created | impossible to mis-reap: registration happens before app creation, and an `alive` owner is never reaped | | Two same-recipe `!testme`s (different PRs) | run in parallel — separate domains, separate per-run recipe trees |
| Pid reuse after crash | cmdline check (`run_recipe_ci`) classifies as `dead`, orphan still reaped | | Double-`!testme` (same PR → same domain) | second blocks on the app lock before creating anything, visibly in its drone log, runs after the first finishes |
| Janitor vs. app being created | impossible to mis-reap: the lock is held before `app new`, and a held lock is never touched |
| Janitor unlink vs. blocked waiter | inode identity re-check on every acquisition → waiter retries on the live path |
| Lock held implausibly long (>120 min) | flagged loudly for a human (`lslocks`), never stolen |
## 8. Known limitations / restructuring candidates ## 8. Where convergence fits (adjacent; unchanged by the restructure)
1. **Drone cancel leaks the harness process.** The exec runner kills the step shell, not the Two swarm-convergence behaviors in `services_converged()` look like concurrency bugs but aren't —
process tree; the leaked python continues deploying/holding the lock. Fix ideas: run the step any future work must keep them fixed:
under `setsid` + a trap that kills the process group, or have the harness watch
`DRONE_BUILD_STATUS`/parent-death (`PR_SET_PDEATHSIG`).
2. **Head-of-line blocking on same-recipe serialisation.** A run waiting on the recipe flock
still occupies one of the 2 runner slots, so two builds of the SAME recipe temporarily starve
all other recipes. Alternatives: a Drone-level per-recipe fan-out (one pipeline `concurrency`
group per recipe is not natively expressible), or detect-and-requeue in the harness.
3. **The lock protects harness runs only.** Manual `abra`/`git` activity on
`~/.abra/recipes/<recipe>` (operator or another agent) bypasses the flock — the
"park the checkout, then hands off" discipline is still required. A restructure could make the
harness deploy from a per-run copy of the recipe tree instead of the shared checkout
(eliminates the lock entirely, at the cost of diverging from abra's expected layout).
4. **`HOME=/root` is still shared.** Safe today by argument (§3), not by enforcement. Per-build
`ABRA_DIR` with a shared read-only server config would make isolation structural.
5. **Registry is advisory.** Nothing stops a non-harness actor from creating run-app-shaped
stacks the janitor will eventually age-reap; conversely the janitor trusts pidfiles it can
parse. Acceptable on a single-purpose CI host.
6. **Capacity is configured in two places** (`drone-runner.nix` + `.drone.yml`) that must be
kept in step by hand.
## 9. File / symbol index - **N/N replicas ≠ converged** during a stop-first rolling update — `UpdateStatus.State` is also
inspected (build 238: backupbot exec'd into a container killed seconds later).
- **`paused` persists forever** (swarm's default `update-failure-action`) — only `updating` and
`rollback_started` block convergence; `paused`/`rollback_paused` are settled (build 241).
- `backup_app()` additionally waits (bounded 300s) for convergence before `backup create`.
## 9. Configuration knobs
| Knob | Where | Current | Meaning |
|---|---|---|---|
| `DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`) | `nix/modules/drone-runner.nix` (`maxTests`) | `2` | **THE single concurrency knob.** Max builds the exec runner executes at once; Drone queues the rest. (The `.drone.yml` `concurrency.limit` duplicate was removed.) Change requires `nixos-rebuild switch`. |
| `CCCI_APP_LOCK_DIR` | env, read at call time | unset → `/run/lock` | App-domain lockfile dir override — used by `tests/concurrency` to sandbox locks. Never set in production. |
| hard deadline | `lifetime.HARD_DEADLINE_SECONDS` | 3600 s | the whole-run alarm; long-held flag threshold is 2× this (`LONG_HELD_LOCK_SECONDS`) |
## 10. Testing: `tests/concurrency/`
Real-kernel suite (19 planned cases + companions): helper subprocesses hold REAL flocks and
install the REAL prctl/signal/alarm guards — flock itself is never mocked; the janitor runs with
injected candidates + stubbed teardown but probes real locks. **Not part of the default
`pytest tests/unit` gate** (it spawns processes and sleeps); run it explicitly:
```
cc-ci-run -m pytest tests/concurrency -q
```
Covers: kernel auto-release on SIGKILL; LOCK_NB probe semantics; PEP 446 fd non-inheritance;
same-domain serialisation; orphan reap + unlink; live-run protection; reap-under-probe-lock
blocking; two-janitor arbitration; reboot-immediate reap; long-held flag; RUN_APP_RE allowlist;
degrade-on-garbage; PDEATHSIG; ppid start race; deadline + SIGTERM funnels; per-run ABRA_DIR
construction/export; concurrent same-recipe fetch isolation; symlinked-servers .env canonicality.
## 11. File / symbol index
| What | Where | | What | Where |
|---|---| |---|---|
| `maxTests` / `DRONE_RUNNER_CAPACITY` | `nix/modules/drone-runner.nix` | | lifetime guards (PDEATHSIG, signal funnels, deadline) | `runner/harness/lifetime.py`; installed in `run_recipe_ci.main()` |
| `concurrency.limit`, `HOME=/root`, env | `.drone.yml` (`recipe-ci` pipeline) | | setsid/trap cancel forwarding | `.drone.yml` (`recipe-ci` step) |
| Lock + registry constants & helpers | `runner/harness/lifecycle.py` (top, after `TeardownError`) | | `acquire_app_lock`, `_held_app_locks`, `_app_lock_path` | `runner/harness/lifecycle.py` |
| `acquire_recipe_lock` call site | `runner/run_recipe_ci.py` `main()`, before `fetch_recipe()` | | `acquire_app_lock` call site | `lifecycle.deploy_app()` (before app creation) |
| `register_run_app` call site | `lifecycle.deploy_app()` (before app creation) | | janitor + probe (`janitor`, `_probe_and_reap`, `LONG_HELD_LOCK_SECONDS`) | `runner/harness/lifecycle.py` |
| `unregister_run_app` call sites | `lifecycle.teardown_app()`, `lifecycle.janitor()` | | per-run ABRA_DIR (`setup_run_abra_dir`, `fetch_recipe`) | `runner/run_recipe_ci.py` |
| Janitor + decision table | `lifecycle.janitor()`, `_run_owner_state()` | | path resolution (`abra_dir`, `recipe_dir`) | `runner/harness/abra.py` (used by `generic`, `lifecycle.prepull_images`, `warm_reconcile`) |
| Run-app naming | `runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py` | | run-app naming | `runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py` |
| Convergence (adjacent) | `lifecycle.services_converged()`, `lifecycle.backup_app()` | | capacity knob | `nix/modules/drone-runner.nix` (`maxTests`) |
| convergence (adjacent) | `lifecycle.services_converged()`, `lifecycle.backup_app()` |
| the test suite | `tests/concurrency/` (`helpers.py` subprocess entrypoints, `concutil.py` probes) |
Deleted in the restructure (grep should find NOTHING): `register_run_app`, `unregister_run_app`,
`_run_owner_state`, `ACTIVE_RUN_DIR`, `CCCI_JANITOR_MAX_AGE`, `_stack_age_seconds`,
`acquire_recipe_lock`, `RECIPE_LOCK_DIR`.

View File

@ -8,18 +8,18 @@
{ pkgs, config, lib, ... }: { pkgs, config, lib, ... }:
let let
# MAX_TESTS (plan §4.2/§4.3 resource safety): max CI builds the exec runner runs at once. Drone # MAX_TESTS (plan §4.2/§4.3 resource safety): max CI builds the exec runner runs at once. Drone
# queues the rest in its native pending-build queue (no custom queue). THE concurrency cap that # queues the rest in its native pending-build queue (no custom queue). THE SINGLE concurrency
# bounds how many test apps can be live at once. # knob — nothing else caps recipe-ci parallelism (the .drone.yml concurrency.limit was removed:
# one knob, one place). Bounds how many test apps can be live at once.
# #
# Raised to 2 (operator request 2026-06-09) so two recipes can be tested in parallel (e.g. immich # Raised to 2 (operator request 2026-06-09) so two recipes can be tested in parallel (e.g. immich
# and plausible under active development at once). Verified safe on the current node (Hetzner cpx22, # and plausible under active development at once). Verified safe on the current node (Hetzner cpx22,
# ~7.6 GiB / 4 vCPU — NOTE: smaller than the original 28 GiB this was written for): a full immich CI # ~7.6 GiB / 4 vCPU — NOTE: smaller than the original 28 GiB this was written for): a full immich CI
# stack measured ~1 GiB (server+ML+pg+redis) with multiple GiB free, so two concurrent recipes fit. # stack measured ~1 GiB (server+ML+pg+redis) with multiple GiB free, so two concurrent recipes fit.
# The concurrency PRECONDITION holds: the run-start janitor is age-based (default 2h) + run-app-name # Concurrent-run safety is the harness's job at ANY capacity (docs/concurrency.md): per-run
# scoped, so it never reaps a concurrent in-flight run (harness.lifecycle.janitor). TRADE-OFF: with # ABRA_DIR recipe trees, per-app-domain flocks, and a flock-probe janitor that reaps a crashed
# capacity>1 a SIGKILL'd build (no teardown) leaves an orphan the run-start sweep can't reap # build's orphan immediately (held lock = live run, never touched). Revert to "1" if OOM /
# immediately (it might be a live run) — bounded instead by the 2h janitor + the /upgrade-all # disk-I/O contention is observed under load.
# start/end reap + sweep-orphans. Revert to "1" if OOM / disk-I/O contention is observed under load.
maxTests = "2"; maxTests = "2";
in in
{ {

View File

@ -10,6 +10,7 @@ Bakes in the known abra gotchas (re-verify per installed abra version, currently
from __future__ import annotations from __future__ import annotations
import json import json
import os
import subprocess import subprocess
ABRA = "abra" ABRA = "abra"
@ -19,6 +20,20 @@ class AbraError(RuntimeError):
pass pass
def abra_dir() -> str:
"""abra's state dir, resolved the same way the abra CLI resolves it: $ABRA_DIR if set, else
~/.abra. Inside a CI run, run_recipe_ci exports a PER-RUN $ABRA_DIR (fresh recipes/, shared
servers/+catalogue/ symlinks) before any abra call, so every helper here and every abra
subprocess agree on the same tree; outside a run (warm_reconcile's systemd timer, manual use)
both fall back to the canonical /root/.abra."""
return os.environ.get("ABRA_DIR") or os.path.expanduser("~/.abra")
def recipe_dir(recipe: str) -> str:
"""The current ABRA_DIR's working tree for a recipe (per-run inside a CI run)."""
return os.path.join(abra_dir(), "recipes", recipe)
def _run_pty( def _run_pty(
args: list[str], timeout: int = 900, check: bool = True args: list[str], timeout: int = 900, check: bool = True
) -> subprocess.CompletedProcess: ) -> subprocess.CompletedProcess:
@ -77,9 +92,7 @@ def recipe_checkout(recipe: str, version: str) -> None:
a chaos (`-C`) deploy ignores ENV VERSION and uses the current checkout — together that silently a chaos (`-C`) deploy ignores ENV VERSION and uses the current checkout — together that silently
deployed LATEST for a 'previous-version' base, making the upgrade a no-op (Adversary F1d-2). With deployed LATEST for a 'previous-version' base, making the upgrade a no-op (Adversary F1d-2). With
this checkout + a non-chaos deploy, a pinned deploy genuinely deploys that version.""" this checkout + a non-chaos deploy, a pinned deploy genuinely deploys that version."""
import os path = recipe_dir(recipe)
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
# -f (force): the version-pinning checkout must yield the EXACT ref tree. Without it, a cc-ci # -f (force): the version-pinning checkout must yield the EXACT ref tree. Without it, a cc-ci
# install_steps-provided overlay (e.g. discourse's compose.ccci.yml, copied into the pinned base) # install_steps-provided overlay (e.g. discourse's compose.ccci.yml, copied into the pinned base)
# is an UNTRACKED file that collides with the same path TRACKED in a later ref, and # is an UNTRACKED file that collides with the same path TRACKED in a later ref, and
@ -100,9 +113,7 @@ def has_lightweight_version_tags(recipe: str) -> bool:
'reference not found'.) The caller (deploy_app) uses this to fall back to a chaos base deploy 'reference not found'.) The caller (deploy_app) uses this to fall back to a chaos base deploy
(which skips lint and deploys the explicitly-checked-out pinned version — see lifecycle.deploy_app). (which skips lint and deploys the explicitly-checked-out pinned version — see lifecycle.deploy_app).
Read-only: just `git tag` + `cat-file -t`; no fetch/mutation, so it can't trigger abra's revert.""" Read-only: just `git tag` + `cat-file -t`; no fetch/mutation, so it can't trigger abra's revert."""
import os path = recipe_dir(recipe)
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
tags = subprocess.run( tags = subprocess.run(
["git", "-C", path, "tag", "-l"], capture_output=True, text=True ["git", "-C", path, "tag", "-l"], capture_output=True, text=True
).stdout.split() ).stdout.split()
@ -231,9 +242,7 @@ def recipe_head_commit(recipe: str) -> str | None:
"""The current HEAD commit of the recipe checkout — captured right after fetch (the PR head, or """The current HEAD commit of the recipe checkout — captured right after fetch (the PR head, or
the catalogue current) so the upgrade tier can re-checkout it for the chaos redeploy after the the catalogue current) so the upgrade tier can re-checkout it for the chaos redeploy after the
prev-tag base deploy reset the working tree (HC1).""" prev-tag base deploy reset the working tree (HC1)."""
import os path = recipe_dir(recipe)
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
proc = subprocess.run(["git", "-C", path, "rev-parse", "HEAD"], capture_output=True, text=True) proc = subprocess.run(["git", "-C", path, "rev-parse", "HEAD"], capture_output=True, text=True)
out = proc.stdout.strip() out = proc.stdout.strip()
return out or None return out or None
@ -241,10 +250,7 @@ def recipe_head_commit(recipe: str) -> str | None:
def recipe_versions(recipe: str) -> list[str]: def recipe_versions(recipe: str) -> list[str]:
"""Published versions of a recipe, oldest→newest (from the recipe git tags).""" """Published versions of a recipe, oldest→newest (from the recipe git tags)."""
import os path = recipe_dir(recipe)
import subprocess
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
proc = subprocess.run( proc = subprocess.run(
["git", "-C", path, "tag", "--sort=creatordate"], capture_output=True, text=True ["git", "-C", path, "tag", "--sort=creatordate"], capture_output=True, text=True
) )

View File

@ -25,7 +25,7 @@ _BACKUPBOT_RE = re.compile(r"backupbot\.backup\b[^\n]*\btrue\b", re.IGNORECASE)
def _recipe_dir(recipe: str) -> str: def _recipe_dir(recipe: str) -> str:
return os.path.expanduser(f"~/.abra/recipes/{recipe}") return abra.recipe_dir(recipe) # the per-run tree inside a CI run ($ABRA_DIR)
def backup_capable(recipe: str, meta: dict | None = None) -> bool: def backup_capable(recipe: str, meta: dict | None = None) -> bool:

View File

@ -7,8 +7,8 @@ next run. Callers wrap deploy()/teardown() in try/finally (or a pytest finalizer
from __future__ import annotations from __future__ import annotations
import contextlib import contextlib
import datetime
import fcntl import fcntl
import glob
import json import json
import os import os
import re import re
@ -18,7 +18,7 @@ import subprocess
import time import time
import urllib.request import urllib.request
from . import abra from . import abra, lifetime
GATEWAY_IP = "143.244.213.108" # *.ci.commoninternet.net -> gateway (TLS passthrough to cc-ci) GATEWAY_IP = "143.244.213.108" # *.ci.commoninternet.net -> gateway (TLS passthrough to cc-ci)
# A run app domain is "<recipe[:4]>-<6hex>.ci.commoninternet.net" (see DECISIONS.md). Used by the # A run app domain is "<recipe[:4]>-<6hex>.ci.commoninternet.net" (see DECISIONS.md). Used by the
@ -31,72 +31,67 @@ class TeardownError(RuntimeError):
# --- Concurrent-run safety (capacity=2) ------------------------------------------------------- # --- Concurrent-run safety (capacity=2) -------------------------------------------------------
# Two cooperating mechanisms, both process-lifetime-scoped so SIGKILL can't leak a stale lock: # ONE mechanism, process-lifetime-scoped so SIGKILL can't leak a stale claim: every run holds an
# 1. Per-recipe flock: ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe # exclusive kernel flock on its app DOMAIN (/run/lock/cc-ci-app-<domain>.lock) for the whole run.
# rm-rf's/reclones and the upgrade tier git-checkouts mid-run. Concurrent runs of the SAME # A held lock implies a live owner — the kernel releases a flock when the holding process dies,
# recipe would corrupt each other's deploy tree (observed: immich builds 229/230 deployed a # however it dies. The janitor probes the lock (LOCK_NB) to tell a live concurrent run (held
# tree missing its config), so they serialise on an exclusive flock; different recipes run in # leave it) from a crashed run's orphan (acquirable → reap it); it never inspects pids and never
# parallel. The kernel drops a flock when the holder dies, however it dies. # steals a held lock. Recipe-tree corruption between same-recipe runs is gone structurally (each
# 2. Active-run registry: each run registers its app domain + pid before creating the app, so the # run deploys from its own per-run ABRA_DIR — there is no shared recipe tree and no recipe lock),
# janitor can tell a live concurrent run from a crashed run's orphan (see janitor()). # and same-domain runs (double-!testme of one PR) serialise on this app lock.
RECIPE_LOCK_DIR = "/run/lock" # See docs/concurrency.md.
ACTIVE_RUN_DIR = "/run/cc-ci-active"
# Acquired app-lock file objects are retained here for the REMAINING PROCESS LIFETIME: if the
# caller drops the returned file object, GC would close the fd and silently release the lock —
# this list is the lock's owner of record. Never cleared; release is process exit.
_held_app_locks: list = []
def acquire_recipe_lock(recipe: str): def _app_lock_dir() -> str:
"""Take the per-recipe exclusive lock; blocks (with a log line) if another run of the same """The app-domain lockfile dir. /run/lock (tmpfs: a reboot clears locks AND lockfiles, so
recipe is in flight. Returns the open lock file — the CALLER must keep a reference for the post-reboot apps probe as orphans and are reaped immediately). Env-overridable so the
whole run; the lock is released only when the process exits and the fd closes.""" tests/concurrency suite (and its helper subprocesses) can use a sandbox dir."""
path = os.path.join(RECIPE_LOCK_DIR, f"cc-ci-recipe-{recipe}.lock") return os.environ.get("CCCI_APP_LOCK_DIR", "/run/lock")
f = open(path, "w") # noqa: SIM115 — deliberately held for the lifetime of the run
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB) def _app_lock_path(domain: str) -> str:
except BlockingIOError: return os.path.join(_app_lock_dir(), f"cc-ci-app-{domain}.lock")
print(
f"== recipe lock: another {recipe} run is in flight — waiting for {path} "
"(shared ~/.abra/recipes checkout) ==", def acquire_app_lock(domain: str):
flush=True, """Take the per-app-domain exclusive lock; blocks (with a log line) if another run of the
) same domain is in flight (double-!testme serialisation). Returns the open lock file, which is
fcntl.flock(f, fcntl.LOCK_EX) ALSO retained in _held_app_locks so the flock lives exactly as long as the process.
print(f"== recipe lock: acquired {path} ==", flush=True)
Unlink/recreate race guard: the janitor unlinks a reaped orphan's lockfile while holding its
flock, so a waiter blocked on the OLD inode can win a lock no later opener can observe (a new
open() at the path creates a FRESH inode). After every acquisition, verify the locked fd is
still the file at the path (st_ino match); if not, drop it and retry on the live path."""
path = _app_lock_path(domain)
waited = False
while True:
# PEP 446: the fd is non-inheritable, so subprocess children never carry the lock.
f = open(path, "a") # noqa: SIM115 — deliberately held for the rest of the process
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
if not waited:
print(f"== app lock: another run of {domain} is in flight — waiting ==", flush=True)
waited = True
fcntl.flock(f, fcntl.LOCK_EX)
try:
if os.fstat(f.fileno()).st_ino == os.stat(path).st_ino:
break # we hold the lock on the inode the path names — done
except FileNotFoundError:
pass
f.close() # locked a stale (unlinked) inode — retry on the live path
os.utime(f.fileno()) # mtime = acquisition time = lock age (janitor's long-held flag)
_held_app_locks.append(f)
if waited:
print(f"== app lock: acquired {path} ==", flush=True)
return f return f
def _registry_path(domain: str) -> str:
return os.path.join(ACTIVE_RUN_DIR, domain)
def register_run_app(domain: str) -> None:
"""Record this process as the live owner of a run app (called BEFORE the app is created, so a
concurrent run's janitor can never observe the app without its registration)."""
with contextlib.suppress(OSError):
os.makedirs(ACTIVE_RUN_DIR, exist_ok=True)
with open(_registry_path(domain), "w") as f:
f.write(str(os.getpid()))
def unregister_run_app(domain: str) -> None:
with contextlib.suppress(OSError):
os.remove(_registry_path(domain))
def _run_owner_state(domain: str) -> str:
"""'alive' if the registered owner is a live run_recipe_ci process, 'dead' if registered but
gone (definite orphan), 'unknown' if never registered (pre-registry code or post-reboot)."""
try:
with open(_registry_path(domain)) as f:
pid = int(f.read().strip())
except (OSError, ValueError):
return "unknown"
try:
with open(f"/proc/{pid}/cmdline", "rb") as f:
cmdline = f.read().decode(errors="replace").replace("\0", " ")
except OSError:
return "dead"
# Guard against pid reuse: the owner must still look like a harness run.
return "alive" if "run_recipe_ci" in cmdline else "dead"
def _docker_names(kind: str, stack: str) -> list[str]: def _docker_names(kind: str, stack: str) -> list[str]:
"""docker <kind> ls names filtered to a stack (kind: service|volume|secret).""" """docker <kind> ls names filtered to a stack (kind: service|volume|secret)."""
proc = subprocess.run( proc = subprocess.run(
@ -116,31 +111,6 @@ def _residual(domain: str) -> dict:
} }
def _stack_age_seconds(stack: str) -> float | None:
"""Age of the stack's oldest service, or None if not present."""
svcs = _docker_names("service", stack)
if not svcs:
return None
oldest = None
for s in svcs:
p = subprocess.run(
["docker", "service", "inspect", s, "--format", "{{.CreatedAt}}"],
capture_output=True,
text=True,
)
ts = p.stdout.strip()
try:
# docker emits e.g. 2026-05-27 00:12:33.123 +0000 UTC -> take the leading 19 chars
dt = datetime.datetime.strptime(ts[:19], "%Y-%m-%d %H:%M:%S").replace(
tzinfo=datetime.UTC
)
except ValueError:
continue
age = (datetime.datetime.now(datetime.UTC) - dt).total_seconds()
oldest = age if oldest is None else max(oldest, age)
return oldest
def _recipe_extra_env(recipe: str, domain: str) -> dict[str, str]: def _recipe_extra_env(recipe: str, domain: str) -> dict[str, str]:
"""Per-recipe extra .env keys, applied at every deploy (install + upgrade's old_app) so a recipe """Per-recipe extra .env keys, applied at every deploy (install + upgrade's old_app) so a recipe
with multi-domain / config needs is enrolled with NO shared-harness change (D5/M6.5). A recipe with multi-domain / config needs is enrolled with NO shared-harness change (D5/M6.5). A recipe
@ -217,9 +187,9 @@ def prepull_images(recipe: str, domain: str) -> None:
app-INIT time (slow-init apps like collabora/immich still need their recipe healthcheck/READY_PROBE). app-INIT time (slow-init apps like collabora/immich still need their recipe healthcheck/READY_PROBE).
Best-effort on resolution failure (skip + let the deploy pull as usual); HARD-fails on a real Best-effort on resolution failure (skip + let the deploy pull as usual); HARD-fails on a real
pull error (don't mask it).""" pull error (don't mask it)."""
import os recipe_dir = abra.recipe_dir(recipe) # per-run tree inside a CI run
# The app .env lives in the CANONICAL servers path (the per-run ABRA_DIR's servers/ is a
recipe_dir = os.path.expanduser(f"~/.abra/recipes/{recipe}") # symlink to it, so abra and this path agree on the same file).
env_path = os.path.expanduser(f"~/.abra/servers/default/{domain}.env") env_path = os.path.expanduser(f"~/.abra/servers/default/{domain}.env")
if not os.path.isdir(recipe_dir) or not os.path.isfile(env_path): if not os.path.isdir(recipe_dir) or not os.path.isfile(env_path):
print(f" prepull: recipe dir or .env missing for {recipe} — skipping", flush=True) print(f" prepull: recipe dir or .env missing for {recipe} — skipping", flush=True)
@ -278,9 +248,10 @@ def deploy_app(
past the 900s default. abra's INTERNAL TIMEOUT (recipe's TIMEOUT env, default 300s) is set via past the 900s default. abra's INTERNAL TIMEOUT (recipe's TIMEOUT env, default 300s) is set via
EXTRA_ENV; this is the Python subprocess wrapper's timeout so abra doesn't get SIGKILLed mid-deploy.""" EXTRA_ENV; this is the Python subprocess wrapper's timeout so abra doesn't get SIGKILLed mid-deploy."""
_record_deploy() _record_deploy()
# Register BEFORE the app exists: a concurrent run's janitor must never see this app without # Lock BEFORE the app exists: a concurrent run's janitor must never see this app without a
# its live-owner registration (it would reap an in-flight deploy). # held app lock (it would probe it as an orphan and reap an in-flight deploy). Also the
register_run_app(domain) # double-!testme serialisation point: a second run of the same domain blocks here.
acquire_app_lock(domain)
abra.app_config_remove(domain) # clear any stale .env from a prior crashed run abra.app_config_remove(domain) # clear any stale .env from a prior crashed run
abra.app_new(recipe, domain, version=version, secrets=secrets) abra.app_new(recipe, domain, version=version, secrets=secrets)
# A pinned version must actually deploy that version: check the recipe out to the tag so the # A pinned version must actually deploy that version: check the recipe out to the tag so the
@ -719,23 +690,84 @@ def teardown_app(domain: str, verify: bool = True) -> None:
residual = _residual(domain) residual = _residual(domain)
if any(residual.values()): if any(residual.values()):
raise TeardownError(f"teardown left residual for {domain}: {residual}") raise TeardownError(f"teardown left residual for {domain}: {residual}")
# The app is gone — drop its active-run registration (janitor() also clears it when reaping). # No unregistration step: the app lock releases implicitly at process exit. The clean run's
unregister_run_app(domain) # leftover lockfile (unheld) is unlinked on sight by the next janitor's stale-lockfile sweep.
def janitor(max_age_seconds: int | None = None) -> None: # A lock held longer than 2x the 60-min hard deadline can only be a leaked run (the deadline
"""Reap orphaned run apps from crashed/rebooted runs. Matches the real naming scheme. Safe under # bounds every healthy run). Flag it for a human — NEVER steal a held lock.
CONCURRENT runs (capacity=2): every harness run registers its app in the active-run registry LONG_HELD_LOCK_SECONDS = 2 * lifetime.HARD_DEADLINE_SECONDS
(register_run_app), so the janitor distinguishes the three cases instead of using age alone:
- registered + owner run_recipe_ci process ALIVE -> in-flight concurrent run: never reap
- registered + owner DEAD (crashed/SIGKILLed run) -> definite orphan: reap immediately
- no registry entry (pre-registry code, reboot) -> fall back to the age threshold
Reaps via docker primitives so it works even when the .env is gone (A2/A3). Age fallback default
2h, env-overridable via CCCI_JANITOR_MAX_AGE."""
import os
if max_age_seconds is None:
max_age_seconds = int(os.environ.get("CCCI_JANITOR_MAX_AGE", "7200")) def _probe_and_reap(domain: str) -> None:
"""Probe one run app's lock; reap iff nobody holds it (kernel-guaranteed orphan).
Reaping happens WHILE HOLDING the probe lock, closing the janitor-vs-new-run race: a new run
of the same domain blocks in acquire_app_lock until the reap finishes, so a fresh app never
coexists with a half-reaped one. The lockfile is unlinked before release (still holding the
lock); a waiter that blocked on the unlinked inode re-checks identity and retries. Two racing
janitors arbitrate on the same flock: one reaps, the other sees 'held' and leaves —
teardown_app(verify=False) is idempotent either way."""
path = _app_lock_path(domain)
try:
# PEP 446: non-inheritable fd, same as acquire_app_lock.
f = open(path, "a") # noqa: SIM115 — closed in the finally below, lock released with it
except OSError as e:
print(f"!! janitor: cannot open lockfile {path} ({e}) — skipping {domain}", flush=True)
return
try:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
# Held -> live run. Never steal; flag if it has been held implausibly long.
try:
held_for = time.time() - os.stat(path).st_mtime
except OSError:
held_for = 0
if held_for > LONG_HELD_LOCK_SECONDS:
print(
f"!! lock for {domain} held >{LONG_HELD_LOCK_SECONDS // 60}min — possible "
"leaked run; inspect with lslocks",
flush=True,
)
else:
print(
f" janitor: {domain} lock held — live concurrent run, leaving it", flush=True
)
return
# Acquired — but only the inode the PATH names counts (another janitor may have reaped
# and unlinked this inode while we raced; a lock on an unlinked inode protects nothing
# and unlinking the path now would delete a NEWER run's lockfile).
try:
if os.fstat(f.fileno()).st_ino != os.stat(path).st_ino:
return
except FileNotFoundError:
return
# Orphan: no live owner (the kernel released the lock when the owner died). Reap while
# holding the probe lock, then unlink the lockfile before releasing.
print(f" janitor: {domain} lock acquirable — orphan, reaping", flush=True)
with contextlib.suppress(Exception):
teardown_app(domain, verify=False)
with contextlib.suppress(OSError):
os.unlink(path)
finally:
f.close()
def janitor() -> None:
"""Reap orphaned run apps from crashed/rebooted runs; the kernel flock is the only liveness
oracle. For every candidate run app, probe its app-domain lock (LOCK_NB):
acquirable -> nobody holds it -> orphan -> reap under the probe lock + unlink lockfile
held -> live concurrent run -> leave it (warn if held >2x the hard deadline)
Candidate discovery is unchanged: `abra app ls` + a docker-service sweep (catches stacks
whose .env is already gone), both matched against RUN_APP_RE — warm/canonical apps never
match and are never probed. Post-reboot, /run/lock (tmpfs) is empty, so every surviving app
probes as an orphan and is reaped immediately (no age threshold). Stale lockfiles with no
app behind them are unlinked on sight. Degrades safely: an unreadable lockfile/dir is
skipped with a log line, never a crash. Reaps via docker primitives so it works even when
the .env is gone (A2/A3)."""
seen = set() seen = set()
for app in abra.app_ls(): for app in abra.app_ls():
name = app.get("appName") or app.get("domain") or "" name = app.get("appName") or app.get("domain") or ""
@ -749,18 +781,22 @@ def janitor(max_age_seconds: int | None = None) -> None:
seen.add(f"{m.group(1)}.ci.commoninternet.net") seen.add(f"{m.group(1)}.ci.commoninternet.net")
for name in seen: for name in seen:
owner = _run_owner_state(name) _probe_and_reap(name)
if owner == "alive":
print(f" janitor: {name} is a live concurrent run — leaving it", flush=True) # Tidy /run/lock: a clean run's leftover lockfile is unheld and appless — unlink it (under
continue # its own probe lock, with the same identity check as above).
if owner == "unknown": with contextlib.suppress(OSError):
# No registry entry (manual run on pre-registry code, or post-reboot): only the age for path in glob.glob(os.path.join(_app_lock_dir(), "cc-ci-app-*.lock")):
# threshold protects it, as before. domain = os.path.basename(path)[len("cc-ci-app-") : -len(".lock")]
stack = _stack_name(name) if domain in seen:
age = _stack_age_seconds(stack) continue # handled (or deliberately left) above
if age is not None and age < max_age_seconds: with contextlib.suppress(OSError):
continue # young and of unknown provenance; leave it f = open(path, "a") # noqa: SIM115 — closed below, lock released with it
# owner == "dead" (a crashed/killed run's definite orphan) or old enough -> reap try:
with contextlib.suppress(Exception): fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
teardown_app(name, verify=False) if os.fstat(f.fileno()).st_ino == os.stat(path).st_ino:
unregister_run_app(name) os.unlink(path)
except (BlockingIOError, FileNotFoundError):
pass # held (live run pre-deploy) or already gone — leave it
finally:
f.close()

View File

@ -0,0 +1,95 @@
"""Run-lifetime hardening (concurrency restructure P1).
The concurrency model's invariant chain is:
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
Locks are kernel flocks released on process exit, so the only thing that needs managing is the
PROCESS lifetime. Three guards, installed at run startup (before any abra call) by
`install_lifetime_guards()`:
1. `PR_SET_PDEATHSIG(SIGTERM)`: if the parent (the drone step shell) dies — cancel, runner
crash, host shutdown of the step — the kernel delivers SIGTERM to the harness, so a dead
build can never leak a running harness that holds locks. Paired with a ppid==1 re-check
AFTER the prctl: a parent that died BEFORE the prctl took effect would never trigger the
death signal, so a harness that finds itself already reparented refuses to run.
2. SIGTERM handler: raise SystemExit so the run's `finally:` teardown funnel executes and the
process exits non-zero. Re-entrant deliveries during teardown are logged and IGNORED so a
second signal can't abort the cleanup the first one asked for (`begin_teardown()` guards
this; the run's own `finally:` blocks also call it so a signal landing mid-normal-teardown
can't abort that either).
3. `signal.alarm(3600)`: self-imposed hard deadline. SIGALRM funnels into the same teardown
path with a distinct log line. Teardown time after the deadline is not alarm-bounded —
interrupting a teardown buys nothing; the janitor (flock probe) is the backstop if a
teardown wedges and the process is killed harder.
"""
from __future__ import annotations
import ctypes
import os
import signal
import sys
HARD_DEADLINE_SECONDS = 60 * 60
_PR_SET_PDEATHSIG = 1 # linux/prctl.h
_state = {"tearing_down": False}
def begin_teardown() -> None:
"""Mark the teardown funnel as running. From here on SIGTERM/SIGALRM must NOT raise — it
would abort the very cleanup it asks for — so the handlers log and return instead. Called by
the handlers themselves before raising, and at the top of the run's `finally:` blocks."""
_state["tearing_down"] = True
def _funnel_handler(log_line: str, exit_code: int):
"""A signal handler that routes into the teardown funnel exactly once: log, then raise
SystemExit (propagates through the run's try/finally → teardown executes → non-zero exit).
While teardown is already running, further signals are logged and swallowed."""
def handler(signum: int, frame) -> None: # noqa: ARG001
print(log_line, flush=True)
if _state["tearing_down"]:
print(
f"== signal {signum} during teardown — ignored (teardown continues, "
"exit stays non-zero) ==",
flush=True,
)
return
begin_teardown()
raise SystemExit(exit_code)
return handler
def install_lifetime_guards(deadline_seconds: int = HARD_DEADLINE_SECONDS) -> None:
"""Install all three lifetime guards (see module docstring). Must run at harness startup,
before any abra call and before any lock is taken."""
libc = ctypes.CDLL("libc.so.6", use_errno=True)
if libc.prctl(_PR_SET_PDEATHSIG, signal.SIGTERM, 0, 0, 0) != 0:
err = ctypes.get_errno()
raise OSError(err, f"prctl(PR_SET_PDEATHSIG, SIGTERM) failed: {os.strerror(err)}")
# The prctl is armed now — but only fires for a parent death AFTER this point. If the parent
# already died, we are reparented (ppid 1) and would never get the signal: refuse to run, an
# orphaned harness would hold locks/apps with nothing managing its lifetime.
if os.getppid() == 1:
sys.exit("parent died before prctl(PR_SET_PDEATHSIG) — refusing to run orphaned")
signal.signal(
signal.SIGTERM,
_funnel_handler(
"== SIGTERM received (drone cancel / parent death) — tearing down ==",
128 + signal.SIGTERM,
),
)
minutes = deadline_seconds // 60
signal.signal(
signal.SIGALRM,
_funnel_handler(
f"== run exceeded {minutes}-minute hard deadline — tearing down ==",
128 + signal.SIGALRM,
),
)
signal.alarm(deadline_seconds)

View File

@ -47,6 +47,7 @@ from harness import ( # noqa: E402
discovery, discovery,
generic, generic,
lifecycle, lifecycle,
lifetime,
naming, naming,
warm, warm,
warmsnap, warmsnap,
@ -137,18 +138,62 @@ def _gitea_token() -> str | None:
return tok or None return tok or None
def setup_run_abra_dir() -> str:
"""P3: build + export this run's PER-RUN ABRA_DIR — structural isolation of recipe trees.
`<runs_dir>/<run-id>/abra/` with:
servers/ -> symlink to the canonical ~/.abra/servers. App .env files land in the shared
canonical path, so janitor discovery (`abra app ls`) and env-based teardown
work unchanged from any process; per-domain filenames + the app-domain lock
prevent write conflicts.
catalogue/ -> symlink to the canonical ~/.abra/catalogue (read-mostly).
recipes/ fresh + empty — THE isolation that matters: each run clones and git-checkouts
its own recipe trees, so concurrent runs (same recipe included) can never
corrupt each other's deploy tree. Replaces the per-recipe flock.
Exported as $ABRA_DIR — honored by the abra CLI and by every harness path helper
(abra.abra_dir()) — BEFORE any abra call. Rides along the existing run-dir retention."""
canonical = os.path.expanduser("~/.abra")
rid = results_mod.run_id()
if rid == "manual":
rid = f"manual-{os.getpid()}" # two concurrent hand-runs must not share a tree
run_abra_dir = os.path.join(results_mod.runs_dir(), rid, "abra")
os.makedirs(os.path.join(run_abra_dir, "recipes"), exist_ok=True)
for shared in ("servers", "catalogue"):
link = os.path.join(run_abra_dir, shared)
if not os.path.islink(link):
os.symlink(os.path.join(canonical, shared), link)
os.environ["ABRA_DIR"] = run_abra_dir
print(
f"== per-run ABRA_DIR: {run_abra_dir} (servers/catalogue -> canonical; fresh recipes/) ==",
flush=True,
)
return run_abra_dir
def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None: def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None:
"""Make the recipe available at the code under test. If SRC+REF point at the mirror PR, """Make the recipe available at the code under test in THIS RUN's recipe tree
($ABRA_DIR/recipes/<recipe>): a plain clone — no locking needed, no rm-rf of any shared
state (the rm below only clears this run's own leftovers, e.g. a janitor-triggered
`abra app ls` auto-clone or a Drone build-number reuse). If SRC+REF point at the mirror PR,
clone it at that ref; otherwise fetch the catalogue copy. Private mirror repos need the bot clone it at that ref; otherwise fetch the catalogue copy. Private mirror repos need the bot
token — passed via a per-command http.extraHeader (not persisted in .git/config, not printed).""" token — passed via a per-command http.extraHeader (not persisted in .git/config, not printed)."""
recipes_dir = os.path.expanduser("~/.abra/recipes") dest = abra.recipe_dir(recipe)
os.makedirs(recipes_dir, exist_ok=True) os.makedirs(os.path.dirname(dest), exist_ok=True)
dest = os.path.join(recipes_dir, recipe) # CCCI_SKIP_FETCH=1: use the locally STAGED recipe clone as-is (lets a test/Adversary stage a
# CCCI_SKIP_FETCH=1: use the local recipe clone as-is (lets a test/Adversary stage a fake/broken # fake/broken ref — e.g. a simulated broken PR head for the --quick rollback proof — without it
# ref — e.g. a simulated broken PR head for the --quick rollback proof — without it being clobbered # being clobbered by a re-fetch). Staging happens in the canonical ~/.abra/recipes/<recipe>;
# by a re-fetch). Never set in production CI. # copy it into the per-run tree so the rest of the run reads the staged state. Never set in
# production CI.
if os.environ.get("CCCI_SKIP_FETCH") == "1": if os.environ.get("CCCI_SKIP_FETCH") == "1":
print(f"[fetch] CCCI_SKIP_FETCH=1 — using local {recipe} recipe clone as-is", flush=True) canonical = os.path.expanduser(f"~/.abra/recipes/{recipe}")
subprocess.run(["rm", "-rf", dest], check=False)
if os.path.isdir(canonical):
shutil.copytree(canonical, dest, symlinks=True)
print(
f"[fetch] CCCI_SKIP_FETCH=1 — using staged {recipe} clone as-is "
f"(copied {canonical} -> per-run tree)",
flush=True,
)
return return
if src and ref: if src and ref:
url = f"https://git.autonomic.zone/{src}.git" url = f"https://git.autonomic.zone/{src}.git"
@ -177,7 +222,7 @@ def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None:
def snapshot_recipe_tests(recipe: str) -> str | None: def snapshot_recipe_tests(recipe: str) -> str | None:
"""Copy the recipe-shipped tests/ to a stable temp dir, immune to abra re-checking-out the """Copy the recipe-shipped tests/ to a stable temp dir, immune to abra re-checking-out the
recipe to a version tag during the run. Returns the snapshot path, or None if no tests/.""" recipe to a version tag during the run. Returns the snapshot path, or None if no tests/."""
src = os.path.expanduser(f"~/.abra/recipes/{recipe}/tests") src = os.path.join(abra.recipe_dir(recipe), "tests")
if not os.path.isdir(src): if not os.path.isdir(src):
return None return None
has_overlay = glob.glob(os.path.join(src, "test_*.py")) or os.path.isfile( has_overlay = glob.glob(os.path.join(src, "test_*.py")) or os.path.isfile(
@ -658,6 +703,8 @@ def run_quick(
results["upgrade"] = "fail" results["upgrade"] = "fail"
results["custom"] = "skip" results["custom"] = "skip"
finally: finally:
# Teardown funnel running: further SIGTERM/SIGALRM are logged + ignored (lifetime.py).
lifetime.begin_teardown()
# F2-11 skip count (read before deciding pass/fail) # F2-11 skip count (read before deciding pass/fail)
requires_deps_skipped = 0 requires_deps_skipped = 0
try: try:
@ -821,6 +868,9 @@ def promote_canonical(recipe: str, head_ref: str | None) -> None:
def main() -> int: def main() -> int:
# P1 lock-lifetime hardening: PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard
# deadline, armed before ANY abra call or lock acquisition (see harness/lifetime.py).
lifetime.install_lifetime_guards()
recipe = os.environ.get("RECIPE") recipe = os.environ.get("RECIPE")
if not recipe: if not recipe:
print("RECIPE env is required", file=sys.stderr) print("RECIPE env is required", file=sys.stderr)
@ -835,12 +885,10 @@ def main() -> int:
print( print(
f"== cc-ci run: recipe={recipe} ref={ref} pr={os.environ.get('PR', '0')} stages={sorted(stages)}" f"== cc-ci run: recipe={recipe} ref={ref} pr={os.environ.get('PR', '0')} stages={sorted(stages)}"
) )
# Concurrent-run safety: runs of the SAME recipe serialise on a per-recipe flock — they share # Concurrent-run safety is structural: this run's recipe trees live in its own ABRA_DIR
# ONE ~/.abra/recipes/<recipe> working tree which fetch_recipe (below) rm-rf's/reclones and the # (exported here, before ANY abra call), so no recipe-tree lock exists; same-DOMAIN runs
# upgrade tier git-checkouts mid-run. Must be taken BEFORE fetch_recipe. Different recipes run # serialise on the app-domain flock taken in deploy_app (see docs/concurrency.md).
# in parallel (capacity=2). The reference must stay alive for the whole run: the kernel drops setup_run_abra_dir()
# the flock when the fd closes (including on any crash/SIGKILL — no stale-lock failure mode).
_recipe_lock = lifecycle.acquire_recipe_lock(recipe) # noqa: F841
fetch_recipe(recipe, ref, src) fetch_recipe(recipe, ref, src)
# The PR-head commit the upgrade tier re-checks out for the chaos redeploy to the code under test # The PR-head commit the upgrade tier re-checks out for the chaos redeploy to the code under test
# (HC1). Prefer the explicit PR head sha ($REF) — robust + exact; fall back to the recipe checkout # (HC1). Prefer the explicit PR head sha ($REF) — robust + exact; fall back to the recipe checkout
@ -1123,6 +1171,9 @@ def main() -> int:
if op in stages: if op in stages:
results[op] = "skip" results[op] = "skip"
finally: finally:
# From here the teardown funnel runs: a SIGTERM/SIGALRM landing now is logged + ignored
# (lifetime.py) so a second signal can't abort the cleanup the first one asked for.
lifetime.begin_teardown()
# Teardown the recipe under test FIRST, then deps in reverse declaration order. # Teardown the recipe under test FIRST, then deps in reverse declaration order.
# Parent verify=False (Phase 1d): keep as-is so a parent residual doesn't mask a tier # Parent verify=False (Phase 1d): keep as-is so a parent residual doesn't mask a tier
# failure. Dep teardown uses verify=True via teardown_deps (F2-5 fix); failures are # failure. Dep teardown uses verify=True via teardown_deps (F2-5 fix); failures are

View File

@ -199,7 +199,13 @@ def _run(cmd, timeout=120, check=False):
def _recipe_dir(recipe: str) -> str: def _recipe_dir(recipe: str) -> str:
return os.path.expanduser(f"~/.abra/recipes/{recipe}") # Resolve like the abra CLI does: $ABRA_DIR (the per-run tree when imported by a CI run,
# e.g. promote_canonical) else the canonical ~/.abra (this module's own systemd-timer runs,
# which set no ABRA_DIR). Keeps fetch_recipe (an `abra` subprocess) and the git readers
# below pointed at the SAME tree in both contexts.
return os.path.join(
os.environ.get("ABRA_DIR") or os.path.expanduser("~/.abra"), "recipes", recipe
)
def recipe_tags(recipe: str) -> list[str]: def recipe_tags(recipe: str) -> list[str]:

View File

@ -0,0 +1,108 @@
"""Shared utilities for the real-kernel concurrency suite (imported by the test modules; the
fixtures in conftest.py wrap these). No flock mocking anywhere — probes use real LOCK_NB."""
from __future__ import annotations
import contextlib
import fcntl
import os
import signal
import subprocess
import sys
import time
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
HELPERS = os.path.join(os.path.dirname(__file__), "helpers.py")
DOMAIN = "test-abc123.ci.commoninternet.net" # matches RUN_APP_RE
class HelperPool:
"""Spawns helpers.py subprocesses and GUARANTEES their cleanup (incl. recorded grandchild
pids from `hold-with-child`/`wrapper` markers) — no leaked children in the test VM."""
def __init__(self, out_dir: str):
self.out_dir = out_dir
self.procs: list[subprocess.Popen] = []
self.extra_pids: list[int] = []
self._n = 0
def spawn(self, *args: str, env_extra: dict | None = None) -> tuple[subprocess.Popen, str]:
"""Start `helpers.py <args...>`; returns (proc, marker_file)."""
self._n += 1
out = os.path.join(self.out_dir, f"helper-{self._n}.out")
env = dict(os.environ, CCCI_HELPER_OUT=out, **(env_extra or {}))
p = subprocess.Popen( # noqa: S603
[sys.executable, HELPERS, *args],
env=env,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
)
self.procs.append(p)
return p, out
def track_pid(self, pid: int) -> None:
self.extra_pids.append(pid)
def cleanup(self) -> None:
for p in self.procs:
if p.poll() is None:
p.kill()
with contextlib.suppress(subprocess.TimeoutExpired):
p.wait(timeout=10)
for pid in self.extra_pids:
with contextlib.suppress(OSError):
os.kill(pid, signal.SIGKILL)
def wait_marker(out: str, token: str, timeout: float = 15.0) -> str | None:
"""Poll a helper's marker file for a line containing `token`; returns the line or None."""
deadline = time.time() + timeout
while time.time() < deadline:
try:
with open(out) as f:
for line in f:
if token in line:
return line.strip()
except OSError:
pass
time.sleep(0.1)
return None
def lock_state(domain: str) -> str:
"""'held' | 'free' | 'absent' for the domain's lockfile, probed with a REAL LOCK_NB."""
path = lifecycle._app_lock_path(domain) # noqa: SLF001
if not os.path.exists(path):
return "absent"
with open(path, "a") as f:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
return "free"
except BlockingIOError:
return "held"
def wait_lock_state(domain: str, want: str, timeout: float = 10.0) -> str:
"""Poll until lock_state(domain) == want (kernel release on process death is fast, but give
the scheduler room). Returns the final observed state."""
deadline = time.time() + timeout
state = lock_state(domain)
while state != want and time.time() < deadline:
time.sleep(0.1)
state = lock_state(domain)
return state
def pid_alive(pid: int) -> bool:
return os.path.exists(f"/proc/{pid}")
def wait_pid_gone(pid: int, timeout: float = 15.0) -> bool:
deadline = time.time() + timeout
while time.time() < deadline:
if not pid_alive(pid):
return True
time.sleep(0.1)
return False

View File

@ -0,0 +1,34 @@
"""Fixtures for the real-kernel concurrency suite (concurrency-restructure plan, 19 cases).
NOT part of the default `pytest tests/unit` gate — run explicitly with `pytest tests/concurrency
-q` (docs/concurrency.md). Locks live in a per-test tmp dir (CCCI_APP_LOCK_DIR); helper
subprocesses hold REAL flocks / install the REAL prctl+signal guards and are always reaped in
fixture finalizers (no leaked children in the test VM).
"""
from __future__ import annotations
import os
import sys
import pytest
sys.path.insert(0, os.path.dirname(__file__))
from concutil import HelperPool # noqa: E402
@pytest.fixture
def lock_dir(tmp_path, monkeypatch):
"""Sandbox lock dir, exported so BOTH this process's lifecycle calls and helper subprocesses
(which inherit os.environ) resolve their lockfiles here — never /run/lock."""
d = tmp_path / "locks"
d.mkdir()
monkeypatch.setenv("CCCI_APP_LOCK_DIR", str(d))
return str(d)
@pytest.fixture
def pool(tmp_path):
hp = HelperPool(str(tmp_path))
yield hp
hp.cleanup()

View File

@ -0,0 +1,120 @@
#!/usr/bin/env python3
"""Subprocess helpers for tests/concurrency — REAL kernel locks and the REAL lifetime guards in
separate processes (flock/prctl are never mocked; tests assert on actual kernel behavior).
Invoked as: python3 helpers.py <command> <args...>
Env contract (set by the spawning test):
CCCI_APP_LOCK_DIR sandbox lock dir (never /run/lock in tests)
CCCI_HELPER_OUT marker file this helper APPENDS progress lines to (ACQUIRED/READY/...)
Commands:
hold <domain> acquire the app lock, mark `ACQUIRED <ts>`, sleep forever
hold-with-child <domain> acquire the lock, spawn a plain sleeping subprocess child, mark
`ACQUIRED <ts>` + `CHILD <pid>` (PEP 446: the child must NOT
inherit the lock fd), sleep forever
guarded <domain> <deadline> install the REAL lifetime guards (alarm=<deadline>s), acquire the
lock, mark `READY`; when the teardown funnel runs (`finally:`),
mark `TEARDOWN` before exiting
wrapper <domain> spawn `guarded <domain> 3600` as MY child, mark `WRAPPED <pid>`,
sleep — the test kills me to prove PDEATHSIG TERMs the child
orphan-probe wait (bounded) until reparented (ppid==1), then install the
guards; mark `REFUSED` if they exit (expected) or `GUARDS_OK`
fetch-checkout <recipe> <ref> run run_recipe_ci.fetch_recipe (the test sets CCCI_SKIP_FETCH=1
+ a per-"run" ABRA_DIR), git-checkout <ref>, mark
`RESULT <head> <data.txt content>`
"""
from __future__ import annotations
import os
import subprocess
import sys
import time
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "runner"))
from harness import abra, lifecycle, lifetime # noqa: E402
OUT = os.environ.get("CCCI_HELPER_OUT")
def mark(line: str) -> None:
if OUT:
with open(OUT, "a") as f:
f.write(line + "\n")
f.flush()
print(line, flush=True)
def cmd_hold(domain: str) -> None:
lifecycle.acquire_app_lock(domain)
mark(f"ACQUIRED {time.time()}")
time.sleep(3600)
def cmd_hold_with_child(domain: str) -> None:
lifecycle.acquire_app_lock(domain)
child = subprocess.Popen([sys.executable, "-c", "import time; time.sleep(3600)"])
mark(f"ACQUIRED {time.time()}")
mark(f"CHILD {child.pid}")
time.sleep(3600)
def cmd_guarded(domain: str, deadline: str) -> None:
lifetime.install_lifetime_guards(deadline_seconds=int(deadline))
lifecycle.acquire_app_lock(domain)
mark("READY")
try:
time.sleep(3600)
finally:
mark("TEARDOWN")
def cmd_wrapper(domain: str) -> None:
p = subprocess.Popen( # noqa: S603
[sys.executable, os.path.abspath(__file__), "guarded", domain, "3600"],
env=os.environ.copy(),
)
mark(f"WRAPPED {p.pid}")
time.sleep(3600)
def cmd_orphan_probe() -> None:
# Our spawner exits immediately after fork; wait (bounded) until we are reparented so the
# prctl is installed with the parent ALREADY dead — the exact race the ppid check closes.
for _ in range(200):
if os.getppid() == 1:
break
time.sleep(0.05)
else:
mark("NEVER_REPARENTED") # e.g. a subreaper environment — test will fail visibly
return
try:
lifetime.install_lifetime_guards()
except SystemExit:
mark("REFUSED")
raise
mark("GUARDS_OK")
def cmd_fetch_checkout(recipe: str, ref: str) -> None:
import run_recipe_ci
run_recipe_ci.fetch_recipe(recipe, None, None)
abra.recipe_checkout(recipe, ref)
head = abra.recipe_head_commit(recipe)
with open(os.path.join(abra.recipe_dir(recipe), "data.txt")) as f:
content = f.read().strip()
mark(f"RESULT {head} {content}")
if __name__ == "__main__":
cmd, *args = sys.argv[1:]
{
"hold": cmd_hold,
"hold-with-child": cmd_hold_with_child,
"guarded": cmd_guarded,
"wrapper": cmd_wrapper,
"orphan-probe": cmd_orphan_probe,
"fetch-checkout": cmd_fetch_checkout,
}[cmd](*args)

View File

@ -0,0 +1,175 @@
"""Per-run ABRA_DIR isolation (concurrency-restructure plan, cases 17-19). Real directories,
real symlinks, real git — abra itself is replaced by a recording stub where a CLI call is
involved (case 17), because these cases test OUR dir/env plumbing, not abra."""
from __future__ import annotations
import os
import stat
import subprocess
import sys
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
import run_recipe_ci # noqa: E402
from concutil import wait_marker # noqa: E402
from harness import abra # noqa: E402
RECIPE = "fakerecipe"
def _git(cwd, *args):
subprocess.run(
["git", "-c", "user.email=t@t", "-c", "user.name=t", *args],
cwd=cwd,
check=True,
capture_output=True,
)
def _make_fake_home(tmp_path):
"""A fake $HOME with a canonical ~/.abra: servers/default + catalogue dirs, and a recipe git
repo with two tags whose data.txt differs (v1 -> 'one', v2 -> 'two', HEAD at v2)."""
home = tmp_path / "home"
(home / ".abra" / "servers" / "default").mkdir(parents=True)
(home / ".abra" / "catalogue").mkdir(parents=True)
repo = home / ".abra" / "recipes" / RECIPE
repo.mkdir(parents=True)
_git(repo, "init", "-q")
(repo / "data.txt").write_text("one\n")
_git(repo, "add", "data.txt")
_git(repo, "commit", "-qm", "v1")
_git(repo, "tag", "v1")
(repo / "data.txt").write_text("two\n")
_git(repo, "add", "data.txt")
_git(repo, "commit", "-qm", "v2")
_git(repo, "tag", "v2")
return home
def test_17_per_run_dir_built_and_exported_before_abra(tmp_path, monkeypatch):
"""Case 17: setup_run_abra_dir builds the per-run dir correctly (servers/catalogue symlinks
resolve to the canonical tree, recipes/ empty + writable) and $ABRA_DIR is exported before
the first abra call — proven by a stub `abra` on PATH that records the env it saw."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.setenv("DRONE_BUILD_NUMBER", "777")
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten") # so monkeypatch restores it
d = run_recipe_ci.setup_run_abra_dir()
assert d == str(tmp_path / "runs" / "777" / "abra")
assert os.environ["ABRA_DIR"] == d
assert os.readlink(os.path.join(d, "servers")) == str(home / ".abra" / "servers")
assert os.readlink(os.path.join(d, "catalogue")) == str(home / ".abra" / "catalogue")
# symlinks RESOLVE (targets exist) and recipes/ is empty + writable
assert os.path.isdir(os.path.join(d, "servers", "default"))
assert os.path.isdir(os.path.join(d, "catalogue"))
assert os.listdir(os.path.join(d, "recipes")) == []
probe = os.path.join(d, "recipes", ".write-probe")
open(probe, "w").close()
os.remove(probe)
# idempotent re-entry (Drone build-number retry): must not raise on existing symlinks
assert run_recipe_ci.setup_run_abra_dir() == d
# stub abra records $ABRA_DIR at call time; fetch_recipe's catalogue branch invokes it
stub_dir = tmp_path / "bin"
stub_dir.mkdir()
log = tmp_path / "abra-env.log"
stub = stub_dir / "abra"
stub.write_text(f'#!/bin/sh\necho "$ABRA_DIR" >> {log}\nexit 0\n')
stub.chmod(stub.stat().st_mode | stat.S_IEXEC)
monkeypatch.setenv("PATH", f"{stub_dir}{os.pathsep}{os.environ['PATH']}")
monkeypatch.delenv("CCCI_SKIP_FETCH", raising=False)
run_recipe_ci.fetch_recipe(RECIPE, None, None)
assert log.read_text().strip() == d, "abra was called without the per-run ABRA_DIR exported"
def test_18_concurrent_same_recipe_fetch_no_cross_talk(tmp_path, monkeypatch, pool):
"""Case 18: two CONCURRENT fetch+checkout flows of the SAME recipe into different ABRA_DIRs
produce two correct, divergent trees (v1 vs v2) — the old shared-tree corruption scenario,
now structurally safe with no lock. The canonical staged clone is untouched."""
home = _make_fake_home(tmp_path)
canonical_repo = home / ".abra" / "recipes" / RECIPE
head_before = subprocess.run(
["git", "-C", canonical_repo, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
runs = {}
for name, ref in (("runA", "v1"), ("runB", "v2")):
abra_dir = tmp_path / name / "abra"
abra_dir.mkdir(parents=True)
_, out = pool.spawn(
"fetch-checkout",
RECIPE,
ref,
env_extra={
"HOME": str(home),
"ABRA_DIR": str(abra_dir),
"CCCI_SKIP_FETCH": "1",
},
)
runs[name] = (out, ref, abra_dir)
expect = {"v1": "one", "v2": "two"}
for name, (out, ref, abra_dir) in runs.items():
line = wait_marker(out, "RESULT", timeout=30)
assert line, f"{name} never produced a RESULT"
_, head, content = line.split()
assert content == expect[ref], f"{name}@{ref}: tree content {content!r}"
tree = abra_dir / "recipes" / RECIPE
assert (tree / "data.txt").read_text().strip() == expect[ref]
assert (
head
== subprocess.run(
["git", "-C", tree, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
)
# the two trees genuinely diverge AND the canonical staged clone is untouched
a = (runs["runA"][2] / "recipes" / RECIPE / "data.txt").read_text()
b = (runs["runB"][2] / "recipes" / RECIPE / "data.txt").read_text()
assert a != b
head_after = subprocess.run(
["git", "-C", canonical_repo, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
assert head_after == head_before, "canonical clone must not be touched by per-run fetches"
def test_19_env_written_through_servers_symlink_lands_canonical(tmp_path, monkeypatch):
"""Case 19: an app .env written through the per-run servers/ symlink (what abra does under
$ABRA_DIR) lands in the CANONICAL shared path — so janitor discovery and every
expanduser('~/.abra/servers/...') reader keep working unchanged."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.setenv("DRONE_BUILD_NUMBER", "778")
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")
d = run_recipe_ci.setup_run_abra_dir()
domain = "test-abc123.ci.commoninternet.net"
via_symlink = os.path.join(d, "servers", "default", f"{domain}.env")
with open(via_symlink, "w") as f:
f.write("TYPE=fakerecipe:1.0.0\nDOMAIN=placeholder\n")
canonical = home / ".abra" / "servers" / "default" / f"{domain}.env"
assert canonical.is_file(), ".env written via the symlink must land in the canonical path"
# the canonical-path readers/writers (abra.env_get/env_set use ~/.abra) see the same file
assert abra.env_get(domain, "TYPE") == "fakerecipe:1.0.0"
abra.env_set(domain, "DOMAIN", domain)
with open(via_symlink) as f:
assert f"DOMAIN={domain}" in f.read()
def test_18b_run_id_manual_fallback_is_per_process(tmp_path, monkeypatch):
"""Companion to case 18: two concurrent MANUAL runs (no DRONE_BUILD_NUMBER) must not share an
abra dir either — the manual fallback is pid-suffixed."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.delenv("DRONE_BUILD_NUMBER", raising=False)
monkeypatch.delenv("CCCI_APP_DOMAIN", raising=False)
monkeypatch.delenv("CCCI_RUN_ID", raising=False)
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")
d = run_recipe_ci.setup_run_abra_dir()
assert f"manual-{os.getpid()}" in d

View File

@ -0,0 +1,189 @@
"""Janitor / flock-probe semantics (concurrency-restructure plan, cases 5-12).
The janitor runs IN-PROCESS with its discovery monkeypatched (candidates injected via a stubbed
abra.app_ls + empty docker sweep) and teardown_app stubbed to record calls — but the LOCKS are
real kernel flocks, held by real helper subprocesses where a live owner is needed."""
from __future__ import annotations
import os
import sys
import threading
import time
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import DOMAIN, lock_state, wait_marker # noqa: E402
from harness import lifecycle # noqa: E402
def _inject_candidates(monkeypatch, domains):
"""Point janitor discovery at exactly `domains`: abra lists them, docker sweep is empty.
teardown_app is stubbed to a recorder; returns the calls list."""
calls = []
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": d} for d in domains])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", lambda d, verify=True: calls.append(d))
return calls
def test_5_orphan_reaped_lockfile_unlinked(lock_dir, pool, monkeypatch):
"""Case 5: an orphan (lockfile exists, no holder — its run was SIGKILL'd) is reaped exactly
once and its lockfile unlinked."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == [DOMAIN], f"teardown calls: {calls} (expected exactly one)"
assert lock_state(DOMAIN) == "absent", "reaped orphan's lockfile must be unlinked"
def test_6_live_run_never_reaped(lock_dir, pool, monkeypatch, capsys):
"""Case 6: a held lock (live helper) is never reaped and is logged as live."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == []
assert "live concurrent run" in capsys.readouterr().out
assert lock_state(DOMAIN) == "held"
def test_7_new_run_blocks_until_reap_finishes(lock_dir, pool, monkeypatch):
"""Case 7: the janitor reaps WHILE HOLDING the probe lock, so a new run of the same domain
blocks in acquire_app_lock until the reap completes — no window where a fresh app coexists
with a half-reaped one."""
# Make an orphan.
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
state = {"teardown_end": None, "acquirer_out": None}
def slow_teardown(domain, verify=True):
# While the janitor holds the probe lock mid-reap, a new run starts acquiring.
_, aout = pool.spawn("hold", DOMAIN)
state["acquirer_out"] = aout
time.sleep(2.0)
state["teardown_end"] = time.time()
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": DOMAIN}])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", slow_teardown)
lifecycle.janitor()
line = wait_marker(state["acquirer_out"], "ACQUIRED", timeout=15)
assert line, "new run never acquired after the reap"
acquired_ts = float(line.split()[1])
assert (
acquired_ts >= state["teardown_end"]
), f"new run acquired at {acquired_ts} BEFORE the reap finished at {state['teardown_end']}"
# The new run must hold a lock the next probe can SEE (fresh inode at the path).
assert lock_state(DOMAIN) == "held"
def test_8_two_janitors_exactly_one_reaps(lock_dir, pool, monkeypatch):
"""Case 8: two concurrent janitors arbitrate on the probe flock — exactly one reaps (the
other sees 'held' and leaves). Teardown is slowed so the runs genuinely overlap."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
calls = []
calls_lock = threading.Lock()
def slow_teardown(domain, verify=True):
with calls_lock:
calls.append(domain)
time.sleep(2.0)
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": DOMAIN}])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", slow_teardown)
barrier = threading.Barrier(2)
def run_janitor():
barrier.wait()
lifecycle.janitor()
t1, t2 = threading.Thread(target=run_janitor), threading.Thread(target=run_janitor)
t1.start(), t2.start()
t1.join(timeout=30), t2.join(timeout=30)
assert calls == [DOMAIN], f"expected exactly one reap, got {calls}"
assert lock_state(DOMAIN) == "absent"
def test_9_reboot_lockfile_absent_reaped_immediately(lock_dir, monkeypatch):
"""Case 9: post-reboot simulation — the app exists but its lockfile is gone (/run/lock is
tmpfs). The probe trivially acquires -> immediate reap, NO age threshold (improvement over
the old 2h fallback)."""
assert lock_state(DOMAIN) == "absent"
calls = _inject_candidates(monkeypatch, [DOMAIN])
t0 = time.time()
lifecycle.janitor()
assert calls == [DOMAIN]
assert time.time() - t0 < 5, "reap must be immediate (no age wait)"
def test_10_long_held_lock_flagged_never_stolen(lock_dir, pool, monkeypatch, capsys):
"""Case 10: a lock held with mtime older than 120min is flagged as a possible leaked run —
and NOT reaped (never steal a held lock)."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
backdate = time.time() - (130 * 60)
os.utime(path, (backdate, backdate))
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == []
out_text = capsys.readouterr().out
assert "possible leaked run" in out_text and "lslocks" in out_text
assert lock_state(DOMAIN) == "held"
def test_11_warm_canonical_names_never_probed(lock_dir, monkeypatch):
"""Case 11: RUN_APP_RE allowlist — warm/canonical-shaped names never become candidates, so
they are never probed (no lockfile is even created for them) and never reaped."""
warmish = [
"warm-keycloak.ci.commoninternet.net",
"keycloak.ci.commoninternet.net",
"warm-hedgedoc.ci.commoninternet.net",
"drone.ci.commoninternet.net",
]
calls = []
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": d} for d in warmish])
monkeypatch.setattr(
lifecycle,
"_docker_names",
lambda kind, stack: ["warm-keycloak_ci_commoninternet_net_app"]
if kind == "service"
else [],
)
monkeypatch.setattr(lifecycle, "teardown_app", lambda d, verify=True: calls.append(d))
lifecycle.janitor()
assert calls == []
lockdir = os.environ["CCCI_APP_LOCK_DIR"]
assert [
f for f in os.listdir(lockdir) if f.startswith("cc-ci-app-")
] == [], "janitor must not create lockfiles for non-run-app names"
def test_12_degrades_safely_on_bad_lockfile_and_missing_dir(lock_dir, monkeypatch, capsys):
"""Case 12: a garbled/unopenable lockfile (here: a DIRECTORY at the lockfile path) is skipped
with a log line; a missing lock dir doesn't crash the janitor either. Never a crash."""
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
os.makedirs(path) # open(path, "a") -> IsADirectoryError (an OSError)
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor() # must not raise
assert calls == []
assert "skipping" in capsys.readouterr().out
os.rmdir(path)
monkeypatch.setenv("CCCI_APP_LOCK_DIR", os.path.join(os.environ["CCCI_APP_LOCK_DIR"], "gone"))
lifecycle.janitor() # missing dir: probe open fails -> skip; tidy glob -> empty. No crash.
assert calls == []

View File

@ -0,0 +1,82 @@
"""Lifetime hardening (concurrency-restructure plan, cases 13-16): the REAL prctl/signal/alarm
guards installed by helper subprocesses; tests assert teardown ran, exit was non-zero, and the
lock was released."""
from __future__ import annotations
import os
import signal
import sys
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import ( # noqa: E402
DOMAIN,
wait_lock_state,
wait_marker,
wait_pid_gone,
)
def test_13_pdeathsig_parent_kill_terms_harness(lock_dir, pool):
"""Case 13: wrapper-parent spawns a guarded harness-child; the parent is SIGKILL'd (the
harness gets no courtesy signal) -> the kernel's PDEATHSIG TERMs the child, its teardown
funnel runs, it exits, and the lock is released."""
p, out = pool.spawn("wrapper", DOMAIN)
line = wait_marker(out, "WRAPPED")
assert line, "wrapper never spawned its child"
child_pid = int(line.split()[1])
pool.track_pid(child_pid)
assert wait_marker(out, "READY"), "guarded child never got ready"
p.kill() # parent dies WITHOUT signalling the child — only PDEATHSIG can save us
p.wait(timeout=10)
assert wait_pid_gone(child_pid), "guarded child must exit on parent death (PDEATHSIG)"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run"
assert wait_lock_state(DOMAIN, "free") == "free"
def test_14_already_orphaned_helper_refuses_to_run(lock_dir, pool):
"""Case 14 (ppid race): a helper whose parent died BEFORE the prctl was armed (it starts
already reparented to pid 1) must refuse to run — PDEATHSIG would never fire for it."""
# Spawn an intermediate parent that forks orphan-probe and exits immediately.
import subprocess
out = os.path.join(pool.out_dir, "orphan.out")
intermediate = (
"import subprocess, sys, os; "
"subprocess.Popen([sys.executable, os.environ['CCCI_HELPERS'], 'orphan-probe']); "
)
env = dict(
os.environ,
CCCI_HELPER_OUT=out,
CCCI_HELPERS=os.path.join(os.path.dirname(__file__), "helpers.py"),
)
subprocess.run([sys.executable, "-c", intermediate], env=env, timeout=15, check=True)
line = wait_marker(out, "REFUSED", timeout=20)
assert line, "orphaned helper did not refuse to run (or never reparented to pid 1)"
def test_15_deadline_alarm_fires_teardown_and_releases(lock_dir, pool):
"""Case 15: the self-deadline (alarm). A guarded helper with a 2s deadline tears down via
the funnel (finally: ran), exits NON-zero, and its lock is released."""
p, out = pool.spawn("guarded", DOMAIN, "2")
assert wait_marker(out, "READY")
rc = p.wait(timeout=20)
assert rc != 0, f"deadline exit must be non-zero (got {rc})"
assert rc == 128 + signal.SIGALRM, f"expected 142 (128+SIGALRM), got {rc}"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run on deadline"
assert wait_lock_state(DOMAIN, "free") == "free"
def test_16_sigterm_runs_teardown_funnel_and_releases(lock_dir, pool):
"""Case 16: SIGTERM (drone cancel path) -> the finally: teardown funnel runs, exit is
non-zero, lock released."""
p, out = pool.spawn("guarded", DOMAIN, "3600")
assert wait_marker(out, "READY")
p.send_signal(signal.SIGTERM)
rc = p.wait(timeout=20)
assert rc != 0, f"SIGTERM exit must be non-zero (got {rc})"
assert rc == 128 + signal.SIGTERM, f"expected 143 (128+SIGTERM), got {rc}"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run on SIGTERM"
assert wait_lock_state(DOMAIN, "free") == "free"

View File

@ -0,0 +1,85 @@
"""Lock fundamentals (concurrency-restructure plan, cases 1-4). Real kernel flocks held by real
subprocesses — nothing mocked."""
from __future__ import annotations
import fcntl
import os
import sys
import time
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import ( # noqa: E402
DOMAIN,
lock_state,
wait_lock_state,
wait_marker,
)
from harness import lifecycle # noqa: E402
def test_1_sigkill_releases_lock(lock_dir, pool):
"""Case 1: acquire -> holder SIGKILL'd -> lock immediately acquirable (kernel auto-release).
The exact property the old pidfile registry approximated with /proc checks."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED"), "holder never acquired"
assert lock_state(DOMAIN) == "held"
p.kill()
p.wait(timeout=10)
assert wait_lock_state(DOMAIN, "free") == "free"
def test_2_nb_probe_held_vs_unheld(lock_dir, pool):
"""Case 2: LOCK_NB probe raises BlockingIOError against a held lock; succeeds when unheld."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
with open(path, "a") as f:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
raise AssertionError("LOCK_NB succeeded against a held lock")
except BlockingIOError:
pass
p.kill()
p.wait(timeout=10)
assert wait_lock_state(DOMAIN, "free") == "free"
with open(path, "a") as f:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB) # must not raise now
def test_3_lock_fd_not_inherited_by_children(lock_dir, pool):
"""Case 3 (PEP 446): the holder spawns a subprocess child, the holder dies, the child lives —
and the lock is STILL released (the child never inherited the lock fd). This is what makes
'held lock == live HARNESS owner' sound even though runs spawn abra/docker/pytest children."""
p, out = pool.spawn("hold-with-child", DOMAIN)
assert wait_marker(out, "ACQUIRED")
child_line = wait_marker(out, "CHILD")
assert child_line, "holder never reported its child pid"
child_pid = int(child_line.split()[1])
pool.track_pid(child_pid)
p.kill()
p.wait(timeout=10)
assert os.path.exists(f"/proc/{child_pid}"), "child should outlive the holder"
assert (
wait_lock_state(DOMAIN, "free") == "free"
), "lock must release on holder death even with a live child (PEP 446 non-inheritable fd)"
def test_4_second_acquire_blocks_until_first_exits(lock_dir, pool):
"""Case 4: a second same-domain acquire blocks until the first holder exits — the
double-!testme serialisation property."""
p1, out1 = pool.spawn("hold", DOMAIN)
assert wait_marker(out1, "ACQUIRED")
p2, out2 = pool.spawn("hold", DOMAIN)
# p2 must NOT acquire while p1 holds.
time.sleep(1.5)
assert wait_marker(out2, "ACQUIRED", timeout=0.1) is None, "second acquire did not block"
t_kill = time.time()
p1.kill()
p1.wait(timeout=10)
line = wait_marker(out2, "ACQUIRED", timeout=15)
assert line, "second acquire never completed after first holder exited"
acquired_ts = float(line.split()[1])
assert acquired_ts >= t_kill - 0.05, "second holder acquired before the first exited"
assert lock_state(DOMAIN) == "held"

View File

@ -15,7 +15,9 @@ set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}" : "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}" # Resolve the recipe tree the way abra does: $ABRA_DIR (the per-run tree inside a CI run) else
# the canonical ~/.abra — the overlay must land in the tree this run actually deploys from.
RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then if [ ! -d "$RECIPE_DIR" ]; then
echo " discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2 echo " discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2

View File

@ -15,7 +15,9 @@ set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}" : "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}" # Resolve the recipe tree the way abra does: $ABRA_DIR (the per-run tree inside a CI run) else
# the canonical ~/.abra — the overlay must land in the tree this run actually deploys from.
RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then if [ ! -d "$RECIPE_DIR" ]; then
echo " ghost install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2 echo " ghost install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2