fix(harness): run-keyed run-scoped state files — CONC-A1 (same-domain runs corrupted shared deploy-count)

The four CCCI state files (deploys countfile, opstate, deps, depskip) were keyed by app domain in shared /tmp. A second run of the same domain executes its main() preamble + deploy_app's pre-lock _record_deploy BEFORE blocking at the app lock, so it reset/polluted the live first run's counter (false DG4.1 deploy-count=2, build 279) and the first run's end-of-run os.remove crashed the second (FileNotFoundError, build 281). Masked pre-restructure by the end-to-end recipe flock. Now keyed by run id + harness pid via _run_state_path(); children receive exact paths via the CCCI_*_FILE env vars, so domain keying was never load-bearing. tests/concurrency/test_run_state.py: path-invariant cases + a real-process regression (helpers.py deploy-count-run) reproducing the live interleaving — verified to FAIL under simulated shared keying. docs/concurrency.md §3 updated.
fix(ci): recipe-ci wrapper — capture harness rc, clear traps before exit (green runs no longer exit 1)
2026-06-10 08:16:09 +00:00 · 2026-06-10 04:54:40 +00:00 · 2026-06-10 04:32:54 +00:00 · 2026-06-10 04:29:36 +00:00 · 2026-06-10 04:19:35 +00:00 · 2026-06-10 04:18:33 +00:00
142 changed files with 3106 additions and 944 deletions
--- a/.drone.yml
+++ b/.drone.yml
@ -35,10 +35,12 @@ steps:
 # the comment-bridge). Deploys the recipe at the PR head, runs install/upgrade/backup + any
 # recipe-local tests via the shared harness, then guarantees teardown (plan §4.2/§4.3).
 #
-# Resource safety (plan §4.2/§4.3): MAX_TESTS=DRONE_RUNNER_CAPACITY=1 (nix/modules/drone-runner.nix) is
+# Resource safety (plan §4.2/§4.3): DRONE_RUNNER_CAPACITY=2 (nix/modules/drone-runner.nix, the
-# the primary concurrency cap; concurrency.limit below is a redundant belt. CCCI_JANITOR_MAX_AGE=0
+# single concurrency knob) allows two recipe runs in parallel. Concurrent-run safety is enforced by
-# makes the run-start janitor reap ANY orphaned run app before deploying — safe because capacity=1
+# the harness, not by serialisation: every run holds an exclusive flock on its app domain
-# means no concurrent run exists (a SIGKILL'd/timed-out build leaves an orphan with no teardown).
+# (/run/lock/cc-ci-app-<domain>.lock) for its whole process lifetime, the run-start janitor probes
 # that lock to reap only orphans (held lock = live run, never touched), and recipe working trees
 # are per-run ($ABRA_DIR/recipes — no shared checkout, no recipe lock). See docs/concurrency.md.
 kind: pipeline
 type: exec
 name: recipe-ci
@ -51,21 +53,37 @@ trigger:
  event:
    - custom
-concurrency:
+# NB deliberately NO `concurrency.limit` here: DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix
-  limit: 1
+# maxTests) is the single concurrency knob (P4 — two knobs in two files drifted).
 steps:
  - name: ci
    environment:
      STAGES: install,upgrade,backup,restore,custom
-      CCCI_JANITOR_MAX_AGE: "0"
+      # The exec runner points HOME at a per-build workspace; force it to /root so abra's server
-      # The exec runner points HOME at a per-build workspace; force it to /root so abra finds its
+      # config is found via the per-run ABRA_DIR's servers/ symlink -> /root/.abra/servers.
-      # server config + recipes under /root/.abra (as the manual M4/M5 runs did). Safe: capacity=1
+      # Recipe trees are PER-RUN ($ABRA_DIR/recipes, exported by run_recipe_ci before any abra
-      # means no concurrent build shares /root/.abra.
+      # call), so concurrent builds never share a recipe checkout; app .env files are per-domain
      # in the shared canonical servers/ path, guarded by the app-domain flock.
      HOME: /root
    commands:
      # RECIPE/REF/PR/SRC (+ CCCI_QUICK for `!testme --quick`) are injected as env vars from the
      # build's custom params. CCCI_QUICK=1 makes run_recipe_ci take the opt-in fast lane (WC7);
      # absent => full cold (default). run_quick ignores STAGES (always upgrade+custom).
      - 'echo "recipe-ci: RECIPE=$RECIPE REF=$REF PR=$PR SRC=$SRC stages=$STAGES quick=${CCCI_QUICK:-0}"'
-      - cc-ci-run runner/run_recipe_ci.py
+      # P1 lock-lifetime hardening: run the harness in its own session/process group (setsid) and
      # forward a drone cancel (TERM to this step shell) to the WHOLE group, so the harness's
      # SIGTERM handler runs its teardown funnel instead of being leaked (the exec runner kills
      # only the step shell, not the tree). PDEATHSIG inside the harness backstops the case where
      # this shell dies without the trap firing. The harness exit code is captured explicitly and
      # the traps cleared before exiting: the runner shell is `set -e`, and an EXIT-trap kill of
      # the already-gone process group returns ESRCH, which otherwise poisons a GREEN run's exit
      # status to 1 (observed live, build 269: all tiers pass, step exit 1).
      - |
        setsid cc-ci-run runner/run_recipe_ci.py &
        PID=$!
        trap 'kill -TERM -- "-$PID" 2>/dev/null || true' TERM EXIT
        rc=0
        wait "$PID" || rc=$?
        trap - TERM EXIT
        exit "$rc"
--- a/BACKLOG-conc.md
+++ b/BACKLOG-conc.md
@ -0,0 +1,22 @@
 # BACKLOG — sub-phase conc
 ## Build backlog
 - [ ] P1 lock-lifetime hardening: prctl PDEATHSIG + ppid race check + SIGTERM handler →
      teardown funnel + signal.alarm(3600) hard deadline; .drone.yml setsid/trap wrap;
      PEP 446 comment on lock open()
 - [ ] P2 flock-probe janitor: acquire_app_lock(domain) at register_run_app's call site;
      janitor probes per-domain lockfiles (acquired→reap under probe lock, held→leave,
      >120min mtime→warn); delete registry symbols
 - [ ] P3 per-run ABRA_DIR: /var/lib/cc-ci-runs/<build>/abra with servers+catalogue symlinks,
      fresh recipes/; fetch_recipe = plain clone; delete acquire_recipe_lock; route harness
      recipe paths through ABRA_DIR
 - [ ] P4 config cleanup: remove concurrency.limit from .drone.yml; maxTests is the single knob
 - [ ] tests/concurrency suite (19 cases, real-kernel flock, explicit invocation only)
 - [ ] P5 docs/concurrency.md rewrite to the new model
 - [ ] M1 claim (branch complete, both suites + lint green)
 - [ ] M2: merge to main after M1 PASS, push build green, live verification a–d
 ## Adversary findings
 (adversary-owned)
--- a/JOURNAL-conc.md
+++ b/JOURNAL-conc.md
@ -0,0 +1,24 @@
 # JOURNAL — sub-phase conc (Builder, append-only)
 ## 2026-06-10 — bootstrap
 Read concurrency-restructure-full-plan.md (SSOT) + plan.md §6.1/§7/§9. Oriented on the code:
 - `runner/harness/lifecycle.py` — recipe flock (l.46), registry (l.65–97), deploy_app
  registration (l.283), teardown unregister (l.723), three-way janitor (l.726).
 - `runner/run_recipe_ci.py` — `acquire_recipe_lock` call site (l.843), `fetch_recipe` (l.140,
  rm-rf + reclone of the shared tree), janitor call sites (l.600 quick, l.932 cold).
 - `.drone.yml` — recipe-ci step runs `cc-ci-run runner/run_recipe_ci.py` bare (P1 wraps it),
  `concurrency.limit: 2` (P4 removes).
 - Greps for P3 fallout: `~/.abra/recipes` referenced in abra.py (recipe_checkout,
  has_lightweight_version_tags, recipe_head_commit, recipe_versions), generic.py:28,
  lifecycle.prepull_images, run_recipe_ci (fetch_recipe, snapshot_recipe_tests, comment),
  warm_reconcile.py:202 (runs OUTSIDE per-run context — keeps default), and
  tests/ghost+discourse install_steps.sh (`${HOME}/.abra/recipes/...` — these run INSIDE a
  run and copy compose.ccci.yml into the deploy tree, so they must resolve the per-run dir).
 - `~/.abra/servers/...` paths are unaffected by design (servers/ is symlinked to the canonical
  /root/.abra/servers, so both resolutions land on the same file).
 Working setup: state files on main in this clone; code on branch `restructure/concurrency`
 via a git worktree at ../cc-ci-conc; test runs on the cc-ci host via /root/builder-clone
 (`cc-ci-run -m pytest ...`, `nix develop .#lint`).
--- a/REVIEW-conc.md
+++ b/REVIEW-conc.md
@ -0,0 +1,32 @@
 # REVIEW-conc.md — Adversary ledger, concurrency-restructure phase
 Append-only. Verdicts: `<gate>: PASS @<ts>` + evidence, or `FAIL` + [adversary] finding in
 BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.
 ## 2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)
 Pulled main @5b65c6c. No STATUS-conc.md, no `restructure/concurrency` branch — nothing claimed yet.
 Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to
 anchor my later diff review in the as-is code, not the Builder's narrative.
 Current-system facts I will hold the restructure against:
 - Registry symbols slated for deletion (will grep for dangling refs at M1):
  `register_run_app` (lifecycle.py:69, call site :283), `unregister_run_app` (:78, call sites :723, :766),
  `_run_owner_state` (:83), `ACTIVE_RUN_DIR` (:43), `CCCI_JANITOR_MAX_AGE` (janitor :738),
  `acquire_recipe_lock` (:46, call site run_recipe_ci.py:843), `RECIPE_LOCK_DIR` (:42).
 - Must survive untouched: `RUN_APP_RE` (lifecycle.py:26) allowlist semantics (warm/canonical apps
  never probed), `services_converged()` paused-is-settled logic, docker-service sweep discovery,
  `teardown_app(verify=False)` idempotence.
 - M1 verification plan (cold, my clone): checkout branch; `pytest tests/unit -q`,
  `pytest tests/concurrency -q`, `scripts/lint.sh`; full diff review hunting: probe-vs-acquire
  ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM),
  teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently
  released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a
  waiter blocks on the old inode → two "held" locks on different inodes for one domain),
  PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration,
  setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases.
 - Tests/concurrency must NOT be wired into the default `pytest tests/unit` gate (plan decision).
 - M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme
  same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.
 No verdict yet — waiting for Builder bootstrap/claim.
--- a/STATUS-conc.md
+++ b/STATUS-conc.md
@ -0,0 +1,19 @@
 # STATUS — sub-phase conc (concurrency restructure)
 Plan: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md (SSOT for this phase)
 ## Phase state
 - Phase: conc — concurrency restructure (P1–P5 + tests/concurrency)
 - Builder branch: `restructure/concurrency` (code lands there; main untouched until M2 merge)
 - In flight: P1 (lock-lifetime hardening)
 - Gate: none claimed yet
 ## Gates
 - M1 (implementation verified): NOT CLAIMED
 - M2 (merged + live-verified): NOT CLAIMED — blocked on M1 PASS
 ## Blockers
 (none)
--- a/bridge/bridge.py
+++ b/bridge/bridge.py
@ -64,6 +64,8 @@ def parse_trigger(body):
    if s == f"{TRIGGER} --quick":
        return True, True
    return False, False
 ALLOWLIST = {u.strip() for u in os.environ.get("AUTH_ALLOWLIST", "").split(",") if u.strip()}
@ -167,8 +169,12 @@ def post_commit_status(owner, repo, sha, state, target_url, description=""):
        f"{GITEA_API}/repos/{owner}/{repo}/statuses/{sha}",
        GITEA_TOKEN,
        method="POST",
-        data={"state": state, "target_url": target_url,
+        data={
-              "description": description, "context": "cc-ci/testme"},
+            "state": state,
            "target_url": target_url,
            "description": description,
            "context": "cc-ci/testme",
        },
    )
@ -217,7 +223,9 @@ def result_comment_body(recipe, sha, num, run_url, status):
        if artifact_available(badge_url):
            body += f"\n\n[![level]({badge_url})]({run_url})"
        return f"{body}\n\n{links}"
-    return f"{header} → {run_url}\n\n_(summary card unavailable — see the run for details.)_ {links}"
+    return (
        f"{header} → {run_url}\n\n_(summary card unavailable — see the run for details.)_ {links}"
    )
 def watch_and_reflect(owner, name, number, num, recipe, sha, comment_id, run_url):
--- a/dashboard/dashboard.py
+++ b/dashboard/dashboard.py
@ -66,8 +66,13 @@ _COLORS = {
 # Level → colour ramp, kept in sync with runner/harness/card.py LEVEL_COLOR (the dashboard is a
 # standalone stdlib service that doesn't import the runner harness, so the small map is duplicated).
 _LEVEL_COLOR = {
-    0: "#e5534b", 1: "#e0823d", 2: "#e0823d", 3: "#d9b343",
+    0: "#e5534b",
-    4: "#a0b93f", 5: "#57ab5a", 6: "#3fb950",
+    1: "#e0823d",
    2: "#e0823d",
    3: "#d9b343",
    4: "#a0b93f",
    5: "#57ab5a",
    6: "#3fb950",
 }
@ -269,7 +274,11 @@ def _card(r):
            f'<a class="shot" href="{run_url}" title="open run">'
            f'<span class="ph">no screenshot</span>{_level_pill(r["level"])}</a>'
        )
-    cap = f'<div class="cap">{html.escape(r["level_cap_reason"])}</div>' if r["level_cap_reason"] else ""
+    cap = (
        f'<div class="cap">{html.escape(r["level_cap_reason"])}</div>'
        if r["level_cap_reason"]
        else ""
    )
    return (
        f'<div class="card">{shot}<div class="body">'
        f'<div class="name">{html.escape(r["recipe"])}</div>'
@ -307,7 +316,11 @@ def render_history(recipe, rows):
    trs = []
    for r in rows:
        color = _COLORS.get(r["status"], "#8b949e")
-        lvl = "—" if r["level"] is None else f'<b style="color:{level_color(r["level"])}">L{int(r["level"])}</b>'
+        lvl = (
            "—"
            if r["level"] is None
            else f'<b style="color:{level_color(r["level"])}">L{int(r["level"])}</b>'
        )
        shot = f'<a href="/runs/{r["number"]}/summary.png">card</a>' if r["has_screenshot"] else "—"
        trs.append(
            f'<tr><td><a href="{html.escape(r["url"])}">#{r["number"]}</a></td>'
@ -317,7 +330,7 @@ def render_history(recipe, rows):
        )
    body = "\n".join(trs) or '<tr><td colspan="6">no runs for this recipe yet</td></tr>'
    inner = (
-        f'<h1>{_FLOWER} {html.escape(recipe)} — run history</h1>'
+        f"<h1>{_FLOWER} {html.escape(recipe)} — run history</h1>"
        '<p class="sub"><a href="/">← all recipes</a> · every <code>!testme</code> run, newest first.</p>'
        "<table><thead><tr><th>Run</th><th>Status</th><th>Level</th><th>Version</th>"
        "<th>When</th><th>Card</th></tr></thead><tbody>"
--- a/docs/concurrency.md
+++ b/docs/concurrency.md
@ -0,0 +1,236 @@
 # Concurrency: how parallel recipe CI runs stay safe
 Spec of the concurrent-run system after the 2026-06-10 restructure (branch
 `restructure/concurrency`; plan: cc-ci-plan `concurrency-restructure-full-plan.md`). The previous
 registry + per-recipe-flock model is documented in this file's git history (`5b65c6c`).
 ## 1. Goal and design summary
 Two recipe CI builds may run **at the same time** on the single cc-ci host. Safety is enforced by
 the **harness**, not by serialising everything, and rests on ONE locking mechanism plus ONE
 structural isolation:
 | Rule | Mechanism |
 |---|---|
 | Different recipes run in parallel | nothing blocks them (isolation, §3) |
 | Same-RECIPE runs run in parallel too | per-run `ABRA_DIR` recipe trees (§4) — no shared tree, no lock |
 | Same-DOMAIN runs (double-`!testme` of one PR) serialise | per-app-domain `flock` (§5) |
 | A starting run never reaps a live concurrent run's app | janitor probes the app lock; held = live (§6) |
 | A crashed/canceled/rebooted run's leftovers get reaped | lock auto-released by the kernel → probe acquires → reap (§6) |
 The invariant chain that makes "held lock = live owner" sound:
 ```
 lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
 ```
 - **lock ⊆ process**: locks are kernel flocks on fds the process holds (and PEP 446 makes those
  fds non-inheritable, so abra/docker/pytest children never carry them). The kernel releases them
  on process death, however it dies. There is no unlock code path and no stale-lock failure mode.
 - **process ⊆ step**: `PR_SET_PDEATHSIG(SIGTERM)` + the `.drone.yml` setsid/trap wrap (§2) — a
  dead or canceled build cannot leak a running harness.
 - **step ⊆ 60 min**: `signal.alarm(3600)` self-deadline (§2).
 Never steal a held lock; manage the holder's lifetime. There is **no daemon and no shared state
 service** — everything is kernel/file primitives under `/run/lock` and per-run directories.
 ## 2. Mechanism 0: run-lifetime hardening (`runner/harness/lifetime.py`)
 `run_recipe_ci.main()` calls `lifetime.install_lifetime_guards()` before ANY abra call or lock
 acquisition:
 1. **`PR_SET_PDEATHSIG(SIGTERM)`** (ctypes prctl, return code checked): if the parent — the drone
   step shell — dies, the kernel TERMs the harness. A post-prctl `ppid == 1` re-check closes the
   start race: a harness whose parent died *before* the prctl armed would never get the signal,
   so it refuses to run orphaned.
 2. **SIGTERM handler**: logs, then raises `SystemExit(143)` so the run's `finally:` teardown
   funnel executes and the process exits non-zero. Re-entrant signals during teardown are logged
   and IGNORED (`lifetime.begin_teardown()`, also set at the top of the run's `finally:` blocks)
   so a second signal can't abort the cleanup the first one asked for.
 3. **`signal.alarm(3600)` hard deadline**: SIGALRM funnels into the same teardown path with a
   distinct log line (`== run exceeded 60-minute hard deadline — tearing down ==`), exit 142.
   Recipes keep their own smaller per-tier timeouts; this bounds the whole run. Teardown time
   after the deadline is deliberately not alarm-bounded — the janitor is the backstop if a
   teardown wedges and the process is killed harder.
 The `.drone.yml` recipe-ci step runs the harness as `setsid cc-ci-run … &` with a
 `trap 'kill -TERM -- "-$PID"' TERM EXIT; wait "$PID"` — a drone **cancel** (TERM to the step
 shell) is forwarded to the harness's whole process group instead of leaking it (the exec runner
 only kills the step shell). PDEATHSIG backstops the no-trap paths.
 ## 3. Isolation model: what is shared, what is per-run
 Per-run (no conflict possible):
 - **App + stack + volumes + secrets.** Run app domain = `naming.app_domain()` →
  `<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net`, unique per (recipe, pr, ref);
  everything abra creates is namespaced by it. Run apps are recognised by
  `RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`; warm/canonical apps
  (e.g. `warm-keycloak...`) deliberately do NOT match → the janitor never probes them.
 - **Recipe working trees** — `$ABRA_DIR/recipes/<recipe>`, per run (§4). NEW in the restructure.
 - **Drone build workspace** (`/var/lib/drone-runner/drone-<id>/`) and **run artifacts**
  (`/var/lib/cc-ci-runs/<run-id>/`).
 - **Run-scoped state files** (`/tmp/ccci-{deploys,opstate,deps,depskip}-<run-id>-<pid>…`) —
  keyed by run id + harness pid via `run_recipe_ci._run_state_path()`, NEVER by app domain.
  A second run of the same domain executes its `main()` preamble before blocking at the app
  lock (§5), so domain-keyed files would be reset/removed underneath the live first run
  (live finding, M2(c) double-`!testme`: false DG4.1 deploy-count in run 1, countfile
  `FileNotFoundError` in run 2). Tier/hook children get the exact paths via the
  `CCCI_*_FILE` env vars; removed on normal run exit.
 Shared (by design, conflict-free):
 - **`/root/.abra/servers`** — app `.env` files, one per domain. The per-run `ABRA_DIR` symlinks
  `servers/` here, so .env files land in the canonical path: janitor discovery (`abra app ls`)
  and out-of-run tooling see every app. Per-domain filenames + the app-domain lock prevent write
  conflicts.
 - **`/root/.abra/catalogue`** — read-mostly, symlinked into each per-run dir.
 - **`HOME=/root`** (forced in `.drone.yml`) — safe: nothing recipe-mutable lives under `~/.abra`
  for a run anymore except through the two symlinks above.
 ## 4. Mechanism 1: per-run `ABRA_DIR` (replaces the per-recipe flock)
 `run_recipe_ci.setup_run_abra_dir()` — called first thing in `main()`, before any abra call —
 builds `<runs_dir>/<run-id>/abra/` (run-id = Drone build number; `manual-<pid>` for hand runs):
 ```
 abra/
  servers/    -> /root/.abra/servers     (symlink; canonical shared .env path)
  catalogue/  -> /root/.abra/catalogue   (symlink; read-mostly)
  recipes/    fresh, empty               (THE isolation that matters)
 ```
 and exports it as `$ABRA_DIR` — honored by the abra CLI itself and by every harness path helper
 (`abra.abra_dir()` / `abra.recipe_dir()`; `generic._recipe_dir`, `prepull_images`,
 `snapshot_recipe_tests`, `warm_reconcile._recipe_dir` all route through the same rule:
 `$ABRA_DIR` if set, else `~/.abra`).
 - `fetch_recipe()` is now a plain clone into `$ABRA_DIR/recipes/<recipe>` (PR-head clone+checkout
  or `abra recipe fetch`); the upgrade tier's mid-run `git checkout`s happen in the run's own
  tree. Two same-recipe runs can no longer corrupt each other — structurally, with no lock. The
  old observed failure (immich builds 229/230 deploying a tree missing its config) is impossible.
 - `CCCI_SKIP_FETCH=1` (test/Adversary staging) copies the canonically-staged
  `~/.abra/recipes/<recipe>` clone into the per-run tree.
 - Out-of-run flows (warm_reconcile's systemd timer, manual abra) set no `ABRA_DIR` and keep using
  the canonical `/root/.abra` unchanged. In-run flows that touch canonical state on purpose
  (warm/canonical .env files) go through `servers/` and are unaffected.
 - The per-run dir rides along the existing `/var/lib/cc-ci-runs/<run-id>/` retention. abra
  auto-clones any recipe it needs to resolve (e.g. during `app ls`) into the per-run `recipes/` —
  a few seconds of git per run, gone with the run dir.
 ## 5. Mechanism 2: per-app-domain flock (`lifecycle.acquire_app_lock`)
 - Lock file: `/run/lock/cc-ci-app-<domain>.lock` (dir overridable via `CCCI_APP_LOCK_DIR` for the
  test suite), exclusive `fcntl.flock`, taken in `deploy_app()` **before the app is created** — a
  concurrent janitor can never see a run app without its held lock.
 - Blocks (with a log line: `== app lock: another run of <domain> is in flight — waiting ==`) when
  another run of the SAME domain is in flight — the double-`!testme` serialisation point; the
  waiting run is visibly parked at that line in its drone log, by design.
 - The returned file object is ALSO retained in module-level `_held_app_locks` — if a caller
  dropped it, GC would close the fd and silently release the lock.
 - mtime is touched at acquisition: lock age feeds the janitor's long-held flag (§6).
 - **Unlink/recreate race guard**: the janitor unlinks reaped lockfiles, so after EVERY
  acquisition the locked fd is verified to still be the inode the path names
  (`fstat().st_ino == stat().st_ino`); a waiter that won a just-unlinked inode closes it and
  retries on the live path. (A lock on an unlinked inode protects nothing: a later opener gets a
  fresh inode and would acquire "the same" lock.)
 - Release is implicit: process exit (any kind). `teardown_app()` does NOT release or unlink —
  a clean run's leftover lockfile is unheld and is unlinked on sight by the next janitor sweep.
 ## 6. The flock-probe janitor (`lifecycle.janitor`)
 Runs at every run start (cold + quick paths) and in the warm/upgrade sweeps. Candidate discovery
 is unchanged from the old model: `abra app ls` + a docker-service sweep (catches stacks whose
 `.env` is already gone), both matched against `RUN_APP_RE` — warm/canonical apps never match and
 are never probed.
 Decision table (per candidate domain, `_probe_and_reap`):
 | Probe (`LOCK_EX\|LOCK_NB`) | Meaning | Action |
 |---|---|---|
 | acquires (+ inode identity OK) | nobody holds it → owner died (kernel-guaranteed) | **reap**: `teardown_app(verify=False)` WHILE HOLDING the probe lock, then unlink the lockfile, then release |
 | acquires, inode stale | another janitor reaped + unlinked while we raced | skip (reap already done; unlinking now would hit a newer run's file) |
 | `BlockingIOError` (held) | live concurrent run | leave it; if lockfile mtime > 120 min (2× the hard deadline): `!! lock for <domain> held >120min — possible leaked run; inspect with lslocks` — flag, **never steal** |
 | `open()` fails (`OSError`) | garbled/unopenable lockfile | skip + log, never crash |
 - Reaping under the probe lock closes the janitor-vs-new-run race: a new run of that domain
  blocks in `acquire_app_lock` until the reap finishes — no window where a fresh app coexists
  with a half-reaped one.
 - Two racing janitors arbitrate on the flock: one reaps, the other sees "held" and leaves; reaps
  are idempotent (`teardown_app(verify=False)` tolerates half-gone stacks).
 - After the candidates, a tidy sweep unlinks stale **unheld** `cc-ci-app-*.lock` files with no
  app behind them (under their own probe lock + identity check), keeping `/run/lock` clean.
 - **Post-reboot**: `/run/lock` is tmpfs → lockfiles gone → every surviving app probes as an
  orphan → reaped immediately. (Improvement over the old 2-hour age fallback; there IS no age
  logic anymore.)
 ## 7. Failure-mode guarantees
 | Event | Outcome |
 |---|---|
 | Run crashes / SIGKILL mid-run | flock auto-released by kernel → next janitor probe reaps app + lockfile |
 | Drone build canceled via API | step trap TERMs the harness process group → SIGTERM funnel runs the run's own teardown (exit 143); if anything still leaks, PDEATHSIG + janitor reap (the old "cancel leaks the harness" gap is CLOSED) |
 | Run exceeds 60 min | SIGALRM → distinct log line → own teardown → exit 142 |
 | Host reboot | locks and lockfiles vanish (tmpfs, correct: no owners survived) → all surviving run apps reaped at the next run start, immediately |
 | Two same-recipe `!testme`s (different PRs) | run in parallel — separate domains, separate per-run recipe trees |
 | Double-`!testme` (same PR → same domain) | second blocks on the app lock before creating anything, visibly in its drone log, runs after the first finishes |
 | Janitor vs. app being created | impossible to mis-reap: the lock is held before `app new`, and a held lock is never touched |
 | Janitor unlink vs. blocked waiter | inode identity re-check on every acquisition → waiter retries on the live path |
 | Lock held implausibly long (>120 min) | flagged loudly for a human (`lslocks`), never stolen |
 ## 8. Where convergence fits (adjacent; unchanged by the restructure)
 Two swarm-convergence behaviors in `services_converged()` look like concurrency bugs but aren't —
 any future work must keep them fixed:
 - **N/N replicas ≠ converged** during a stop-first rolling update — `UpdateStatus.State` is also
  inspected (build 238: backupbot exec'd into a container killed seconds later).
 - **`paused` persists forever** (swarm's default `update-failure-action`) — only `updating` and
  `rollback_started` block convergence; `paused`/`rollback_paused` are settled (build 241).
 - `backup_app()` additionally waits (bounded 300s) for convergence before `backup create`.
 ## 9. Configuration knobs
 | Knob | Where | Current | Meaning |
 |---|---|---|---|
 | `DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`) | `nix/modules/drone-runner.nix` (`maxTests`) | `2` | **THE single concurrency knob.** Max builds the exec runner executes at once; Drone queues the rest. (The `.drone.yml` `concurrency.limit` duplicate was removed.) Change requires `nixos-rebuild switch`. |
 | `CCCI_APP_LOCK_DIR` | env, read at call time | unset → `/run/lock` | App-domain lockfile dir override — used by `tests/concurrency` to sandbox locks. Never set in production. |
 | hard deadline | `lifetime.HARD_DEADLINE_SECONDS` | 3600 s | the whole-run alarm; long-held flag threshold is 2× this (`LONG_HELD_LOCK_SECONDS`) |
 ## 10. Testing: `tests/concurrency/`
 Real-kernel suite (19 planned cases + companions): helper subprocesses hold REAL flocks and
 install the REAL prctl/signal/alarm guards — flock itself is never mocked; the janitor runs with
 injected candidates + stubbed teardown but probes real locks. **Not part of the default
 `pytest tests/unit` gate** (it spawns processes and sleeps); run it explicitly:
 ```
 cc-ci-run -m pytest tests/concurrency -q
 ```
 Covers: kernel auto-release on SIGKILL; LOCK_NB probe semantics; PEP 446 fd non-inheritance;
 same-domain serialisation; orphan reap + unlink; live-run protection; reap-under-probe-lock
 blocking; two-janitor arbitration; reboot-immediate reap; long-held flag; RUN_APP_RE allowlist;
 degrade-on-garbage; PDEATHSIG; ppid start race; deadline + SIGTERM funnels; per-run ABRA_DIR
 construction/export; concurrent same-recipe fetch isolation; symlinked-servers .env canonicality;
 run-keyed (never domain-keyed) run-scoped state files (M2(c) regression, `test_run_state.py`).
 ## 11. File / symbol index
 | What | Where |
 |---|---|
 | lifetime guards (PDEATHSIG, signal funnels, deadline) | `runner/harness/lifetime.py`; installed in `run_recipe_ci.main()` |
 | setsid/trap cancel forwarding | `.drone.yml` (`recipe-ci` step) |
 | `acquire_app_lock`, `_held_app_locks`, `_app_lock_path` | `runner/harness/lifecycle.py` |
 | `acquire_app_lock` call site | `lifecycle.deploy_app()` (before app creation) |
 | janitor + probe (`janitor`, `_probe_and_reap`, `LONG_HELD_LOCK_SECONDS`) | `runner/harness/lifecycle.py` |
 | per-run ABRA_DIR (`setup_run_abra_dir`, `fetch_recipe`) | `runner/run_recipe_ci.py` |
 | path resolution (`abra_dir`, `recipe_dir`) | `runner/harness/abra.py` (used by `generic`, `lifecycle.prepull_images`, `warm_reconcile`) |
 | run-app naming | `runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py` |
 | capacity knob | `nix/modules/drone-runner.nix` (`maxTests`) |
 | convergence (adjacent) | `lifecycle.services_converged()`, `lifecycle.backup_app()` |
 | the test suite | `tests/concurrency/` (`helpers.py` subprocess entrypoints, `concutil.py` probes) |
 Deleted in the restructure (grep should find NOTHING): `register_run_app`, `unregister_run_app`,
 `_run_owner_state`, `ACTIVE_RUN_DIR`, `CCCI_JANITOR_MAX_AGE`, `_stack_age_seconds`,
 `acquire_recipe_lock`, `RECIPE_LOCK_DIR`.
--- a/flake.nix
+++ b/flake.nix
@ -31,34 +31,36 @@
      ];
    in
    {
-      # Canonical live host target: the Hetzner cc-ci server.
+      nixosConfigurations = {
-      # Use `.#cc-ci` for the current production host.
+        # Canonical live host target: the Hetzner cc-ci server.
-      nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem {
+        # Use `.#cc-ci` for the current production host.
-        inherit system;
+        cc-ci = nixpkgs.lib.nixosSystem {
-        modules = [
+          inherit system;
-          sops-nix.nixosModules.sops
+          modules = [
-          ./nix/hosts/cc-ci-hetzner/configuration.nix
+            sops-nix.nixosModules.sops
-        ];
+            ./nix/hosts/cc-ci-hetzner/configuration.nix
-      };
+          ];
        };
-      # Legacy Incus VM host definition retained only for historical comparison and fallback.
+        # Legacy Incus VM host definition retained only for historical comparison and fallback.
-      # Do NOT use this target on the live Hetzner server.
+        # Do NOT use this target on the live Hetzner server.
-      nixosConfigurations.cc-ci-incus = nixpkgs.lib.nixosSystem {
+        cc-ci-incus = nixpkgs.lib.nixosSystem {
-        inherit system;
+          inherit system;
-        modules = [
+          modules = [
-          sops-nix.nixosModules.sops
+            sops-nix.nixosModules.sops
-          ./nix/hosts/cc-ci/configuration.nix
+            ./nix/hosts/cc-ci/configuration.nix
-        ];
+          ];
-      };
+        };
-      # Explicit alias for the live Hetzner host. Kept alongside `cc-ci` so the intended host target
+        # Explicit alias for the live Hetzner host. Kept alongside `cc-ci` so the intended host
-      # remains obvious in recovery/migration workflows.
+        # target remains obvious in recovery/migration workflows.
-      nixosConfigurations.cc-ci-hetzner = nixpkgs.lib.nixosSystem {
+        cc-ci-hetzner = nixpkgs.lib.nixosSystem {
-        inherit system;
+          inherit system;
-        modules = [
+          modules = [
-          sops-nix.nixosModules.sops
+            sops-nix.nixosModules.sops
-          ./nix/hosts/cc-ci-hetzner/configuration.nix
+            ./nix/hosts/cc-ci-hetzner/configuration.nix
-        ];
+          ];
        };
      };
      devShells.${system} = {
--- a/nix/hosts/cc-ci-hetzner/configuration.nix
+++ b/nix/hosts/cc-ci-hetzner/configuration.nix
@ -7,7 +7,7 @@
 #   git clone --recursive https://git.autonomic.zone/recipe-maintainers/cc-ci.git /etc/cc-ci
 #   install -m600 <age-private-key> /var/lib/sops-nix/key.txt
 #   nixos-rebuild switch --flake /etc/cc-ci#cc-ci-hetzner
-{ pkgs, lib, ... }:
+{ pkgs, ... }:
 {
  imports = [
    ./hardware.nix
--- a/nix/hosts/cc-ci-hetzner/hardware.nix
+++ b/nix/hosts/cc-ci-hetzner/hardware.nix
@ -11,13 +11,17 @@
 {
  imports = [ (modulesPath + "/profiles/qemu-guest.nix") ];
-  boot.loader = {
+  boot = {
-    efi.efiSysMountPoint = "/boot/efi";
+    loader = {
-    grub = {
+      efi.efiSysMountPoint = "/boot/efi";
-      efiSupport = true;
+      grub = {
-      efiInstallAsRemovable = true;
+        efiSupport = true;
-      device = "nodev";
+        efiInstallAsRemovable = true;
        device = "nodev";
      };
    };
    initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "xen_blkfront" "vmw_pvscsi" ];
    initrd.kernelModules = [ "nvme" ];
  };
  fileSystems."/boot/efi" = {
@ -25,9 +29,6 @@
    fsType = "vfat";
  };
  boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "xen_blkfront" "vmw_pvscsi" ];
  boot.initrd.kernelModules = [ "nvme" ];
  fileSystems."/" = {
    device = "/dev/sda1";
    fsType = "ext4";
--- a/nix/modules/drone-runner.nix
+++ b/nix/modules/drone-runner.nix
@ -8,14 +8,19 @@
 { pkgs, config, lib, ... }:
 let
  # MAX_TESTS (plan §4.2/§4.3 resource safety): max CI builds the exec runner runs at once. Drone
-  # queues the rest in its native pending-build queue (no custom queue). THE concurrency cap that
+  # queues the rest in its native pending-build queue (no custom queue). THE SINGLE concurrency
-  # bounds how many test apps can be live at once — kept LOW (1) on this single 28GiB node since
+  # knob — nothing else caps recipe-ci parallelism (the .drone.yml concurrency.limit was removed:
-  # recipes are heavy (immich/matrix large volumes). With capacity=1 there is never a concurrent
+  # one knob, one place). Bounds how many test apps can be live at once.
-  # in-flight run, so the run-start janitor can safely reap *any* orphan (a SIGKILL'd build runs no
+  #
-  # teardown) and the "at most MAX_TESTS apps live" bound holds exactly. Raise to 2 only if the node
+  # Raised to 2 (operator request 2026-06-09) so two recipes can be tested in parallel (e.g. immich
-  # is shown to handle two light recipes at once (then the janitor MUST stay age-based to avoid
+  # and plausible under active development at once). Verified safe on the current node (Hetzner cpx22,
-  # reaping a concurrent run — see DECISIONS.md "Resource safety").
+  # ~7.6 GiB / 4 vCPU — NOTE: smaller than the original 28 GiB this was written for): a full immich CI
-  maxTests = "1";
+  # stack measured ~1 GiB (server+ML+pg+redis) with multiple GiB free, so two concurrent recipes fit.
  # Concurrent-run safety is the harness's job at ANY capacity (docs/concurrency.md): per-run
  # ABRA_DIR recipe trees, per-app-domain flocks, and a flock-probe janitor that reaps a crashed
  # build's orphan immediately (held lock = live run, never touched). Revert to "1" if OOM /
  # disk-I/O contention is observed under load.
  maxTests = "2";
 in
 {
  # Drone ships under the Polyform Small Business license (nixpkgs marks it unfree);
--- a/nix/modules/nightly-sweep.nix
+++ b/nix/modules/nightly-sweep.nix
@ -29,7 +29,7 @@ in
    serviceConfig = {
      Type = "oneshot";
      # A full sweep across several recipes (each a cold deploy/test/teardown) is long; bound it.
-      TimeoutStartSec = "21600";  # 6h ceiling
+      TimeoutStartSec = "21600"; # 6h ceiling
      ExecStart = "${sweep}/bin/cc-ci-nightly-sweep";
    };
  };
@ -39,7 +39,7 @@ in
    wantedBy = [ "timers.target" ];
    timerConfig = {
      OnCalendar = "*-*-* 03:00:00";
-      Persistent = true;   # catch up a missed nightly after downtime
+      Persistent = true; # catch up a missed nightly after downtime
      RandomizedDelaySec = "600";
    };
  };
--- a/nix/modules/reports.nix
+++ b/nix/modules/reports.nix
@ -3,10 +3,49 @@
 # no secrets — just static files behind traefik + the wildcard TLS (same pattern as dashboard.nix,
 # but a plain nginx:alpine since there's nothing to render server-side). Content is updated by writing
 # files into /var/lib/cc-ci-reports; nginx serves them live (no redeploy needed).
 #
 # It ALSO serves a same-origin realtime PR-status proxy at /pr/<recipe>/<n>: the report's STATUS
 # column fetches it client-side to show each PR's live state (open vs. ✓). Same-origin means no
 # dependency on the Gitea CORS allow-list; the recipe mirrors are public so no token is needed. The
 # proxy is pinned to recipe-maintainers + a safe recipe-name charset and is read-only (GET/HEAD).
 { pkgs, ... }:
 let
  reportsDir = "/var/lib/cc-ci-reports";
  # Custom nginx server: static report files + the /pr/<recipe>/<n> → Gitea-API proxy. Replaces the
  # stock /etc/nginx/conf.d/default.conf (which the image's nginx.conf includes inside http{}).
  nginxConf = pkgs.writeText "cc-ci-reports-default.conf" ''
    server {
        listen 80;
        server_name _;
        root /usr/share/nginx/html;
        index index.html;
        # Realtime PR-status proxy for the Recipe Report STATUS column.
        # GET /pr/<recipe>/<n> -> the PUBLIC Gitea PR JSON ({state, merged, ...}). Same-origin from
        # the browser's view, so no CORS dependency; unauthenticated, since the recipe mirrors are
        # public. The repo owner is hard-pinned to recipe-maintainers and the recipe name to a
        # slashless charset, so the proxied path can only ever address recipe-maintainers/<name>/pulls
        # (it cannot be coerced to another org or path). Only safe read methods are allowed.
        location ~ ^/pr/([a-z0-9._-]+)/([0-9]+)$ {
            limit_except GET HEAD { deny all; }
            resolver 127.0.0.11 ipv6=off valid=30s;   # docker embedded DNS (forwards external names)
            proxy_ssl_server_name on;
            proxy_set_header Host git.autonomic.zone;
            proxy_set_header Accept "application/json";
            proxy_pass https://git.autonomic.zone/api/v1/repos/recipe-maintainers/$1/pulls/$2;
            proxy_intercept_errors off;
            proxy_connect_timeout 5s;
            proxy_read_timeout 10s;
            add_header Cache-Control "no-store" always;  # always fetch live state, never cache in the browser
        }
        location / {
            try_files $uri $uri/ =404;
        }
    }
  '';
  stack = pkgs.writeText "cc-ci-reports-stack.yml" ''
    version: "3.8"
    services:
@ -17,6 +56,10 @@ let
            source: ${reportsDir}
            target: /usr/share/nginx/html
            read_only: true
          - type: bind
            source: ${nginxConf}
            target: /etc/nginx/conf.d/default.conf
            read_only: true
        networks:
          - proxy
        deploy:
--- a/runner/harness/abra.py
+++ b/runner/harness/abra.py
@ -10,6 +10,7 @@ Bakes in the known abra gotchas (re-verify per installed abra version, currently
 from __future__ import annotations
 import json
 import os
 import subprocess
 ABRA = "abra"
@ -19,6 +20,20 @@ class AbraError(RuntimeError):
    pass
 def abra_dir() -> str:
    """abra's state dir, resolved the same way the abra CLI resolves it: $ABRA_DIR if set, else
    ~/.abra. Inside a CI run, run_recipe_ci exports a PER-RUN $ABRA_DIR (fresh recipes/, shared
    servers/+catalogue/ symlinks) before any abra call, so every helper here and every abra
    subprocess agree on the same tree; outside a run (warm_reconcile's systemd timer, manual use)
    both fall back to the canonical /root/.abra."""
    return os.environ.get("ABRA_DIR") or os.path.expanduser("~/.abra")
 def recipe_dir(recipe: str) -> str:
    """The current ABRA_DIR's working tree for a recipe (per-run inside a CI run)."""
    return os.path.join(abra_dir(), "recipes", recipe)
 def _run_pty(
    args: list[str], timeout: int = 900, check: bool = True
 ) -> subprocess.CompletedProcess:
@ -77,9 +92,7 @@ def recipe_checkout(recipe: str, version: str) -> None:
    a chaos (`-C`) deploy ignores ENV VERSION and uses the current checkout — together that silently
    deployed LATEST for a 'previous-version' base, making the upgrade a no-op (Adversary F1d-2). With
    this checkout + a non-chaos deploy, a pinned deploy genuinely deploys that version."""
-    import os
+    path = recipe_dir(recipe)
    path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
    # -f (force): the version-pinning checkout must yield the EXACT ref tree. Without it, a cc-ci
    # install_steps-provided overlay (e.g. discourse's compose.ccci.yml, copied into the pinned base)
    # is an UNTRACKED file that collides with the same path TRACKED in a later ref, and
@ -100,9 +113,7 @@ def has_lightweight_version_tags(recipe: str) -> bool:
    'reference not found'.) The caller (deploy_app) uses this to fall back to a chaos base deploy
    (which skips lint and deploys the explicitly-checked-out pinned version — see lifecycle.deploy_app).
    Read-only: just `git tag` + `cat-file -t`; no fetch/mutation, so it can't trigger abra's revert."""
-    import os
+    path = recipe_dir(recipe)
    path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
    tags = subprocess.run(
        ["git", "-C", path, "tag", "-l"], capture_output=True, text=True
    ).stdout.split()
@ -168,7 +179,9 @@ def secret_generate(domain: str, timeout: int = 300) -> None:
    )
-def deploy(domain: str, chaos: bool = True, timeout: int = 900, no_converge_checks: bool = False) -> None:
+def deploy(
    domain: str, chaos: bool = True, timeout: int = 900, no_converge_checks: bool = False
 ) -> None:
    args = ["app", "deploy", domain, "-o", "-n"]
    if chaos:
        args.append("-C")
@ -203,7 +216,10 @@ def backup_create(domain: str, timeout: int = 900) -> str:
    # remote and fails "authentication required: Unauthorized". Returns the captured output, whose
    # restic JSON summary line carries the produced "snapshot_id" (the backup artifact, DG3) — note
    # `abra app backup snapshots` needs a TTY and is awkward to script, so we read the create output.
-    out = _run_pty(["app", "backup", "create", domain, "-n", "-C", "-o"], timeout=timeout).stdout or ""
+    out = (
        _run_pty(["app", "backup", "create", domain, "-n", "-C", "-o"], timeout=timeout).stdout
        or ""
    )
    # Echo the backup output (incl. backupbot's pre-hook run / any "Failed to run command" or
    # "Container ... not running" ERROR) into the run log. Backup is otherwise opaque: a pre-hook that
    # fails to register/run leaves the DB dump out of the snapshot, surfacing only as a downstream
@ -226,9 +242,7 @@ def recipe_head_commit(recipe: str) -> str | None:
    """The current HEAD commit of the recipe checkout — captured right after fetch (the PR head, or
    the catalogue current) so the upgrade tier can re-checkout it for the chaos redeploy after the
    prev-tag base deploy reset the working tree (HC1)."""
-    import os
+    path = recipe_dir(recipe)
    path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
    proc = subprocess.run(["git", "-C", path, "rev-parse", "HEAD"], capture_output=True, text=True)
    out = proc.stdout.strip()
    return out or None
@ -236,10 +250,7 @@ def recipe_head_commit(recipe: str) -> str | None:
 def recipe_versions(recipe: str) -> list[str]:
    """Published versions of a recipe, oldest→newest (from the recipe git tags)."""
-    import os
+    path = recipe_dir(recipe)
    import subprocess
    path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
    proc = subprocess.run(
        ["git", "-C", path, "tag", "--sort=creatordate"], capture_output=True, text=True
    )
--- a/runner/harness/browser.py
+++ b/runner/harness/browser.py
@ -13,8 +13,15 @@ from __future__ import annotations
 import time
-def goto_with_retry(page, url, *, deadline_seconds: int = 120, accept_statuses=(200, 304),
+def goto_with_retry(
-                    goto_timeout_ms: int = 30_000, wait_until: str = "domcontentloaded"):
+    page,
    url,
    *,
    deadline_seconds: int = 120,
    accept_statuses=(200, 304),
    goto_timeout_ms: int = 30_000,
    wait_until: str = "domcontentloaded",
 ):
    """Poll `page.goto(url)` until status is in `accept_statuses` OR the deadline expires.
    Returns the final Playwright response. Raises AssertionError if the deadline expires without
--- a/runner/harness/canonical.py
+++ b/runner/harness/canonical.py
@ -55,7 +55,9 @@ def enrolled_recipes() -> list[str]:
    out = []
    try:
        for name in sorted(os.listdir(tests_dir)):
-            if os.path.isfile(os.path.join(tests_dir, name, "recipe_meta.py")) and is_enrolled(name):
+            if os.path.isfile(os.path.join(tests_dir, name, "recipe_meta.py")) and is_enrolled(
                name
            ):
                out.append(name)
    except OSError:
        pass
@ -122,11 +124,15 @@ def deploy_canonical(recipe: str, timeout: int = 900) -> None:
    abra.recipe_checkout(recipe, version)
    r = subprocess.run(
        ["abra", "app", "deploy", domain, version, "-o", "-n", "-f"],
-        capture_output=True, text=True, timeout=timeout,
+        capture_output=True,
        text=True,
        timeout=timeout,
    )
    if r.returncode != 0:
-        raise RuntimeError(f"deploy canonical {domain} {version} failed: "
+        raise RuntimeError(
-                           f"{(r.stderr + ' ' + r.stdout).strip()[:300]}")
+            f"deploy canonical {domain} {version} failed: "
            f"{(r.stderr + ' ' + r.stdout).strip()[:300]}"
        )
    _set_status(recipe, "warm")
--- a/runner/harness/card.py
+++ b/runner/harness/card.py
@ -79,10 +79,44 @@ def render_badge_svg(label: str, message: str, color: str) -> str:
    )
-def level_badge_svg(level: int, cap_reason: str = "") -> str:
+# Third-segment colours for the level badge: amber = an UNINTENTIONAL skip (a rung skipped but not
-    """Per-recipe/-run LEVEL badge: 'cc-ci | level N'. Colour by level (R6)."""
+# in the recipe's intentional list — likely missing coverage) capped the climb; muted = an
-    msg = f"level {int(level)}"
+# INTENTIONAL skip (declared in recipe_meta.EXPECTED_NA — nothing to fix). Font-safe text labels
-    return render_badge_svg("cc-ci", msg, level_color(level))
+# (no emoji) so the SVG renders anywhere.
 GAP_COLOR = "#d29922"
 EXPECT_COLOR = "#6e7681"
 def level_badge_svg(level: int, cap_reason: str = "", cap_skip: str = "") -> str:
    """Per-recipe/-run LEVEL badge: 'cc-ci | level N' coloured by level (R6), with a THIRD segment
    that differentiates *why* the climb stopped when a SKIP capped it (`cap_skip`):
      - "unintentional" (a rung skipped but not in the recipe's intentional list): amber 'gap?'.
      - "intentional"   (a skip declared in recipe_meta.EXPECTED_NA): muted 'expected'.
      - "" (clean cap / full climb / a real failure): no third segment (the level + card carry it).
    The badge never inflates — it only annotates the cap the level already reflects."""
    label, msg = "cc-ci", f"level {int(level)}"
    lw, mw = _text_width(label), _text_width(msg)
    third: tuple[str, str] | None = None
    if cap_skip == "unintentional":
        third = ("gap?", GAP_COLOR)
    elif cap_skip == "intentional":
        third = ("expected", EXPECT_COLOR)
    if third is None:
        return render_badge_svg(label, msg, level_color(level))
    txt, tcolor = third
    tw = _text_width(txt)
    w = lw + mw + tw
    return (
        f'<svg xmlns="http://www.w3.org/2000/svg" width="{w}" height="20" role="img" '
        f'aria-label="{html.escape(label)}: {html.escape(msg)} ({html.escape(txt)})">'
        f'<rect width="{lw}" height="20" fill="#555"/>'
        f'<rect x="{lw}" width="{mw}" height="20" fill="{level_color(level)}"/>'
        f'<rect x="{lw + mw}" width="{tw}" height="20" fill="{tcolor}"/>'
        f'<g fill="#fff" font-family="Verdana,Geneva,sans-serif" font-size="11">'
        f'<text x="6" y="14">{html.escape(label)}</text>'
        f'<text x="{lw + 6}" y="14">{html.escape(msg)}</text>'
        f'<text x="{lw + mw + 6}" y="14">{html.escape(txt)}</text></g></svg>'
    )
 def _stage_rows(stages: list[dict]) -> str:
@ -107,6 +141,45 @@ def _stage_rows(stages: list[dict]) -> str:
    return "\n".join(rows) or '<tr><td colspan="3">no stages</td></tr>'
 # Friendly rung labels for the skip rows (the four essential rungs).
 RUNG_LABEL = {
    "install": "install",
    "upgrade": "upgrade",
    "backup_restore": "backup/restore",
    "functional": "functional",
 }
 SKIP_GREEN = (
    "#57ab5a"  # muted green — an intentional skip reads like a pass (but labelled, never inflating)
 )
 def _skip_rows(skips: dict) -> str:
    """Render SKIPPED rungs as stage-like rows. An intentional (declared) skip looks like a pass row
    but its status says 'INTENTIONAL SKIP' (muted green) with the declared reason on the line below;
    an unintentional skip is amber 'UNINTENTIONAL SKIP' with a prompt to add a test or declare it."""
    rows = []
    for rung, reason in (skips.get("intentional") or {}).items():
        rows.append(
            f'<tr class="stage"><td colspan="2"><span class="mark" style="color:{SKIP_GREEN}">⊘</span>'
            f"<b>{html.escape(RUNG_LABEL.get(rung, rung))}</b></td>"
            f'<td class="st" style="color:{SKIP_GREEN}">intentional skip</td></tr>'
        )
        rows.append(
            f'<tr class="skipreason"><td></td><td colspan="2">{html.escape(reason)}</td></tr>'
        )
    for rung in skips.get("unintentional") or []:
        rows.append(
            f'<tr class="stage"><td colspan="2"><span class="mark" style="color:{GAP_COLOR}">⊘</span>'
            f"<b>{html.escape(RUNG_LABEL.get(rung, rung))}</b></td>"
            f'<td class="st" style="color:{GAP_COLOR}">unintentional skip</td></tr>'
        )
        rows.append(
            '<tr class="skipreason"><td></td><td colspan="2">not declared in EXPECTED_NA — add the '
            "missing test/label, or declare the skip with a reason</td></tr>"
        )
    return "\n".join(rows)
 def render_card_html(data: dict, screenshot_rel: str | None = "screenshot.png") -> str:
    """Build the summary-card HTML from a results.json dict. `screenshot_rel` is the relative path to
    the screenshot PNG (same dir as the card) — omitted from the card if None / absent.
@ -116,7 +189,9 @@ def render_card_html(data: dict, screenshot_rel: str | None = "screenshot.png")
    recipe = html.escape(str(data.get("recipe", "?")))
    version = html.escape(str(data.get("version") or data.get("ref") or ""))
    level = int(data.get("level", 0))
-    cap = html.escape(str(data.get("level_cap_reason") or ""))
+    cap_reason = str(data.get("level_cap_reason") or "")
    cap = html.escape(cap_reason)
    sk = data.get("skips", {}) or {}
    color = level_color(level)
    flags = data.get("flags", {}) or {}
    flag_bits = []
@ -132,7 +207,7 @@ def render_card_html(data: dict, screenshot_rel: str | None = "screenshot.png")
        if show_shot
        else '<div class="shot noshot">no screenshot</div>'
    )
-    rows = _stage_rows(data.get("stages", []))
+    rows = _stage_rows(data.get("stages", [])) + "\n" + _skip_rows(sk)
    return f"""<!doctype html><html><head><meta charset="utf-8"><style>
 *{{box-sizing:border-box}}
 body{{margin:0;font-family:system-ui,-apple-system,Segoe UI,sans-serif;background:#0d1117;color:#c9d1d9}}
@ -157,6 +232,7 @@ tr.stage td{{padding-top:.5rem;border-bottom:1px solid #30363d}}
 .test .tmark{{width:1.4rem;text-align:center}}
 .test .tname{{color:#c9d1d9;font-family:ui-monospace,monospace;font-size:.8rem}}
 .test .tms{{text-align:right;color:#8b949e;font-size:.74rem;width:5rem}}
 tr.skipreason td{{color:#8b949e;font-size:.78rem;font-style:italic;padding-top:0;padding-bottom:.45rem;border-bottom:1px solid #21262d}}
 .shot{{width:360px;flex:none;border:1px solid #30363d;border-radius:8px;overflow:hidden;background:#0d1117}}
 .shot img{{width:100%;display:block}}
 .shot.noshot{{display:flex;align-items:center;justify-content:center;height:225px;color:#8b949e;font-size:.85rem}}
@ -167,7 +243,7 @@ tr.stage td{{padding-top:.5rem;border-bottom:1px solid #30363d}}
 <div class="hd">{FLOWER_SVG}
 <div class="title"><h1>{recipe}</h1><span class="ver">{version}</span></div>
 <div class="lvl"><span class="num">{level}</span><span class="lbl">level</span></div></div>
-<div class="cap">{("<b>capped:</b> " + cap) if cap else "<b>full clean climb</b> — top level (6)"}</div>
+<div class="cap">{("<b>capped:</b> " + cap) if cap else "<b>full clean climb</b> — top level (4)"}</div>
 <div class="body"><div class="tbl"><table>{rows}</table></div>{shot_html}</div>
 <div class="flags">{"".join(flag_bits)}</div>
 </div></body></html>"""
--- a/runner/harness/deps.py
+++ b/runner/harness/deps.py
@ -28,7 +28,7 @@ from __future__ import annotations
 import contextlib
 import json
 import os
-from typing import Iterable
+from collections.abc import Iterable
 from . import lifecycle, naming
@ -36,9 +36,7 @@ from . import lifecycle, naming
 def declared_deps(recipe: str) -> list[str]:
    """Read `DEPS` from `tests/<recipe>/recipe_meta.py` — a list of recipe names this recipe needs
    deployed alongside it. Returns [] if none."""
-    path = os.path.join(
+    path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
        os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py"
    )
    if not os.path.exists(path):
        return []
    ns: dict = {}
--- a/runner/harness/generic.py
+++ b/runner/harness/generic.py
@ -25,7 +25,7 @@ _BACKUPBOT_RE = re.compile(r"backupbot\.backup\b[^\n]*\btrue\b", re.IGNORECASE)
 def _recipe_dir(recipe: str) -> str:
-    return os.path.expanduser(f"~/.abra/recipes/{recipe}")
+    return abra.recipe_dir(recipe)  # the per-run tree inside a CI run ($ABRA_DIR)
 def backup_capable(recipe: str, meta: dict | None = None) -> bool:
@ -222,7 +222,11 @@ def assert_restore_healthy(domain: str, meta: dict) -> None:
 def perform_upgrade(
-    domain: str, recipe: str, head_ref: str | None, deploy_timeout: int = 900, meta: dict | None = None
+    domain: str,
    recipe: str,
    head_ref: str | None,
    deploy_timeout: int = 900,
    meta: dict | None = None,
 ) -> dict[str, str | None]:
    """Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
    PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
@ -267,7 +271,9 @@ def perform_upgrade(
        deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout)),
        http_timeout=int(meta.get("HTTP_TIMEOUT", 300)),
    )
-    lifecycle.wait_ready_probes(meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout)))
+    lifecycle.wait_ready_probes(
        meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout))
    )
    after = lifecycle.deployed_identity(domain)
    # Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
    # PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.
--- a/runner/harness/http.py
+++ b/runner/harness/http.py
@ -73,7 +73,7 @@ def http_post(
    `data` is JSON-encoded if content_type='application/json',
    form-encoded if 'application/x-www-form-urlencoded' (the OIDC token endpoint form),
    or sent raw bytes if data is already bytes."""
-    if isinstance(data, (bytes, bytearray)):
+    if isinstance(data, bytes | bytearray):
        body: bytes | None = bytes(data)
    elif content_type == "application/json" and data is not None:
        body = json.dumps(data).encode()
@ -107,7 +107,7 @@ def http_request(
 ) -> tuple[int, object | None]:
    """Arbitrary-method HTTP (PUT/DELETE/PATCH) for parity tests that mutate. Same shape as
    http_post (returns (status, json_or_None))."""
-    if isinstance(data, (bytes, bytearray)):
+    if isinstance(data, bytes | bytearray):
        body: bytes | None = bytes(data)
    elif content_type == "application/json" and data is not None:
        body = json.dumps(data).encode()
@ -142,7 +142,7 @@ def post_with_headers(
    """Like http_post but ALSO returns the response headers as a dict — for APIs that hand back an
    auth token in a response header rather than the body (e.g. mattermost login → `Token` header).
    Returns (status, parsed_json_or_None, response_headers). status=0 + {} on transport failure."""
-    if isinstance(data, (bytes, bytearray)):
+    if isinstance(data, bytes | bytearray):
        body: bytes | None = bytes(data)
    elif content_type == "application/json" and data is not None:
        body = json.dumps(data).encode()
@ -252,13 +252,16 @@ def retry_http_post(
 ) -> tuple[int, object | None]:
    """POST with retry until expect_fn(status, json) is truthy. Defaults to any 2xx."""
    if expect_fn is None:
        def expect_fn(s, _j):  # noqa: ARG001
            return 200 <= s < 300
    result: list[tuple[int, object | None]] = [(0, None)]
    def _check():
-        s, j = http_post(url, data=data, headers=headers, content_type=content_type, timeout=timeout)
+        s, j = http_post(
            url, data=data, headers=headers, content_type=content_type, timeout=timeout
        )
        result[0] = (s, j)
        return expect_fn(s, j)
--- a/runner/harness/level.py
+++ b/runner/harness/level.py
@ -5,37 +5,39 @@ YunoHost semantics: **a gap caps the level** — you only earn level L if every
 PASS. The first rung that is not a clean PASS (a real FAIL *or* genuinely N/A for this recipe) stops
 the climb; `cap_reason` records why. This is deliberately conservative: presentation must NEVER make
 a run look greener than its tests (plan §6 cardinal guardrail), so an N/A rung caps just like a fail
-(the L5 example in §4.1 — "recipes with no integration surface cap at L4 by definition" — is exactly
+— with a recorded reason so the level is *fair*, not inflated.
 this: N/A caps, with a recorded reason so the level is *fair*, not inflated).
-The ladder (§4.1):
+The ladder is the FOUR essential rungs every recipe is held to:
  L0 — install failed / app never became healthy.
  L1 — Installs: deploys + passes health/readiness.
  L2 — Upgrades: previous published version → PR version, stays healthy, data intact.
  L3 — Backup/restore: seeded data survives backup → wipe → restore.
  L4 — Functional: recipe-specific functional tests pass.
-  L5 — Integration: SSO/OIDC + cross-app integration tests pass.
+
-  L6 — Recipe-local: the recipe repo's own tests/ (D4) pass and are merged.
+Integration (SSO/OIDC + cross-app) and recipe-local (the recipe repo's own tests/) are **OPTIONAL**
 capabilities — they are NOT part of the level ladder and never cap it. They still run when present
 (and SSO is still enforced for the run VERDICT via the deps/SSO checks in run_recipe_ci.py), but a
 recipe without an SSO surface or without repo-local tests is simply not penalised on the level.
 This module is PURE (no I/O) so it is cheaply unit-testable and the Adversary can re-run the unit
 test cold (`cc-ci-run -m pytest tests/unit/test_level.py -q`). The orchestrator
-(`run_recipe_ci.py`) is responsible for translating its raw per-tier results + deps/SSO signals into
+(`run_recipe_ci.py`) is responsible for translating its raw per-tier results into the rung-status
-the rung-status dict this function consumes; that mapping is documented in DECISIONS.md (Phase 3).
+dict this function consumes; that mapping is documented in DECISIONS.md (Phase 3).
 Rung status vocabulary (each rung ∈ these three):
  "pass" — the rung was exercised and passed.
  "fail" — the rung was exercised and failed.
  "na"   — the rung does not apply to this recipe (e.g. only one published version → no upgrade;
-           not backup-capable; no SSO/integration surface; no recipe-local tests). N/A is NOT a
+           not backup-capable). N/A is NOT a failure, but it DOES cap the climb (with a distinct
-           failure, but it DOES cap the climb (with a distinct cap_reason) so the level never
+           cap_reason) so the level never overstates what was actually verified.
           overstates what was actually verified.
 """
 from __future__ import annotations
 # The climbable rungs in ascending order. install (L1) is the foundation; L0 means install itself
-# did not pass. Each later rung requires every earlier rung to be a clean PASS.
+# did not pass. Each later rung requires every earlier rung to be a clean PASS. These four are the
-RUNGS = ("install", "upgrade", "backup_restore", "functional", "integration", "recipe_local")
+# ESSENTIAL rungs — integration/recipe-local are optional and deliberately NOT in this tuple.
 RUNGS = ("install", "upgrade", "backup_restore", "functional")
 # Human-readable label per rung level, for cap_reason + the summary card.
 RUNG_LABEL = {
@ -43,22 +45,20 @@ RUNG_LABEL = {
    2: "upgrade (prev published → PR)",
    3: "backup/restore (data integrity)",
    4: "functional (recipe-specific tests)",
    5: "integration (SSO/OIDC + cross-app)",
    6: "recipe-local (recipe repo tests/)",
 }
 VALID = {"pass", "fail", "na"}
 def compute_level(rungs: dict[str, str]) -> tuple[int, str]:
-    """Map a rung-status dict → (level 0..6, cap_reason).
+    """Map a rung-status dict → (level 0..4, cap_reason).
    `rungs` must contain a status in {"pass","fail","na"} for every name in RUNGS. The level is the
    highest L such that rungs[1..L] are all "pass"; the first non-"pass" rung caps the climb. L0 is
    returned when the install rung itself is not "pass" (install failed / never healthy).
    cap_reason explains where the climb stopped:
-      - "" (empty) when the recipe earned the top rung (L6, full clean climb).
+      - "" (empty) when the recipe earned the top rung (L4, full clean climb).
      - "L<k> <label> FAILED" when a rung was exercised and failed.
      - "L<k> <label> N/A" when a rung does not apply to this recipe.
    Returns the reason for the FIRST rung that stopped the climb (the binding constraint).
--- a/runner/harness/lifecycle.py
+++ b/runner/harness/lifecycle.py
@ -7,7 +7,8 @@ next run. Callers wrap deploy()/teardown() in try/finally (or a pytest finalizer
 from __future__ import annotations
 import contextlib
-import datetime
+import fcntl
 import glob
 import json
 import os
 import re
@ -17,7 +18,7 @@ import subprocess
 import time
 import urllib.request
-from . import abra
+from . import abra, lifetime
 GATEWAY_IP = "143.244.213.108"  # *.ci.commoninternet.net -> gateway (TLS passthrough to cc-ci)
 # A run app domain is "<recipe[:4]>-<6hex>.ci.commoninternet.net" (see DECISIONS.md). Used by the
@ -29,6 +30,68 @@ class TeardownError(RuntimeError):
    pass
 # --- Concurrent-run safety (capacity=2) -------------------------------------------------------
 # ONE mechanism, process-lifetime-scoped so SIGKILL can't leak a stale claim: every run holds an
 # exclusive kernel flock on its app DOMAIN (/run/lock/cc-ci-app-<domain>.lock) for the whole run.
 # A held lock implies a live owner — the kernel releases a flock when the holding process dies,
 # however it dies. The janitor probes the lock (LOCK_NB) to tell a live concurrent run (held →
 # leave it) from a crashed run's orphan (acquirable → reap it); it never inspects pids and never
 # steals a held lock. Recipe-tree corruption between same-recipe runs is gone structurally (each
 # run deploys from its own per-run ABRA_DIR — there is no shared recipe tree and no recipe lock),
 # and same-domain runs (double-!testme of one PR) serialise on this app lock.
 # See docs/concurrency.md.
 # Acquired app-lock file objects are retained here for the REMAINING PROCESS LIFETIME: if the
 # caller drops the returned file object, GC would close the fd and silently release the lock —
 # this list is the lock's owner of record. Never cleared; release is process exit.
 _held_app_locks: list = []
 def _app_lock_dir() -> str:
    """The app-domain lockfile dir. /run/lock (tmpfs: a reboot clears locks AND lockfiles, so
    post-reboot apps probe as orphans and are reaped immediately). Env-overridable so the
    tests/concurrency suite (and its helper subprocesses) can use a sandbox dir."""
    return os.environ.get("CCCI_APP_LOCK_DIR", "/run/lock")
 def _app_lock_path(domain: str) -> str:
    return os.path.join(_app_lock_dir(), f"cc-ci-app-{domain}.lock")
 def acquire_app_lock(domain: str):
    """Take the per-app-domain exclusive lock; blocks (with a log line) if another run of the
    same domain is in flight (double-!testme serialisation). Returns the open lock file, which is
    ALSO retained in _held_app_locks so the flock lives exactly as long as the process.
    Unlink/recreate race guard: the janitor unlinks a reaped orphan's lockfile while holding its
    flock, so a waiter blocked on the OLD inode can win a lock no later opener can observe (a new
    open() at the path creates a FRESH inode). After every acquisition, verify the locked fd is
    still the file at the path (st_ino match); if not, drop it and retry on the live path."""
    path = _app_lock_path(domain)
    waited = False
    while True:
        # PEP 446: the fd is non-inheritable, so subprocess children never carry the lock.
        f = open(path, "a")  # noqa: SIM115 — deliberately held for the rest of the process
        try:
            fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
        except BlockingIOError:
            if not waited:
                print(f"== app lock: another run of {domain} is in flight — waiting ==", flush=True)
                waited = True
            fcntl.flock(f, fcntl.LOCK_EX)
        try:
            if os.fstat(f.fileno()).st_ino == os.stat(path).st_ino:
                break  # we hold the lock on the inode the path names — done
        except FileNotFoundError:
            pass
        f.close()  # locked a stale (unlinked) inode — retry on the live path
    os.utime(f.fileno())  # mtime = acquisition time = lock age (janitor's long-held flag)
    _held_app_locks.append(f)
    if waited:
        print(f"== app lock: acquired {path} ==", flush=True)
    return f
 def _docker_names(kind: str, stack: str) -> list[str]:
    """docker <kind> ls names filtered to a stack (kind: service|volume|secret)."""
    proc = subprocess.run(
@ -48,31 +111,6 @@ def _residual(domain: str) -> dict:
    }
 def _stack_age_seconds(stack: str) -> float | None:
    """Age of the stack's oldest service, or None if not present."""
    svcs = _docker_names("service", stack)
    if not svcs:
        return None
    oldest = None
    for s in svcs:
        p = subprocess.run(
            ["docker", "service", "inspect", s, "--format", "{{.CreatedAt}}"],
            capture_output=True,
            text=True,
        )
        ts = p.stdout.strip()
        try:
            # docker emits e.g. 2026-05-27 00:12:33.123 +0000 UTC -> take the leading 19 chars
            dt = datetime.datetime.strptime(ts[:19], "%Y-%m-%d %H:%M:%S").replace(
                tzinfo=datetime.UTC
            )
        except ValueError:
            continue
        age = (datetime.datetime.now(datetime.UTC) - dt).total_seconds()
        oldest = age if oldest is None else max(oldest, age)
    return oldest
 def _recipe_extra_env(recipe: str, domain: str) -> dict[str, str]:
    """Per-recipe extra .env keys, applied at every deploy (install + upgrade's old_app) so a recipe
    with multi-domain / config needs is enrolled with NO shared-harness change (D5/M6.5). A recipe
@ -149,9 +187,9 @@ def prepull_images(recipe: str, domain: str) -> None:
    app-INIT time (slow-init apps like collabora/immich still need their recipe healthcheck/READY_PROBE).
    Best-effort on resolution failure (skip + let the deploy pull as usual); HARD-fails on a real
    pull error (don't mask it)."""
-    import os
+    recipe_dir = abra.recipe_dir(recipe)  # per-run tree inside a CI run
-
+    # The app .env lives in the CANONICAL servers path (the per-run ABRA_DIR's servers/ is a
-    recipe_dir = os.path.expanduser(f"~/.abra/recipes/{recipe}")
+    # symlink to it, so abra and this path agree on the same file).
    env_path = os.path.expanduser(f"~/.abra/servers/default/{domain}.env")
    if not os.path.isdir(recipe_dir) or not os.path.isfile(env_path):
        print(f"  prepull: recipe dir or .env missing for {recipe} — skipping", flush=True)
@ -161,7 +199,8 @@ def prepull_images(recipe: str, domain: str) -> None:
    # --env-file supplies $VERSION-style interpolation so pinned tags resolve correctly.
    cf = subprocess.run(
        ["bash", "-c", f'set -a; . "{env_path}"; printf "%s" "${{COMPOSE_FILE:-compose.yml}}"'],
-        capture_output=True, text=True,
+        capture_output=True,
        text=True,
    ).stdout.strip()
    files = [f for f in cf.split(":") if f] or ["compose.yml"]
    args = ["docker", "compose", "--env-file", env_path]
@ -209,6 +248,10 @@ def deploy_app(
    past the 900s default. abra's INTERNAL TIMEOUT (recipe's TIMEOUT env, default 300s) is set via
    EXTRA_ENV; this is the Python subprocess wrapper's timeout so abra doesn't get SIGKILLed mid-deploy."""
    _record_deploy()
    # Lock BEFORE the app exists: a concurrent run's janitor must never see this app without a
    # held app lock (it would probe it as an orphan and reap an in-flight deploy). Also the
    # double-!testme serialisation point: a second run of the same domain blocks here.
    acquire_app_lock(domain)
    abra.app_config_remove(domain)  # clear any stale .env from a prior crashed run
    abra.app_new(recipe, domain, version=version, secrets=secrets)
    # A pinned version must actually deploy that version: check the recipe out to the tag so the
@ -268,18 +311,22 @@ def _stack_name(domain: str) -> str:
 def services_converged(domain: str) -> bool:
-    """True when every service in the stack reports replicas N/N (N>0)."""
+    """True when every service in the stack reports replicas N/N (N>0) AND no service is
    mid-rolling-update (swarm UpdateStatus settled)."""
    stack = _stack_name(domain)
    proc = subprocess.run(
-        ["docker", "stack", "services", stack, "--format", "{{.Replicas}}"],
+        ["docker", "stack", "services", stack, "--format", "{{.Name}} {{.Replicas}}"],
        capture_output=True,
        text=True,
    )
    rows = [r for r in proc.stdout.split("\n") if r.strip()]
    if not rows:
        return False
    names = []
    for r in rows:
-        cur, _, want = r.partition("/")
+        name, _, replicas = r.partition(" ")
        names.append(name)
        cur, _, want = replicas.partition("/")
        # A service at its DESIRED replica count is converged — including a `replicas: 0`
        # on-demand one-shot (e.g. lasuite-drive's `minio-createbuckets`, which is scaled up
        # manually only when buckets need (re)creating), which reports "0/0". The earlier
@ -288,6 +335,34 @@ def services_converged(domain: str) -> bool:
        # still spinning up shows e.g. "0/1" (cur != want) and is correctly not-yet-converged.
        if not want or cur != want:
            return False
    # N/N alone is NOT convergence during a stop-first rolling update: a chaos redeploy that changes
    # a non-app service image (e.g. immich's db pin) registers the update immediately, but swarm may
    # not have cycled that service's task yet — the OLD task still shows 1/1, then dies seconds later
    # (immich CI 238: backupbot exec'd the db pre-hook into the just-killed container → 409). Require
    # every service's UpdateStatus to be settled too, so the wait spans the whole rolling update.
    proc = subprocess.run(
        [
            "docker",
            "service",
            "inspect",
            *names,
            "--format",
            "{{if .UpdateStatus}}{{.UpdateStatus.State}}{{end}}",
        ],
        capture_output=True,
        text=True,
    )
    if proc.returncode != 0:
        return False  # a service vanished mid-check — not settled
    for state in proc.stdout.split("\n"):
        # Only ACTIVE states block convergence. 'paused'/'rollback_paused' are terminal-without-
        # intervention: swarm's default update-failure-action pauses the update on one task flicker
        # and the flag then persists FOREVER (immich CI 241: app service 'paused' from a restart
        # during restore, service back at 1/1 and healthy — the wait hung to its deadline). With
        # N/N already required above, a paused update is settled for our purposes; the HTTP-health
        # and tier assertions still gate whether the app actually works.
        if state.strip() in ("updating", "rollback_started"):
            return False
    return True
@ -415,7 +490,9 @@ def recipe_checkout_ref(recipe: str, ref: str) -> None:
    abra.recipe_checkout(recipe, ref)
-def chaos_redeploy(domain: str, deploy_timeout: int = 900, no_converge_checks: bool = False) -> None:
+def chaos_redeploy(
    domain: str, deploy_timeout: int = 900, no_converge_checks: bool = False
 ) -> None:
    """In-place `abra app deploy --chaos`: redeploy the running app at the CURRENT recipe checkout
    (HC1: the PR-head code under test). This is the upgrade op, not a fresh install — it does NOT go
    through deploy_app, so the deploy-count guard (DG4.1) is not incremented.
@ -498,6 +575,16 @@ def wait_ready_probes(meta: dict, domain: str, timeout: int = 600) -> None:
 def backup_app(domain: str) -> str:
    """Create a backup; return the abra/restic output (carries the produced snapshot_id)."""
    # Never back up a stack that is still converging/rolling-updating: backupbot resolves each
    # service's hook container ONCE up front, so a task that cycles between that lookup and the
    # pre-hook exec crashes the whole backup with a 409 (immich CI 238). Bounded wait — on timeout
    # we still attempt the backup and let the tier's assertion deliver the verdict.
    deadline = time.time() + 300
    while time.time() < deadline and not services_converged(domain):
        print(
            f"  backup: {domain} stack not settled yet — waiting before backup create", flush=True
        )
        time.sleep(5)
    return abra.backup_create(domain)
@ -603,17 +690,84 @@ def teardown_app(domain: str, verify: bool = True) -> None:
        residual = _residual(domain)
        if any(residual.values()):
            raise TeardownError(f"teardown left residual for {domain}: {residual}")
    # No unregistration step: the app lock releases implicitly at process exit. The clean run's
    # leftover lockfile (unheld) is unlinked on sight by the next janitor's stale-lockfile sweep.
-def janitor(max_age_seconds: int | None = None) -> None:
+# A lock held longer than 2x the 60-min hard deadline can only be a leaked run (the deadline
-    """Reap orphaned run apps from crashed/rebooted runs. Matches the real naming scheme and only
+# bounds every healthy run). Flag it for a human — NEVER steal a held lock.
-    reaps apps older than max_age_seconds (so concurrent in-flight runs are never killed). Reaps via
+LONG_HELD_LOCK_SECONDS = 2 * lifetime.HARD_DEADLINE_SECONDS
    docker primitives so it works even when the .env is gone (A2/A3). Default 2h, env-overridable
    via CCCI_JANITOR_MAX_AGE (e.g. 0 to reap all matching orphans immediately)."""
    import os
-    if max_age_seconds is None:
+
-        max_age_seconds = int(os.environ.get("CCCI_JANITOR_MAX_AGE", "7200"))
+def _probe_and_reap(domain: str) -> None:
    """Probe one run app's lock; reap iff nobody holds it (kernel-guaranteed orphan).
    Reaping happens WHILE HOLDING the probe lock, closing the janitor-vs-new-run race: a new run
    of the same domain blocks in acquire_app_lock until the reap finishes, so a fresh app never
    coexists with a half-reaped one. The lockfile is unlinked before release (still holding the
    lock); a waiter that blocked on the unlinked inode re-checks identity and retries. Two racing
    janitors arbitrate on the same flock: one reaps, the other sees 'held' and leaves —
    teardown_app(verify=False) is idempotent either way."""
    path = _app_lock_path(domain)
    try:
        # PEP 446: non-inheritable fd, same as acquire_app_lock.
        f = open(path, "a")  # noqa: SIM115 — closed in the finally below, lock released with it
    except OSError as e:
        print(f"!! janitor: cannot open lockfile {path} ({e}) — skipping {domain}", flush=True)
        return
    try:
        try:
            fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
        except BlockingIOError:
            # Held -> live run. Never steal; flag if it has been held implausibly long.
            try:
                held_for = time.time() - os.stat(path).st_mtime
            except OSError:
                held_for = 0
            if held_for > LONG_HELD_LOCK_SECONDS:
                print(
                    f"!! lock for {domain} held >{LONG_HELD_LOCK_SECONDS // 60}min — possible "
                    "leaked run; inspect with lslocks",
                    flush=True,
                )
            else:
                print(
                    f"  janitor: {domain} lock held — live concurrent run, leaving it", flush=True
                )
            return
        # Acquired — but only the inode the PATH names counts (another janitor may have reaped
        # and unlinked this inode while we raced; a lock on an unlinked inode protects nothing
        # and unlinking the path now would delete a NEWER run's lockfile).
        try:
            if os.fstat(f.fileno()).st_ino != os.stat(path).st_ino:
                return
        except FileNotFoundError:
            return
        # Orphan: no live owner (the kernel released the lock when the owner died). Reap while
        # holding the probe lock, then unlink the lockfile before releasing.
        print(f"  janitor: {domain} lock acquirable — orphan, reaping", flush=True)
        with contextlib.suppress(Exception):
            teardown_app(domain, verify=False)
        with contextlib.suppress(OSError):
            os.unlink(path)
    finally:
        f.close()
 def janitor() -> None:
    """Reap orphaned run apps from crashed/rebooted runs; the kernel flock is the only liveness
    oracle. For every candidate run app, probe its app-domain lock (LOCK_NB):
      acquirable -> nobody holds it -> orphan -> reap under the probe lock + unlink lockfile
      held       -> live concurrent run -> leave it (warn if held >2x the hard deadline)
    Candidate discovery is unchanged: `abra app ls` + a docker-service sweep (catches stacks
    whose .env is already gone), both matched against RUN_APP_RE — warm/canonical apps never
    match and are never probed. Post-reboot, /run/lock (tmpfs) is empty, so every surviving app
    probes as an orphan and is reaped immediately (no age threshold). Stale lockfiles with no
    app behind them are unlinked on sight. Degrades safely: an unreadable lockfile/dir is
    skipped with a log line, never a crash. Reaps via docker primitives so it works even when
    the .env is gone (A2/A3)."""
    seen = set()
    for app in abra.app_ls():
        name = app.get("appName") or app.get("domain") or ""
@ -627,9 +781,22 @@ def janitor(max_age_seconds: int | None = None) -> None:
            seen.add(f"{m.group(1)}.ci.commoninternet.net")
    for name in seen:
-        stack = _stack_name(name)
+        _probe_and_reap(name)
-        age = _stack_age_seconds(stack)
+
-        if age is not None and age < max_age_seconds:
+    # Tidy /run/lock: a clean run's leftover lockfile is unheld and appless — unlink it (under
-            continue  # likely a concurrent in-flight run; leave it
+    # its own probe lock, with the same identity check as above).
-        with contextlib.suppress(Exception):
+    with contextlib.suppress(OSError):
-            teardown_app(name, verify=False)
+        for path in glob.glob(os.path.join(_app_lock_dir(), "cc-ci-app-*.lock")):
            domain = os.path.basename(path)[len("cc-ci-app-") : -len(".lock")]
            if domain in seen:
                continue  # handled (or deliberately left) above
            with contextlib.suppress(OSError):
                f = open(path, "a")  # noqa: SIM115 — closed below, lock released with it
                try:
                    fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
                    if os.fstat(f.fileno()).st_ino == os.stat(path).st_ino:
                        os.unlink(path)
                except (BlockingIOError, FileNotFoundError):
                    pass  # held (live run pre-deploy) or already gone — leave it
                finally:
                    f.close()
--- a/runner/harness/lifetime.py
+++ b/runner/harness/lifetime.py
@ -0,0 +1,95 @@
 """Run-lifetime hardening (concurrency restructure P1).
 The concurrency model's invariant chain is:
    lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
 Locks are kernel flocks released on process exit, so the only thing that needs managing is the
 PROCESS lifetime. Three guards, installed at run startup (before any abra call) by
 `install_lifetime_guards()`:
  1. `PR_SET_PDEATHSIG(SIGTERM)`: if the parent (the drone step shell) dies — cancel, runner
     crash, host shutdown of the step — the kernel delivers SIGTERM to the harness, so a dead
     build can never leak a running harness that holds locks. Paired with a ppid==1 re-check
     AFTER the prctl: a parent that died BEFORE the prctl took effect would never trigger the
     death signal, so a harness that finds itself already reparented refuses to run.
  2. SIGTERM handler: raise SystemExit so the run's `finally:` teardown funnel executes and the
     process exits non-zero. Re-entrant deliveries during teardown are logged and IGNORED so a
     second signal can't abort the cleanup the first one asked for (`begin_teardown()` guards
     this; the run's own `finally:` blocks also call it so a signal landing mid-normal-teardown
     can't abort that either).
  3. `signal.alarm(3600)`: self-imposed hard deadline. SIGALRM funnels into the same teardown
     path with a distinct log line. Teardown time after the deadline is not alarm-bounded —
     interrupting a teardown buys nothing; the janitor (flock probe) is the backstop if a
     teardown wedges and the process is killed harder.
 """
 from __future__ import annotations
 import ctypes
 import os
 import signal
 import sys
 HARD_DEADLINE_SECONDS = 60 * 60
 _PR_SET_PDEATHSIG = 1  # linux/prctl.h
 _state = {"tearing_down": False}
 def begin_teardown() -> None:
    """Mark the teardown funnel as running. From here on SIGTERM/SIGALRM must NOT raise — it
    would abort the very cleanup it asks for — so the handlers log and return instead. Called by
    the handlers themselves before raising, and at the top of the run's `finally:` blocks."""
    _state["tearing_down"] = True
 def _funnel_handler(log_line: str, exit_code: int):
    """A signal handler that routes into the teardown funnel exactly once: log, then raise
    SystemExit (propagates through the run's try/finally → teardown executes → non-zero exit).
    While teardown is already running, further signals are logged and swallowed."""
    def handler(signum: int, frame) -> None:  # noqa: ARG001
        print(log_line, flush=True)
        if _state["tearing_down"]:
            print(
                f"== signal {signum} during teardown — ignored (teardown continues, "
                "exit stays non-zero) ==",
                flush=True,
            )
            return
        begin_teardown()
        raise SystemExit(exit_code)
    return handler
 def install_lifetime_guards(deadline_seconds: int = HARD_DEADLINE_SECONDS) -> None:
    """Install all three lifetime guards (see module docstring). Must run at harness startup,
    before any abra call and before any lock is taken."""
    libc = ctypes.CDLL("libc.so.6", use_errno=True)
    if libc.prctl(_PR_SET_PDEATHSIG, signal.SIGTERM, 0, 0, 0) != 0:
        err = ctypes.get_errno()
        raise OSError(err, f"prctl(PR_SET_PDEATHSIG, SIGTERM) failed: {os.strerror(err)}")
    # The prctl is armed now — but only fires for a parent death AFTER this point. If the parent
    # already died, we are reparented (ppid 1) and would never get the signal: refuse to run, an
    # orphaned harness would hold locks/apps with nothing managing its lifetime.
    if os.getppid() == 1:
        sys.exit("parent died before prctl(PR_SET_PDEATHSIG) — refusing to run orphaned")
    signal.signal(
        signal.SIGTERM,
        _funnel_handler(
            "== SIGTERM received (drone cancel / parent death) — tearing down ==",
            128 + signal.SIGTERM,
        ),
    )
    minutes = deadline_seconds // 60
    signal.signal(
        signal.SIGALRM,
        _funnel_handler(
            f"== run exceeded {minutes}-minute hard deadline — tearing down ==",
            128 + signal.SIGALRM,
        ),
    )
    signal.alarm(deadline_seconds)
--- a/runner/harness/results.py
+++ b/runner/harness/results.py
@ -2,7 +2,14 @@
 Turns a run's per-tier pytest outcomes into a single `results.json` artifact carrying, per the plan:
  { recipe, version, pr, ref, run_id, finished, stages:[{name,status,tests:[{name,status,ms}]}],
-    level, level_cap_reason, rungs, flags:{clean_teardown,no_secret_leak}, screenshot, summary_card }
+    level, level_cap_reason, level_cap_rung, rungs,
    skips:{intentional:{rung:reason}, unintentional:[rung]},
    flags:{clean_teardown,no_secret_leak}, screenshot, summary_card }
 `skips` splits the N/A (skipped) rungs by a simple rule: a skip is INTENTIONAL iff the recipe lists
 it (with a reason) in `recipe_meta.EXPECTED_NA = {rung: reason}`; any rung skipped but not listed is
 UNINTENTIONAL (a coverage gap to fill or declare). Skips still cap the level either way — the harness
 never claims a rung it did not verify; this only labels *why* a skip happened.
 The per-test breakdown comes from JUnit XML emitted by each tier's pytest invocation (`--junitxml`),
 parsed here with the stdlib (no new dep). The integer **level** is computed by harness.level from a
@ -127,41 +134,24 @@ def collect_stages(records: list[dict]) -> list[dict]:
    return stages
 def _has_repo_local(records: list[dict]) -> bool:
    return any(r.get("source") == "repo-local" for r in records)
 def _repo_local_passed(records: list[dict]) -> bool:
    repo = [r for r in records if r.get("source") == "repo-local"]
    return bool(repo) and all(r.get("rc", 1) == 0 for r in repo)
 def derive_rungs(
    results: dict[str, str],
    *,
    backup_capable: bool,
    declared: list[str] | None,
    deps_ready: bool,
    sso_unverified: bool,
    has_custom: bool,
    has_repo_local: bool,
    repo_local_passed: bool,
 ) -> dict[str, str]:
-    """Translate the orchestrator's tier results + deps/SSO signals into the rung-status dict
+    """Translate the orchestrator's tier results into the rung-status dict harness.level consumes —
-    harness.level consumes. Documented in DECISIONS.md (Phase 3). Conservative by design — never
+    the FOUR essential rungs only. Conservative by design — never reports a rung 'pass' it can't
-    reports a rung 'pass' it can't substantiate (cardinal guardrail: presentation never inflates).
+    substantiate (cardinal guardrail: presentation never inflates).
      L1 install    : install tier pass.
      L2 upgrade    : upgrade tier (skip → N/A: only one published version).
      L3 backup/res : backup AND restore tiers pass (N/A if not backup-capable).
-      L4 functional : the recipe-specific functional (non-deps) tests pass — the custom tier, minus
+      L4 functional : recipe-specific functional tests pass — the custom tier. N/A if none ran.
-                      its SSO/integration tests. N/A if the recipe has no custom tests at all.
+
-      L5 integration: SSO/OIDC + cross-app. Applies ONLY if the recipe declares deps (else N/A — the
+    Integration (SSO/OIDC) and recipe-local are OPTIONAL and intentionally NOT rungs here — they
-                      "no integration surface caps at L4" rule, §4.1). pass iff deps wired
+    never cap the level (SSO is still enforced for the run VERDICT in run_recipe_ci.py).
                      (deps_ready) and not sso_unverified and the custom tier didn't fail.
      L6 recipe-loc : the recipe repo's own tests/ (repo-local source) ran and passed (N/A if none).
    """
    declared = declared or []
    rungs: dict[str, str] = {}
    rungs["install"] = level_mod.tier_to_rung(results.get("install"))
    rungs["upgrade"] = level_mod.tier_to_rung(results.get("upgrade"))
@ -170,36 +160,34 @@ def derive_rungs(
    )
    custom = results.get("custom")
    # Functional rung (L4): the non-deps custom tests.
    if not has_custom or custom == "skip" or custom is None:
        rungs["functional"] = "na"
    elif custom == "fail":
        # A custom test failed. With declared deps we cannot cheaply tell functional-vs-SSO apart, so
        # conservatively fail the functional rung (caps at L3) — never inflate.
        rungs["functional"] = "fail"
    else:  # custom == "pass"
        rungs["functional"] = "pass"
    # Integration rung (L5): only recipes with an SSO/integration surface (declared deps) can climb.
    if not declared:
        rungs["integration"] = "na"
    elif sso_unverified or not deps_ready or custom == "fail":
        # SSO not wired/verified, or a custom test failed → integration not verified.
        rungs["integration"] = "fail"
    elif custom == "pass":
        rungs["integration"] = "pass"
    else:
        # declared deps but no custom tests ran — can't claim integration verified
        rungs["integration"] = "na"
    # Recipe-local rung (L6).
    if not has_repo_local:
        rungs["recipe_local"] = "na"
    else:
        rungs["recipe_local"] = "pass" if repo_local_passed else "fail"
    return rungs
 def skips(rungs: dict[str, str], expected_na: dict | None) -> dict:
    """Split the SKIPPED (N/A) rungs into intentional vs unintentional (operator model).
    A recipe lists the rungs it intentionally skips, each with a reason, in
    `recipe_meta.EXPECTED_NA = {rung: reason}`. The rule is dead simple: a skipped rung is
    **intentional** iff it is in that list; any rung that is skipped and NOT in the list is
    **unintentional** (a coverage gap someone should either fill or declare). N/A still caps the
    level either way — the harness never claims a rung it did not verify — this only labels *why* a
    skip happened. Returns:
      { "intentional": {rung: reason, ...},   # skipped AND declared in EXPECTED_NA
        "unintentional": [rung, ...] }         # skipped but NOT declared
    """
    expected = {str(k): str(v) for k, v in (expected_na or {}).items()}
    na = [r for r, st in rungs.items() if st == "na"]
    intentional = {r: expected[r] for r in na if r in expected}
    unintentional = sorted(r for r in na if r not in expected)
    return {"intentional": intentional, "unintentional": unintentional}
 def build_results(
    *,
    recipe: str,
@ -209,30 +197,24 @@ def build_results(
    records: list[dict],
    results: dict[str, str],
    backup_capable: bool,
    declared: list[str] | None,
    deps_ready: bool,
    sso_unverified: bool,
    clean_teardown: bool,
    no_secret_leak: bool,
    finished_ts: float | None,
    screenshot: str | None = None,
    summary_card: str | None = None,
    expected_na: dict | None = None,
 ) -> dict:
    """Assemble the full results.json dict (no I/O). `finished_ts` is passed in (the orchestrator
-    stamps it) so this stays pure and deterministic for unit tests."""
+    stamps it) so this stays pure and deterministic for unit tests. `expected_na` is the recipe's
    declared intentional-skip map (recipe_meta.EXPECTED_NA) used to distinguish a deliberate skip from
    accidentally-missing coverage."""
    stages = collect_stages(records)
    has_custom = any(r["tier"] == "custom" for r in records)
-    rungs = derive_rungs(
+    rungs = derive_rungs(results, backup_capable=backup_capable, has_custom=has_custom)
        results,
        backup_capable=backup_capable,
        declared=declared,
        deps_ready=deps_ready,
        sso_unverified=sso_unverified,
        has_custom=has_custom,
        has_repo_local=_has_repo_local(records),
        repo_local_passed=_repo_local_passed(records),
    )
    lvl, cap_reason = level_mod.compute_level(rungs)
    # The rung that capped the climb (lowest non-pass), or None on a full climb — lets a consumer
    # (card/badge) tell whether the cap was an intentional skip, an unintentional one, or a failure.
    capped = level_mod.RUNGS[lvl] if cap_reason else None
    return {
        "schema": 1,
        "run_id": run_id(),
@ -243,7 +225,9 @@ def build_results(
        "finished": finished_ts,
        "level": lvl,
        "level_cap_reason": cap_reason,
        "level_cap_rung": capped,
        "rungs": rungs,
        "skips": skips(rungs, expected_na),
        "stages": stages,
        "results": results,
        "flags": {
--- a/runner/harness/warmsnap.py
+++ b/runner/harness/warmsnap.py
@ -113,7 +113,9 @@ def _assert_undeployed(domain: str) -> None:
        )
-def snapshot(recipe: str, domain: str, commit: str | None = None, version: str | None = None) -> dict:
+def snapshot(
    recipe: str, domain: str, commit: str | None = None, version: str | None = None
 ) -> dict:
    """Take a last-known-good snapshot of every data volume of <domain>'s stack. The app MUST be
    undeployed. Atomically replaces the prior last-good. Returns the written meta dict."""
    _assert_undeployed(domain)
@ -169,7 +171,9 @@ def restore(recipe: str, domain: str) -> dict:
    for vol in meta.get("volumes", []):
        tar_path = os.path.join(volumes_dir(recipe), f"{vol}.tar")
        if vol not in current:
-            raise SnapshotError(f"snapshot volume {vol} absent from current stack {sorted(current)}")
+            raise SnapshotError(
                f"snapshot volume {vol} absent from current stack {sorted(current)}"
            )
        mp = _volume_mountpoint(vol)
        # Clear the volume contents (incl. dotfiles) without removing the mountpoint itself.
        r = _run(["sh", "-c", f'rm -rf -- "{mp}"/* "{mp}"/.[!.]* "{mp}"/..?* 2>/dev/null; true'])
--- a/runner/nightly_sweep.py
+++ b/runner/nightly_sweep.py
@ -60,14 +60,17 @@ def sweep() -> int:
    for r in recipes:
        print(f"\n===== nightly: full-cold {r} (latest) =====", flush=True)
        env = dict(os.environ, RECIPE=r)
-        env.pop("REF", None)      # latest, not a PR head
+        env.pop("REF", None)  # latest, not a PR head
        env.pop("CCCI_QUICK", None)
        env.pop("MODE", None)
        rc = subprocess.run(
            [sys.executable, os.path.join(_here(), "run_recipe_ci.py")], env=env
        ).returncode
        results[r] = rc
-        print(f"nightly: {r} rc={rc} ({'green→canonical refreshed' if rc == 0 else 'red'})", flush=True)
+        print(
            f"nightly: {r} rc={rc} ({'green→canonical refreshed' if rc == 0 else 'red'})",
            flush=True,
        )
    # WC8 disk hygiene: drop warm data for de-enrolled canonicals; log the disk budget.
    pruned = canonical.prune_stale()
    if pruned:
--- a/runner/run_recipe_ci.py
+++ b/runner/run_recipe_ci.py
@ -44,17 +44,26 @@ sys.path.insert(0, os.path.join(ROOT, "runner"))
 from harness import (  # noqa: E402
    abra,
    canonical,
    card as card_mod,
    deps as deps_mod,
    discovery,
    generic,
    lifecycle,
    lifetime,
    naming,
    results as results_mod,
    screenshot as screenshot_mod,
    warm,
    warmsnap,
 )
 from harness import (  # noqa: E402
    card as card_mod,
 )
 from harness import (  # noqa: E402
    deps as deps_mod,
 )
 from harness import (  # noqa: E402
    results as results_mod,
 )
 from harness import (  # noqa: E402
    screenshot as screenshot_mod,
 )
 ALL_STAGES = ("install", "upgrade", "backup", "restore", "custom")
@ -129,18 +138,73 @@ def _gitea_token() -> str | None:
    return tok or None
 def _run_state_path(name: str) -> str:
    """Run-scoped state file in the tempdir, keyed by run id + harness pid — NEVER by app domain.
    A second run of the SAME domain overlaps this process (its main() preamble executes before it
    blocks at the app lock inside deploy_app), so domain-keyed files get reset/removed under the
    live run: M2(c) double-!testme produced a false DG4.1 deploy-count=2 in run 1 and a countfile
    FileNotFoundError crash in run 2. Children never re-derive these paths — they receive them
    via the CCCI_*_FILE env vars, so the key only has to be unique per harness process."""
    rid = results_mod.run_id()
    return os.path.join(tempfile.gettempdir(), f"ccci-{name}-{rid}-{os.getpid()}")
 def setup_run_abra_dir() -> str:
    """P3: build + export this run's PER-RUN ABRA_DIR — structural isolation of recipe trees.
    `<runs_dir>/<run-id>/abra/` with:
      servers/   -> symlink to the canonical ~/.abra/servers. App .env files land in the shared
                    canonical path, so janitor discovery (`abra app ls`) and env-based teardown
                    work unchanged from any process; per-domain filenames + the app-domain lock
                    prevent write conflicts.
      catalogue/ -> symlink to the canonical ~/.abra/catalogue (read-mostly).
      recipes/   fresh + empty — THE isolation that matters: each run clones and git-checkouts
                 its own recipe trees, so concurrent runs (same recipe included) can never
                 corrupt each other's deploy tree. Replaces the per-recipe flock.
    Exported as $ABRA_DIR — honored by the abra CLI and by every harness path helper
    (abra.abra_dir()) — BEFORE any abra call. Rides along the existing run-dir retention."""
    canonical = os.path.expanduser("~/.abra")
    rid = results_mod.run_id()
    if rid == "manual":
        rid = f"manual-{os.getpid()}"  # two concurrent hand-runs must not share a tree
    run_abra_dir = os.path.join(results_mod.runs_dir(), rid, "abra")
    os.makedirs(os.path.join(run_abra_dir, "recipes"), exist_ok=True)
    for shared in ("servers", "catalogue"):
        link = os.path.join(run_abra_dir, shared)
        if not os.path.islink(link):
            os.symlink(os.path.join(canonical, shared), link)
    os.environ["ABRA_DIR"] = run_abra_dir
    print(
        f"== per-run ABRA_DIR: {run_abra_dir} (servers/catalogue -> canonical; fresh recipes/) ==",
        flush=True,
    )
    return run_abra_dir
 def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None:
-    """Make the recipe available at the code under test. If SRC+REF point at the mirror PR,
+    """Make the recipe available at the code under test in THIS RUN's recipe tree
    ($ABRA_DIR/recipes/<recipe>): a plain clone — no locking needed, no rm-rf of any shared
    state (the rm below only clears this run's own leftovers, e.g. a janitor-triggered
    `abra app ls` auto-clone or a Drone build-number reuse). If SRC+REF point at the mirror PR,
    clone it at that ref; otherwise fetch the catalogue copy. Private mirror repos need the bot
    token — passed via a per-command http.extraHeader (not persisted in .git/config, not printed)."""
-    recipes_dir = os.path.expanduser("~/.abra/recipes")
+    dest = abra.recipe_dir(recipe)
-    os.makedirs(recipes_dir, exist_ok=True)
+    os.makedirs(os.path.dirname(dest), exist_ok=True)
-    dest = os.path.join(recipes_dir, recipe)
+    # CCCI_SKIP_FETCH=1: use the locally STAGED recipe clone as-is (lets a test/Adversary stage a
-    # CCCI_SKIP_FETCH=1: use the local recipe clone as-is (lets a test/Adversary stage a fake/broken
+    # fake/broken ref — e.g. a simulated broken PR head for the --quick rollback proof — without it
-    # ref — e.g. a simulated broken PR head for the --quick rollback proof — without it being clobbered
+    # being clobbered by a re-fetch). Staging happens in the canonical ~/.abra/recipes/<recipe>;
-    # by a re-fetch). Never set in production CI.
+    # copy it into the per-run tree so the rest of the run reads the staged state. Never set in
    # production CI.
    if os.environ.get("CCCI_SKIP_FETCH") == "1":
-        print(f"[fetch] CCCI_SKIP_FETCH=1 — using local {recipe} recipe clone as-is", flush=True)
+        canonical = os.path.expanduser(f"~/.abra/recipes/{recipe}")
        subprocess.run(["rm", "-rf", dest], check=False)
        if os.path.isdir(canonical):
            shutil.copytree(canonical, dest, symlinks=True)
        print(
            f"[fetch] CCCI_SKIP_FETCH=1 — using staged {recipe} clone as-is "
            f"(copied {canonical} -> per-run tree)",
            flush=True,
        )
        return
    if src and ref:
        url = f"https://git.autonomic.zone/{src}.git"
@ -169,7 +233,7 @@ def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None:
 def snapshot_recipe_tests(recipe: str) -> str | None:
    """Copy the recipe-shipped tests/ to a stable temp dir, immune to abra re-checking-out the
    recipe to a version tag during the run. Returns the snapshot path, or None if no tests/."""
-    src = os.path.expanduser(f"~/.abra/recipes/{recipe}/tests")
+    src = os.path.join(abra.recipe_dir(recipe), "tests")
    if not os.path.isdir(src):
        return None
    has_overlay = glob.glob(os.path.join(src, "test_*.py")) or os.path.isfile(
@ -200,6 +264,7 @@ def _load_meta(recipe: str) -> dict:
        for k in list(meta) + [
            "BACKUP_CAPABLE",
            "SKIP_GENERIC",
            "EXPECTED_NA",
            "OIDC_AT_INSTALL",
            "READY_PROBE",
            "UPGRADE_BASE_VERSION",
@ -565,15 +630,15 @@ def run_quick(
        flush=True,
    )
-    statefile = os.path.join(tempfile.gettempdir(), f"ccci-opstate-{domain}.json")
+    statefile = _run_state_path("opstate") + ".json"
    with open(statefile, "w") as f:
        json.dump({}, f)
    os.environ["CCCI_OP_STATE_FILE"] = statefile
-    depsfile = os.path.join(tempfile.gettempdir(), f"ccci-deps-{domain}.json")
+    depsfile = _run_state_path("deps") + ".json"
    with open(depsfile, "w") as f:
        json.dump({}, f)
    os.environ["CCCI_DEPS_FILE"] = depsfile
-    skipfile = os.path.join(tempfile.gettempdir(), f"ccci-depskip-{domain}.txt")
+    skipfile = _run_state_path("depskip") + ".txt"
    with contextlib.suppress(OSError):
        os.remove(skipfile)
    os.environ["CCCI_DEPS_SKIP_REPORT"] = skipfile
@ -649,6 +714,8 @@ def run_quick(
            results["upgrade"] = "fail"
            results["custom"] = "skip"
    finally:
        # Teardown funnel running: further SIGTERM/SIGALRM are logged + ignored (lifetime.py).
        lifetime.begin_teardown()
        # F2-11 skip count (read before deciding pass/fail)
        requires_deps_skipped = 0
        try:
@ -812,6 +879,9 @@ def promote_canonical(recipe: str, head_ref: str | None) -> None:
 def main() -> int:
    # P1 lock-lifetime hardening: PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard
    # deadline, armed before ANY abra call or lock acquisition (see harness/lifetime.py).
    lifetime.install_lifetime_guards()
    recipe = os.environ.get("RECIPE")
    if not recipe:
        print("RECIPE env is required", file=sys.stderr)
@ -826,6 +896,10 @@ def main() -> int:
    print(
        f"== cc-ci run: recipe={recipe} ref={ref} pr={os.environ.get('PR', '0')} stages={sorted(stages)}"
    )
    # Concurrent-run safety is structural: this run's recipe trees live in its own ABRA_DIR
    # (exported here, before ANY abra call), so no recipe-tree lock exists; same-DOMAIN runs
    # serialise on the app-domain flock taken in deploy_app (see docs/concurrency.md).
    setup_run_abra_dir()
    fetch_recipe(recipe, ref, src)
    # The PR-head commit the upgrade tier re-checks out for the chaos redeploy to the code under test
    # (HC1). Prefer the explicit PR head sha ($REF) — robust + exact; fall back to the recipe checkout
@ -864,7 +938,7 @@ def main() -> int:
    hook = discovery.install_steps(recipe, repo_local)
    # Deploy-count guard (DG4.1): exactly one deploy_app() per run.
-    countfile = os.path.join(tempfile.gettempdir(), f"ccci-deploys-{domain}")
+    countfile = _run_state_path("deploys")
    with open(countfile, "w") as f:
        f.write("0")
    os.environ["CCCI_DEPLOY_COUNT_FILE"] = countfile
@ -880,7 +954,7 @@ def main() -> int:
    # Run-scoped op state (HC3): the orchestrator records op results (pre-upgrade identity, backup
    # snapshot_id) here for the assertion tiers (generic + overlay) to read via generic.op_state().
-    statefile = os.path.join(tempfile.gettempdir(), f"ccci-opstate-{domain}.json")
+    statefile = _run_state_path("opstate") + ".json"
    with open(statefile, "w") as f:
        json.dump({}, f)
    os.environ["CCCI_OP_STATE_FILE"] = statefile
@ -891,12 +965,12 @@ def main() -> int:
    # cannot break the generic-tier signal. The `setup_custom_tests` step deploys each dep + runs
    # `tests/<recipe>/setup_custom_tests.sh` to wire OIDC env via in-place redeploy.
    # `$CCCI_DEPS_FILE` is written with the full creds dict the hook script needs (jq-readable).
-    depsfile = os.path.join(tempfile.gettempdir(), f"ccci-deps-{domain}.json")
+    depsfile = _run_state_path("deps") + ".json"
    with open(depsfile, "w") as f:
        json.dump({}, f)
    os.environ["CCCI_DEPS_FILE"] = depsfile
    # F2-11: conftest appends the count of requires_deps tests it skips (deps-not-ready) here.
-    skipfile = os.path.join(tempfile.gettempdir(), f"ccci-depskip-{domain}.txt")
+    skipfile = _run_state_path("depskip") + ".txt"
    with contextlib.suppress(OSError):
        os.remove(skipfile)
    os.environ["CCCI_DEPS_SKIP_REPORT"] = skipfile
@ -1108,6 +1182,9 @@ def main() -> int:
                if op in stages:
                    results[op] = "skip"
    finally:
        # From here the teardown funnel runs: a SIGTERM/SIGALRM landing now is logged + ignored
        # (lifetime.py) so a second signal can't abort the cleanup the first one asked for.
        lifetime.begin_teardown()
        # Teardown the recipe under test FIRST, then deps in reverse declaration order.
        # Parent verify=False (Phase 1d): keep as-is so a parent residual doesn't mask a tier
        # failure. Dep teardown uses verify=True via teardown_deps (F2-5 fix); failures are
@ -1224,7 +1301,6 @@ def main() -> int:
    # a failure here NEVER changes `overall` (R7 — cosmetics never block the pipeline). ----
    data: dict | None = None
    try:
        sso_unverified = sso_dep_unverified(declared, deps_ready, requires_deps_skipped)
        clean_teardown = (deploy_count == expected_deploy_count) and not dep_teardown_error
        data = results_mod.build_results(
            recipe=recipe,
@ -1234,13 +1310,11 @@ def main() -> int:
            records=records,
            results=results,
            backup_capable=backup_cap,
            declared=declared,
            deps_ready=deps_ready,
            sso_unverified=sso_unverified,
            clean_teardown=clean_teardown,
            no_secret_leak=True,  # narrowed below by an actual scan of the serialised artifact
            screenshot=screenshot_rel,  # Phase 3 U1 (R4): relative PNG name iff capture succeeded
            finished_ts=time.time(),
            expected_na=meta.get("EXPECTED_NA"),  # declared intentional-skip map (recipe_meta)
        )
        # Real (if narrow) leak check: no known infra-secret value may appear in the artifact (R7).
        blob = json.dumps(data)
@ -1257,6 +1331,15 @@ def main() -> int:
            f"{' — ' + data['level_cap_reason'] if data['level_cap_reason'] else ''})",
            flush=True,
        )
        # Surface UNINTENTIONAL skips in the CI log (non-blocking, R7): a rung that was skipped (N/A)
        # but is not in the recipe's intentional list — either add the missing coverage or declare it.
        for rung in data.get("skips", {}).get("unintentional", []):
            print(
                f"⚠ coverage: rung '{rung}' was skipped (N/A) but is not declared intentional — add "
                f"the missing test/label, or list it in tests/{recipe}/recipe_meta.py "
                f"EXPECTED_NA = {{'{rung}': '<why>'}}.",
                flush=True,
            )
    except Exception as e:  # noqa: BLE001 — results assembly is cosmetic; never fail a run on it (R7)
        print(
            f"!! results.json assembly failed (non-fatal, verdict unaffected): {_scrub(str(e))}",
@ -1275,8 +1358,21 @@ def main() -> int:
            with open(html_path, "w", encoding="utf-8") as f:
                f.write(card_mod.render_card_html(data, screenshot_rel=data.get("screenshot")))
            png = card_mod.render_card_png(html_path, os.path.join(run_artifact_dir, "summary.png"))
            capped = data.get("level_cap_rung")
            sk = data.get("skips", {})
            cap_skip = (
                "intentional"
                if capped in (sk.get("intentional") or {})
                else "unintentional"
                if capped in (sk.get("unintentional") or [])
                else ""
            )
            with open(os.path.join(run_artifact_dir, "badge.svg"), "w", encoding="utf-8") as f:
-                f.write(card_mod.level_badge_svg(data["level"], data.get("level_cap_reason", "")))
+                f.write(
                    card_mod.level_badge_svg(
                        data["level"], data.get("level_cap_reason", ""), cap_skip
                    )
                )
            print(
                f"summary card {'rendered ' + png if png else '(PNG render unavailable)'} + "
                f"badge.svg written into {run_artifact_dir}",
--- a/runner/warm_reconcile.py
+++ b/runner/warm_reconcile.py
@ -43,11 +43,16 @@ def _traefik_setup(recipe: str, domain: str, version: str) -> None:
    ssl_cert/ssl_key swarm secrets; NO ACME). Uses the proven abra.env_set (newline-safe, unlike the
    bash set_env that bit keycloak)."""
    cert_dir = "/var/lib/ci-certs/live"
-    if not (os.path.isfile(f"{cert_dir}/fullchain.pem") and os.path.isfile(f"{cert_dir}/privkey.pem")):
+    if not (
        os.path.isfile(f"{cert_dir}/fullchain.pem") and os.path.isfile(f"{cert_dir}/privkey.pem")
    ):
        raise RuntimeError(f"FATAL: wildcard cert missing at {cert_dir} (sops decrypt broken?)")
    if not os.path.isfile(env_file(domain)):
-        _run(["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
+        _run(
-             timeout=120, check=True)
+            ["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
            timeout=120,
            check=True,
        )
    abra.env_set(domain, "DOMAIN", domain)
    abra.env_set(domain, "LETS_ENCRYPT_ENV", "")
    abra.env_set(domain, "WILDCARDS_ENABLED", "1")
@ -61,11 +66,39 @@ def _traefik_setup(recipe: str, domain: str, version: str) -> None:
        return any(s.endswith(f"_{name}_v1") for s in have)
    if not _has("ssl_cert"):
-        _run(["abra", "app", "secret", "insert", domain, "ssl_cert", "v1",
+        _run(
-              f"{cert_dir}/fullchain.pem", "-f", "-n"], timeout=120, check=True)
+            [
                "abra",
                "app",
                "secret",
                "insert",
                domain,
                "ssl_cert",
                "v1",
                f"{cert_dir}/fullchain.pem",
                "-f",
                "-n",
            ],
            timeout=120,
            check=True,
        )
    if not _has("ssl_key"):
-        _run(["abra", "app", "secret", "insert", domain, "ssl_key", "v1",
+        _run(
-              f"{cert_dir}/privkey.pem", "-f", "-n"], timeout=120, check=True)
+            [
                "abra",
                "app",
                "secret",
                "insert",
                domain,
                "ssl_key",
                "v1",
                f"{cert_dir}/privkey.pem",
                "-f",
                "-n",
            ],
            timeout=120,
            check=True,
        )
 SPECS: dict[str, dict] = {
@ -166,7 +199,13 @@ def _run(cmd, timeout=120, check=False):
 def _recipe_dir(recipe: str) -> str:
-    return os.path.expanduser(f"~/.abra/recipes/{recipe}")
+    # Resolve like the abra CLI does: $ABRA_DIR (the per-run tree when imported by a CI run,
    # e.g. promote_canonical) else the canonical ~/.abra (this module's own systemd-timer runs,
    # which set no ABRA_DIR). Keeps fetch_recipe (an `abra` subprocess) and the git readers
    # below pointed at the SAME tree in both contexts.
    return os.path.join(
        os.environ.get("ABRA_DIR") or os.path.expanduser("~/.abra"), "recipes", recipe
    )
 def recipe_tags(recipe: str) -> list[str]:
@ -218,8 +257,17 @@ def health_code(spec: dict) -> int:
    domain = spec.get("health_domain", spec["domain"])
    r = _run(
        [
-            "curl", "-sk", "-o", "/dev/null", "-w", "%{http_code}", "--max-time", "10",
+            "curl",
-            "--resolve", f"{domain}:443:127.0.0.1", f"https://{domain}{spec['health_path']}",
+            "-sk",
            "-o",
            "/dev/null",
            "-w",
            "%{http_code}",
            "--max-time",
            "10",
            "--resolve",
            f"{domain}:443:127.0.0.1",
            f"https://{domain}{spec['health_path']}",
        ],
        timeout=20,
    )
@ -230,7 +278,6 @@ def health_code(spec: dict) -> int:
 def wait_healthy(spec: dict, timeout: int | None = None) -> bool:
    domain = spec["domain"]
    deadline = time.time() + (timeout or spec["health_timeout"])
    while time.time() < deadline:
        if health_code(spec) in tuple(spec["health_ok"]):
@ -325,15 +372,18 @@ def ensure_server() -> None:
 def ensure_app_config(recipe: str, domain: str, version: str) -> None:
    if not os.path.isfile(env_file(domain)):
-        _run(["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
+        _run(
-             timeout=120, check=True)
+            ["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
            timeout=120,
            check=True,
        )
    abra.env_set(domain, "DOMAIN", domain)
    abra.env_set(domain, "LETS_ENCRYPT_ENV", "")
 def ensure_secrets(domain: str) -> None:
    stack = lifecycle._stack_name(domain)  # noqa: SLF001
-    have = {n for n in lifecycle._docker_names("secret", stack)}  # noqa: SLF001
+    have = set(lifecycle._docker_names("secret", stack))  # noqa: SLF001
    if not any(n.endswith("_admin_password_v1") for n in have):
        abra.secret_generate(domain)
@ -393,8 +443,9 @@ def reconcile(app: str) -> str:
        write_alert(app, "held-major", current=current, latest=latest, release_notes=notes[:4000])
        return f"held-major:{current}->{latest}"
    if notes_flag_manual_migration(notes):
-        write_alert(app, "held-manual-migration", current=current, latest=latest,
+        write_alert(
-                    release_notes=notes[:4000])
+            app, "held-manual-migration", current=current, latest=latest, release_notes=notes[:4000]
        )
        return f"held-manual-migration:{current}->{latest}"
    # WC1.1 health-gated upgrade with rollback.
@ -428,8 +479,14 @@ def reconcile(app: str) -> str:
        warmsnap.restore(recipe, domain)
    deploy_version(recipe, domain, last_good, dt)
    recovered = wait_healthy(spec)
-    write_alert(app, "rollback", last_good=last_good, attempted=latest, recovered=recovered,
+    write_alert(
-                release_notes=notes[:2000])
+        app,
        "rollback",
        last_good=last_good,
        attempted=latest,
        recovered=recovered,
        release_notes=notes[:2000],
    )
    if not recovered:
        raise RuntimeError(f"{app} rollback to {last_good} did not become healthy")
    return f"rolled-back:{latest}->{last_good}"
--- a/tests/bluesky-pds/_p4.py
+++ b/tests/bluesky-pds/_p4.py
@ -15,7 +15,8 @@ import shlex
 import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
-from harness import http as harness_http, lifecycle  # noqa: E402
+from harness import http as harness_http  # noqa: E402
 from harness import lifecycle
 PDS_HOST_LOCAL = "http://localhost:3000"
 _PW = "ccci-P4-marker-pw-2026"
--- a/tests/bluesky-pds/functional/test_account_and_post.py
+++ b/tests/bluesky-pds/functional/test_account_and_post.py
@ -27,6 +27,7 @@ CRUD). A wedged PDS subsystem fails AT its layer.
 from __future__ import annotations
 import contextlib
 import os
 import re
 import secrets
@ -35,7 +36,8 @@ import sys
 import uuid
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
-from harness import http as harness_http, lifecycle  # noqa: E402
+from harness import http as harness_http  # noqa: E402
 from harness import lifecycle
 PDS_HOST_LOCAL = "http://localhost:3000"
@ -58,14 +60,18 @@ def _goat_admin(domain: str, args: str) -> str:
    return _in_container(domain, cmd)
-def _xrpc_post(domain: str, nsid: str, data: dict, token: str | None = None) -> tuple[int, dict | None]:
+def _xrpc_post(
    domain: str, nsid: str, data: dict, token: str | None = None
 ) -> tuple[int, dict | None]:
    headers = {}
    if token:
        headers["Authorization"] = f"Bearer {token}"
    return harness_http.http_post(f"https://{domain}/xrpc/{nsid}", data=data, headers=headers)
-def _xrpc_get(domain: str, nsid: str, query: str, token: str | None = None) -> tuple[int, dict | None]:
+def _xrpc_get(
    domain: str, nsid: str, query: str, token: str | None = None
 ) -> tuple[int, dict | None]:
    headers = {}
    if token:
        headers["Authorization"] = f"Bearer {token}"
@ -82,9 +88,9 @@ def test_account_lifecycle_and_post_roundtrip(live_app):
    # Step 1: PDS describe via goat — recipe self-identifies as did:web:<domain>
    out = _in_container(domain, f"goat pds describe {PDS_HOST_LOCAL} 2>&1")
-    assert f"did:web:{domain}" in out, (
+    assert (
-        f"goat pds describe did not contain expected DID 'did:web:{domain}'. Output:\n{out[:500]!r}"
+        f"did:web:{domain}" in out
-    )
+    ), f"goat pds describe did not contain expected DID 'did:web:{domain}'. Output:\n{out[:500]!r}"
    # Step 2: Create account (UUID-suffixed handle = no run-to-run collision)
    out = _goat_admin(
@ -127,9 +133,9 @@ def test_account_lifecycle_and_post_roundtrip(live_app):
        assert s == 200, f"createRecord HTTP {s}: {body!r}"
        record_uri = (body or {}).get("uri", "")
        # URI format: at://<did>/app.bsky.feed.post/<rkey>
-        assert record_uri.startswith(f"at://{new_did}/app.bsky.feed.post/"), (
+        assert record_uri.startswith(
-            f"unexpected record uri: {record_uri!r}"
+            f"at://{new_did}/app.bsky.feed.post/"
-        )
+        ), f"unexpected record uri: {record_uri!r}"
        rkey = record_uri.rsplit("/", 1)[-1]
        assert rkey, f"no rkey in uri: {record_uri!r}"
@ -142,15 +148,13 @@ def test_account_lifecycle_and_post_roundtrip(live_app):
        )
        assert s == 200, f"getRecord HTTP {s}: {body!r}"
        record_value = (body or {}).get("value", {})
-        assert record_value.get("text") == marker, (
+        assert (
-            f"post text did not round-trip: created={marker!r}, fetched={record_value.get('text')!r}"
+            record_value.get("text") == marker
-        )
+        ), f"post text did not round-trip: created={marker!r}, fetched={record_value.get('text')!r}"
        assert record_value.get("$type") == "app.bsky.feed.post"
    finally:
        # Step 6: Best-effort cleanup. (The per-run domain teardown will discard the volume
        # too, but we exercise the delete-account path because it's part of §4.3.)
        if cleanup_did:
-            try:
+            with contextlib.suppress(Exception):
                _goat_admin(domain, f"account delete {cleanup_did}")
            except Exception:  # noqa: BLE001
                pass
--- a/tests/bluesky-pds/functional/test_describe_server.py
+++ b/tests/bluesky-pds/functional/test_describe_server.py
@ -26,6 +26,6 @@ def test_describe_server_returns_atproto_envelope(live_app):
    # At least one of these atproto-spec fields must be present
    expected_any = ("availableUserDomains", "inviteCodeRequired", "links", "did")
    present = [k for k in expected_any if k in body]
-    assert present, (
+    assert (
-        f"describe-server missing all of {expected_any}; got keys: {sorted(body.keys())[:20]}"
+        present
-    )
+    ), f"describe-server missing all of {expected_any}; got keys: {sorted(body.keys())[:20]}"
--- a/tests/bluesky-pds/functional/test_health_check.py
+++ b/tests/bluesky-pds/functional/test_health_check.py
@ -17,6 +17,6 @@ def test_pds_health_returns_version(live_app):
    url = f"https://{live_app}/xrpc/_health"
    status, body = harness_http.retry_http_get(url, expect_status=200, max_wait=60, interval=3)
    assert status == 200, f"GET {url} HTTP {status} (expected 200)"
-    assert isinstance(body, dict) and isinstance(body.get("version"), str) and body["version"], (
+    assert (
-        f"GET {url} response is not the expected health envelope: {body!r}"
+        isinstance(body, dict) and isinstance(body.get("version"), str) and body["version"]
-    )
+    ), f"GET {url} response is not the expected health envelope: {body!r}"
--- a/tests/bluesky-pds/functional/test_session_auth.py
+++ b/tests/bluesky-pds/functional/test_session_auth.py
@ -30,6 +30,6 @@ def test_get_session_requires_auth(live_app):
        f"body: {body!r}"
    )
    # The XRPC error envelope is JSON with an `error` field per the atproto spec.
-    assert isinstance(body, dict) and body.get("error"), (
+    assert isinstance(body, dict) and body.get(
-        f"expected XRPC JSON error envelope; got: {body!r}"
+        "error"
-    )
+    ), f"expected XRPC JSON error envelope; got: {body!r}"
--- a/tests/bluesky-pds/install_steps.sh
+++ b/tests/bluesky-pds/install_steps.sh
@ -22,12 +22,12 @@ echo "  bluesky-pds install_steps: generating secp256k1 PLC rotation key..."
 # same shape the PDS expects (32-byte hex). Equivalent for atproto PDS bootstrap.
 KEY_HEX=$(cc-ci-run -c 'import secrets; print(secrets.token_bytes(32).hex())')
 if [ -z "${KEY_HEX}" ] || [ "${#KEY_HEX}" != "64" ]; then
-    echo "  install_steps: failed to generate PLC rotation key (KEY_HEX length=${#KEY_HEX})" >&2
+  echo "  install_steps: failed to generate PLC rotation key (KEY_HEX length=${#KEY_HEX})" >&2
-    exit 1
+  exit 1
 fi
 # Insert via abra under TTY-wrap (`abra app secret insert` requires a TTY on this version).
 # We DON'T log the key value — abra also doesn't print it.
 script -qec "abra app secret insert ${CCCI_APP_DOMAIN} pds_plc_rotation_key v1 ${KEY_HEX} --no-input" /dev/null \
-    >/dev/null 2>&1
+  >/dev/null 2>&1
 echo "  bluesky-pds install_steps: PLC rotation key inserted (v1)."
--- a/tests/bluesky-pds/test_restore.py
+++ b/tests/bluesky-pds/test_restore.py
@ -11,6 +11,6 @@ import _p4  # noqa: E402
 def test_restore_returns_state(live_app):
-    assert _p4.account_exists(live_app), (
+    assert _p4.account_exists(
-        "restore did not bring back the seeded marker account (PDS data did not survive restore)"
+        live_app
-    )
+    ), "restore did not bring back the seeded marker account (PDS data did not survive restore)"
--- a/tests/concurrency/concutil.py
+++ b/tests/concurrency/concutil.py
@ -0,0 +1,108 @@
 """Shared utilities for the real-kernel concurrency suite (imported by the test modules; the
 fixtures in conftest.py wrap these). No flock mocking anywhere — probes use real LOCK_NB."""
 from __future__ import annotations
 import contextlib
 import fcntl
 import os
 import signal
 import subprocess
 import sys
 import time
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
 from harness import lifecycle  # noqa: E402
 HELPERS = os.path.join(os.path.dirname(__file__), "helpers.py")
 DOMAIN = "test-abc123.ci.commoninternet.net"  # matches RUN_APP_RE
 class HelperPool:
    """Spawns helpers.py subprocesses and GUARANTEES their cleanup (incl. recorded grandchild
    pids from `hold-with-child`/`wrapper` markers) — no leaked children in the test VM."""
    def __init__(self, out_dir: str):
        self.out_dir = out_dir
        self.procs: list[subprocess.Popen] = []
        self.extra_pids: list[int] = []
        self._n = 0
    def spawn(self, *args: str, env_extra: dict | None = None) -> tuple[subprocess.Popen, str]:
        """Start `helpers.py <args...>`; returns (proc, marker_file)."""
        self._n += 1
        out = os.path.join(self.out_dir, f"helper-{self._n}.out")
        env = dict(os.environ, CCCI_HELPER_OUT=out, **(env_extra or {}))
        p = subprocess.Popen(  # noqa: S603
            [sys.executable, HELPERS, *args],
            env=env,
            stdout=subprocess.DEVNULL,
            stderr=subprocess.STDOUT,
        )
        self.procs.append(p)
        return p, out
    def track_pid(self, pid: int) -> None:
        self.extra_pids.append(pid)
    def cleanup(self) -> None:
        for p in self.procs:
            if p.poll() is None:
                p.kill()
            with contextlib.suppress(subprocess.TimeoutExpired):
                p.wait(timeout=10)
        for pid in self.extra_pids:
            with contextlib.suppress(OSError):
                os.kill(pid, signal.SIGKILL)
 def wait_marker(out: str, token: str, timeout: float = 15.0) -> str | None:
    """Poll a helper's marker file for a line containing `token`; returns the line or None."""
    deadline = time.time() + timeout
    while time.time() < deadline:
        try:
            with open(out) as f:
                for line in f:
                    if token in line:
                        return line.strip()
        except OSError:
            pass
        time.sleep(0.1)
    return None
 def lock_state(domain: str) -> str:
    """'held' | 'free' | 'absent' for the domain's lockfile, probed with a REAL LOCK_NB."""
    path = lifecycle._app_lock_path(domain)  # noqa: SLF001
    if not os.path.exists(path):
        return "absent"
    with open(path, "a") as f:
        try:
            fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
            return "free"
        except BlockingIOError:
            return "held"
 def wait_lock_state(domain: str, want: str, timeout: float = 10.0) -> str:
    """Poll until lock_state(domain) == want (kernel release on process death is fast, but give
    the scheduler room). Returns the final observed state."""
    deadline = time.time() + timeout
    state = lock_state(domain)
    while state != want and time.time() < deadline:
        time.sleep(0.1)
        state = lock_state(domain)
    return state
 def pid_alive(pid: int) -> bool:
    return os.path.exists(f"/proc/{pid}")
 def wait_pid_gone(pid: int, timeout: float = 15.0) -> bool:
    deadline = time.time() + timeout
    while time.time() < deadline:
        if not pid_alive(pid):
            return True
        time.sleep(0.1)
    return False
--- a/tests/concurrency/conftest.py
+++ b/tests/concurrency/conftest.py
@ -0,0 +1,34 @@
 """Fixtures for the real-kernel concurrency suite (concurrency-restructure plan, 19 cases).
 NOT part of the default `pytest tests/unit` gate — run explicitly with `pytest tests/concurrency
 -q` (docs/concurrency.md). Locks live in a per-test tmp dir (CCCI_APP_LOCK_DIR); helper
 subprocesses hold REAL flocks / install the REAL prctl+signal guards and are always reaped in
 fixture finalizers (no leaked children in the test VM).
 """
 from __future__ import annotations
 import os
 import sys
 import pytest
 sys.path.insert(0, os.path.dirname(__file__))
 from concutil import HelperPool  # noqa: E402
@pytest.fixture
 def lock_dir(tmp_path, monkeypatch):
    """Sandbox lock dir, exported so BOTH this process's lifecycle calls and helper subprocesses
    (which inherit os.environ) resolve their lockfiles here — never /run/lock."""
    d = tmp_path / "locks"
    d.mkdir()
    monkeypatch.setenv("CCCI_APP_LOCK_DIR", str(d))
    return str(d)
@pytest.fixture
 def pool(tmp_path):
    hp = HelperPool(str(tmp_path))
    yield hp
    hp.cleanup()
--- a/tests/concurrency/helpers.py
+++ b/tests/concurrency/helpers.py
@ -0,0 +1,149 @@
 #!/usr/bin/env python3
 """Subprocess helpers for tests/concurrency — REAL kernel locks and the REAL lifetime guards in
 separate processes (flock/prctl are never mocked; tests assert on actual kernel behavior).
 Invoked as:  python3 helpers.py <command> <args...>
 Env contract (set by the spawning test):
  CCCI_APP_LOCK_DIR  sandbox lock dir (never /run/lock in tests)
  CCCI_HELPER_OUT    marker file this helper APPENDS progress lines to (ACQUIRED/READY/...)
 Commands:
  hold <domain>                 acquire the app lock, mark `ACQUIRED <ts>`, sleep forever
  hold-with-child <domain>      acquire the lock, spawn a plain sleeping subprocess child, mark
                                `ACQUIRED <ts>` + `CHILD <pid>` (PEP 446: the child must NOT
                                inherit the lock fd), sleep forever
  guarded <domain> <deadline>   install the REAL lifetime guards (alarm=<deadline>s), acquire the
                                lock, mark `READY`; when the teardown funnel runs (`finally:`),
                                mark `TEARDOWN` before exiting
  wrapper <domain>              spawn `guarded <domain> 3600` as MY child, mark `WRAPPED <pid>`,
                                sleep — the test kills me to prove PDEATHSIG TERMs the child
  orphan-probe                  wait (bounded) until reparented (ppid==1), then install the
                                guards; mark `REFUSED` if they exit (expected) or `GUARDS_OK`
  fetch-checkout <recipe> <ref> run run_recipe_ci.fetch_recipe (the test sets CCCI_SKIP_FETCH=1
                                + a per-"run" ABRA_DIR), git-checkout <ref>, mark
                                `RESULT <head> <data.txt content>`
 """
 from __future__ import annotations
 import os
 import subprocess
 import sys
 import time
 sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "runner"))
 from harness import abra, lifecycle, lifetime  # noqa: E402
 OUT = os.environ.get("CCCI_HELPER_OUT")
 def mark(line: str) -> None:
    if OUT:
        with open(OUT, "a") as f:
            f.write(line + "\n")
            f.flush()
    print(line, flush=True)
 def cmd_hold(domain: str) -> None:
    lifecycle.acquire_app_lock(domain)
    mark(f"ACQUIRED {time.time()}")
    time.sleep(3600)
 def cmd_hold_with_child(domain: str) -> None:
    lifecycle.acquire_app_lock(domain)
    child = subprocess.Popen([sys.executable, "-c", "import time; time.sleep(3600)"])
    mark(f"ACQUIRED {time.time()}")
    mark(f"CHILD {child.pid}")
    time.sleep(3600)
 def cmd_guarded(domain: str, deadline: str) -> None:
    lifetime.install_lifetime_guards(deadline_seconds=int(deadline))
    lifecycle.acquire_app_lock(domain)
    mark("READY")
    try:
        time.sleep(3600)
    finally:
        mark("TEARDOWN")
 def cmd_wrapper(domain: str) -> None:
    p = subprocess.Popen(  # noqa: S603
        [sys.executable, os.path.abspath(__file__), "guarded", domain, "3600"],
        env=os.environ.copy(),
    )
    mark(f"WRAPPED {p.pid}")
    time.sleep(3600)
 def cmd_orphan_probe() -> None:
    # Our spawner exits immediately after fork; wait (bounded) until we are reparented so the
    # prctl is installed with the parent ALREADY dead — the exact race the ppid check closes.
    for _ in range(200):
        if os.getppid() == 1:
            break
        time.sleep(0.05)
    else:
        mark("NEVER_REPARENTED")  # e.g. a subreaper environment — test will fail visibly
        return
    try:
        lifetime.install_lifetime_guards()
    except SystemExit:
        mark("REFUSED")
        raise
    mark("GUARDS_OK")
 def cmd_fetch_checkout(recipe: str, ref: str) -> None:
    import run_recipe_ci
    run_recipe_ci.fetch_recipe(recipe, None, None)
    abra.recipe_checkout(recipe, ref)
    head = abra.recipe_head_commit(recipe)
    with open(os.path.join(abra.recipe_dir(recipe), "data.txt")) as f:
        content = f.read().strip()
    mark(f"RESULT {head} {content}")
 def cmd_deploy_count_run(domain: str, gate: str) -> None:
    """Mirror the REAL run flow for the DG4.1 counter (CONC-A1 regression): countfile init
    (main() preamble) → _record_deploy (deploy_app fires it BEFORE the app lock) → acquire
    the app lock → wait for `gate` (file path; '' = no wait) → read + remove own countfile.
    Two of these on the SAME domain must each see COUNT 1 and never lose their file."""
    import run_recipe_ci
    countfile = run_recipe_ci._run_state_path("deploys")
    with open(countfile, "w") as f:
        f.write("0")
    os.environ["CCCI_DEPLOY_COUNT_FILE"] = countfile
    lifecycle._record_deploy()  # pre-lock, exactly like lifecycle.deploy_app()
    mark("PRELOCK")
    lifecycle.acquire_app_lock(domain)
    mark("ACQUIRED")
    if gate:
        deadline = time.time() + 15
        while not os.path.exists(gate) and time.time() < deadline:
            time.sleep(0.05)
    try:
        with open(countfile) as f:
            n = int(f.read().strip() or "0")
        os.remove(countfile)
        mark(f"COUNT {n}")
    except FileNotFoundError:
        mark("COUNT_FILE_MISSING")
 if __name__ == "__main__":
    cmd, *args = sys.argv[1:]
    {
        "hold": cmd_hold,
        "hold-with-child": cmd_hold_with_child,
        "guarded": cmd_guarded,
        "wrapper": cmd_wrapper,
        "orphan-probe": cmd_orphan_probe,
        "fetch-checkout": cmd_fetch_checkout,
        "deploy-count-run": cmd_deploy_count_run,
    }[cmd](*args)
--- a/tests/concurrency/test_abra_dir.py
+++ b/tests/concurrency/test_abra_dir.py
@ -0,0 +1,175 @@
 """Per-run ABRA_DIR isolation (concurrency-restructure plan, cases 17-19). Real directories,
 real symlinks, real git — abra itself is replaced by a recording stub where a CLI call is
 involved (case 17), because these cases test OUR dir/env plumbing, not abra."""
 from __future__ import annotations
 import os
 import stat
 import subprocess
 import sys
 sys.path.insert(0, os.path.dirname(__file__))
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
 import run_recipe_ci  # noqa: E402
 from concutil import wait_marker  # noqa: E402
 from harness import abra  # noqa: E402
 RECIPE = "fakerecipe"
 def _git(cwd, *args):
    subprocess.run(
        ["git", "-c", "user.email=t@t", "-c", "user.name=t", *args],
        cwd=cwd,
        check=True,
        capture_output=True,
    )
 def _make_fake_home(tmp_path):
    """A fake $HOME with a canonical ~/.abra: servers/default + catalogue dirs, and a recipe git
    repo with two tags whose data.txt differs (v1 -> 'one', v2 -> 'two', HEAD at v2)."""
    home = tmp_path / "home"
    (home / ".abra" / "servers" / "default").mkdir(parents=True)
    (home / ".abra" / "catalogue").mkdir(parents=True)
    repo = home / ".abra" / "recipes" / RECIPE
    repo.mkdir(parents=True)
    _git(repo, "init", "-q")
    (repo / "data.txt").write_text("one\n")
    _git(repo, "add", "data.txt")
    _git(repo, "commit", "-qm", "v1")
    _git(repo, "tag", "v1")
    (repo / "data.txt").write_text("two\n")
    _git(repo, "add", "data.txt")
    _git(repo, "commit", "-qm", "v2")
    _git(repo, "tag", "v2")
    return home
 def test_17_per_run_dir_built_and_exported_before_abra(tmp_path, monkeypatch):
    """Case 17: setup_run_abra_dir builds the per-run dir correctly (servers/catalogue symlinks
    resolve to the canonical tree, recipes/ empty + writable) and $ABRA_DIR is exported before
    the first abra call — proven by a stub `abra` on PATH that records the env it saw."""
    home = _make_fake_home(tmp_path)
    monkeypatch.setenv("HOME", str(home))
    monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
    monkeypatch.setenv("DRONE_BUILD_NUMBER", "777")
    monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")  # so monkeypatch restores it
    d = run_recipe_ci.setup_run_abra_dir()
    assert d == str(tmp_path / "runs" / "777" / "abra")
    assert os.environ["ABRA_DIR"] == d
    assert os.readlink(os.path.join(d, "servers")) == str(home / ".abra" / "servers")
    assert os.readlink(os.path.join(d, "catalogue")) == str(home / ".abra" / "catalogue")
    # symlinks RESOLVE (targets exist) and recipes/ is empty + writable
    assert os.path.isdir(os.path.join(d, "servers", "default"))
    assert os.path.isdir(os.path.join(d, "catalogue"))
    assert os.listdir(os.path.join(d, "recipes")) == []
    probe = os.path.join(d, "recipes", ".write-probe")
    open(probe, "w").close()
    os.remove(probe)
    # idempotent re-entry (Drone build-number retry): must not raise on existing symlinks
    assert run_recipe_ci.setup_run_abra_dir() == d
    # stub abra records $ABRA_DIR at call time; fetch_recipe's catalogue branch invokes it
    stub_dir = tmp_path / "bin"
    stub_dir.mkdir()
    log = tmp_path / "abra-env.log"
    stub = stub_dir / "abra"
    stub.write_text(f'#!/bin/sh\necho "$ABRA_DIR" >> {log}\nexit 0\n')
    stub.chmod(stub.stat().st_mode | stat.S_IEXEC)
    monkeypatch.setenv("PATH", f"{stub_dir}{os.pathsep}{os.environ['PATH']}")
    monkeypatch.delenv("CCCI_SKIP_FETCH", raising=False)
    run_recipe_ci.fetch_recipe(RECIPE, None, None)
    assert log.read_text().strip() == d, "abra was called without the per-run ABRA_DIR exported"
 def test_18_concurrent_same_recipe_fetch_no_cross_talk(tmp_path, monkeypatch, pool):
    """Case 18: two CONCURRENT fetch+checkout flows of the SAME recipe into different ABRA_DIRs
    produce two correct, divergent trees (v1 vs v2) — the old shared-tree corruption scenario,
    now structurally safe with no lock. The canonical staged clone is untouched."""
    home = _make_fake_home(tmp_path)
    canonical_repo = home / ".abra" / "recipes" / RECIPE
    head_before = subprocess.run(
        ["git", "-C", canonical_repo, "rev-parse", "HEAD"], capture_output=True, text=True
    ).stdout.strip()
    runs = {}
    for name, ref in (("runA", "v1"), ("runB", "v2")):
        abra_dir = tmp_path / name / "abra"
        abra_dir.mkdir(parents=True)
        _, out = pool.spawn(
            "fetch-checkout",
            RECIPE,
            ref,
            env_extra={
                "HOME": str(home),
                "ABRA_DIR": str(abra_dir),
                "CCCI_SKIP_FETCH": "1",
            },
        )
        runs[name] = (out, ref, abra_dir)
    expect = {"v1": "one", "v2": "two"}
    for name, (out, ref, abra_dir) in runs.items():
        line = wait_marker(out, "RESULT", timeout=30)
        assert line, f"{name} never produced a RESULT"
        _, head, content = line.split()
        assert content == expect[ref], f"{name}@{ref}: tree content {content!r}"
        tree = abra_dir / "recipes" / RECIPE
        assert (tree / "data.txt").read_text().strip() == expect[ref]
        assert (
            head
            == subprocess.run(
                ["git", "-C", tree, "rev-parse", "HEAD"], capture_output=True, text=True
            ).stdout.strip()
        )
    # the two trees genuinely diverge AND the canonical staged clone is untouched
    a = (runs["runA"][2] / "recipes" / RECIPE / "data.txt").read_text()
    b = (runs["runB"][2] / "recipes" / RECIPE / "data.txt").read_text()
    assert a != b
    head_after = subprocess.run(
        ["git", "-C", canonical_repo, "rev-parse", "HEAD"], capture_output=True, text=True
    ).stdout.strip()
    assert head_after == head_before, "canonical clone must not be touched by per-run fetches"
 def test_19_env_written_through_servers_symlink_lands_canonical(tmp_path, monkeypatch):
    """Case 19: an app .env written through the per-run servers/ symlink (what abra does under
    $ABRA_DIR) lands in the CANONICAL shared path — so janitor discovery and every
    expanduser('~/.abra/servers/...') reader keep working unchanged."""
    home = _make_fake_home(tmp_path)
    monkeypatch.setenv("HOME", str(home))
    monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
    monkeypatch.setenv("DRONE_BUILD_NUMBER", "778")
    monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")
    d = run_recipe_ci.setup_run_abra_dir()
    domain = "test-abc123.ci.commoninternet.net"
    via_symlink = os.path.join(d, "servers", "default", f"{domain}.env")
    with open(via_symlink, "w") as f:
        f.write("TYPE=fakerecipe:1.0.0\nDOMAIN=placeholder\n")
    canonical = home / ".abra" / "servers" / "default" / f"{domain}.env"
    assert canonical.is_file(), ".env written via the symlink must land in the canonical path"
    # the canonical-path readers/writers (abra.env_get/env_set use ~/.abra) see the same file
    assert abra.env_get(domain, "TYPE") == "fakerecipe:1.0.0"
    abra.env_set(domain, "DOMAIN", domain)
    with open(via_symlink) as f:
        assert f"DOMAIN={domain}" in f.read()
 def test_18b_run_id_manual_fallback_is_per_process(tmp_path, monkeypatch):
    """Companion to case 18: two concurrent MANUAL runs (no DRONE_BUILD_NUMBER) must not share an
    abra dir either — the manual fallback is pid-suffixed."""
    home = _make_fake_home(tmp_path)
    monkeypatch.setenv("HOME", str(home))
    monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
    monkeypatch.delenv("DRONE_BUILD_NUMBER", raising=False)
    monkeypatch.delenv("CCCI_APP_DOMAIN", raising=False)
    monkeypatch.delenv("CCCI_RUN_ID", raising=False)
    monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")
    d = run_recipe_ci.setup_run_abra_dir()
    assert f"manual-{os.getpid()}" in d
--- a/tests/concurrency/test_janitor.py
+++ b/tests/concurrency/test_janitor.py
@ -0,0 +1,189 @@
 """Janitor / flock-probe semantics (concurrency-restructure plan, cases 5-12).
 The janitor runs IN-PROCESS with its discovery monkeypatched (candidates injected via a stubbed
 abra.app_ls + empty docker sweep) and teardown_app stubbed to record calls — but the LOCKS are
 real kernel flocks, held by real helper subprocesses where a live owner is needed."""
 from __future__ import annotations
 import os
 import sys
 import threading
 import time
 sys.path.insert(0, os.path.dirname(__file__))
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
 from concutil import DOMAIN, lock_state, wait_marker  # noqa: E402
 from harness import lifecycle  # noqa: E402
 def _inject_candidates(monkeypatch, domains):
    """Point janitor discovery at exactly `domains`: abra lists them, docker sweep is empty.
    teardown_app is stubbed to a recorder; returns the calls list."""
    calls = []
    monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": d} for d in domains])
    monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
    monkeypatch.setattr(lifecycle, "teardown_app", lambda d, verify=True: calls.append(d))
    return calls
 def test_5_orphan_reaped_lockfile_unlinked(lock_dir, pool, monkeypatch):
    """Case 5: an orphan (lockfile exists, no holder — its run was SIGKILL'd) is reaped exactly
    once and its lockfile unlinked."""
    p, out = pool.spawn("hold", DOMAIN)
    assert wait_marker(out, "ACQUIRED")
    p.kill()
    p.wait(timeout=10)
    calls = _inject_candidates(monkeypatch, [DOMAIN])
    lifecycle.janitor()
    assert calls == [DOMAIN], f"teardown calls: {calls} (expected exactly one)"
    assert lock_state(DOMAIN) == "absent", "reaped orphan's lockfile must be unlinked"
 def test_6_live_run_never_reaped(lock_dir, pool, monkeypatch, capsys):
    """Case 6: a held lock (live helper) is never reaped and is logged as live."""
    p, out = pool.spawn("hold", DOMAIN)
    assert wait_marker(out, "ACQUIRED")
    calls = _inject_candidates(monkeypatch, [DOMAIN])
    lifecycle.janitor()
    assert calls == []
    assert "live concurrent run" in capsys.readouterr().out
    assert lock_state(DOMAIN) == "held"
 def test_7_new_run_blocks_until_reap_finishes(lock_dir, pool, monkeypatch):
    """Case 7: the janitor reaps WHILE HOLDING the probe lock, so a new run of the same domain
    blocks in acquire_app_lock until the reap completes — no window where a fresh app coexists
    with a half-reaped one."""
    # Make an orphan.
    p, out = pool.spawn("hold", DOMAIN)
    assert wait_marker(out, "ACQUIRED")
    p.kill()
    p.wait(timeout=10)
    state = {"teardown_end": None, "acquirer_out": None}
    def slow_teardown(domain, verify=True):
        # While the janitor holds the probe lock mid-reap, a new run starts acquiring.
        _, aout = pool.spawn("hold", DOMAIN)
        state["acquirer_out"] = aout
        time.sleep(2.0)
        state["teardown_end"] = time.time()
    monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": DOMAIN}])
    monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
    monkeypatch.setattr(lifecycle, "teardown_app", slow_teardown)
    lifecycle.janitor()
    line = wait_marker(state["acquirer_out"], "ACQUIRED", timeout=15)
    assert line, "new run never acquired after the reap"
    acquired_ts = float(line.split()[1])
    assert (
        acquired_ts >= state["teardown_end"]
    ), f"new run acquired at {acquired_ts} BEFORE the reap finished at {state['teardown_end']}"
    # The new run must hold a lock the next probe can SEE (fresh inode at the path).
    assert lock_state(DOMAIN) == "held"
 def test_8_two_janitors_exactly_one_reaps(lock_dir, pool, monkeypatch):
    """Case 8: two concurrent janitors arbitrate on the probe flock — exactly one reaps (the
    other sees 'held' and leaves). Teardown is slowed so the runs genuinely overlap."""
    p, out = pool.spawn("hold", DOMAIN)
    assert wait_marker(out, "ACQUIRED")
    p.kill()
    p.wait(timeout=10)
    calls = []
    calls_lock = threading.Lock()
    def slow_teardown(domain, verify=True):
        with calls_lock:
            calls.append(domain)
        time.sleep(2.0)
    monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": DOMAIN}])
    monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
    monkeypatch.setattr(lifecycle, "teardown_app", slow_teardown)
    barrier = threading.Barrier(2)
    def run_janitor():
        barrier.wait()
        lifecycle.janitor()
    t1, t2 = threading.Thread(target=run_janitor), threading.Thread(target=run_janitor)
    t1.start(), t2.start()
    t1.join(timeout=30), t2.join(timeout=30)
    assert calls == [DOMAIN], f"expected exactly one reap, got {calls}"
    assert lock_state(DOMAIN) == "absent"
 def test_9_reboot_lockfile_absent_reaped_immediately(lock_dir, monkeypatch):
    """Case 9: post-reboot simulation — the app exists but its lockfile is gone (/run/lock is
    tmpfs). The probe trivially acquires -> immediate reap, NO age threshold (improvement over
    the old 2h fallback)."""
    assert lock_state(DOMAIN) == "absent"
    calls = _inject_candidates(monkeypatch, [DOMAIN])
    t0 = time.time()
    lifecycle.janitor()
    assert calls == [DOMAIN]
    assert time.time() - t0 < 5, "reap must be immediate (no age wait)"
 def test_10_long_held_lock_flagged_never_stolen(lock_dir, pool, monkeypatch, capsys):
    """Case 10: a lock held with mtime older than 120min is flagged as a possible leaked run —
    and NOT reaped (never steal a held lock)."""
    p, out = pool.spawn("hold", DOMAIN)
    assert wait_marker(out, "ACQUIRED")
    path = lifecycle._app_lock_path(DOMAIN)  # noqa: SLF001
    backdate = time.time() - (130 * 60)
    os.utime(path, (backdate, backdate))
    calls = _inject_candidates(monkeypatch, [DOMAIN])
    lifecycle.janitor()
    assert calls == []
    out_text = capsys.readouterr().out
    assert "possible leaked run" in out_text and "lslocks" in out_text
    assert lock_state(DOMAIN) == "held"
 def test_11_warm_canonical_names_never_probed(lock_dir, monkeypatch):
    """Case 11: RUN_APP_RE allowlist — warm/canonical-shaped names never become candidates, so
    they are never probed (no lockfile is even created for them) and never reaped."""
    warmish = [
        "warm-keycloak.ci.commoninternet.net",
        "keycloak.ci.commoninternet.net",
        "warm-hedgedoc.ci.commoninternet.net",
        "drone.ci.commoninternet.net",
    ]
    calls = []
    monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": d} for d in warmish])
    monkeypatch.setattr(
        lifecycle,
        "_docker_names",
        lambda kind, stack: ["warm-keycloak_ci_commoninternet_net_app"]
        if kind == "service"
        else [],
    )
    monkeypatch.setattr(lifecycle, "teardown_app", lambda d, verify=True: calls.append(d))
    lifecycle.janitor()
    assert calls == []
    lockdir = os.environ["CCCI_APP_LOCK_DIR"]
    assert [
        f for f in os.listdir(lockdir) if f.startswith("cc-ci-app-")
    ] == [], "janitor must not create lockfiles for non-run-app names"
 def test_12_degrades_safely_on_bad_lockfile_and_missing_dir(lock_dir, monkeypatch, capsys):
    """Case 12: a garbled/unopenable lockfile (here: a DIRECTORY at the lockfile path) is skipped
    with a log line; a missing lock dir doesn't crash the janitor either. Never a crash."""
    path = lifecycle._app_lock_path(DOMAIN)  # noqa: SLF001
    os.makedirs(path)  # open(path, "a") -> IsADirectoryError (an OSError)
    calls = _inject_candidates(monkeypatch, [DOMAIN])
    lifecycle.janitor()  # must not raise
    assert calls == []
    assert "skipping" in capsys.readouterr().out
    os.rmdir(path)
    monkeypatch.setenv("CCCI_APP_LOCK_DIR", os.path.join(os.environ["CCCI_APP_LOCK_DIR"], "gone"))
    lifecycle.janitor()  # missing dir: probe open fails -> skip; tidy glob -> empty. No crash.
    assert calls == []
--- a/tests/concurrency/test_lifetime.py
+++ b/tests/concurrency/test_lifetime.py
@ -0,0 +1,82 @@
 """Lifetime hardening (concurrency-restructure plan, cases 13-16): the REAL prctl/signal/alarm
 guards installed by helper subprocesses; tests assert teardown ran, exit was non-zero, and the
 lock was released."""
 from __future__ import annotations
 import os
 import signal
 import sys
 sys.path.insert(0, os.path.dirname(__file__))
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
 from concutil import (  # noqa: E402
    DOMAIN,
    wait_lock_state,
    wait_marker,
    wait_pid_gone,
 )
 def test_13_pdeathsig_parent_kill_terms_harness(lock_dir, pool):
    """Case 13: wrapper-parent spawns a guarded harness-child; the parent is SIGKILL'd (the
    harness gets no courtesy signal) -> the kernel's PDEATHSIG TERMs the child, its teardown
    funnel runs, it exits, and the lock is released."""
    p, out = pool.spawn("wrapper", DOMAIN)
    line = wait_marker(out, "WRAPPED")
    assert line, "wrapper never spawned its child"
    child_pid = int(line.split()[1])
    pool.track_pid(child_pid)
    assert wait_marker(out, "READY"), "guarded child never got ready"
    p.kill()  # parent dies WITHOUT signalling the child — only PDEATHSIG can save us
    p.wait(timeout=10)
    assert wait_pid_gone(child_pid), "guarded child must exit on parent death (PDEATHSIG)"
    assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run"
    assert wait_lock_state(DOMAIN, "free") == "free"
 def test_14_already_orphaned_helper_refuses_to_run(lock_dir, pool):
    """Case 14 (ppid race): a helper whose parent died BEFORE the prctl was armed (it starts
    already reparented to pid 1) must refuse to run — PDEATHSIG would never fire for it."""
    # Spawn an intermediate parent that forks orphan-probe and exits immediately.
    import subprocess
    out = os.path.join(pool.out_dir, "orphan.out")
    intermediate = (
        "import subprocess, sys, os; "
        "subprocess.Popen([sys.executable, os.environ['CCCI_HELPERS'], 'orphan-probe']); "
    )
    env = dict(
        os.environ,
        CCCI_HELPER_OUT=out,
        CCCI_HELPERS=os.path.join(os.path.dirname(__file__), "helpers.py"),
    )
    subprocess.run([sys.executable, "-c", intermediate], env=env, timeout=15, check=True)
    line = wait_marker(out, "REFUSED", timeout=20)
    assert line, "orphaned helper did not refuse to run (or never reparented to pid 1)"
 def test_15_deadline_alarm_fires_teardown_and_releases(lock_dir, pool):
    """Case 15: the self-deadline (alarm). A guarded helper with a 2s deadline tears down via
    the funnel (finally: ran), exits NON-zero, and its lock is released."""
    p, out = pool.spawn("guarded", DOMAIN, "2")
    assert wait_marker(out, "READY")
    rc = p.wait(timeout=20)
    assert rc != 0, f"deadline exit must be non-zero (got {rc})"
    assert rc == 128 + signal.SIGALRM, f"expected 142 (128+SIGALRM), got {rc}"
    assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run on deadline"
    assert wait_lock_state(DOMAIN, "free") == "free"
 def test_16_sigterm_runs_teardown_funnel_and_releases(lock_dir, pool):
    """Case 16: SIGTERM (drone cancel path) -> the finally: teardown funnel runs, exit is
    non-zero, lock released."""
    p, out = pool.spawn("guarded", DOMAIN, "3600")
    assert wait_marker(out, "READY")
    p.send_signal(signal.SIGTERM)
    rc = p.wait(timeout=20)
    assert rc != 0, f"SIGTERM exit must be non-zero (got {rc})"
    assert rc == 128 + signal.SIGTERM, f"expected 143 (128+SIGTERM), got {rc}"
    assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run on SIGTERM"
    assert wait_lock_state(DOMAIN, "free") == "free"
--- a/tests/concurrency/test_locks.py
+++ b/tests/concurrency/test_locks.py
@ -0,0 +1,85 @@
 """Lock fundamentals (concurrency-restructure plan, cases 1-4). Real kernel flocks held by real
 subprocesses — nothing mocked."""
 from __future__ import annotations
 import fcntl
 import os
 import sys
 import time
 sys.path.insert(0, os.path.dirname(__file__))
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
 from concutil import (  # noqa: E402
    DOMAIN,
    lock_state,
    wait_lock_state,
    wait_marker,
 )
 from harness import lifecycle  # noqa: E402
 def test_1_sigkill_releases_lock(lock_dir, pool):
    """Case 1: acquire -> holder SIGKILL'd -> lock immediately acquirable (kernel auto-release).
    The exact property the old pidfile registry approximated with /proc checks."""
    p, out = pool.spawn("hold", DOMAIN)
    assert wait_marker(out, "ACQUIRED"), "holder never acquired"
    assert lock_state(DOMAIN) == "held"
    p.kill()
    p.wait(timeout=10)
    assert wait_lock_state(DOMAIN, "free") == "free"
 def test_2_nb_probe_held_vs_unheld(lock_dir, pool):
    """Case 2: LOCK_NB probe raises BlockingIOError against a held lock; succeeds when unheld."""
    p, out = pool.spawn("hold", DOMAIN)
    assert wait_marker(out, "ACQUIRED")
    path = lifecycle._app_lock_path(DOMAIN)  # noqa: SLF001
    with open(path, "a") as f:
        try:
            fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
            raise AssertionError("LOCK_NB succeeded against a held lock")
        except BlockingIOError:
            pass
    p.kill()
    p.wait(timeout=10)
    assert wait_lock_state(DOMAIN, "free") == "free"
    with open(path, "a") as f:
        fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)  # must not raise now
 def test_3_lock_fd_not_inherited_by_children(lock_dir, pool):
    """Case 3 (PEP 446): the holder spawns a subprocess child, the holder dies, the child lives —
    and the lock is STILL released (the child never inherited the lock fd). This is what makes
    'held lock == live HARNESS owner' sound even though runs spawn abra/docker/pytest children."""
    p, out = pool.spawn("hold-with-child", DOMAIN)
    assert wait_marker(out, "ACQUIRED")
    child_line = wait_marker(out, "CHILD")
    assert child_line, "holder never reported its child pid"
    child_pid = int(child_line.split()[1])
    pool.track_pid(child_pid)
    p.kill()
    p.wait(timeout=10)
    assert os.path.exists(f"/proc/{child_pid}"), "child should outlive the holder"
    assert (
        wait_lock_state(DOMAIN, "free") == "free"
    ), "lock must release on holder death even with a live child (PEP 446 non-inheritable fd)"
 def test_4_second_acquire_blocks_until_first_exits(lock_dir, pool):
    """Case 4: a second same-domain acquire blocks until the first holder exits — the
    double-!testme serialisation property."""
    p1, out1 = pool.spawn("hold", DOMAIN)
    assert wait_marker(out1, "ACQUIRED")
    p2, out2 = pool.spawn("hold", DOMAIN)
    # p2 must NOT acquire while p1 holds.
    time.sleep(1.5)
    assert wait_marker(out2, "ACQUIRED", timeout=0.1) is None, "second acquire did not block"
    t_kill = time.time()
    p1.kill()
    p1.wait(timeout=10)
    line = wait_marker(out2, "ACQUIRED", timeout=15)
    assert line, "second acquire never completed after first holder exited"
    acquired_ts = float(line.split()[1])
    assert acquired_ts >= t_kill - 0.05, "second holder acquired before the first exited"
    assert lock_state(DOMAIN) == "held"
--- a/tests/concurrency/test_run_state.py
+++ b/tests/concurrency/test_run_state.py
@ -0,0 +1,79 @@
 """Run-scoped state files — M2(c) live-verify regression (not one of the 19 plan cases).
 The four CCCI state files (deploys countfile, opstate, deps, depskip) must be keyed by
 run id + harness pid, NEVER by app domain: a second run of the SAME domain executes its
 main() preamble (state-file init, deploy_app's _record_deploy) BEFORE it blocks at the
 app lock, so domain-keyed files in the shared tempdir get reset/removed underneath the
 live first run. Observed live (builds 279/281): false DG4.1 deploy-count=2 in run 1,
 countfile FileNotFoundError crash in run 2. Children never re-derive these paths — they
 receive them via the CCCI_*_FILE env vars, so per-process uniqueness is sufficient.
 """
 from __future__ import annotations
 import os
 import sys
 import tempfile
 sys.path.insert(0, os.path.dirname(__file__))
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
 import run_recipe_ci  # noqa: E402
 from concutil import wait_marker  # noqa: E402
 DOMAIN = "fake-abc123.ci.commoninternet.net"
 def test_20_state_paths_keyed_by_run_and_pid_never_by_domain(monkeypatch):
    domain = "immi-ad3e33.ci.commoninternet.net"
    monkeypatch.setenv("CCCI_APP_DOMAIN", domain)
    monkeypatch.setenv("DRONE_BUILD_NUMBER", "279")
    p279 = run_recipe_ci._run_state_path("deploys")
    monkeypatch.setenv("DRONE_BUILD_NUMBER", "281")
    p281 = run_recipe_ci._run_state_path("deploys")
    # the double-!testme invariant: two runs (same domain) share NO state file
    assert p279 != p281
    # keyed by run id + pid, under the tempdir
    base = os.path.basename(p279)
    assert base == f"ccci-deploys-279-{os.getpid()}"
    assert os.path.dirname(p279) == tempfile.gettempdir()
    # the app domain must not appear in the path at all
    assert domain not in p279 and domain not in p281
 def test_20c_same_domain_runs_each_keep_their_own_count(tmp_path, lock_dir, pool):
    """The live CONC-A1 interleaving, with REAL processes + the REAL lock and counter code:
    run A holds the app lock; run B (same domain) fires its pre-lock _record_deploy and
    blocks; A then reads its counter — must still be 1 (not polluted by B) — and removes
    its own file; B acquires and must find ITS file intact (no FileNotFoundError)."""
    gate = tmp_path / "gate"
    env_a = {"TMPDIR": str(tmp_path), "DRONE_BUILD_NUMBER": "9001"}
    env_b = {"TMPDIR": str(tmp_path), "DRONE_BUILD_NUMBER": "9002"}
    pa, out_a = pool.spawn("deploy-count-run", DOMAIN, str(gate), env_extra=env_a)
    assert wait_marker(out_a, "ACQUIRED")
    pb, out_b = pool.spawn("deploy-count-run", DOMAIN, "", env_extra=env_b)
    # B's main()-preamble + pre-lock increment have fired; B is now blocked on the app lock
    assert wait_marker(out_b, "PRELOCK")
    assert wait_marker(out_b, "ACQUIRED", timeout=1.0) is None  # still serialised behind A
    gate.touch()  # let A read its counter only AFTER B's pre-lock work landed
    line_a = wait_marker(out_a, "COUNT")
    assert line_a is not None and line_a.strip() == "COUNT 1", line_a  # not 2: B didn't pollute A
    pa.wait(timeout=15)
    line_b = wait_marker(out_b, "COUNT")
    assert (
        line_b is not None and line_b.strip() == "COUNT 1"
    ), line_b  # B's file survived A's remove
    pb.wait(timeout=15)
 def test_20b_manual_runs_distinct_via_pid(monkeypatch):
    # no DRONE_BUILD_NUMBER and no domain/run-id env → run_id() falls back to "manual";
    # the pid suffix still separates two concurrent hand-runs of the same domain.
    for var in ("DRONE_BUILD_NUMBER", "CCCI_APP_DOMAIN", "CCCI_RUN_ID"):
        monkeypatch.delenv(var, raising=False)
    p = run_recipe_ci._run_state_path("opstate")
    assert os.path.basename(p) == f"ccci-opstate-manual-{os.getpid()}"
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -13,7 +13,8 @@ import sys
 import pytest
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "runner"))
-from harness import deps as deps_mod, lifecycle, naming  # noqa: E402
+from harness import deps as deps_mod  # noqa: E402
 from harness import lifecycle, naming
 def _short(s: str, n: int = 8) -> str:
--- a/tests/cryptpad/playwright/test_pad_content_roundtrip.py
+++ b/tests/cryptpad/playwright/test_pad_content_roundtrip.py
@ -26,6 +26,7 @@ Transient `net::ERR_NETWORK_CHANGED` is handled by the shared `goto_with_retry`
 from __future__ import annotations
 import contextlib
 import os
 import sys
 import uuid
@ -39,7 +40,11 @@ def _open_pad(ctx, url):
    bar once CryptPad has created/loaded the fragment-keyed pad (`#/2/pad/edit/<key>/`)."""
    page = ctx.new_page()
    harness_browser.goto_with_retry(
-        page, url, accept_statuses=(200,), goto_timeout_ms=60_000, wait_until="load",
+        page,
        url,
        accept_statuses=(200,),
        goto_timeout_ms=60_000,
        wait_until="load",
        deadline_seconds=150,
    )
    pad_url = url
@ -53,13 +58,15 @@ def _open_pad(ctx, url):
            pad_url = page.url
            break
        if i == 40:
-            try:
+            with contextlib.suppress(Exception):  # best-effort unstick
                harness_browser.goto_with_retry(
-                    page, url, accept_statuses=(200,), goto_timeout_ms=60_000,
+                    page,
-                    wait_until="load", deadline_seconds=120,
+                    url,
                    accept_statuses=(200,),
                    goto_timeout_ms=60_000,
                    wait_until="load",
                    deadline_seconds=120,
                )
            except Exception:  # noqa: BLE001 — best-effort unstick
                pass
    return page, pad_url
@ -74,18 +81,22 @@ def _ckeditor_frame(page, deadline_polls=90, reload_at=22, reload_url=None):
            if "ckeditor-inner" in f.url:
                return f
        if i == reload_at and reload_url is not None:
-            try:
+            with contextlib.suppress(Exception):  # reload is a best-effort unstick
                harness_browser.goto_with_retry(
-                    page, reload_url, accept_statuses=(200,), goto_timeout_ms=60_000,
+                    page,
-                    wait_until="load", deadline_seconds=120,
+                    reload_url,
                    accept_statuses=(200,),
                    goto_timeout_ms=60_000,
                    wait_until="load",
                    deadline_seconds=120,
                )
            except Exception:  # noqa: BLE001 — reload is a best-effort unstick
                pass
        page.wait_for_timeout(2000)
    return None
-def _poll_any_frame_for_text(page, needle, deadline_polls=120, reload_at=(20, 45, 75, 100), reload_url=None):
+def _poll_any_frame_for_text(
    page, needle, deadline_polls=120, reload_at=(20, 45, 75, 100), reload_url=None
 ):
    """Robust read-back (F2-13): poll EVERY frame's body text for `needle`, returning True as soon as
    it appears. The fresh cold-cache read-back context's deeply-nested CKEditor frame is slow/flaky to
    *attach* by URL (the prior `_ckeditor_frame` wait timed out on the Adversary's cold run), but the
@ -101,13 +112,15 @@ def _poll_any_frame_for_text(page, needle, deadline_polls=120, reload_at=(20, 45
            except Exception:  # noqa: BLE001 — frame not ready / detached; keep polling
                pass
        if reload_url and i in reload_at:
-            try:
+            with contextlib.suppress(Exception):  # best-effort unstick
                harness_browser.goto_with_retry(
-                    page, reload_url, accept_statuses=(200,), goto_timeout_ms=60_000,
+                    page,
-                    wait_until="load", deadline_seconds=120,
+                    reload_url,
                    accept_statuses=(200,),
                    goto_timeout_ms=60_000,
                    wait_until="load",
                    deadline_seconds=120,
                )
            except Exception:  # noqa: BLE001 — best-effort unstick
                pass
        page.wait_for_timeout(2000)
    return False
@ -137,9 +150,9 @@ def test_cryptpad_pad_content_survives_fresh_session(live_app):
            # --- session 1: create the pad + write the marker ---
            ctx1 = browser.new_context(ignore_https_errors=True)
            page, pad_url = _open_pad(ctx1, f"https://{live_app}/pad/")
-            assert "#/2/pad/edit/" in pad_url, (
+            assert (
-                f"CryptPad did not create a fragment-keyed pad URL; got {pad_url!r}"
+                "#/2/pad/edit/" in pad_url
-            )
+            ), f"CryptPad did not create a fragment-keyed pad URL; got {pad_url!r}"
            ck = _ckeditor_frame(page, reload_url=pad_url)
            assert ck is not None, "CKEditor content frame never attached (pad editor not ready)"
            _dismiss_store_modal(page)
@ -148,9 +161,9 @@ def test_cryptpad_pad_content_survives_fresh_session(live_app):
            page.wait_for_timeout(1000)
            body.type(marker, delay=40)
            page.wait_for_timeout(12000)  # let CryptPad encrypt + sync the update to the server
-            assert marker in ck.locator("body").inner_text(), (
+            assert (
-                "marker not present in the editor after typing — type did not land"
+                marker in ck.locator("body").inner_text()
-            )
+            ), "marker not present in the editor after typing — type did not land"
            ctx1.close()
            # --- session 2: FRESH context (no shared storage/localStorage) reads the pad back by URL.
--- a/tests/cryptpad/playwright/test_pad_create.py
+++ b/tests/cryptpad/playwright/test_pad_create.py
@ -51,9 +51,9 @@ def test_cryptpad_spa_renders_with_no_console_errors(live_app):
            title = (page.title() or "").lower()
            body = page.content()
            blower = body.lower()
-            assert "cryptpad" in title or "cryptpad" in blower, (
+            assert (
-                f"CryptPad SPA does not carry brand. title={title!r}, body excerpt: {body[:200]!r}"
+                "cryptpad" in title or "cryptpad" in blower
-            )
+            ), f"CryptPad SPA does not carry brand. title={title!r}, body excerpt: {body[:200]!r}"
            # Canonical CryptPad asset references in the rendered DOM
            canonical = ("/customize/", "/components/", "main.js", "/api/broadcast")
--- a/tests/cryptpad/test_install.py
+++ b/tests/cryptpad/test_install.py
@ -8,7 +8,8 @@ import os
 import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
-from harness import browser as harness_browser, generic, lifecycle  # noqa: E402
+from harness import browser as harness_browser  # noqa: E402
 from harness import generic, lifecycle
 def test_serving_and_content(live_app, meta):
--- a/tests/custom-html-bkp-bad/test_backup.py
+++ b/tests/custom-html-bkp-bad/test_backup.py
@ -20,7 +20,9 @@ def test_backup_captures_state(live_app):
    Since custom-html-bkp-bad has no ops.py::pre_backup to seed the marker, this file does NOT
    exist at backup time — exec_in_app returns empty or raises → assertion fails → backup tier RED.
    This models a recipe that declares backup capability but omits the data-seeding hook."""
-    result = lifecycle.exec_in_app(live_app, ["sh", "-c", f"cat {MARKER_PATH} 2>/dev/null || echo MISSING"]).strip()
+    result = lifecycle.exec_in_app(
        live_app, ["sh", "-c", f"cat {MARKER_PATH} 2>/dev/null || echo MISSING"]
    ).strip()
    assert result == "original", (
        f"backup did not capture the expected marker at {MARKER_PATH}: got {result!r}. "
        "Expected 'original' (seeded by pre_backup). If the marker is 'MISSING', the pre_backup "
--- a/tests/custom-html-tiny/functional/test_serves_content.py
+++ b/tests/custom-html-tiny/functional/test_serves_content.py
@ -0,0 +1,87 @@
 """custom-html-tiny — recipe-specific functional test (static-web-server).
 Proves the deployed static-web-server is *actually serving files from its `content` volume* with real
 file-server semantics, not merely returning 200 from a Traefik fallback or a generic stub:
  1. exact-byte round-trip — write a uniquely-named file with random content into the served volume,
     fetch it over HTTPS, and assert the bytes come back verbatim. Non-vacuous: the content is random
     per run, so only a server that reads this file off the volume can pass.
  2. real 404 — a random non-existent path returns 404, proving directory/file semantics (a
     200-everything stub or mis-routed host would not 404).
 The recipe's image (joseluisq/static-web-server) is shell-less (scratch-based) and its content volume
 is seeded via the install_steps.sh host-mountpoint mechanism — so this test writes its probe file the
 same way (resolve the swarm volume's mountpoint with `docker volume inspect`, write directly) rather
 than `docker exec`-ing in a container that has no shell.
 Runs in the custom tier against the shared post-install deployment (the `live_app` fixture is its
 per-run domain). Mirrors install_steps.sh: the app's content volume is named `<stack>_content`, where
 `stack` is the domain with dots replaced by underscores; HTTP_SUBDIR is empty, so the volume root is
 served at `/`.
 """
 from __future__ import annotations
 import contextlib
 import os
 import ssl
 import subprocess
 import urllib.error
 import urllib.request
 import uuid
 def _served_dir(domain: str) -> str:
    """Host mountpoint of the app's served `content` volume (same naming as install_steps.sh)."""
    vol = f"{domain.replace('.', '_')}_content"
    out = subprocess.run(
        ["docker", "volume", "inspect", vol, "--format", "{{.Mountpoint}}"],
        capture_output=True,
        text=True,
        check=True,
    )
    mountpoint = out.stdout.strip()
    assert mountpoint, f"could not resolve mountpoint for volume {vol!r}"
    return mountpoint
 def _get(url: str) -> tuple[int, bytes]:
    """GET the URL; return (status, body). A 4xx/5xx is returned, not raised (we assert on the code).
    TLS verification is relaxed: the served wildcard cert is validated separately by the infra check;
    here we care only about the app's response."""
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    try:
        with urllib.request.urlopen(url, timeout=20, context=ctx) as resp:
            return resp.status, resp.read()
    except urllib.error.HTTPError as e:
        return e.code, e.read()
 def test_static_file_roundtrip_and_404(live_app):
    """Write a random file into the served volume → fetch it → bytes match; and a missing path 404s."""
    served = _served_dir(live_app)
    token = uuid.uuid4().hex
    name = f"ccci-probe-{token}.txt"
    body = f"cc-ci-functional-{token}\n".encode()
    path = os.path.join(served, name)
    with open(path, "wb") as fh:
        fh.write(body)
    try:
        status, got = _get(f"https://{live_app}/{name}")
        assert status == 200, f"served probe file returned {status} (expected 200)"
        assert got == body, (
            f"content round-trip mismatch: served {got!r}, wrote {body!r} "
            "(static-web-server not serving the content volume?)"
        )
        # A random non-existent path must 404 — proves real static-file semantics, distinguishing a
        # working server from a 200-everything stub or a mis-routed Traefik fallback.
        miss_status, _ = _get(f"https://{live_app}/ccci-missing-{uuid.uuid4().hex}.txt")
        assert (
            miss_status == 404
        ), f"missing path returned {miss_status} (expected 404 — generic 200-returner / mis-route?)"
    finally:
        with contextlib.suppress(OSError):
            os.remove(path)
--- a/tests/custom-html-tiny/recipe_meta.py
+++ b/tests/custom-html-tiny/recipe_meta.py
@ -3,3 +3,14 @@
 # (DG5) is detected quickly instead of waiting the default 300s HTTP timeout.
 DEPLOY_TIMEOUT = 120
 HTTP_TIMEOUT = 90
 # Rungs this recipe INTENTIONALLY skips, each with a reason. Any essential rung skipped (N/A) and NOT
 # listed here is reported as an *unintentional* skip (a coverage gap to fill or declare). A skip still
 # caps the level either way — the harness never claims a rung it did not verify; this only records
 # that the skip is deliberate. (The level ladder is the four essential rungs install/upgrade/
 # backup_restore/functional; integration + recipe-local are optional and not leveled.)
 # custom-html-tiny is a stateless static-web-server, so it has no backup surface:
 EXPECTED_NA = {
    "backup_restore": "stateless static file server: serves an ephemeral content volume seeded at "
    "deploy, with no persistent/user data to back up or restore (no backupbot.backup label)",
 }
--- a/tests/custom-html/functional/test_content_roundtrip.py
+++ b/tests/custom-html/functional/test_content_roundtrip.py
@ -15,7 +15,8 @@ import sys
 import uuid
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
-from harness import http as harness_http, lifecycle  # noqa: E402
+from harness import http as harness_http  # noqa: E402
 from harness import lifecycle
 def test_content_roundtrip(live_app):
--- a/tests/custom-html/functional/test_content_type_header.py
+++ b/tests/custom-html/functional/test_content_type_header.py
@ -53,9 +53,9 @@ def test_content_type_html_and_txt(live_app):
    ct_txt = h_txt.get("content-type", "")
    # nginx default: "text/html" for .html and "text/plain" for .txt (may include "; charset=utf-8")
-    assert ct_html.startswith("text/html"), (
+    assert ct_html.startswith(
-        f"{html_name} Content-Type={ct_html!r}, expected text/html (nginx MIME config broken?)"
+        "text/html"
-    )
+    ), f"{html_name} Content-Type={ct_html!r}, expected text/html (nginx MIME config broken?)"
-    assert ct_txt.startswith("text/plain"), (
+    assert ct_txt.startswith(
-        f"{txt_name} Content-Type={ct_txt!r}, expected text/plain (nginx MIME config broken?)"
+        "text/plain"
-    )
+    ), f"{txt_name} Content-Type={ct_txt!r}, expected text/plain (nginx MIME config broken?)"
--- a/tests/custom-html/test_install.py
+++ b/tests/custom-html/test_install.py
@ -9,7 +9,8 @@ import os
 import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
-from harness import browser as harness_browser, generic  # noqa: E402
+from harness import browser as harness_browser  # noqa: E402
 from harness import generic
 def test_serving_and_content(live_app, meta):
--- a/tests/discourse/functional/_discourse.py
+++ b/tests/discourse/functional/_discourse.py
@ -53,7 +53,7 @@ def mint_admin(domain: str) -> tuple[str, str]:
    cmd = (
        "cd /opt/bitnami/discourse && "
        "RUBY=$(command -v ruby || echo /opt/bitnami/ruby/bin/ruby) && "
-        f"RAILS_ENV=production \"$RUBY\" bin/rails runner \"{_BOOTSTRAP_RB}\""
+        f'RAILS_ENV=production "$RUBY" bin/rails runner "{_BOOTSTRAP_RB}"'
    )
    out = lifecycle.exec_in_app(domain, ["bash", "-c", cmd], service="app", timeout=240)
    key = user = None
@ -63,9 +63,9 @@ def mint_admin(domain: str) -> tuple[str, str]:
            key = line.split("=", 1)[1].strip()
        elif line.startswith("CCCI_API_USER="):
            user = line.split("=", 1)[1].strip()
-    assert key and user, (
+    assert (
-        f"could not bootstrap discourse admin/API key; rails output tail:\n{out[-1000:]}"
+        key and user
-    )
+    ), f"could not bootstrap discourse admin/API key; rails output tail:\n{out[-1000:]}"
    return key, user
--- a/tests/discourse/functional/test_create_topic.py
+++ b/tests/discourse/functional/test_create_topic.py
@ -48,21 +48,23 @@ def test_create_topic_roundtrip(live_app):
        headers=hdrs,
        timeout=60,
    )
-    assert status in (200, 201) and isinstance(body, dict), (
+    assert status in (200, 201) and isinstance(
-        f"create topic failed: HTTP {status}, body={body!r}"
+        body, dict
-    )
+    ), f"create topic failed: HTTP {status}, body={body!r}"
    topic_id = body.get("topic_id")
    assert topic_id, f"create topic returned no topic_id: {body!r}"
    # 4) Read the topic back and assert title + first-post body round-trip.
    status, got = harness_http.http_get(f"{base}/t/{topic_id}.json", headers=hdrs, timeout=30)
-    assert status == 200 and isinstance(got, dict), f"read topic failed: HTTP {status}, body={got!r}"
+    assert status == 200 and isinstance(
-    assert got.get("title") == title, (
+        got, dict
-        f"topic title did not round-trip: sent {title!r}, got {got.get('title')!r}"
+    ), f"read topic failed: HTTP {status}, body={got!r}"
-    )
+    assert (
        got.get("title") == title
    ), f"topic title did not round-trip: sent {title!r}, got {got.get('title')!r}"
    posts = (got.get("post_stream") or {}).get("posts") or []
    assert posts, f"topic has no posts on read-back: {got!r}"
    first_cooked = posts[0].get("cooked", "")
-    assert marker in first_cooked, (
+    assert (
-        f"topic body did not round-trip: marker {marker!r} not in first post {first_cooked!r}"
+        marker in first_cooked
-    )
+    ), f"topic body did not round-trip: marker {marker!r} not in first post {first_cooked!r}"
--- a/tests/discourse/functional/test_site_basic.py
+++ b/tests/discourse/functional/test_site_basic.py
@ -20,12 +20,12 @@ def test_site_json_has_discourse_config(live_app):
    status, body = harness_http.retry_http_get(
        f"https://{live_app}/site.json", expect_status=200, max_wait=120, interval=5
    )
-    assert status == 200 and isinstance(body, dict), (
+    assert status == 200 and isinstance(
-        f"GET /site.json failed: HTTP {status}, body type={type(body).__name__}"
+        body, dict
-    )
+    ), f"GET /site.json failed: HTTP {status}, body type={type(body).__name__}"
    # /site.json carries Discourse-specific structure — `categories` (a list) and `groups` are always
    # present in a booted Discourse. A non-Discourse 200 (placeholder page) would not parse to this.
    assert "categories" in body, f"/site.json missing 'categories' key: keys={list(body)[:20]}"
-    assert isinstance(body["categories"], list), (
+    assert isinstance(
-        f"/site.json 'categories' not a list: {type(body['categories']).__name__}"
+        body["categories"], list
-    )
+    ), f"/site.json 'categories' not a list: {type(body['categories']).__name__}"
--- a/tests/discourse/install_steps.sh
+++ b/tests/discourse/install_steps.sh
@ -15,7 +15,9 @@ set -euo pipefail
 : "${CCCI_RECIPE:?missing CCCI_RECIPE}"
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
+# Resolve the recipe tree the way abra does: $ABRA_DIR (the per-run tree inside a CI run) else
 # the canonical ~/.abra — the overlay must land in the tree this run actually deploys from.
 RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"
 if [ ! -d "$RECIPE_DIR" ]; then
  echo "  discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2
--- a/tests/discourse/ops.py
+++ b/tests/discourse/ops.py
@ -15,8 +15,7 @@ from harness import lifecycle  # noqa: E402
 def _psql(domain, sql):
    cmd = (
-        'PGPASSWORD=$(cat /run/secrets/db_password) '
+        "PGPASSWORD=$(cat /run/secrets/db_password) " f'psql -U discourse -d discourse -tAc "{sql}"'
        f'psql -U discourse -d discourse -tAc "{sql}"'
    )
    return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
@ -42,6 +41,7 @@ def pre_backup(domain, meta):
 def pre_restore(domain, meta):
    # diverge from the backup so a successful restore is observable
    _psql(domain, "DROP TABLE IF EXISTS ci_marker;")
-    assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in ("", "NULL"), (
+    assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
-        "drop did not take"
+        "",
-    )
+        "NULL",
    ), "drop did not take"
--- a/tests/discourse/recipe_meta.py
+++ b/tests/discourse/recipe_meta.py
@ -6,7 +6,9 @@
 # app is actually serving (the canonical "is discourse up" signal — NOT "/", which may redirect to setup).
 HEALTH_PATH = "/srv/status"
 HEALTH_OK = (200,)
-DEPLOY_TIMEOUT = 3600  # slow Rails cold boot (15-25min) on the 7-GiB single node; bumped 2400→3600 for
+DEPLOY_TIMEOUT = (
    3600  # slow Rails cold boot (15-25min) on the 7-GiB single node; bumped 2400→3600 for
 )
 # headroom after full4's base deploy timed out at 2400s (RAM/CPU-constrained boot + image re-pull).
 HTTP_TIMEOUT = 1200
@ -59,7 +61,11 @@ def BACKUP_VERIFY(domain):
    try:
        out = lifecycle.exec_in_app(
            domain,
-            ["sh", "-c", "gzip -t /var/lib/postgresql/data/backup.sql && wc -c < /var/lib/postgresql/data/backup.sql"],
+            [
                "sh",
                "-c",
                "gzip -t /var/lib/postgresql/data/backup.sql && wc -c < /var/lib/postgresql/data/backup.sql",
            ],
            service="db",
            timeout=60,
        ).strip()
--- a/tests/discourse/test_backup.py
+++ b/tests/discourse/test_backup.py
@ -14,13 +14,12 @@ from harness import lifecycle  # noqa: E402
 def _psql(domain, sql):
    cmd = (
-        'PGPASSWORD=$(cat /run/secrets/db_password) '
+        "PGPASSWORD=$(cat /run/secrets/db_password) " f'psql -U discourse -d discourse -tAc "{sql}"'
        f'psql -U discourse -d discourse -tAc "{sql}"'
    )
    return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
 def test_backup_captures_state(live_app):
-    assert _psql(live_app, "SELECT v FROM ci_marker;") == "original", (
+    assert (
-        "the seeded discourse postgres state was not present at backup time"
+        _psql(live_app, "SELECT v FROM ci_marker;") == "original"
-    )
+    ), "the seeded discourse postgres state was not present at backup time"
--- a/tests/discourse/test_restore.py
+++ b/tests/discourse/test_restore.py
@ -14,13 +14,12 @@ from harness import lifecycle  # noqa: E402
 def _psql(domain, sql):
    cmd = (
-        'PGPASSWORD=$(cat /run/secrets/db_password) '
+        "PGPASSWORD=$(cat /run/secrets/db_password) " f'psql -U discourse -d discourse -tAc "{sql}"'
        f'psql -U discourse -d discourse -tAc "{sql}"'
    )
    return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
 def test_restore_returns_state(live_app):
-    assert _psql(live_app, "SELECT v FROM ci_marker;") == "original", (
+    assert (
-        "restore did not return the pre-mutation discourse postgres state (data-integrity failure)"
+        _psql(live_app, "SELECT v FROM ci_marker;") == "original"
-    )
+    ), "restore did not return the pre-mutation discourse postgres state (data-integrity failure)"
--- a/tests/ghost/functional/_ghost.py
+++ b/tests/ghost/functional/_ghost.py
@ -93,9 +93,10 @@ class GhostAdmin:
        status, body = self.req(
            "POST", "/session/", {"username": ADMIN_EMAIL, "password": ADMIN_PW}
        )
-        assert status in (200, 201), (
+        assert status in (
-            f"ghost admin session login failed: HTTP {status}, body={body!r}"
+            200,
-        )
+            201,
        ), f"ghost admin session login failed: HTTP {status}, body={body!r}"
    def create_post(self, title: str, html: str) -> dict:
        status, body = self.req(
--- a/tests/ghost/functional/test_admin_redirect.py
+++ b/tests/ghost/functional/test_admin_redirect.py
@ -53,13 +53,15 @@ def test_ghost_admin_route_is_wired(live_app):
        return None
    status_body = harness_http.assert_converges(
-        _ready, f"GET {url} returns Ghost admin (200) or setup redirect (302)",
+        _ready,
-        max_wait=60, interval=3,
+        f"GET {url} returns Ghost admin (200) or setup redirect (302)",
        max_wait=60,
        interval=3,
    )
    status, body = status_body
    assert status in (200, 302), f"unexpected status: {status}"
    if status == 200:
        # The admin SPA references /ghost-assets/ or contains "ghost" in title/body
-        assert "ghost" in body.lower(), (
+        assert (
-            f"GET {url} 200 but body has no Ghost markers: {body[:200]!r}"
+            "ghost" in body.lower()
-        )
+        ), f"GET {url} 200 but body has no Ghost markers: {body[:200]!r}"
--- a/tests/ghost/functional/test_content_api.py
+++ b/tests/ghost/functional/test_content_api.py
@ -35,10 +35,10 @@ def test_content_api_settings_endpoint(live_app):
    assert body is not None, f"GET {url} returned non-JSON body"
    # On success: {"settings": {...}}. On error: {"errors": [...]}. Either shape is valid.
    if status == 200:
-        assert isinstance(body, dict) and "settings" in body, (
+        assert (
-            f"200 response missing 'settings' envelope: {body!r}"
+            isinstance(body, dict) and "settings" in body
-        )
+        ), f"200 response missing 'settings' envelope: {body!r}"
    else:
-        assert isinstance(body, dict) and ("errors" in body or "message" in body or body), (
+        assert isinstance(body, dict) and (
-            f"error response not a proper Ghost error envelope: {body!r}"
+            "errors" in body or "message" in body or body
-        )
+        ), f"error response not a proper Ghost error envelope: {body!r}"
--- a/tests/ghost/functional/test_post_roundtrip.py
+++ b/tests/ghost/functional/test_post_roundtrip.py
@ -43,17 +43,17 @@ def test_create_post_roundtrip(live_app):
    title = f"ccci-marker-{uniq}"
    marker = f"ccci-body-marker-{uniq}-roundtrip"
    created = admin.create_post(title, f"<p>{marker}</p>")
-    assert created.get("title") == title, (
+    assert (
-        f"created post title mismatch: sent {title!r}, got {created.get('title')!r}"
+        created.get("title") == title
-    )
+    ), f"created post title mismatch: sent {title!r}, got {created.get('title')!r}"
    # 4) Read it back by id and assert the post survived the round-trip (title always returned;
    #    html returned because we requested ?formats=html).
    got = admin.get_post(created["id"])
-    assert got.get("title") == title, (
+    assert (
-        f"post title did not round-trip: sent {title!r}, got {got.get('title')!r}"
+        got.get("title") == title
-    )
+    ), f"post title did not round-trip: sent {title!r}, got {got.get('title')!r}"
    html = got.get("html") or ""
-    assert marker in html, (
+    assert (
-        f"post body did not round-trip: marker {marker!r} not in read-back html {html!r}"
+        marker in html
-    )
+    ), f"post body did not round-trip: marker {marker!r} not in read-back html {html!r}"
--- a/tests/ghost/install_steps.sh
+++ b/tests/ghost/install_steps.sh
@ -15,7 +15,9 @@ set -euo pipefail
 : "${CCCI_RECIPE:?missing CCCI_RECIPE}"
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
+# Resolve the recipe tree the way abra does: $ABRA_DIR (the per-run tree inside a CI run) else
 # the canonical ~/.abra — the overlay must land in the tree this run actually deploys from.
 RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"
 if [ ! -d "$RECIPE_DIR" ]; then
  echo "  ghost install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2
--- a/tests/ghost/ops.py
+++ b/tests/ghost/ops.py
@ -22,10 +22,7 @@ from harness import lifecycle  # noqa: E402
 def _mysql(domain, sql):
-    cmd = (
+    cmd = 'MYSQL_PWD="$(cat /run/secrets/db_password)" ' f'mysql -u root -N -s ghost -e "{sql}"'
        'MYSQL_PWD="$(cat /run/secrets/db_password)" '
        f'mysql -u root -N -s ghost -e "{sql}"'
    )
    return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
--- a/tests/ghost/recipe_meta.py
+++ b/tests/ghost/recipe_meta.py
@ -63,7 +63,11 @@ def BACKUP_VERIFY(domain):
    try:
        out = lifecycle.exec_in_app(
            domain,
-            ["sh", "-c", "gzip -t /var/lib/mysql/backup.sql.gz && wc -c < /var/lib/mysql/backup.sql.gz"],
+            [
                "sh",
                "-c",
                "gzip -t /var/lib/mysql/backup.sql.gz && wc -c < /var/lib/mysql/backup.sql.gz",
            ],
            service="db",
            timeout=60,
        ).strip()
--- a/tests/ghost/test_backup.py
+++ b/tests/ghost/test_backup.py
@ -15,14 +15,11 @@ from harness import lifecycle  # noqa: E402
 def _mysql(domain, sql):
-    cmd = (
+    cmd = 'MYSQL_PWD="$(cat /run/secrets/db_password)" ' f'mysql -u root -N -s ghost -e "{sql}"'
        'MYSQL_PWD="$(cat /run/secrets/db_password)" '
        f'mysql -u root -N -s ghost -e "{sql}"'
    )
    return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
 def test_backup_captures_state(live_app):
-    assert _mysql(live_app, "SELECT v FROM ci_marker;") == "original", (
+    assert (
-        "the seeded ghost MySQL marker was not present at backup time"
+        _mysql(live_app, "SELECT v FROM ci_marker;") == "original"
-    )
+    ), "the seeded ghost MySQL marker was not present at backup time"
--- a/tests/ghost/test_restore.py
+++ b/tests/ghost/test_restore.py
@ -22,10 +22,7 @@ from harness import lifecycle  # noqa: E402
 def _mysql(domain, sql):
-    cmd = (
+    cmd = 'MYSQL_PWD="$(cat /run/secrets/db_password)" ' f'mysql -u root -N -s ghost -e "{sql}"'
        'MYSQL_PWD="$(cat /run/secrets/db_password)" '
        f'mysql -u root -N -s ghost -e "{sql}"'
    )
    return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
--- a/tests/ghost/test_upgrade.py
+++ b/tests/ghost/test_upgrade.py
@ -14,14 +14,11 @@ from harness import lifecycle  # noqa: E402
 def _mysql(domain, sql):
-    cmd = (
+    cmd = 'MYSQL_PWD="$(cat /run/secrets/db_password)" ' f'mysql -u root -N -s ghost -e "{sql}"'
        'MYSQL_PWD="$(cat /run/secrets/db_password)" '
        f'mysql -u root -N -s ghost -e "{sql}"'
    )
    return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
 def test_upgrade_preserves_state(live_app):
-    assert _mysql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives", (
+    assert (
-        "the seeded ghost MySQL marker did not survive the upgrade redeploy (data loss on upgrade)"
+        _mysql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives"
-    )
+    ), "the seeded ghost MySQL marker did not survive the upgrade redeploy (data loss on upgrade)"
--- a/tests/hedgedoc/functional/test_branding.py
+++ b/tests/hedgedoc/functional/test_branding.py
@ -14,7 +14,6 @@ import urllib.request
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
 from harness import http as harness_http  # noqa: E402
 _CTX = ssl.create_default_context()
 _CTX.check_hostname = False
 _CTX.verify_mode = ssl.CERT_NONE
--- a/tests/hedgedoc/functional/test_health_check.py
+++ b/tests/hedgedoc/functional/test_health_check.py
@ -15,7 +15,5 @@ from harness import http as harness_http  # noqa: E402
 def test_hedgedoc_root_serves(live_app):
    """GET / → 200 or 302 (login/new redirect)."""
    url = f"https://{live_app}/"
-    status, _ = harness_http.retry_http_get(
+    status, _ = harness_http.retry_http_get(url, expect_status=(200, 302), max_wait=90, interval=5)
        url, expect_status=(200, 302), max_wait=90, interval=5
    )
    assert status in (200, 302), f"GET {url} HTTP {status} (expected 200 or 302)"
--- a/tests/immich/functional/test_asset_processing.py
+++ b/tests/immich/functional/test_asset_processing.py
@ -111,13 +111,13 @@ def test_immich_processes_uploaded_asset_metadata_and_statistics(live_app):
        if exif and exif.get("exifImageWidth"):
            break
        time.sleep(5)
-    assert exif and exif.get("exifImageWidth") == 1 and exif.get("exifImageHeight") == 1, (
+    assert (
-        f"immich metadata-extraction did not populate the 1x1 PNG dimensions in exifInfo: {exif!r}"
+        exif and exif.get("exifImageWidth") == 1 and exif.get("exifImageHeight") == 1
-    )
+    ), f"immich metadata-extraction did not populate the 1x1 PNG dimensions in exifInfo: {exif!r}"
    # the asset is catalogued into the owner's library statistics (list-back in aggregate)
    sst, stats = harness_http.http_request("GET", f"{base}/api/assets/statistics", headers=auth)
    assert sst == 200 and isinstance(stats, dict), f"statistics HTTP {sst}: {stats!r}"
-    assert stats.get("images", 0) >= 1 and stats.get("total", 0) >= 1, (
+    assert (
-        f"uploaded asset not reflected in library statistics: {stats!r}"
+        stats.get("images", 0) >= 1 and stats.get("total", 0) >= 1
-    )
+    ), f"uploaded asset not reflected in library statistics: {stats!r}"
--- a/tests/immich/functional/test_asset_upload.py
+++ b/tests/immich/functional/test_asset_upload.py
@ -121,6 +121,6 @@ def test_immich_upload_asset_readback_and_thumbnail(live_app):
        if thumb == 200:
            break
        time.sleep(5)
-    assert thumb == 200, (
+    assert (
-        f"immich did not generate a thumbnail/derivative for the uploaded asset (last HTTP {thumb})"
+        thumb == 200
-    )
+    ), f"immich did not generate a thumbnail/derivative for the uploaded asset (last HTTP {thumb})"
--- a/tests/immich/functional/test_health_check.py
+++ b/tests/immich/functional/test_health_check.py
@ -16,5 +16,11 @@ from harness import http as harness_http  # noqa: E402
 def test_immich_returns_200(live_app):
    url = f"https://{live_app}/"
-    status, _ = harness_http.retry_http_get(url, expect_status=(200, 301, 302), max_wait=60, interval=3)
+    status, _ = harness_http.retry_http_get(
-    assert status in (200, 301, 302), f"immich at {url} returned HTTP {status} (expected 200/301/302)"
+        url, expect_status=(200, 301, 302), max_wait=60, interval=3
    )
    assert status in (
        200,
        301,
        302,
    ), f"immich at {url} returned HTTP {status} (expected 200/301/302)"
--- a/tests/immich/ops.py
+++ b/tests/immich/ops.py
@ -35,4 +35,7 @@ def pre_backup(domain, meta):
 def pre_restore(domain, meta):
    _psql(domain, "DROP TABLE ci_marker;")
-    assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in ("", "NULL"), "drop did not take"
+    assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
        "",
        "NULL",
    ), "drop did not take"
--- a/tests/immich/test_backup.py
+++ b/tests/immich/test_backup.py
@ -14,4 +14,6 @@ def _psql(domain, sql):
 def test_backup_captures_state(live_app):
-    assert _psql(live_app, "SELECT v FROM ci_marker;") == "original", "seeded postgres state not present at backup time"
+    assert (
        _psql(live_app, "SELECT v FROM ci_marker;") == "original"
    ), "seeded postgres state not present at backup time"
--- a/tests/immich/test_install.py
+++ b/tests/immich/test_install.py
@ -7,7 +7,8 @@ import os
 import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
-from harness import browser as harness_browser, generic, lifecycle  # noqa: E402
+from harness import browser as harness_browser  # noqa: E402
 from harness import generic, lifecycle
 def test_serving_and_frontend(live_app, meta):
@ -25,7 +26,11 @@ def test_serving_and_frontend(live_app, meta):
            resp = harness_browser.goto_with_retry(
                page, url, accept_statuses=(200, 301, 302), goto_timeout_ms=60_000
            )
-            assert resp is not None and resp.status in (200, 301, 302), f"page status {resp and resp.status}"
+            assert resp is not None and resp.status in (
                200,
                301,
                302,
            ), f"page status {resp and resp.status}"
            assert "<html" in page.content().lower(), "no HTML served by the immich frontend"
        finally:
            browser.close()
--- a/tests/immich/test_restore.py
+++ b/tests/immich/test_restore.py
@ -14,4 +14,6 @@ def _psql(domain, sql):
 def test_restore_returns_state(live_app):
-    assert _psql(live_app, "SELECT v FROM ci_marker;") == "original", "restore did not return the pre-mutation postgres state"
+    assert (
        _psql(live_app, "SELECT v FROM ci_marker;") == "original"
    ), "restore did not return the pre-mutation postgres state"
--- a/tests/immich/test_upgrade.py
+++ b/tests/immich/test_upgrade.py
@ -14,4 +14,6 @@ def _psql(domain, sql):
 def test_upgrade_preserves_data(live_app):
-    assert _psql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives", "postgres data did not survive the upgrade"
+    assert (
        _psql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives"
    ), "postgres data did not survive the upgrade"
--- a/tests/keycloak/functional/test_create_client_and_use.py
+++ b/tests/keycloak/functional/test_create_client_and_use.py
@ -120,9 +120,9 @@ def test_create_confidential_client_and_obtain_token(live_app):
        "clientId": client_id,
        "enabled": True,
        "secret": client_secret,
-        "publicClient": False,            # confidential client
+        "publicClient": False,  # confidential client
-        "serviceAccountsEnabled": True,    # required for client_credentials grant
+        "serviceAccountsEnabled": True,  # required for client_credentials grant
-        "standardFlowEnabled": False,      # not needed for service-account-only client
+        "standardFlowEnabled": False,  # not needed for service-account-only client
        "directAccessGrantsEnabled": False,
        "protocol": "openid-connect",
    }
@ -144,25 +144,25 @@ def test_create_confidential_client_and_obtain_token(live_app):
        # Use the client to obtain its own token (client_credentials grant)
        tok_status, tok_resp = _client_credentials_token(live_app, client_id, client_secret)
-        assert tok_status == 200, (
+        assert (
-            f"client_credentials token returned HTTP {tok_status}: {tok_resp!r}"
+            tok_status == 200
-        )
+        ), f"client_credentials token returned HTTP {tok_status}: {tok_resp!r}"
        access_token = tok_resp.get("access_token") if isinstance(tok_resp, dict) else None
-        assert isinstance(access_token, str) and access_token.count(".") == 2, (
+        assert (
-            f"client_credentials access_token not a JWT: {access_token!r}"
+            isinstance(access_token, str) and access_token.count(".") == 2
-        )
+        ), f"client_credentials access_token not a JWT: {access_token!r}"
        # Decode the JWT payload; assert azp matches the new client
        payload = json.loads(_b64url_decode(access_token.split(".")[1]))
-        assert payload.get("azp") == client_id, (
+        assert (
-            f"client_credentials JWT azp={payload.get('azp')!r} != client_id={client_id!r}"
+            payload.get("azp") == client_id
-        )
+        ), f"client_credentials JWT azp={payload.get('azp')!r} != client_id={client_id!r}"
        # Service-account token does NOT carry a session-scoped user (azp + clientId differ from
        # admin-cli token). The presence of azp + iss == per-run-domain proves the issuance flow.
        expected_iss = f"https://{live_app}/realms/master"
-        assert payload.get("iss") == expected_iss, (
+        assert (
-            f"JWT iss={payload.get('iss')!r} != {expected_iss!r}"
+            payload.get("iss") == expected_iss
-        )
+        ), f"JWT iss={payload.get('iss')!r} != {expected_iss!r}"
    finally:
        # Idempotent cleanup
        if cleanup_id:
--- a/tests/keycloak/functional/test_password_grant_token.py
+++ b/tests/keycloak/functional/test_password_grant_token.py
@ -43,22 +43,20 @@ def test_password_grant_issues_valid_jwt(live_app):
    token = kc_admin.admin_token(live_app, password)
    # Shape: a JWT is exactly 3 base64url segments
-    assert isinstance(token, str) and token.count(".") == 2, (
+    assert (
-        f"access_token does not look like a JWT (no 3 segments): len={len(token) if token else 0}"
+        isinstance(token, str) and token.count(".") == 2
-    )
+    ), f"access_token does not look like a JWT (no 3 segments): len={len(token) if token else 0}"
    payload = _decode_jwt_payload(token)
    # iss = the issuer URL, must be the per-run domain's /realms/master endpoint
    expected_iss = f"https://{live_app}/realms/master"
-    assert payload.get("iss") == expected_iss, (
+    assert (
-        f"JWT iss claim {payload.get('iss')!r} != {expected_iss!r}"
+        payload.get("iss") == expected_iss
-    )
+    ), f"JWT iss claim {payload.get('iss')!r} != {expected_iss!r}"
    # azp = authorized party (which client requested this token)
-    assert payload.get("azp") == "admin-cli", (
+    assert payload.get("azp") == "admin-cli", f"JWT azp claim {payload.get('azp')!r} != 'admin-cli'"
        f"JWT azp claim {payload.get('azp')!r} != 'admin-cli'"
    )
    # typ = token type
    assert payload.get("typ") == "Bearer", f"JWT typ claim {payload.get('typ')!r} != 'Bearer'"
@ -70,6 +68,6 @@ def test_password_grant_issues_valid_jwt(live_app):
    # iat (issued at) is also a standard claim
    iat = payload.get("iat")
-    assert isinstance(iat, int) and iat <= time.time() + 60, (
+    assert (
-        f"JWT iat {iat!r} not a reasonable past timestamp"
+        isinstance(iat, int) and iat <= time.time() + 60
-    )
+    ), f"JWT iat {iat!r} not a reasonable past timestamp"
--- a/tests/keycloak/recipe_meta.py
+++ b/tests/keycloak/recipe_meta.py
@ -2,5 +2,7 @@
 # conftest — enrolling this recipe needs NO change to runner/harness code (D5).
 HEALTH_PATH = "/realms/master"  # 200 JSON once keycloak is up (not "/", which redirects)
 HEALTH_OK = (200,)
-DEPLOY_TIMEOUT = 900  # JVM + DB migration are slow on a 2-vCPU VM; observed 502 fallback up to ~10min
+DEPLOY_TIMEOUT = (
    900  # JVM + DB migration are slow on a 2-vCPU VM; observed 502 fallback up to ~10min
 )
 HTTP_TIMEOUT = 900
--- a/tests/keycloak/test_install.py
+++ b/tests/keycloak/test_install.py
@ -8,7 +8,8 @@ import os
 import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
-from harness import browser as harness_browser, generic, lifecycle  # noqa: E402
+from harness import browser as harness_browser  # noqa: E402
 from harness import generic, lifecycle
 def test_serving_and_admin_console(live_app, meta):
--- a/tests/lasuite-docs/functional/test_auth_required.py
+++ b/tests/lasuite-docs/functional/test_auth_required.py
@ -28,9 +28,7 @@ def test_users_me_requires_auth(live_app):
    url = f"https://{live_app}/api/v1.0/users/me/"
    # Retry with broad acceptance: any 4xx (or specific 401) indicates the route exists + auth is
    # required. Reject 200 (anonymous access) and 5xx (broken backend).
-    status, _ = harness_http.retry_http_get(
+    status, _ = harness_http.retry_http_get(url, expect_status=(401, 403), max_wait=60, interval=3)
        url, expect_status=(401, 403), max_wait=60, interval=3
    )
    assert status in (401, 403), (
        f"GET {url} returned {status}, expected 401 (auth required). "
        f"200 = anonymous access leaked; 404 = route missing; 5xx = backend broken."
--- a/tests/lasuite-docs/functional/test_create_doc.py
+++ b/tests/lasuite-docs/functional/test_create_doc.py
@ -27,7 +27,8 @@ import uuid
 import pytest
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
-from harness import http as harness_http, sso  # noqa: E402
+from harness import http as harness_http  # noqa: E402
 from harness import sso
@pytest.mark.requires_deps
@ -36,13 +37,15 @@ def test_create_doc_and_read_back(live_app, deps_creds):
    kc = deps_creds["keycloak"]
    # Obtain a JWT via OIDC password grant
-    access_token = sso.oidc_password_grant({
+    access_token = sso.oidc_password_grant(
-        "client_id": kc["client_id"],
+        {
-        "client_secret": kc["client_secret"],
+            "client_id": kc["client_id"],
-        "user": kc["user"],
+            "client_secret": kc["client_secret"],
-        "password": kc["password"],
+            "user": kc["user"],
-        "token_url": kc["token_url"],
+            "password": kc["password"],
-    })
+            "token_url": kc["token_url"],
        }
    )
    auth = {"Authorization": f"Bearer {access_token}"}
    # Create a doc with a unique title
@ -56,9 +59,9 @@ def test_create_doc_and_read_back(live_app, deps_creds):
    assert isinstance(body, dict), f"unexpected response shape: {body!r}"
    doc_id = body.get("id")
    assert doc_id, f"created doc has no id: {body!r}"
-    assert body.get("title") == title, (
+    assert (
-        f"created doc title mismatch: created={title!r}, response={body.get('title')!r}"
+        body.get("title") == title
-    )
+    ), f"created doc title mismatch: created={title!r}, response={body.get('title')!r}"
    # Fetch it back via the dedicated GET endpoint
    s, fetched = harness_http.http_get(
@ -66,9 +69,10 @@ def test_create_doc_and_read_back(live_app, deps_creds):
    )
    assert s == 200, f"GET /api/v1.0/documents/{doc_id}/ HTTP {s}: {fetched!r}"
    assert isinstance(fetched, dict), f"unexpected GET response: {fetched!r}"
-    assert fetched.get("id") in (doc_id, str(doc_id)), (
+    assert fetched.get("id") in (
-        f"fetched id mismatch: created={doc_id!r}, fetched={fetched.get('id')!r}"
+        doc_id,
-    )
+        str(doc_id),
-    assert fetched.get("title") == title, (
+    ), f"fetched id mismatch: created={doc_id!r}, fetched={fetched.get('id')!r}"
-        f"fetched title mismatch: created={title!r}, fetched={fetched.get('title')!r}"
+    assert (
-    )
+        fetched.get("title") == title
    ), f"fetched title mismatch: created={title!r}, fetched={fetched.get('title')!r}"
--- a/tests/lasuite-docs/functional/test_health_check.py
+++ b/tests/lasuite-docs/functional/test_health_check.py
@ -22,7 +22,11 @@ def test_lasuite_docs_returns_200(live_app):
    url = f"https://{live_app}/"
    # accept 200 (frontend SPA shell) — lasuite-docs serves the SPA at root unauthenticated;
    # the SPA itself bootstraps via /api/v1.0/users/me/ which requires OIDC (separate test).
-    status, _ = harness_http.retry_http_get(url, expect_status=(200, 301, 302), max_wait=60, interval=3)
+    status, _ = harness_http.retry_http_get(
-    assert status in (200, 301, 302), (
+        url, expect_status=(200, 301, 302), max_wait=60, interval=3
        f"lasuite-docs at {url} returned HTTP {status} (expected 200/301/302)"
    )
    assert status in (
        200,
        301,
        302,
    ), f"lasuite-docs at {url} returned HTTP {status} (expected 200/301/302)"
--- a/tests/lasuite-docs/functional/test_oidc_login.py
+++ b/tests/lasuite-docs/functional/test_oidc_login.py
@ -25,7 +25,8 @@ import urllib.request
 import pytest
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
-from harness import http as harness_http, sso  # noqa: E402
+from harness import http as harness_http  # noqa: E402
 from harness import sso
 _CTX = ssl.create_default_context()
 _CTX.check_hostname = False
@ -61,9 +62,9 @@ def test_oidc_login_via_keycloak(live_app, deps_creds):
    # 302 redirect. Both are valid "auth-required" indicators — accept either, but if a
    # redirect is returned it must point at the dep keycloak realm.
    if status in (301, 302, 303, 307, 308):
-        assert expected_prefix in (redirect or ""), (
+        assert expected_prefix in (
-            f"Docs redirected to {redirect!r}, expected to start with {expected_prefix!r}"
+            redirect or ""
-        )
+        ), f"Docs redirected to {redirect!r}, expected to start with {expected_prefix!r}"
    else:
        assert status in (401, 403), (
            f"GET /api/v1.0/users/me/ unauth: HTTP {status}; expected redirect to keycloak "
@ -88,6 +89,6 @@ def test_oidc_login_via_keycloak(live_app, deps_creds):
    )
    assert status == 200, f"GET /api/v1.0/users/me/ with token HTTP {status}: {body!r}"
    assert isinstance(body, dict), f"unexpected response: {body!r}"
-    assert body.get("email") == kc["email"], (
+    assert (
-        f"unexpected user email: got {body.get('email')!r}, expected {kc['email']!r}"
+        body.get("email") == kc["email"]
-    )
+    ), f"unexpected user email: got {body.get('email')!r}, expected {kc['email']!r}"
--- a/tests/lasuite-docs/functional/test_oidc_with_keycloak.py
+++ b/tests/lasuite-docs/functional/test_oidc_with_keycloak.py
@ -42,9 +42,9 @@ def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
    # Sanity-check the creds shape — orchestrator-written
    assert kc["domain"]
    # WC1: realm is per-run namespaced "<parent>-<6hex>" so concurrent dependents never collide.
-    assert re.fullmatch(r"lasuite-docs-[0-9a-f]{6}", kc["realm"]), (
+    assert re.fullmatch(
-        f"realm {kc['realm']!r} not the per-run namespaced form lasuite-docs-<6hex>"
+        r"lasuite-docs-[0-9a-f]{6}", kc["realm"]
-    )
+    ), f"realm {kc['realm']!r} not the per-run namespaced form lasuite-docs-<6hex>"
    assert kc["client_id"] == "lasuite-docs"
    assert isinstance(kc["client_secret"], str) and len(kc["client_secret"]) >= 16
    assert isinstance(kc["password"], str) and len(kc["password"]) >= 16
@ -74,16 +74,14 @@ def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
    # Password grant → real JWT
    token = sso.oidc_password_grant(creds)
-    assert isinstance(token, str) and token.count(".") == 2, (
+    assert isinstance(token, str) and token.count(".") == 2, f"access_token is not a JWT: {token!r}"
        f"access_token is not a JWT: {token!r}"
    )
    payload = json.loads(_b64url_decode(token.split(".")[1]))
    assert payload.get("iss") == expected_iss, f"JWT iss={payload.get('iss')!r} != {expected_iss!r}"
-    assert payload.get("azp") == kc["client_id"], (
+    assert (
-        f"JWT azp={payload.get('azp')!r} != {kc['client_id']!r}"
+        payload.get("azp") == kc["client_id"]
-    )
+    ), f"JWT azp={payload.get('azp')!r} != {kc['client_id']!r}"
    assert payload.get("typ") == "Bearer", f"JWT typ={payload.get('typ')!r} != 'Bearer'"
    exp = payload.get("exp")
-    assert isinstance(exp, int) and exp > time.time(), (
+    assert (
-        f"JWT exp={exp!r} not a future timestamp (now={time.time():.0f})"
+        isinstance(exp, int) and exp > time.time()
-    )
+    ), f"JWT exp={exp!r} not a future timestamp (now={time.time():.0f})"
--- a/tests/lasuite-docs/setup_custom_tests.sh
+++ b/tests/lasuite-docs/setup_custom_tests.sh
@ -21,15 +21,24 @@ set -euo pipefail
 : "${CCCI_APP_DOMAIN:?missing}"
 : "${CCCI_DEPS_FILE:?missing}"
-test -s "$CCCI_DEPS_FILE" || { echo "  setup_custom_tests: deps file empty"; exit 1; }
+test -s "$CCCI_DEPS_FILE" || {
  echo "  setup_custom_tests: deps file empty"
  exit 1
 }
 # Read keycloak dep info via jq
-KC_DOMAIN=$(jq -r '.keycloak.domain'         "$CCCI_DEPS_FILE")
+KC_DOMAIN=$(jq -r '.keycloak.domain' "$CCCI_DEPS_FILE")
-KC_REALM=$( jq -r '.keycloak.realm'          "$CCCI_DEPS_FILE")
+KC_REALM=$(jq -r '.keycloak.realm' "$CCCI_DEPS_FILE")
-KC_CLIENT=$(jq -r '.keycloak.client_id'      "$CCCI_DEPS_FILE")
+KC_CLIENT=$(jq -r '.keycloak.client_id' "$CCCI_DEPS_FILE")
-KC_SECRET=$(jq -r '.keycloak.client_secret'  "$CCCI_DEPS_FILE")
+KC_SECRET=$(jq -r '.keycloak.client_secret' "$CCCI_DEPS_FILE")
-[ -n "$KC_DOMAIN" ] && [ "$KC_DOMAIN" != "null" ] || { echo "  setup_custom_tests: no keycloak.domain in deps"; exit 1; }
+if [ -z "$KC_DOMAIN" ] || [ "$KC_DOMAIN" = "null" ]; then
-[ -n "$KC_SECRET" ] && [ "$KC_SECRET" != "null" ] || { echo "  setup_custom_tests: no keycloak.client_secret"; exit 1; }
+  echo "  setup_custom_tests: no keycloak.domain in deps"
  exit 1
 fi
 if [ -z "$KC_SECRET" ] || [ "$KC_SECRET" = "null" ]; then
  echo "  setup_custom_tests: no keycloak.client_secret"
  exit 1
 fi
 echo "  lasuite-docs setup_custom_tests: wiring OIDC against keycloak dep ${KC_DOMAIN}"
@ -39,12 +48,15 @@ echo "  lasuite-docs setup_custom_tests: wiring OIDC against keycloak dep ${KC_D
 # update SECRET_OIDC_RPCS_VERSION in the .env to point at the new one.
 ENV_PATH="$HOME/.abra/servers/default/${CCCI_APP_DOMAIN}.env"
 CUR_VER=$(grep -E '^\s*SECRET_OIDC_RPCS_VERSION=' "$ENV_PATH" | tail -1 | cut -d= -f2 | tr -d '"\r' || echo "v1")
-NEW_NUM=$(( ${CUR_VER#v} + 1 ))
+NEW_NUM=$((${CUR_VER#v} + 1))
 NEW_VER="v${NEW_NUM}"
-INSERT_LOG=$(abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o 2>&1) \
+INSERT_LOG=$(abra app secret insert "$CCCI_APP_DOMAIN" oidc_rpcs "$NEW_VER" "$KC_SECRET" --no-input -C -o 2>&1) ||
-  || INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) \
+  INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) ||
-  || { echo "  setup_custom_tests: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"; exit 1; }
+  {
    echo "  setup_custom_tests: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"
    exit 1
  }
 # Repoint the env var to the new version
 sed -i "s|^\s*SECRET_OIDC_RPCS_VERSION=.*|SECRET_OIDC_RPCS_VERSION=$NEW_VER|" "$ENV_PATH"
 echo "  setup_custom_tests: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)"
@ -52,25 +64,25 @@ echo "  setup_custom_tests: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)
 # 2) Write OIDC env vars to the app's .env (names per lasuite-docs's .env.sample).
 # Ensure the file ends with a newline FIRST so our appends don't concatenate onto the last line
 # (we saw `TIMEOUT=900OIDC_REALM=...` malformed by a missing-trailing-newline file).
-[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >> "$ENV_PATH"
+[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
-write_env () {
+write_env() {
  local key="$1" val="$2"
  # remove any existing key (commented or live) then append the live key=val
  sed -i "/^\s*#\?\s*${key}=/d" "$ENV_PATH"
  # Re-ensure trailing newline after each delete (sed may leave the file without one)
-  [ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >> "$ENV_PATH"
+  [ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
-  printf '%s=%s\n' "$key" "$val" >> "$ENV_PATH"
+  printf '%s=%s\n' "$key" "$val" >>"$ENV_PATH"
 }
-write_env OIDC_REALM                       "$KC_REALM"
+write_env OIDC_REALM "$KC_REALM"
-write_env OIDC_OP_DISCOVERY_ENDPOINT       "https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration"
+write_env OIDC_OP_DISCOVERY_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration"
-write_env OIDC_OP_AUTHORIZATION_ENDPOINT   "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
+write_env OIDC_OP_AUTHORIZATION_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
-write_env OIDC_OP_TOKEN_ENDPOINT           "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
+write_env OIDC_OP_TOKEN_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
-write_env OIDC_OP_USER_ENDPOINT            "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
+write_env OIDC_OP_USER_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
-write_env OIDC_OP_LOGOUT_ENDPOINT          "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
+write_env OIDC_OP_LOGOUT_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
-write_env OIDC_OP_JWKS_ENDPOINT            "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
+write_env OIDC_OP_JWKS_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
-write_env OIDC_RP_CLIENT_ID                "$KC_CLIENT"
+write_env OIDC_RP_CLIENT_ID "$KC_CLIENT"
-write_env OIDC_RP_SIGN_ALGO                "RS256"
+write_env OIDC_RP_SIGN_ALGO "RS256"
-write_env OIDC_RP_SCOPES                   "openid email profile"
+write_env OIDC_RP_SCOPES "openid email profile"
 # 3) Trigger an in-place redeploy so the env update takes effect. --force re-deploys even when
 # the recipe hasn't changed; --chaos avoids the chaos prompt; --no-input non-interactive.
--- a/tests/lasuite-docs/test_install.py
+++ b/tests/lasuite-docs/test_install.py
@ -10,7 +10,8 @@ import os
 import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
-from harness import browser as harness_browser, generic, lifecycle  # noqa: E402
+from harness import browser as harness_browser  # noqa: E402
 from harness import generic, lifecycle
 def test_serving_and_frontend(live_app, meta):
--- a/tests/lasuite-drive/functional/test_health_check.py
+++ b/tests/lasuite-drive/functional/test_health_check.py
@ -25,6 +25,8 @@ def test_lasuite_drive_returns_200(live_app):
    status, _ = harness_http.retry_http_get(
        url, expect_status=(200, 301, 302), max_wait=60, interval=3
    )
-    assert status in (200, 301, 302), (
+    assert status in (
-        f"lasuite-drive at {url} returned HTTP {status} (expected 200/301/302)"
+        200,
-    )
+        301,
        302,
    ), f"lasuite-drive at {url} returned HTTP {status} (expected 200/301/302)"
--- a/tests/lasuite-drive/functional/test_minio_storage.py
+++ b/tests/lasuite-drive/functional/test_minio_storage.py
@ -29,8 +29,8 @@ BUCKET = "drive-media-storage"
 def _mc(domain: str, script: str) -> str:
    """Run an `mc` shell script inside the minio container (root creds from /run/secrets)."""
    prelude = (
-        'set -e; '
+        "set -e; "
-        'U=$(cat /run/secrets/minio_ru); P=$(cat /run/secrets/minio_rp); '
+        "U=$(cat /run/secrets/minio_ru); P=$(cat /run/secrets/minio_rp); "
        'mc alias set ccci http://localhost:9000 "$U" "$P" >/dev/null 2>&1; '
    )
    return lifecycle.exec_in_app(domain, ["sh", "-c", prelude + script], service="minio")
@ -49,13 +49,13 @@ def test_minio_bucket_present_and_object_roundtrip(live_app):
        domain,
        # upload via stdin; list the object; read it back (tagged); then delete.
        f'printf %s "{marker}" | mc pipe ccci/{BUCKET}/{key} >/dev/null 2>&1; '
-        f'mc ls ccci/{BUCKET}/{key}; '
+        f"mc ls ccci/{BUCKET}/{key}; "
        f'echo "READBACK:$(mc cat ccci/{BUCKET}/{key})"; '
-        f'mc rm ccci/{BUCKET}/{key} >/dev/null 2>&1',
+        f"mc rm ccci/{BUCKET}/{key} >/dev/null 2>&1",
    )
    # The object was listed (its key appears) and its content round-tripped intact.
    assert f"{marker}.txt" in out, f"uploaded object not listed in bucket: {out!r}"
-    assert f"READBACK:{marker}" in out, (
+    assert (
-        f"object content did not round-trip through MinIO; got: {out!r}"
+        f"READBACK:{marker}" in out
-    )
+    ), f"object content did not round-trip through MinIO; got: {out!r}"
--- a/tests/lasuite-drive/functional/test_oidc_with_keycloak.py
+++ b/tests/lasuite-drive/functional/test_oidc_with_keycloak.py
@ -46,9 +46,9 @@ def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
    # Creds shape. WC1: realm is per-run namespaced "<parent>-<6hex>"; client_id stays the parent.
    assert kc["domain"]
-    assert re.fullmatch(r"lasuite-drive-[0-9a-f]{6}", kc["realm"]), (
+    assert re.fullmatch(
-        f"realm {kc['realm']!r} not the per-run namespaced form lasuite-drive-<6hex>"
+        r"lasuite-drive-[0-9a-f]{6}", kc["realm"]
-    )
+    ), f"realm {kc['realm']!r} not the per-run namespaced form lasuite-drive-<6hex>"
    assert kc["client_id"] == "lasuite-drive"
    assert isinstance(kc["client_secret"], str) and len(kc["client_secret"]) >= 16
    assert isinstance(kc["password"], str) and len(kc["password"]) >= 16
@ -77,16 +77,14 @@ def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
    # Password grant → real JWT
    token = sso.oidc_password_grant(creds)
-    assert isinstance(token, str) and token.count(".") == 2, (
+    assert isinstance(token, str) and token.count(".") == 2, f"access_token is not a JWT: {token!r}"
        f"access_token is not a JWT: {token!r}"
    )
    payload = json.loads(_b64url_decode(token.split(".")[1]))
    assert payload.get("iss") == expected_iss, f"JWT iss={payload.get('iss')!r} != {expected_iss!r}"
-    assert payload.get("azp") == kc["client_id"], (
+    assert (
-        f"JWT azp={payload.get('azp')!r} != {kc['client_id']!r}"
+        payload.get("azp") == kc["client_id"]
-    )
+    ), f"JWT azp={payload.get('azp')!r} != {kc['client_id']!r}"
    assert payload.get("typ") == "Bearer", f"JWT typ={payload.get('typ')!r} != 'Bearer'"
    exp = payload.get("exp")
-    assert isinstance(exp, int) and exp > time.time(), (
+    assert (
-        f"JWT exp={exp!r} not a future timestamp (now={time.time():.0f})"
+        isinstance(exp, int) and exp > time.time()
-    )
+    ), f"JWT exp={exp!r} not a future timestamp (now={time.time():.0f})"
--- a/tests/lasuite-drive/install_steps.sh
+++ b/tests/lasuite-drive/install_steps.sh
@ -28,7 +28,7 @@ if [ -z "${CCCI_DEPS_FILE:-}" ] || [ ! -s "${CCCI_DEPS_FILE}" ]; then
  exit 0
 fi
 KC_DOMAIN=$(jq -r '.keycloak.domain        // empty' "$CCCI_DEPS_FILE")
-KC_REALM=$( jq -r '.keycloak.realm         // empty' "$CCCI_DEPS_FILE")
+KC_REALM=$(jq -r '.keycloak.realm         // empty' "$CCCI_DEPS_FILE")
 KC_CLIENT=$(jq -r '.keycloak.client_id     // empty' "$CCCI_DEPS_FILE")
 KC_SECRET=$(jq -r '.keycloak.client_secret // empty' "$CCCI_DEPS_FILE")
 if [ -z "$KC_DOMAIN" ] || [ -z "$KC_SECRET" ]; then
@ -43,35 +43,38 @@ echo "  lasuite-drive install_steps: wiring OIDC at install against keycloak ${K
 # point SECRET_OIDC_RPCS_VERSION at it. (The app is not deployed yet — a swarm secret can be created
 # independently of a running stack — so the single deploy below picks up v2.)
 CUR_VER=$(grep -E '^\s*SECRET_OIDC_RPCS_VERSION=' "$ENV_PATH" | tail -1 | cut -d= -f2 | tr -d '"\r' || echo "v1")
-NEW_NUM=$(( ${CUR_VER#v} + 1 ))
+NEW_NUM=$((${CUR_VER#v} + 1))
 NEW_VER="v${NEW_NUM}"
-INSERT_LOG=$(abra app secret insert "$CCCI_APP_DOMAIN" oidc_rpcs "$NEW_VER" "$KC_SECRET" --no-input -C -o 2>&1) \
+INSERT_LOG=$(abra app secret insert "$CCCI_APP_DOMAIN" oidc_rpcs "$NEW_VER" "$KC_SECRET" --no-input -C -o 2>&1) ||
-  || INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) \
+  INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) ||
-  || { echo "  install_steps: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"; exit 1; }
+  {
    echo "  install_steps: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"
    exit 1
  }
 sed -i "s|^\s*SECRET_OIDC_RPCS_VERSION=.*|SECRET_OIDC_RPCS_VERSION=$NEW_VER|" "$ENV_PATH"
 echo "  install_steps: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)"
 # 2) Write the OIDC env vars (explicit endpoints — deterministic, no reliance on ${AUTH_DOMAIN}
 # expansion). Mirrors the recipe-maintainer impress/La Suite OIDC env contract.
-write_env () {
+write_env() {
  local key="$1" val="$2"
  sed -i "/^\s*#\?\s*${key}=/d" "$ENV_PATH"
-  [ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >> "$ENV_PATH"
+  [ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
-  printf '%s=%s\n' "$key" "$val" >> "$ENV_PATH"
+  printf '%s=%s\n' "$key" "$val" >>"$ENV_PATH"
 }
-write_env AUTH_DOMAIN                      "$KC_DOMAIN"
+write_env AUTH_DOMAIN "$KC_DOMAIN"
-write_env OIDC_REALM                       "$KC_REALM"
+write_env OIDC_REALM "$KC_REALM"
-write_env OIDC_OP_JWKS_ENDPOINT            "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
+write_env OIDC_OP_JWKS_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
-write_env OIDC_OP_AUTHORIZATION_ENDPOINT   "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
+write_env OIDC_OP_AUTHORIZATION_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
-write_env OIDC_OP_TOKEN_ENDPOINT           "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
+write_env OIDC_OP_TOKEN_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
-write_env OIDC_OP_USER_ENDPOINT            "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
+write_env OIDC_OP_USER_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
-write_env OIDC_OP_LOGOUT_ENDPOINT          "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
+write_env OIDC_OP_LOGOUT_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
-write_env OIDC_RP_CLIENT_ID                "$KC_CLIENT"
+write_env OIDC_RP_CLIENT_ID "$KC_CLIENT"
-write_env OIDC_RP_SIGN_ALGO                "RS256"
+write_env OIDC_RP_SIGN_ALGO "RS256"
-write_env OIDC_RP_SCOPES                   "openid email profile"
+write_env OIDC_RP_SCOPES "openid email profile"
-write_env OIDC_REDIRECT_ALLOWED_HOSTS      "[\"https://${KC_DOMAIN}\", \"https://${CCCI_APP_DOMAIN}\"]"
+write_env OIDC_REDIRECT_ALLOWED_HOSTS "[\"https://${KC_DOMAIN}\", \"https://${CCCI_APP_DOMAIN}\"]"
 # The recipe default acr_values=eidas1 is FranceConnect-specific; keycloak can't satisfy it and it
 # would break the interactive auth flow. Clear it so the keycloak OIDC client works.
-write_env OIDC_AUTH_REQUEST_EXTRA_PARAMS   "{}"
+write_env OIDC_AUTH_REQUEST_EXTRA_PARAMS "{}"
 echo "  lasuite-drive install_steps: OIDC env wired into .env (deploy will pick it up, no reconverge)"
--- a/tests/lasuite-drive/setup_custom_tests.sh
+++ b/tests/lasuite-drive/setup_custom_tests.sh
@ -29,7 +29,7 @@ docker service scale --detach "${STACK}_minio-createbuckets=1" >/dev/null 2>&1 |
 for i in $(seq 1 30); do
  MC_CID=$(docker ps -q -f "name=${STACK}_minio.1" | head -1)
  if [ -n "$MC_CID" ] && docker exec "$MC_CID" sh -c \
-       'mc alias set _c http://localhost:9000 "$(cat /run/secrets/minio_ru)" "$(cat /run/secrets/minio_rp)" >/dev/null 2>&1 && mc ls _c/drive-media-storage >/dev/null 2>&1'; then
+    'mc alias set _c http://localhost:9000 "$(cat /run/secrets/minio_ru)" "$(cat /run/secrets/minio_rp)" >/dev/null 2>&1 && mc ls _c/drive-media-storage >/dev/null 2>&1'; then
    echo "  setup: bucket drive-media-storage present after ${i} poll(s)"
    break
  fi
--- a/tests/lasuite-drive/test_install.py
+++ b/tests/lasuite-drive/test_install.py
@ -10,7 +10,8 @@ import os
 import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
-from harness import browser as harness_browser, generic, lifecycle  # noqa: E402
+from harness import browser as harness_browser  # noqa: E402
 from harness import generic, lifecycle
 def test_serving_and_frontend(live_app, meta):
--- a/tests/lasuite-meet/functional/test_health_check.py
+++ b/tests/lasuite-meet/functional/test_health_check.py
@ -21,6 +21,8 @@ def test_lasuite_meet_returns_200(live_app):
    status, _ = harness_http.retry_http_get(
        url, expect_status=(200, 301, 302), max_wait=60, interval=3
    )
-    assert status in (200, 301, 302), (
+    assert status in (
-        f"lasuite-meet at {url} returned HTTP {status} (expected 200/301/302)"
+        200,
-    )
+        301,
        302,
    ), f"lasuite-meet at {url} returned HTTP {status} (expected 200/301/302)"
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
autonomic-bot	b6e12ef428	fix(harness): run-keyed run-scoped state files — CONC-A1 (same-domain runs corrupted shared deploy-count) All checks were successful continuous-integration/drone/push Build is passing Details The four CCCI state files (deploys countfile, opstate, deps, depskip) were keyed by app domain in shared /tmp. A second run of the same domain executes its main() preamble + deploy_app's pre-lock _record_deploy BEFORE blocking at the app lock, so it reset/polluted the live first run's counter (false DG4.1 deploy-count=2, build 279) and the first run's end-of-run os.remove crashed the second (FileNotFoundError, build 281). Masked pre-restructure by the end-to-end recipe flock. Now keyed by run id + harness pid via _run_state_path(); children receive exact paths via the CCCI_*_FILE env vars, so domain keying was never load-bearing. tests/concurrency/test_run_state.py: path-invariant cases + a real-process regression (helpers.py deploy-count-run) reproducing the live interleaving — verified to FAIL under simulated shared keying. docs/concurrency.md §3 updated.	2026-06-10 08:16:09 +00:00
autonomic-bot	e1c4198c08	fix(ci): recipe-ci wrapper — capture harness rc, clear traps before exit (green runs no longer exit 1) All checks were successful continuous-integration/drone/push Build is passing Details The drone exec runner's step shell is set -e. On a NORMAL harness exit the EXIT trap still fired and its kill of the already-exited process group failed with ESRCH, poisoning the script's exit status: build 269 (plausible#3) ran fully GREEN (all tiers pass, level=4) but the step exited 1. Reproduced minimally with sh -e and bash -e on the host; the fixed wrapper verified for all three paths: green rc=0, red rc=7 (propagated), TERM-to-shell -> child gets TERM and wrapper exits 143. Cancel forwarding semantics unchanged.	2026-06-10 04:54:40 +00:00
autonomic-bot	d3fe9e26bb	docs: P5 concurrency spec rewrite — one lock, one structural isolation, the invariant chain All checks were successful continuous-integration/drone/push Build is passing Details Rewritten to the restructured model: lifetime-hardening guards (PDEATHSIG/SIGTERM/SIGALRM + setsid/trap), per-run ABRA_DIR isolation (same-recipe runs now parallel), per-app-domain flock (double-!testme serialisation), flock-probe janitor decision table (incl. the inode-identity race rows), updated failure-mode table (cancel now tears down via the harness's own funnel; reboot reaps immediately; 60-min deadline bounds everything), single-knob config table, how to run tests/concurrency, fresh file/symbol index + deleted-symbol list for grep verification. Also drops the last stale concurrency.limit mention from the .drone.yml header comment.	2026-06-10 04:32:54 +00:00
autonomic-bot	84d90fb655	test(concurrency): real-kernel suite for the restructured model — 20 tests, 19 plan cases All checks were successful continuous-integration/drone/push Build is passing Details tests/concurrency/ — NOT in the default `pytest tests/unit` gate; run explicitly with `pytest tests/concurrency -q`. flock/prctl/alarm are never mocked: helper subprocesses (helpers.py) hold real locks and install the real lifetime guards; locks live in a per-test tmp dir via CCCI_APP_LOCK_DIR; every helper (and recorded grandchild) is reaped by fixture cleanup. - test_locks.py (cases 1-4): SIGKILL auto-release; LOCK_NB held/unheld semantics; PEP 446 fd-not-inherited (holder's child survives, lock still releases); same-domain second acquire blocks until first holder exits. - test_janitor.py (cases 5-12): orphan reaped once + lockfile unlinked; live holder never reaped + logged; new-run acquire blocks until a slow reap completes (reap-under-probe-lock); two overlapping janitors -> exactly one reaps (flock arbitration); reboot sim (no lockfile) reaps immediately with no age wait; >120min-held lock flagged 'possible leaked run' and NOT stolen; warm/canonical names never probed (no lockfile even created); directory-as-lockfile and missing lock dir degrade to skip+log, never crash. - test_lifetime.py (cases 13-16): PDEATHSIG (wrapper parent SIGKILL'd -> guarded child TERM'd, teardown marker, lock released); already-orphaned helper REFUSES to run (ppid race); 2s deadline alarm -> teardown + exit 142 + lock released; SIGTERM -> teardown + exit 143 + lock released. - test_abra_dir.py (cases 17-19 + 18b): per-run dir built + $ABRA_DIR exported before the first abra call (recording stub abra on PATH); two CONCURRENT same-recipe fetch+checkout flows into different ABRA_DIRs -> divergent correct trees, canonical staged clone untouched; .env written through the servers/ symlink lands in the canonical path (env_get/env_set agree); manual runs get pid-suffixed dirs. On cc-ci: pytest tests/concurrency -q -> 20 passed; tests/unit -> 138 passed; lint PASS.	2026-06-10 04:29:36 +00:00
autonomic-bot	91d3cc7e99	chore(ci): P4 config cleanup — DRONE_RUNNER_CAPACITY is the single concurrency knob All checks were successful continuous-integration/drone/push Build is passing Details Remove concurrency.limit from the recipe-ci pipeline (.drone.yml): it duplicated DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) and the two had to be kept in step by hand (docs/concurrency.md §8.6). maxTests comment updated to state it is the single knob and to describe the new safety model.	2026-06-10 04:19:35 +00:00
autonomic-bot	17ebdf39ac	feat(harness): P3 per-run ABRA_DIR — structural recipe-tree isolation, recipe flock deleted All checks were successful continuous-integration/drone/push Build is passing Details - run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared canonical path, so janitor discovery and env-based teardown are unchanged; per-domain filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ — each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation. - fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock. CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging workflow, run reads staged state). - abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions, generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but follows the per-run tree when imported by promote_canonical inside a run). - deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the must-lock-before-fetch ordering rule. - tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix required by per-run trees; no assertion/gate touched — see DECISIONS.md). - .drone.yml comments updated (HOME=/root rationale now via the servers symlink).	2026-06-10 04:18:33 +00:00
autonomic-bot	b302f3ab63	feat(harness): P2 flock-probe janitor — the kernel flock IS the liveness oracle All checks were successful continuous-integration/drone/push Build is passing Details - acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line when another run of the same domain is in flight (double-!testme serialisation). The file object is retained in module-level _held_app_locks so GC can never close the fd and silently release the lock. mtime is touched at acquisition (lock age for the long-held flag). - janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the probe lock (a new same-domain run blocks until the reap finishes), then unlink before release. Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log. - unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on the live path, and a probe that won one skips (unlinking now would hit a newer run's file). - deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard. teardown_app no longer unregisters (release is process exit). janitor() takes no args now. - post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate reap (improvement over the old 2h age fallback).	2026-06-10 04:11:31 +00:00
autonomic-bot	b492f995bd	feat(harness): P1 lock-lifetime hardening — PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard deadline All checks were successful continuous-integration/drone/push Build is passing Details - new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged and ignored (begin_teardown guard) so a second signal can't abort the running cleanup. - run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown finally: blocks (cold + quick) mark begin_teardown(). - .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of leaking it (docs/concurrency.md §8.1). - PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.	2026-06-10 04:04:28 +00:00
autonomic-bot	45afccbef5	status(conc): bootstrap phase state files — P1 in flight on branch restructure/concurrency All checks were successful continuous-integration/drone/push Build is passing Details	2026-06-10 04:00:12 +00:00
autonomic-bot	48d03d8405	chore(conc): seed REVIEW-conc.md — adversary online, baseline pre-read (no verdict) All checks were successful continuous-integration/drone/push Build is passing Details	2026-06-10 03:56:26 +00:00
autonomic-bot	5b65c6caa3	docs: concurrency spec — how parallel recipe runs stay safe (for review/restructuring) All checks were successful continuous-integration/drone/push Build is passing Details Documents the capacity=2 concurrent-run system as landed in `c0df77d`, `68ef0f8`, `e6d55b5`: config knobs, isolation model, per-recipe flock, active-run registry + three-way janitor, convergence interactions, failure-mode guarantees, and known limitations / restructuring candidates.	2026-06-10 03:05:20 +00:00
autonomic-bot	157d06dc77	Merge pull request 'test(plausible): psql -q in _register_site — -t does not suppress command tags' (#9 ) from test/plausible-psql-quiet into main All checks were successful continuous-integration/drone/push Build is passing Details continuous-integration/drone Build is passing Details	2026-06-09 23:12:37 +00:00
autonomic-bot	e6d55b53c7	fix(harness): a paused swarm update is settled — only active states block convergence All checks were successful continuous-integration/drone/push Build is passing Details continuous-integration/drone Build is passing Details `68ef0f8` made services_converged() require UpdateStatus settled, treating 'paused' as in flight. But swarm's default update-failure-action pauses the update on a single task flicker and the flag persists FOREVER (until the next update): immich CI 241 had the app service 'paused' from a restart during restore while the service was back at 1/1 and healthy — every subsequent wait hung to its deadline and the run had to be killed. Only 'updating' and 'rollback_started' now block convergence: those are the states swarm is actively driving (the 238 stop-first race lives in 'updating'). 'paused'/'rollback_paused' make no progress without intervention, so waiting on them is pointless — N/N replicas is already required, and the HTTP-health and tier assertions still gate whether the app actually works. lint: PASS, unit tests: 138 passed.	2026-06-09 23:07:36 +00:00
autonomic-bot	79c652ddd3	test(plausible): psql -q in _register_site — -t does not suppress command tags All checks were successful continuous-integration/drone/push Build is passing Details psql -tAc still prints INSERT/CREATE command tags (e.g. "INSERT 0 1"), so _register_site asserted out == site against "INSERT 0 1\nsite" and both event-tracking roundtrip tests failed on their very first run (build 237 — the custom tier had never executed before; install always failed earlier). -q suppresses the tags; verified against the recipe db container.	2026-06-09 22:50:55 +00:00
autonomic-bot	68ef0f84fb	fix(harness): convergence must span stop-first rolling updates (immich 238 backup 409) Some checks reported errors continuous-integration/drone/push Build is passing Details continuous-integration/drone Build was killed Details services_converged() accepted N/N replicas as converged — but a chaos redeploy that changes a non-app service image (immich PR #2 moves the db to the vectorchord pin) registers a stop-first rolling update that swarm may not have STARTED yet: the OLD task still shows 1/1, the wait passes, and the task dies seconds later. Build 238: backupbot resolved the db hook container, the task was killed in the gap, and the pre-hook exec crashed the whole backup with a 409 -> no dump in the snapshot -> restore had nothing -> RED. - services_converged() now also requires every service's swarm UpdateStatus to be settled ('', completed, rollback_completed) — updating/paused/rollback in flight is NOT converged. Strictly stricter: no gate is weakened. - backup_app() gains a bounded (300s) settle-wait before 'abra app backup create' as defence in depth; on timeout the backup still runs and the tier's assertion delivers the verdict. lint: PASS, unit tests: 138 passed.	2026-06-09 22:10:55 +00:00
autonomic-bot	c828f6cdd0	Merge remote-tracking branch 'origin/test/plausible-upgrade-base-3.0.1' Some checks failed continuous-integration/drone/push Build is passing Details continuous-integration/drone Build is failing Details	2026-06-09 21:57:39 +00:00
autonomic-bot	c0df77d0d9	fix(harness): make concurrent recipe runs safe (per-recipe flock + active-run registry) All checks were successful continuous-integration/drone/push Build is passing Details capacity=2 went live with three stale capacity=1-era assumptions that corrupted concurrent runs (immich 229/230 '/pg_backup.sh: No such file'): - ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe rm-rf's/ reclones and the upgrade tier git-checkouts mid-run. Same-recipe runs now serialise on an exclusive flock (/run/lock/cc-ci-recipe-<recipe>.lock), taken in main() BEFORE fetch_recipe and held for the whole run; the kernel releases it on any process death, so there is no stale-lock failure mode. Different recipes still run in parallel. - CCCI_JANITOR_MAX_AGE=0 made a starting build reap ANY in-flight run app. Every run now registers its app domain + pid in /run/cc-ci-active/<domain> before app creation; the janitor checks the owner: alive (pid is a live run_recipe_ci process) -> never reaped; dead -> reaped immediately; unknown (pre-registry or post-reboot) -> age fallback (default 2h). The MAX_AGE=0 env override is gone from .drone.yml. - .drone.yml: concurrency.limit 1 -> 2 to match DRONE_RUNNER_CAPACITY=2; the 'safe because capacity=1' comments now describe the flock+registry model. lint: PASS, unit tests: 138 passed.	2026-06-09 21:56:25 +00:00
autonomic-bot	9a7772563a	style: repo-wide lint pass — make the lint gate green again Push builds have been RED on the lint step since ~build 209 from accumulated formatting drift. This is the mechanical cleanup: ruff format + ruff --fix (UP038 isinstance unions, SIM105 contextlib.suppress, UP031 f-strings, SIM115 tempfile context manager), shfmt -i 2 -ci, nixpkgs-fmt/statix/deadnix (merged attrsets, dropped unused lib args), yamllint, and shell quoting fixes in tests/lasuite-docs/setup_custom_tests.sh. No behaviour changes intended; lint: PASS, unit tests: 138 passed.	2026-06-09 21:56:15 +00:00
autonomic-bot	1ba0d961a3	test(plausible): pin UPGRADE_BASE_VERSION to 3.0.1+v2.0.0 (newest published) Some checks failed continuous-integration/drone/push Build is failing Details The harness default base (recipe_versions[-2]) resolves to 3.0.0+v2.0.0 for the open 3.1.0 upgrade PR. That release predates x86_64 support in the clickhouse entrypoint (added 3.0.1): on this amd64 host it downloads clickhouse-backup-linux-x86_64.tar.gz — a deterministic HTTP 404 — and with set -e + a silenced wget the container exits 1 before logging anything, crash-looping until the deploy times out. The base therefore can never converge, regardless of the PR content (the published tag is immutable). This is exactly the case the harness documents for UPGRADE_BASE_VERSION: a PR adding its version ABOVE the newest published tag, where the true predecessor is [-1] (3.0.1+v2.0.0), not [-2]. The upgrade tier then tests the real operator path 3.0.1 -> 3.1.0. Pairs with recipe-maintainers/plausible#3 (its !testme can only go green once this lands).	2026-06-09 19:24:21 +00:00
autonomic-bot	e76d4005ab	chore(runner): raise CI concurrency to 2 (parallel recipe testing) (#8 ) Some checks reported errors continuous-integration/drone/push Build is failing Details continuous-integration/drone Build was killed Details	2026-06-09 18:35:19 +00:00
autonomic-bot	c32e6105d0	feat(reports): same-origin /pr proxy for the Recipe Report live STATUS column (#7 ) Some checks failed continuous-integration/drone/push Build is failing Details continuous-integration/drone Build is failing Details	2026-06-09 13:16:12 +00:00
autonomic-bot	c51cd84159	feat(harness): intentional skips + custom-html-tiny functional test; 4-rung ladder (#6 ) Some checks failed continuous-integration/drone/push Build is failing Details Declare intentional skips + custom-html-tiny functional test; 4-rung level ladder - recipe_meta.EXPECTED_NA = {rung: reason} lists intentionally-skipped rungs; any essential rung skipped and not listed is unintentional. Skips still cap the level (never inflate). results.json: skips:{intentional,unintentional} + level_cap_rung. - Level ladder = the four essential rungs (install, upgrade, backup/restore, functional; top = L4). integration & recipe-local are optional, not leveled (SSO still enforced for the run verdict, unchanged). - Card shows skipped rungs as INTENTIONAL SKIP (green, reason below) / UNINTENTIONAL SKIP (amber); level badge gains an expected/gap? third segment. - custom-html-tiny: functional serve test (exact-byte round-trip + 404); declares backup_restore intentionally skipped (stateless static server). Independently verified by the adversary: 138 unit tests pass cold; live full-stage run on custom-html-tiny green (upgrade tier ran; level 2; correct skips/badge); clean teardown.	2026-06-09 03:12:11 +00:00