The drone exec runner's step shell is set -e. On a NORMAL harness exit the EXIT trap still
fired and its kill of the already-exited process group failed with ESRCH, poisoning the
script's exit status: build 269 (plausible#3) ran fully GREEN (all tiers pass, level=4) but
the step exited 1. Reproduced minimally with sh -e and bash -e on the host; the fixed wrapper
verified for all three paths: green rc=0, red rc=7 (propagated), TERM-to-shell -> child gets
TERM and wrapper exits 143. Cancel forwarding semantics unchanged.
Rewritten to the restructured model: lifetime-hardening guards (PDEATHSIG/SIGTERM/SIGALRM +
setsid/trap), per-run ABRA_DIR isolation (same-recipe runs now parallel), per-app-domain flock
(double-!testme serialisation), flock-probe janitor decision table (incl. the inode-identity
race rows), updated failure-mode table (cancel now tears down via the harness's own funnel;
reboot reaps immediately; 60-min deadline bounds everything), single-knob config table, how to
run tests/concurrency, fresh file/symbol index + deleted-symbol list for grep verification.
Also drops the last stale concurrency.limit mention from the .drone.yml header comment.
Remove concurrency.limit from the recipe-ci pipeline (.drone.yml): it duplicated
DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) and the two had to be kept in
step by hand (docs/concurrency.md §8.6). maxTests comment updated to state it is the single
knob and to describe the new safety model.
- run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and
catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared
canonical path, so janitor discovery and env-based teardown are unchanged; per-domain
filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ —
each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the
abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation.
- fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock.
CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging
workflow, run reads staged state).
- abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by
recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions,
generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and
warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but
follows the per-run tree when imported by promote_canonical inside a run).
- deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the
must-lock-before-fetch ordering rule.
- tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the
compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix
required by per-run trees; no assertion/gate touched — see DECISIONS.md).
- .drone.yml comments updated (HOME=/root rationale now via the servers symlink).
- acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in
deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line
when another run of the same domain is in flight (double-!testme serialisation). The file
object is retained in module-level _held_app_locks so GC can never close the fd and silently
release the lock. mtime is touched at acquisition (lock age for the long-held flag).
- janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service
sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the
probe lock (a new same-domain run blocks until the reap finishes), then unlink before release.
Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale
unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log.
- unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is
the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on
the live path, and a probe that won one skips (unlinking now would hit a newer run's file).
- deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path,
ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard.
teardown_app no longer unregisters (release is process exit). janitor() takes no args now.
- post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate
reap (improvement over the old 2h age fallback).
- new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with
post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's
finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way
with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged
and ignored (begin_teardown guard) so a second signal can't abort the running cleanup.
- run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown
finally: blocks (cold + quick) mark begin_teardown().
- .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards
the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of
leaking it (docs/concurrency.md §8.1).
- PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.
capacity=2 went live with three stale capacity=1-era assumptions that corrupted
concurrent runs (immich 229/230 '/pg_backup.sh: No such file'):
- ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe rm-rf's/
reclones and the upgrade tier git-checkouts mid-run. Same-recipe runs now
serialise on an exclusive flock (/run/lock/cc-ci-recipe-<recipe>.lock), taken
in main() BEFORE fetch_recipe and held for the whole run; the kernel releases
it on any process death, so there is no stale-lock failure mode. Different
recipes still run in parallel.
- CCCI_JANITOR_MAX_AGE=0 made a starting build reap ANY in-flight run app. Every
run now registers its app domain + pid in /run/cc-ci-active/<domain> before
app creation; the janitor checks the owner: alive (pid is a live run_recipe_ci
process) -> never reaped; dead -> reaped immediately; unknown (pre-registry or
post-reboot) -> age fallback (default 2h). The MAX_AGE=0 env override is gone
from .drone.yml.
- .drone.yml: concurrency.limit 1 -> 2 to match DRONE_RUNNER_CAPACITY=2; the
'safe because capacity=1' comments now describe the flock+registry model.
lint: PASS, unit tests: 138 passed.
bridge/bridge.py: parse_trigger(body) → (is_trigger, quick); accepts exactly
'!testme' (cold, default) and '!testme --quick' (opt-in fast lane), rejects
'!testmexyz'/'!testme foo'/etc. Threaded through both poll + webhook paths and
process_testme → trigger_build adds the CCCI_QUICK=1 Drone param (auto-exposed
to run_recipe_ci). PR comment labels a quick run lower-confidence. .drone.yml
echoes quick=. +3 unit tests (incl. the !testmexyz negative). 64 unit pass.
WC7: default !testme stays full cold; --quick opt-in, never gates merge.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flake.nix/flake.lock STAY at root so the build ref #cc-ci is unchanged; only flake's internal
configuration.nix path updated. Root-relative refs inside moved modules re-based ../X -> ../../X
(secrets/bridge/dashboard); configuration.nix's ../../modules imports unchanged (both dirs under nix/).
Living docs (README, architecture/install/secrets/enroll) + .drone.yml comment updated to nix/...;
append-only history logs left as-is. DECISIONS.md records RL5 + the deferred-coordinated RL6.
Verified on cc-ci: nixos-rebuild build 'path:#cc-ci' -> toplevel 8i3jcad9 (BYTE-IDENTICAL to the
pre-move build — store derivations are content-addressed on file contents, module .nix not in the
runtime closure); scripts/lint.sh -> lint: PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The exec runner sets HOME to a per-build workspace, leaving ~/.abra empty
(FATA directory is empty: .../home/drone/.abra/servers). Force HOME=/root in the
step so abra and the harness's ~/.abra/recipes resolve to the real config, as the
manual runs did. Safe at capacity=1 (no concurrent build shares /root/.abra).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits .drone.yml into a push-triggered self-test pipeline and a custom-triggered
recipe-ci pipeline. The bridge fires event=custom builds with RECIPE/REF/PR/SRC
params; recipe-ci runs the shared harness (install/upgrade/backup + recipe-local)
with STAGES set and CCCI_JANITOR_MAX_AGE=0 (safe at capacity=1), concurrency limit 1.
Connects the verified !testme trigger to actual recipe CI (D2/D10 path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>