Complete reference for per-recipe CI customization: all 18 recipe_meta keys
(incl. the base pin UPGRADE_BASE_VERSION), the six divergent meta loaders,
every hook file (test_<op>.py overlays, ops.py, install_steps.sh,
setup_custom_tests.sh, compose.ccci.yml), env contracts, and §8 known
limitations / restructuring candidates (R1 loader drift, R2 dead SCREENSHOT
knob, R6 silent-typo hazard, ...). Written for operator review ahead of a
possible restructure.
Builds 290+291 (same immich domain) both success: 291 logged block line + acquired,
both deploy-count=1 (290 no false-2, 291 no FileNotFoundError), zero leakage.
Serialization also observed live in lslocks. CONC-A1 conditions 1-3 met; veto lifted.
Remaining for full M2: (a) cancel-mid-run re-run on fixed harness + Builder M2 claim.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The four CCCI state files (deploys countfile, opstate, deps, depskip) were keyed
by app domain in shared /tmp. A second run of the same domain executes its main()
preamble + deploy_app's pre-lock _record_deploy BEFORE blocking at the app lock,
so it reset/polluted the live first run's counter (false DG4.1 deploy-count=2,
build 279) and the first run's end-of-run os.remove crashed the second
(FileNotFoundError, build 281). Masked pre-restructure by the end-to-end recipe
flock. Now keyed by run id + harness pid via _run_state_path(); children receive
exact paths via the CCCI_*_FILE env vars, so domain keying was never load-bearing.
tests/concurrency/test_run_state.py: path-invariant cases + a real-process
regression (helpers.py deploy-count-run) reproducing the live interleaving —
verified to FAIL under simulated shared keying. docs/concurrency.md §3 updated.
Builds 279+281 (immich#2, same domain immi-ad3e33) both RED: 279 false DG4.1
'deploy-count 2!=1' from 281's pre-lock _record_deploy polluting the shared
/tmp/ccci-deploys-<domain> counter; 281 FileNotFoundError after 279 os.remove'd it.
Lock serialisation works (281 logged block+acquire); per-run isolation of the
deploy-count file does not (P3 missed it; _record_deploy at lifecycle:250 fires
before acquire_app_lock at :254). Control build 275 (isolated) green.
Veto DONE until counter keyed per-run + same-domain test + live (c) both-green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The drone exec runner's step shell is set -e. On a NORMAL harness exit the EXIT trap still
fired and its kill of the already-exited process group failed with ESRCH, poisoning the
script's exit status: build 269 (plausible#3) ran fully GREEN (all tiers pass, level=4) but
the step exited 1. Reproduced minimally with sh -e and bash -e on the host; the fixed wrapper
verified for all three paths: green rc=0, red rc=7 (propagated), TERM-to-shell -> child gets
TERM and wrapper exits 143. Cancel forwarding semantics unchanged.
Rewritten to the restructured model: lifetime-hardening guards (PDEATHSIG/SIGTERM/SIGALRM +
setsid/trap), per-run ABRA_DIR isolation (same-recipe runs now parallel), per-app-domain flock
(double-!testme serialisation), flock-probe janitor decision table (incl. the inode-identity
race rows), updated failure-mode table (cancel now tears down via the harness's own funnel;
reboot reaps immediately; 60-min deadline bounds everything), single-knob config table, how to
run tests/concurrency, fresh file/symbol index + deleted-symbol list for grep verification.
Also drops the last stale concurrency.limit mention from the .drone.yml header comment.
tests/concurrency/ — NOT in the default `pytest tests/unit` gate; run explicitly with
`pytest tests/concurrency -q`. flock/prctl/alarm are never mocked: helper subprocesses
(helpers.py) hold real locks and install the real lifetime guards; locks live in a per-test
tmp dir via CCCI_APP_LOCK_DIR; every helper (and recorded grandchild) is reaped by fixture
cleanup.
- test_locks.py (cases 1-4): SIGKILL auto-release; LOCK_NB held/unheld semantics; PEP 446
fd-not-inherited (holder's child survives, lock still releases); same-domain second acquire
blocks until first holder exits.
- test_janitor.py (cases 5-12): orphan reaped once + lockfile unlinked; live holder never
reaped + logged; new-run acquire blocks until a slow reap completes (reap-under-probe-lock);
two overlapping janitors -> exactly one reaps (flock arbitration); reboot sim (no lockfile)
reaps immediately with no age wait; >120min-held lock flagged 'possible leaked run' and NOT
stolen; warm/canonical names never probed (no lockfile even created); directory-as-lockfile
and missing lock dir degrade to skip+log, never crash.
- test_lifetime.py (cases 13-16): PDEATHSIG (wrapper parent SIGKILL'd -> guarded child TERM'd,
teardown marker, lock released); already-orphaned helper REFUSES to run (ppid race); 2s
deadline alarm -> teardown + exit 142 + lock released; SIGTERM -> teardown + exit 143 +
lock released.
- test_abra_dir.py (cases 17-19 + 18b): per-run dir built + $ABRA_DIR exported before the
first abra call (recording stub abra on PATH); two CONCURRENT same-recipe fetch+checkout
flows into different ABRA_DIRs -> divergent correct trees, canonical staged clone untouched;
.env written through the servers/ symlink lands in the canonical path (env_get/env_set
agree); manual runs get pid-suffixed dirs.
On cc-ci: pytest tests/concurrency -q -> 20 passed; tests/unit -> 138 passed; lint PASS.
Remove concurrency.limit from the recipe-ci pipeline (.drone.yml): it duplicated
DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) and the two had to be kept in
step by hand (docs/concurrency.md §8.6). maxTests comment updated to state it is the single
knob and to describe the new safety model.
- run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and
catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared
canonical path, so janitor discovery and env-based teardown are unchanged; per-domain
filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ —
each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the
abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation.
- fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock.
CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging
workflow, run reads staged state).
- abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by
recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions,
generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and
warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but
follows the per-run tree when imported by promote_canonical inside a run).
- deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the
must-lock-before-fetch ordering rule.
- tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the
compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix
required by per-run trees; no assertion/gate touched — see DECISIONS.md).
- .drone.yml comments updated (HOME=/root rationale now via the servers symlink).
- acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in
deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line
when another run of the same domain is in flight (double-!testme serialisation). The file
object is retained in module-level _held_app_locks so GC can never close the fd and silently
release the lock. mtime is touched at acquisition (lock age for the long-held flag).
- janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service
sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the
probe lock (a new same-domain run blocks until the reap finishes), then unlink before release.
Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale
unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log.
- unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is
the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on
the live path, and a probe that won one skips (unlinking now would hit a newer run's file).
- deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path,
ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard.
teardown_app no longer unregisters (release is process exit). janitor() takes no args now.
- post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate
reap (improvement over the old 2h age fallback).
- new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with
post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's
finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way
with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged
and ignored (begin_teardown guard) so a second signal can't abort the running cleanup.
- run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown
finally: blocks (cold + quick) mark begin_teardown().
- .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards
the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of
leaking it (docs/concurrency.md §8.1).
- PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.