The drone exec runner's step shell is set -e. On a NORMAL harness exit the EXIT trap still
fired and its kill of the already-exited process group failed with ESRCH, poisoning the
script's exit status: build 269 (plausible#3) ran fully GREEN (all tiers pass, level=4) but
the step exited 1. Reproduced minimally with sh -e and bash -e on the host; the fixed wrapper
verified for all three paths: green rc=0, red rc=7 (propagated), TERM-to-shell -> child gets
TERM and wrapper exits 143. Cancel forwarding semantics unchanged.
Rewritten to the restructured model: lifetime-hardening guards (PDEATHSIG/SIGTERM/SIGALRM +
setsid/trap), per-run ABRA_DIR isolation (same-recipe runs now parallel), per-app-domain flock
(double-!testme serialisation), flock-probe janitor decision table (incl. the inode-identity
race rows), updated failure-mode table (cancel now tears down via the harness's own funnel;
reboot reaps immediately; 60-min deadline bounds everything), single-knob config table, how to
run tests/concurrency, fresh file/symbol index + deleted-symbol list for grep verification.
Also drops the last stale concurrency.limit mention from the .drone.yml header comment.
tests/concurrency/ — NOT in the default `pytest tests/unit` gate; run explicitly with
`pytest tests/concurrency -q`. flock/prctl/alarm are never mocked: helper subprocesses
(helpers.py) hold real locks and install the real lifetime guards; locks live in a per-test
tmp dir via CCCI_APP_LOCK_DIR; every helper (and recorded grandchild) is reaped by fixture
cleanup.
- test_locks.py (cases 1-4): SIGKILL auto-release; LOCK_NB held/unheld semantics; PEP 446
fd-not-inherited (holder's child survives, lock still releases); same-domain second acquire
blocks until first holder exits.
- test_janitor.py (cases 5-12): orphan reaped once + lockfile unlinked; live holder never
reaped + logged; new-run acquire blocks until a slow reap completes (reap-under-probe-lock);
two overlapping janitors -> exactly one reaps (flock arbitration); reboot sim (no lockfile)
reaps immediately with no age wait; >120min-held lock flagged 'possible leaked run' and NOT
stolen; warm/canonical names never probed (no lockfile even created); directory-as-lockfile
and missing lock dir degrade to skip+log, never crash.
- test_lifetime.py (cases 13-16): PDEATHSIG (wrapper parent SIGKILL'd -> guarded child TERM'd,
teardown marker, lock released); already-orphaned helper REFUSES to run (ppid race); 2s
deadline alarm -> teardown + exit 142 + lock released; SIGTERM -> teardown + exit 143 +
lock released.
- test_abra_dir.py (cases 17-19 + 18b): per-run dir built + $ABRA_DIR exported before the
first abra call (recording stub abra on PATH); two CONCURRENT same-recipe fetch+checkout
flows into different ABRA_DIRs -> divergent correct trees, canonical staged clone untouched;
.env written through the servers/ symlink lands in the canonical path (env_get/env_set
agree); manual runs get pid-suffixed dirs.
On cc-ci: pytest tests/concurrency -q -> 20 passed; tests/unit -> 138 passed; lint PASS.
Remove concurrency.limit from the recipe-ci pipeline (.drone.yml): it duplicated
DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) and the two had to be kept in
step by hand (docs/concurrency.md §8.6). maxTests comment updated to state it is the single
knob and to describe the new safety model.
- run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and
catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared
canonical path, so janitor discovery and env-based teardown are unchanged; per-domain
filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ —
each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the
abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation.
- fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock.
CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging
workflow, run reads staged state).
- abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by
recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions,
generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and
warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but
follows the per-run tree when imported by promote_canonical inside a run).
- deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the
must-lock-before-fetch ordering rule.
- tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the
compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix
required by per-run trees; no assertion/gate touched — see DECISIONS.md).
- .drone.yml comments updated (HOME=/root rationale now via the servers symlink).
- acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in
deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line
when another run of the same domain is in flight (double-!testme serialisation). The file
object is retained in module-level _held_app_locks so GC can never close the fd and silently
release the lock. mtime is touched at acquisition (lock age for the long-held flag).
- janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service
sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the
probe lock (a new same-domain run blocks until the reap finishes), then unlink before release.
Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale
unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log.
- unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is
the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on
the live path, and a probe that won one skips (unlinking now would hit a newer run's file).
- deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path,
ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard.
teardown_app no longer unregisters (release is process exit). janitor() takes no args now.
- post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate
reap (improvement over the old 2h age fallback).
- new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with
post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's
finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way
with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged
and ignored (begin_teardown guard) so a second signal can't abort the running cleanup.
- run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown
finally: blocks (cold + quick) mark begin_teardown().
- .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards
the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of
leaking it (docs/concurrency.md §8.1).
- PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.
68ef0f8 made services_converged() require UpdateStatus settled, treating
'paused' as in flight. But swarm's default update-failure-action pauses the
update on a single task flicker and the flag persists FOREVER (until the next
update): immich CI 241 had the app service 'paused' from a restart during
restore while the service was back at 1/1 and healthy — every subsequent wait
hung to its deadline and the run had to be killed.
Only 'updating' and 'rollback_started' now block convergence: those are the
states swarm is actively driving (the 238 stop-first race lives in 'updating').
'paused'/'rollback_paused' make no progress without intervention, so waiting on
them is pointless — N/N replicas is already required, and the HTTP-health and
tier assertions still gate whether the app actually works.
lint: PASS, unit tests: 138 passed.
psql -tAc still prints INSERT/CREATE command tags (e.g. "INSERT 0 1"), so
_register_site asserted out == site against "INSERT 0 1\nsite" and both
event-tracking roundtrip tests failed on their very first run (build 237 —
the custom tier had never executed before; install always failed earlier).
-q suppresses the tags; verified against the recipe db container.
services_converged() accepted N/N replicas as converged — but a chaos redeploy
that changes a non-app service image (immich PR #2 moves the db to the
vectorchord pin) registers a stop-first rolling update that swarm may not have
STARTED yet: the OLD task still shows 1/1, the wait passes, and the task dies
seconds later. Build 238: backupbot resolved the db hook container, the task
was killed in the gap, and the pre-hook exec crashed the whole backup with a
409 -> no dump in the snapshot -> restore had nothing -> RED.
- services_converged() now also requires every service's swarm UpdateStatus to
be settled ('', completed, rollback_completed) — updating/paused/rollback in
flight is NOT converged. Strictly stricter: no gate is weakened.
- backup_app() gains a bounded (300s) settle-wait before 'abra app backup
create' as defence in depth; on timeout the backup still runs and the tier's
assertion delivers the verdict.
lint: PASS, unit tests: 138 passed.
capacity=2 went live with three stale capacity=1-era assumptions that corrupted
concurrent runs (immich 229/230 '/pg_backup.sh: No such file'):
- ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe rm-rf's/
reclones and the upgrade tier git-checkouts mid-run. Same-recipe runs now
serialise on an exclusive flock (/run/lock/cc-ci-recipe-<recipe>.lock), taken
in main() BEFORE fetch_recipe and held for the whole run; the kernel releases
it on any process death, so there is no stale-lock failure mode. Different
recipes still run in parallel.
- CCCI_JANITOR_MAX_AGE=0 made a starting build reap ANY in-flight run app. Every
run now registers its app domain + pid in /run/cc-ci-active/<domain> before
app creation; the janitor checks the owner: alive (pid is a live run_recipe_ci
process) -> never reaped; dead -> reaped immediately; unknown (pre-registry or
post-reboot) -> age fallback (default 2h). The MAX_AGE=0 env override is gone
from .drone.yml.
- .drone.yml: concurrency.limit 1 -> 2 to match DRONE_RUNNER_CAPACITY=2; the
'safe because capacity=1' comments now describe the flock+registry model.
lint: PASS, unit tests: 138 passed.
Push builds have been RED on the lint step since ~build 209 from accumulated
formatting drift. This is the mechanical cleanup: ruff format + ruff --fix
(UP038 isinstance unions, SIM105 contextlib.suppress, UP031 f-strings, SIM115
tempfile context manager), shfmt -i 2 -ci, nixpkgs-fmt/statix/deadnix (merged
attrsets, dropped unused lib args), yamllint, and shell quoting fixes in
tests/lasuite-docs/setup_custom_tests.sh. No behaviour changes intended;
lint: PASS, unit tests: 138 passed.
The harness default base (recipe_versions[-2]) resolves to 3.0.0+v2.0.0 for
the open 3.1.0 upgrade PR. That release predates x86_64 support in the
clickhouse entrypoint (added 3.0.1): on this amd64 host it downloads
clickhouse-backup-linux-x86_64.tar.gz — a deterministic HTTP 404 — and with
set -e + a silenced wget the container exits 1 before logging anything,
crash-looping until the deploy times out. The base therefore can never
converge, regardless of the PR content (the published tag is immutable).
This is exactly the case the harness documents for UPGRADE_BASE_VERSION:
a PR adding its version ABOVE the newest published tag, where the true
predecessor is [-1] (3.0.1+v2.0.0), not [-2]. The upgrade tier then tests
the real operator path 3.0.1 -> 3.1.0.
Pairs with recipe-maintainers/plausible#3 (its !testme can only go green
once this lands).
Declare intentional skips + custom-html-tiny functional test; 4-rung level ladder
- recipe_meta.EXPECTED_NA = {rung: reason} lists intentionally-skipped rungs; any
essential rung skipped and not listed is unintentional. Skips still cap the level
(never inflate). results.json: skips:{intentional,unintentional} + level_cap_rung.
- Level ladder = the four essential rungs (install, upgrade, backup/restore,
functional; top = L4). integration & recipe-local are optional, not leveled
(SSO still enforced for the run verdict, unchanged).
- Card shows skipped rungs as INTENTIONAL SKIP (green, reason below) / UNINTENTIONAL
SKIP (amber); level badge gains an expected/gap? third segment.
- custom-html-tiny: functional serve test (exact-byte round-trip + 404); declares
backup_restore intentionally skipped (stateless static server).
Independently verified by the adversary: 138 unit tests pass cold; live full-stage
run on custom-html-tiny green (upgrade tier ran; level 2; correct skips/badge);
clean teardown.
nginx:alpine swarm service serving /var/lib/cc-ci-reports behind traefik
(Host(report.ci.commoninternet.net) + wildcard TLS), deployed by a reconcile
oneshot mirroring dashboard.nix. The /recipe-report skill writes the weekly
HTML pages there; nginx serves them live. report.ci.* already resolves
(wildcard *.ci DNS) and is covered by the wildcard cert.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Operator preference: each !testme should get its own comment response so a
re-run is visible in the PR timeline. process_testme now always posts a fresh
⏳ placeholder comment; watch_and_reflect edits THAT comment to the result.
(Was: reuse/edit a single marker comment in place — which made re-runs on an
unchanged head invisible, only updating commit status.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Having test_backup.py in custom-html-bkp-bad caused both bad-backup and bad-restore
to fail at the backup tier. Create custom-html-rst-bad with its own cc-ci test dir
that has ops.py+test_restore.py but NO test_backup.py, so:
- backup: only generic test_backup_artifact → PASS (snapshot exists)
- restore: pre_restore writes 'mutated', marker stays 'mutated' after restore → FAIL
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
backup-bot-two ignores backupbot.backup.path labels and always backs up
the full volume, making path-based restore-RED infeasible.
New approach: custom-html-bkp-bad has no pre_backup → marker never seeded
→ backup snapshot has no ci-marker.txt. pre_restore writes 'mutated'.
After restore: marker is MISSING or 'mutated' → test_restore_returns_state FAILS.
upgrade=skip (no version tags) is acceptable since passing_tiers_before=[install,backup].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No ops.py::pre_backup for custom-html-bkp-bad → ci-marker.txt never seeded.
test_backup_captures_state asserts marker=='original' → MISSING → FAIL → backup=RED.
This works regardless of backupbot label behavior.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
backupbot-two ignores nonexistent backup paths and backs up the whole
volume, making the bad-path approach unreliable. New approach:
- Create recipe-maintainers/custom-html-bkp-bad on Gitea (custom-html
without backupbot.backup=true label) — SHA 4e584063a99a
- Add tests/custom-html-bkp-bad/recipe_meta.py with BACKUP_CAPABLE=True
so the harness runs the backup tier despite auto-detect returning False
- Without a labeled container, backup-bot-two produces no snapshot →
parse_snapshot_id=None → test_backup_artifact fails → backup=RED ✓
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both compose.yml uploads had empty files due to a bash encoding bug.
Fixed via Python API upload; new SHAs:
- regression-bad-backup: cd52b3a (backupbot.backup.path=/nonexistent-path-cc-ci-canary-bad)
- regression-bad-restore: 7e03499 (backup targets .backup-data subdir + command creates it)
Adversary confirmed bad-install ✓ and bad-upgrade ✓ from run artifacts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>