Placement RULE: discovery.custom_tests covers ONLY functional/ + playwright/ —
the top-level test_*.py glob for recipe dirs is removed (top level is reserved
for lifecycle overlays; zero in-repo users of top-level custom tests, verified
by sweep). Lifecycle-name exclusion inside the subdirs stays as the double-run
safety net. HC2 default-deny unchanged (repo-local custom now pinned via
functional/ in the gate test).
New conftest fixture op_state: parses $CCCI_OP_STATE_FILE (op context: versions,
artifact paths), skipping with a clear reason when unset/absent/unparseable —
overlay tests read op facts from the fixture instead of hand-parsing env (zero
existing hand-parsers found; the fixture is the documented path forward). deps
fixture landed in P2d.
Unit tests: placement-rule discovery tests (top-level custom NOT discovered;
functional/playwright are; misfiled lifecycle names excluded), op_state fixture
contract (reads file / skips without env / skips on missing file), deps fixture
attribute sugar.
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 184 passed; scripts/lint.sh -> PASS.
harness.meta.HookCtx (frozen): .domain, .base_url, .meta (RecipeMeta), .deps
(provisioned dep creds from $CCCI_DEPS_FILE or None), .op (current lifecycle op
or None); built via meta.hook_ctx() at each hook call site.
All recipe callables now take ctx: EXTRA_ENV(ctx), UPGRADE_EXTRA_ENV(ctx),
READY_PROBE(ctx), BACKUP_VERIFY(ctx), SCREENSHOT(page, ctx), ops.py pre_<op>(ctx).
Dict-valued EXTRA_ENV/UPGRADE_EXTRA_ENV unchanged (only the callable signature
moved). Call sites converted: deploy_app env shaping, perform_upgrade,
wait_ready_probes (gains op=), _perform_op BACKUP_VERIFY, screenshot.capture,
_run_pre_hook.
Legacy signatures fail FAST with a clear migration message: the registry carries
hook_params per hook key, enforced at meta.load() (MetaError names the old vs new
signature); ops.py pre-op hooks get the same check at the orchestrator call site
(meta.check_hook_signature) — no silent TypeError mid-run.
Migrated every in-repo user mechanically (17 ops.py files; cryptpad/lasuite-*/
mailu EXTRA_ENV; mumble+lasuite-drive READY_PROBE; ghost/discourse BACKUP_VERIFY)
— seeded values, probes and assertions byte-identical (domain -> ctx.domain;
keycloak pre_restore's meta arg -> ctx.meta).
Unit tests: hook_ctx field contract, ctx.deps from the run deps file, legacy-
signature MetaError (READY_PROBE/EXTRA_ENV/SCREENSHOT + pre-op checker), ctx
signatures accepted. Docs table regenerated (signature docs in key docs).
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 180 passed; scripts/lint.sh -> PASS.
a) compose.ccci.yml is FIRST-CLASS: the harness auto-copies tests/<recipe>/
compose.ccci.yml into the run's recipe checkout (ABRA_DIR-aware, lifecycle.
provide_ccci_overlay) and auto-chaoses the pinned base deploy on its presence
(kills the R7 implicit coupling). ghost/discourse install_steps.sh (copy-only
boilerplate) deleted; CHAOS_BASE_DEPLOY removed from both metas + the registry.
b) install-time deps wiring is the ONLY mode: deps with DEPS provision BEFORE the
single deploy; legacy post-deploy provisioning + the setup_custom_tests.sh
invocation machinery deleted. lasuite-docs migrated to install_steps.sh OIDC
wiring (same env names/values as the old hook — only the timing moved);
lasuite-drive's remaining post-deploy MinIO bucket one-shot moved to ops.py
pre_install; both setup_custom_tests.sh files deleted; OIDC_AT_INSTALL removed
from drive/meet metas + the registry.
c) SKIP_GENERIC meta key deleted (zero users). Env form CCCI_SKIP_GENERIC* stays
as the documented dev-only escape hatch; when active in a drone CI run the
orchestrator prints a loud !! warning (manifest embedding lands in P5).
d) conftest cleanup: dead pre-deploy-once fixtures deployed/deployed_app deleted
(zero users), app_domain + _short + _wait_healthy dropped (only users were the
deleted fixtures); deps_apps+deps_creds consolidated into ONE deps fixture
(entries expose .domain etc. as attributes; dict access intact); the 6 lasuite
test files renamed deps_creds->deps (fixture name only — assertions and flows
byte-identical). requires_deps marker + F2-11 skip-report plumbing unchanged.
Registry is now exactly the 14 final keys; docs §4 table regenerated. Stale
setup_custom_tests/OIDC_AT_INSTALL prose in docstrings/comments/assert MESSAGES
updated (no assert logic or expected value touched).
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.
One loader: runner/harness/meta.py::load(recipe) -> RecipeMeta (frozen dataclass,
attribute access), backed by the declarative KEYS registry (14 final keys + 3
P2-deprecated). The ONLY exec() of tests/<recipe>/recipe_meta.py. Validation per
the locked decision: unknown ALL-CAPS top-level name or type mismatch = MetaError
(hard error at load); underscore-prefixed names recipe-private; callables only on
hook-typed keys.
Migrated all six legacy loaders (spec §4 L1–L6):
- run_recipe_ci.py::_load_meta deleted; orchestrator loads once, passes meta down
- tests/conftest.py::_recipe_meta deleted; meta fixture returns full RecipeMeta (R3)
- lifecycle.py::_recipe_extra_env/_recipe_meta_flag deleted; deploy_app takes meta
- deps.py::declared_deps deleted; callers read meta.DEPS
- canonical.py::is_enrolled reads through meta.load()
- screenshot.py now actually receives SCREENSHOT through the orchestrator path (R2
fix; proven by unit test through the real load path)
Mumble private constants underscore-prefixed (_WELCOME_TEXT_MARKER/_MAX_USERS) +
importers fixed. New tests/unit/test_meta.py (all-recipes-load-clean typo gate,
MetaError cases, spec §2 baseline defaults, underscore exemption, doc sync). Docs
§4 key table now GENERATED from the registry (scripts/gen-meta-docs.py); drift
fails CI.
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.
Complete reference for per-recipe CI customization: all 18 recipe_meta keys
(incl. the base pin UPGRADE_BASE_VERSION), the six divergent meta loaders,
every hook file (test_<op>.py overlays, ops.py, install_steps.sh,
setup_custom_tests.sh, compose.ccci.yml), env contracts, and §8 known
limitations / restructuring candidates (R1 loader drift, R2 dead SCREENSHOT
knob, R6 silent-typo hazard, ...). Written for operator review ahead of a
possible restructure.
Builds 290+291 (same immich domain) both success: 291 logged block line + acquired,
both deploy-count=1 (290 no false-2, 291 no FileNotFoundError), zero leakage.
Serialization also observed live in lslocks. CONC-A1 conditions 1-3 met; veto lifted.
Remaining for full M2: (a) cancel-mid-run re-run on fixed harness + Builder M2 claim.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The four CCCI state files (deploys countfile, opstate, deps, depskip) were keyed
by app domain in shared /tmp. A second run of the same domain executes its main()
preamble + deploy_app's pre-lock _record_deploy BEFORE blocking at the app lock,
so it reset/polluted the live first run's counter (false DG4.1 deploy-count=2,
build 279) and the first run's end-of-run os.remove crashed the second
(FileNotFoundError, build 281). Masked pre-restructure by the end-to-end recipe
flock. Now keyed by run id + harness pid via _run_state_path(); children receive
exact paths via the CCCI_*_FILE env vars, so domain keying was never load-bearing.
tests/concurrency/test_run_state.py: path-invariant cases + a real-process
regression (helpers.py deploy-count-run) reproducing the live interleaving —
verified to FAIL under simulated shared keying. docs/concurrency.md §3 updated.
Builds 279+281 (immich#2, same domain immi-ad3e33) both RED: 279 false DG4.1
'deploy-count 2!=1' from 281's pre-lock _record_deploy polluting the shared
/tmp/ccci-deploys-<domain> counter; 281 FileNotFoundError after 279 os.remove'd it.
Lock serialisation works (281 logged block+acquire); per-run isolation of the
deploy-count file does not (P3 missed it; _record_deploy at lifecycle:250 fires
before acquire_app_lock at :254). Control build 275 (isolated) green.
Veto DONE until counter keyed per-run + same-domain test + live (c) both-green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The drone exec runner's step shell is set -e. On a NORMAL harness exit the EXIT trap still
fired and its kill of the already-exited process group failed with ESRCH, poisoning the
script's exit status: build 269 (plausible#3) ran fully GREEN (all tiers pass, level=4) but
the step exited 1. Reproduced minimally with sh -e and bash -e on the host; the fixed wrapper
verified for all three paths: green rc=0, red rc=7 (propagated), TERM-to-shell -> child gets
TERM and wrapper exits 143. Cancel forwarding semantics unchanged.
Rewritten to the restructured model: lifetime-hardening guards (PDEATHSIG/SIGTERM/SIGALRM +
setsid/trap), per-run ABRA_DIR isolation (same-recipe runs now parallel), per-app-domain flock
(double-!testme serialisation), flock-probe janitor decision table (incl. the inode-identity
race rows), updated failure-mode table (cancel now tears down via the harness's own funnel;
reboot reaps immediately; 60-min deadline bounds everything), single-knob config table, how to
run tests/concurrency, fresh file/symbol index + deleted-symbol list for grep verification.
Also drops the last stale concurrency.limit mention from the .drone.yml header comment.
tests/concurrency/ — NOT in the default `pytest tests/unit` gate; run explicitly with
`pytest tests/concurrency -q`. flock/prctl/alarm are never mocked: helper subprocesses
(helpers.py) hold real locks and install the real lifetime guards; locks live in a per-test
tmp dir via CCCI_APP_LOCK_DIR; every helper (and recorded grandchild) is reaped by fixture
cleanup.
- test_locks.py (cases 1-4): SIGKILL auto-release; LOCK_NB held/unheld semantics; PEP 446
fd-not-inherited (holder's child survives, lock still releases); same-domain second acquire
blocks until first holder exits.
- test_janitor.py (cases 5-12): orphan reaped once + lockfile unlinked; live holder never
reaped + logged; new-run acquire blocks until a slow reap completes (reap-under-probe-lock);
two overlapping janitors -> exactly one reaps (flock arbitration); reboot sim (no lockfile)
reaps immediately with no age wait; >120min-held lock flagged 'possible leaked run' and NOT
stolen; warm/canonical names never probed (no lockfile even created); directory-as-lockfile
and missing lock dir degrade to skip+log, never crash.
- test_lifetime.py (cases 13-16): PDEATHSIG (wrapper parent SIGKILL'd -> guarded child TERM'd,
teardown marker, lock released); already-orphaned helper REFUSES to run (ppid race); 2s
deadline alarm -> teardown + exit 142 + lock released; SIGTERM -> teardown + exit 143 +
lock released.
- test_abra_dir.py (cases 17-19 + 18b): per-run dir built + $ABRA_DIR exported before the
first abra call (recording stub abra on PATH); two CONCURRENT same-recipe fetch+checkout
flows into different ABRA_DIRs -> divergent correct trees, canonical staged clone untouched;
.env written through the servers/ symlink lands in the canonical path (env_get/env_set
agree); manual runs get pid-suffixed dirs.
On cc-ci: pytest tests/concurrency -q -> 20 passed; tests/unit -> 138 passed; lint PASS.
Remove concurrency.limit from the recipe-ci pipeline (.drone.yml): it duplicated
DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) and the two had to be kept in
step by hand (docs/concurrency.md §8.6). maxTests comment updated to state it is the single
knob and to describe the new safety model.
- run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and
catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared
canonical path, so janitor discovery and env-based teardown are unchanged; per-domain
filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ —
each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the
abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation.
- fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock.
CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging
workflow, run reads staged state).
- abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by
recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions,
generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and
warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but
follows the per-run tree when imported by promote_canonical inside a run).
- deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the
must-lock-before-fetch ordering rule.
- tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the
compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix
required by per-run trees; no assertion/gate touched — see DECISIONS.md).
- .drone.yml comments updated (HOME=/root rationale now via the servers symlink).
- acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in
deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line
when another run of the same domain is in flight (double-!testme serialisation). The file
object is retained in module-level _held_app_locks so GC can never close the fd and silently
release the lock. mtime is touched at acquisition (lock age for the long-held flag).
- janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service
sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the
probe lock (a new same-domain run blocks until the reap finishes), then unlink before release.
Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale
unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log.
- unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is
the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on
the live path, and a probe that won one skips (unlinking now would hit a newer run's file).
- deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path,
ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard.
teardown_app no longer unregisters (release is process exit). janitor() takes no args now.
- post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate
reap (improvement over the old 2h age fallback).
- new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with
post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's
finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way
with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged
and ignored (begin_teardown guard) so a second signal can't abort the running cleanup.
- run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown
finally: blocks (cold + quick) mark begin_teardown().
- .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards
the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of
leaking it (docs/concurrency.md §8.1).
- PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.
68ef0f8 made services_converged() require UpdateStatus settled, treating
'paused' as in flight. But swarm's default update-failure-action pauses the
update on a single task flicker and the flag persists FOREVER (until the next
update): immich CI 241 had the app service 'paused' from a restart during
restore while the service was back at 1/1 and healthy — every subsequent wait
hung to its deadline and the run had to be killed.
Only 'updating' and 'rollback_started' now block convergence: those are the
states swarm is actively driving (the 238 stop-first race lives in 'updating').
'paused'/'rollback_paused' make no progress without intervention, so waiting on
them is pointless — N/N replicas is already required, and the HTTP-health and
tier assertions still gate whether the app actually works.
lint: PASS, unit tests: 138 passed.