Compare commits

..

50 Commits

Author SHA1 Message Date
29a28e2028 feat(harness): P4 — custom-test ergonomics (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
Placement RULE: discovery.custom_tests covers ONLY functional/ + playwright/ —
the top-level test_*.py glob for recipe dirs is removed (top level is reserved
for lifecycle overlays; zero in-repo users of top-level custom tests, verified
by sweep). Lifecycle-name exclusion inside the subdirs stays as the double-run
safety net. HC2 default-deny unchanged (repo-local custom now pinned via
functional/ in the gate test).

New conftest fixture op_state: parses $CCCI_OP_STATE_FILE (op context: versions,
artifact paths), skipping with a clear reason when unset/absent/unparseable —
overlay tests read op facts from the fixture instead of hand-parsing env (zero
existing hand-parsers found; the fixture is the documented path forward). deps
fixture landed in P2d.

Unit tests: placement-rule discovery tests (top-level custom NOT discovered;
functional/playwright are; misfiled lifecycle names excluded), op_state fixture
contract (reads file / skips without env / skips on missing file), deps fixture
attribute sugar.

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 184 passed; scripts/lint.sh -> PASS.
2026-06-10 17:14:21 +00:00
fd02d9f4b8 feat(harness): P3 — uniform ctx hook convention (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
harness.meta.HookCtx (frozen): .domain, .base_url, .meta (RecipeMeta), .deps
(provisioned dep creds from $CCCI_DEPS_FILE or None), .op (current lifecycle op
or None); built via meta.hook_ctx() at each hook call site.

All recipe callables now take ctx: EXTRA_ENV(ctx), UPGRADE_EXTRA_ENV(ctx),
READY_PROBE(ctx), BACKUP_VERIFY(ctx), SCREENSHOT(page, ctx), ops.py pre_<op>(ctx).
Dict-valued EXTRA_ENV/UPGRADE_EXTRA_ENV unchanged (only the callable signature
moved). Call sites converted: deploy_app env shaping, perform_upgrade,
wait_ready_probes (gains op=), _perform_op BACKUP_VERIFY, screenshot.capture,
_run_pre_hook.

Legacy signatures fail FAST with a clear migration message: the registry carries
hook_params per hook key, enforced at meta.load() (MetaError names the old vs new
signature); ops.py pre-op hooks get the same check at the orchestrator call site
(meta.check_hook_signature) — no silent TypeError mid-run.

Migrated every in-repo user mechanically (17 ops.py files; cryptpad/lasuite-*/
mailu EXTRA_ENV; mumble+lasuite-drive READY_PROBE; ghost/discourse BACKUP_VERIFY)
— seeded values, probes and assertions byte-identical (domain -> ctx.domain;
keycloak pre_restore's meta arg -> ctx.meta).

Unit tests: hook_ctx field contract, ctx.deps from the run deps file, legacy-
signature MetaError (READY_PROBE/EXTRA_ENV/SCREENSHOT + pre-op checker), ctx
signatures accepted. Docs table regenerated (signature docs in key docs).

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 180 passed; scripts/lint.sh -> PASS.
2026-06-10 17:10:26 +00:00
8cd72fd78d feat(harness): P2 — delete legacy customization keys & paths (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
a) compose.ccci.yml is FIRST-CLASS: the harness auto-copies tests/<recipe>/
   compose.ccci.yml into the run's recipe checkout (ABRA_DIR-aware, lifecycle.
   provide_ccci_overlay) and auto-chaoses the pinned base deploy on its presence
   (kills the R7 implicit coupling). ghost/discourse install_steps.sh (copy-only
   boilerplate) deleted; CHAOS_BASE_DEPLOY removed from both metas + the registry.

b) install-time deps wiring is the ONLY mode: deps with DEPS provision BEFORE the
   single deploy; legacy post-deploy provisioning + the setup_custom_tests.sh
   invocation machinery deleted. lasuite-docs migrated to install_steps.sh OIDC
   wiring (same env names/values as the old hook — only the timing moved);
   lasuite-drive's remaining post-deploy MinIO bucket one-shot moved to ops.py
   pre_install; both setup_custom_tests.sh files deleted; OIDC_AT_INSTALL removed
   from drive/meet metas + the registry.

c) SKIP_GENERIC meta key deleted (zero users). Env form CCCI_SKIP_GENERIC* stays
   as the documented dev-only escape hatch; when active in a drone CI run the
   orchestrator prints a loud !! warning (manifest embedding lands in P5).

d) conftest cleanup: dead pre-deploy-once fixtures deployed/deployed_app deleted
   (zero users), app_domain + _short + _wait_healthy dropped (only users were the
   deleted fixtures); deps_apps+deps_creds consolidated into ONE deps fixture
   (entries expose .domain etc. as attributes; dict access intact); the 6 lasuite
   test files renamed deps_creds->deps (fixture name only — assertions and flows
   byte-identical). requires_deps marker + F2-11 skip-report plumbing unchanged.

Registry is now exactly the 14 final keys; docs §4 table regenerated. Stale
setup_custom_tests/OIDC_AT_INSTALL prose in docstrings/comments/assert MESSAGES
updated (no assert logic or expected value touched).

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.
2026-06-10 17:01:33 +00:00
472a68b32c feat(harness): P1 — single registry-backed meta loader (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
One loader: runner/harness/meta.py::load(recipe) -> RecipeMeta (frozen dataclass,
attribute access), backed by the declarative KEYS registry (14 final keys + 3
P2-deprecated). The ONLY exec() of tests/<recipe>/recipe_meta.py. Validation per
the locked decision: unknown ALL-CAPS top-level name or type mismatch = MetaError
(hard error at load); underscore-prefixed names recipe-private; callables only on
hook-typed keys.

Migrated all six legacy loaders (spec §4 L1–L6):
- run_recipe_ci.py::_load_meta deleted; orchestrator loads once, passes meta down
- tests/conftest.py::_recipe_meta deleted; meta fixture returns full RecipeMeta (R3)
- lifecycle.py::_recipe_extra_env/_recipe_meta_flag deleted; deploy_app takes meta
- deps.py::declared_deps deleted; callers read meta.DEPS
- canonical.py::is_enrolled reads through meta.load()
- screenshot.py now actually receives SCREENSHOT through the orchestrator path (R2
  fix; proven by unit test through the real load path)

Mumble private constants underscore-prefixed (_WELCOME_TEXT_MARKER/_MAX_USERS) +
importers fixed. New tests/unit/test_meta.py (all-recipes-load-clean typo gate,
MetaError cases, spec §2 baseline defaults, underscore exemption, doc sync). Docs
§4 key table now GENERATED from the registry (scripts/gen-meta-docs.py); drift
fails CI.

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.
2026-06-10 16:46:58 +00:00
49fb818c60 status(rcust): bootstrap phase state files — P1 starting on branch restructure/recipe-custom
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 16:29:44 +00:00
12318582aa review(rcust): seed Adversary ledger — phase start, awaiting M1 claim
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 16:29:26 +00:00
76a4b6b3fa docs: recipe-customization review spec — full settings reference + restructuring candidates
All checks were successful
continuous-integration/drone/push Build is passing
Complete reference for per-recipe CI customization: all 18 recipe_meta keys
(incl. the base pin UPGRADE_BASE_VERSION), the six divergent meta loaders,
every hook file (test_<op>.py overlays, ops.py, install_steps.sh,
setup_custom_tests.sh, compose.ccci.yml), env contracts, and §8 known
limitations / restructuring candidates (R1 loader drift, R2 dead SCREENSHOT
knob, R6 silent-typo hazard, ...). Written for operator review ahead of a
possible restructure.
2026-06-10 15:55:34 +00:00
6060086c01 status(conc): ## DONE — M1+M2 both Adversary-PASS, no open veto; phase complete
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:56:02 +00:00
9987fba4b6 review(conc): M2 PASS — merged + live-verified (a)-(d) on final main 139e319; M1+M2 both fresh PASS, no open veto — DONE unblocked
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:55:19 +00:00
74ed24053d claim(conc): M2 — merged + live-verified (a)-(d) on final main 139e319; (a) re-run build 295 clean; awaiting Adversary
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:52:48 +00:00
2894778810 review(conc): M2(c) PASS — double-!testme both GREEN on CONC-A1-fixed harness; VETO LIFTED, CONC-A1 closed
All checks were successful
continuous-integration/drone/push Build is passing
Builds 290+291 (same immich domain) both success: 291 logged block line + acquired,
both deploy-count=1 (290 no false-2, 291 no FileNotFoundError), zero leakage.
Serialization also observed live in lslocks. CONC-A1 conditions 1-3 met; veto lifted.
Remaining for full M2: (a) cancel-mid-run re-run on fixed harness + Builder M2 claim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:51:46 +00:00
536a3595b9 journal(conc): M2(c) PASS round 2 — 290+291 both green, block line visible, zero leakage; (a) re-run triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:50:26 +00:00
0684576d74 chore(conc): consume BUILDER-INBOX (ML-flake context on (c) round-2; concur — will re-trigger (c) clean after 290/291 terminal)
Some checks reported errors
continuous-integration/drone/push Build is passing
continuous-integration/drone Build was killed
2026-06-10 08:45:14 +00:00
fa9a89bcf8 review(conc): live (c) round-2 — serialization confirmed via lslocks; delay is immich-ML healthcheck flake, not the restructure; veto unchanged
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:44:30 +00:00
374371966f journal(conc): (b)+(d) PASS on CONC-A1-fixed main (287/288 parallel green, zero leakage); (c) round 2 triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:22:40 +00:00
b1bca1a745 chore(conc): CONC-A1 fix code-verified (veto conditions 1+2 met, mutation-proven); 3+4 pending live (c) re-run
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:19:37 +00:00
4f6c9554b7 inbox(adversary): consumed CONC-A1-fixed message from Builder
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:17:16 +00:00
96ba67a63f inbox(adversary): CONC-A1 fixed b6e12ef/139e319 — run-keyed state files + regression test; re-running M2 live checks
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:16:43 +00:00
139e319d7e Merge branch 'restructure/concurrency': fix(harness) CONC-A1 run-keyed state files (M2(c) live-verify finding)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:16:18 +00:00
b6e12ef428 fix(harness): run-keyed run-scoped state files — CONC-A1 (same-domain runs corrupted shared deploy-count)
All checks were successful
continuous-integration/drone/push Build is passing
The four CCCI state files (deploys countfile, opstate, deps, depskip) were keyed
by app domain in shared /tmp. A second run of the same domain executes its main()
preamble + deploy_app's pre-lock _record_deploy BEFORE blocking at the app lock,
so it reset/polluted the live first run's counter (false DG4.1 deploy-count=2,
build 279) and the first run's end-of-run os.remove crashed the second
(FileNotFoundError, build 281). Masked pre-restructure by the end-to-end recipe
flock. Now keyed by run id + harness pid via _run_state_path(); children receive
exact paths via the CCCI_*_FILE env vars, so domain keying was never load-bearing.

tests/concurrency/test_run_state.py: path-invariant cases + a real-process
regression (helpers.py deploy-count-run) reproducing the live interleaving —
verified to FAIL under simulated shared keying. docs/concurrency.md §3 updated.
2026-06-10 08:16:09 +00:00
2173894f07 review(conc): M2(c) FAIL — double-!testme same domain corrupts shared deploy-count file (CONC-A1) + VETO
All checks were successful
continuous-integration/drone/push Build is passing
Builds 279+281 (immich#2, same domain immi-ad3e33) both RED: 279 false DG4.1
'deploy-count 2!=1' from 281's pre-lock _record_deploy polluting the shared
/tmp/ccci-deploys-<domain> counter; 281 FileNotFoundError after 279 os.remove'd it.
Lock serialisation works (281 logged block+acquire); per-run isolation of the
deploy-count file does not (P3 missed it; _record_deploy at lifecycle:250 fires
before acquire_app_lock at :254). Control build 275 (isolated) green.
Veto DONE until counter keyed per-run + same-domain test + live (c) both-green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:11:07 +00:00
e392c73cbc journal(conc): M2(b)+(d) PASS evidence; (c) double-!testme triggered
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing
2026-06-10 05:04:14 +00:00
3180ae1355 review(conc): wrapper exit-code fix verified safe (red still propagates) + correct my set -e pre-review miss; inbox consumed
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:58:27 +00:00
9d82a02026 journal(conc): M2(b) round-1 evidence + wrapper fix verification
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-10 04:56:22 +00:00
bbc2bafbcb inbox(adversary): M2 wrapper exit-code fix e1c4198/b7a009c — context for M2 review
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-10 04:55:07 +00:00
b7a009c1fc Merge branch 'restructure/concurrency': fix(ci) wrapper exit-code poisoning on green runs (M2 live-verify finding)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:54:51 +00:00
e1c4198c08 fix(ci): recipe-ci wrapper — capture harness rc, clear traps before exit (green runs no longer exit 1)
All checks were successful
continuous-integration/drone/push Build is passing
The drone exec runner's step shell is set -e. On a NORMAL harness exit the EXIT trap still
fired and its kill of the already-exited process group failed with ESRCH, poisoning the
script's exit status: build 269 (plausible#3) ran fully GREEN (all tiers pass, level=4) but
the step exited 1. Reproduced minimally with sh -e and bash -e on the host; the fixed wrapper
verified for all three paths: green rc=0, red rc=7 (propagated), TERM-to-shell -> child gets
TERM and wrapper exits 143. Cancel forwarding semantics unchanged.
2026-06-10 04:54:40 +00:00
56723ae0ec chore(conc): M2 merge-integrity pre-check — merged main == M1-verified tree (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:49:55 +00:00
dfa5c8b9ee journal(conc): M2(a) cancel-mid-run PASS evidence; (b) parallel runs triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:47:19 +00:00
bb5eb3d3aa Merge branch 'restructure/concurrency': concurrency restructure (P1-P5 + tests/concurrency)
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing
M1 Adversary-verified PASS (REVIEW-conc.md @83a6c6e): lock-lifetime hardening (PDEATHSIG +
signal funnels + 60-min deadline + setsid/trap cancel forwarding), flock-probe janitor
(registry deleted), per-run ABRA_DIR (recipe flock deleted), single concurrency knob,
tests/concurrency real-kernel suite, docs/concurrency.md rewrite.
2026-06-10 04:40:00 +00:00
83a6c6e157 review(M1): PASS — branch @d3fe9e2 cold-verified (unit 138, conc 20, lint, 0 dangling refs, gate-integrity, independent flock probe)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:39:16 +00:00
8b9033f3d6 journal(conc): tests suite + P5 evidence, M1 claim context
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:34:19 +00:00
e8e52cf4c6 claim(conc): M1 CLAIMED — branch restructure/concurrency complete (P1-P5 + tests, tip d3fe9e2), awaiting Adversary
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:33:59 +00:00
d3fe9e26bb docs: P5 concurrency spec rewrite — one lock, one structural isolation, the invariant chain
All checks were successful
continuous-integration/drone/push Build is passing
Rewritten to the restructured model: lifetime-hardening guards (PDEATHSIG/SIGTERM/SIGALRM +
setsid/trap), per-run ABRA_DIR isolation (same-recipe runs now parallel), per-app-domain flock
(double-!testme serialisation), flock-probe janitor decision table (incl. the inode-identity
race rows), updated failure-mode table (cancel now tears down via the harness's own funnel;
reboot reaps immediately; 60-min deadline bounds everything), single-knob config table, how to
run tests/concurrency, fresh file/symbol index + deleted-symbol list for grep verification.
Also drops the last stale concurrency.limit mention from the .drone.yml header comment.
2026-06-10 04:32:54 +00:00
84d90fb655 test(concurrency): real-kernel suite for the restructured model — 20 tests, 19 plan cases
All checks were successful
continuous-integration/drone/push Build is passing
tests/concurrency/ — NOT in the default `pytest tests/unit` gate; run explicitly with
`pytest tests/concurrency -q`. flock/prctl/alarm are never mocked: helper subprocesses
(helpers.py) hold real locks and install the real lifetime guards; locks live in a per-test
tmp dir via CCCI_APP_LOCK_DIR; every helper (and recorded grandchild) is reaped by fixture
cleanup.

- test_locks.py (cases 1-4): SIGKILL auto-release; LOCK_NB held/unheld semantics; PEP 446
  fd-not-inherited (holder's child survives, lock still releases); same-domain second acquire
  blocks until first holder exits.
- test_janitor.py (cases 5-12): orphan reaped once + lockfile unlinked; live holder never
  reaped + logged; new-run acquire blocks until a slow reap completes (reap-under-probe-lock);
  two overlapping janitors -> exactly one reaps (flock arbitration); reboot sim (no lockfile)
  reaps immediately with no age wait; >120min-held lock flagged 'possible leaked run' and NOT
  stolen; warm/canonical names never probed (no lockfile even created); directory-as-lockfile
  and missing lock dir degrade to skip+log, never crash.
- test_lifetime.py (cases 13-16): PDEATHSIG (wrapper parent SIGKILL'd -> guarded child TERM'd,
  teardown marker, lock released); already-orphaned helper REFUSES to run (ppid race); 2s
  deadline alarm -> teardown + exit 142 + lock released; SIGTERM -> teardown + exit 143 +
  lock released.
- test_abra_dir.py (cases 17-19 + 18b): per-run dir built + $ABRA_DIR exported before the
  first abra call (recording stub abra on PATH); two CONCURRENT same-recipe fetch+checkout
  flows into different ABRA_DIRs -> divergent correct trees, canonical staged clone untouched;
  .env written through the servers/ symlink lands in the canonical path (env_get/env_set
  agree); manual runs get pid-suffixed dirs.

On cc-ci: pytest tests/concurrency -q -> 20 passed; tests/unit -> 138 passed; lint PASS.
2026-06-10 04:29:36 +00:00
c51692b57e chore(conc): pre-review P3+P4 — zero dangling refs, ABRA_DIR ordering clean (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:28:41 +00:00
ffcf441364 journal(conc): P1-P4 evidence (live smokes on cc-ci) + pre-existing abra app ls FATA observation
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:21:17 +00:00
2080d734d3 status(conc): P1-P4 on branch (b492f99..91d3cc7), tests/concurrency next
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:20:20 +00:00
91d3cc7e99 chore(ci): P4 config cleanup — DRONE_RUNNER_CAPACITY is the single concurrency knob
All checks were successful
continuous-integration/drone/push Build is passing
Remove concurrency.limit from the recipe-ci pipeline (.drone.yml): it duplicated
DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) and the two had to be kept in
step by hand (docs/concurrency.md §8.6). maxTests comment updated to state it is the single
knob and to describe the new safety model.
2026-06-10 04:19:35 +00:00
f98b444559 decisions(conc): record P3 install_steps.sh ABRA_DIR path fix (guardrail justification)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:18:45 +00:00
17ebdf39ac feat(harness): P3 per-run ABRA_DIR — structural recipe-tree isolation, recipe flock deleted
All checks were successful
continuous-integration/drone/push Build is passing
- run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and
  catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared
  canonical path, so janitor discovery and env-based teardown are unchanged; per-domain
  filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ —
  each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the
  abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation.
- fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock.
  CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging
  workflow, run reads staged state).
- abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by
  recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions,
  generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and
  warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but
  follows the per-run tree when imported by promote_canonical inside a run).
- deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the
  must-lock-before-fetch ordering rule.
- tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the
  compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix
  required by per-run trees; no assertion/gate touched — see DECISIONS.md).
- .drone.yml comments updated (HOME=/root rationale now via the servers symlink).
2026-06-10 04:18:33 +00:00
08b629f52a chore(conc): pre-review P1+P2 — 4 break-it concerns tested + refuted (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:16:41 +00:00
b302f3ab63 feat(harness): P2 flock-probe janitor — the kernel flock IS the liveness oracle
All checks were successful
continuous-integration/drone/push Build is passing
- acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in
  deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line
  when another run of the same domain is in flight (double-!testme serialisation). The file
  object is retained in module-level _held_app_locks so GC can never close the fd and silently
  release the lock. mtime is touched at acquisition (lock age for the long-held flag).
- janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service
  sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the
  probe lock (a new same-domain run blocks until the reap finishes), then unlink before release.
  Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale
  unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log.
- unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is
  the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on
  the live path, and a probe that won one skips (unlinking now would hit a newer run's file).
- deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path,
  ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard.
  teardown_app no longer unregisters (release is process exit). janitor() takes no args now.
- post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate
  reap (improvement over the old 2h age fallback).
2026-06-10 04:11:31 +00:00
b492f995bd feat(harness): P1 lock-lifetime hardening — PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard deadline
All checks were successful
continuous-integration/drone/push Build is passing
- new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with
  post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's
  finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way
  with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged
  and ignored (begin_teardown guard) so a second signal can't abort the running cleanup.
- run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown
  finally: blocks (cold + quick) mark begin_teardown().
- .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards
  the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of
  leaking it (docs/concurrency.md §8.1).
- PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.
2026-06-10 04:04:28 +00:00
e350c94c3f chore(conc): record cold-verify environment (cc-ci-run pytest env, M1 plan)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:03:23 +00:00
45afccbef5 status(conc): bootstrap phase state files — P1 in flight on branch restructure/concurrency
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:00:12 +00:00
48d03d8405 chore(conc): seed REVIEW-conc.md — adversary online, baseline pre-read (no verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 03:56:26 +00:00
5b65c6caa3 docs: concurrency spec — how parallel recipe runs stay safe (for review/restructuring)
All checks were successful
continuous-integration/drone/push Build is passing
Documents the capacity=2 concurrent-run system as landed in c0df77d,
68ef0f8, e6d55b5: config knobs, isolation model, per-recipe flock,
active-run registry + three-way janitor, convergence interactions,
failure-mode guarantees, and known limitations / restructuring
candidates.
2026-06-10 03:05:20 +00:00
157d06dc77 Merge pull request 'test(plausible): psql -q in _register_site — -t does not suppress command tags' (#9) from test/plausible-psql-quiet into main
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-09 23:12:37 +00:00
e6d55b53c7 fix(harness): a paused swarm update is settled — only active states block convergence
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
68ef0f8 made services_converged() require UpdateStatus settled, treating
'paused' as in flight. But swarm's default update-failure-action pauses the
update on a single task flicker and the flag persists FOREVER (until the next
update): immich CI 241 had the app service 'paused' from a restart during
restore while the service was back at 1/1 and healthy — every subsequent wait
hung to its deadline and the run had to be killed.

Only 'updating' and 'rollback_started' now block convergence: those are the
states swarm is actively driving (the 238 stop-first race lives in 'updating').
'paused'/'rollback_paused' make no progress without intervention, so waiting on
them is pointless — N/N replicas is already required, and the HTTP-health and
tier assertions still gate whether the app actually works.

lint: PASS, unit tests: 138 passed.
2026-06-09 23:07:36 +00:00
83 changed files with 4139 additions and 1012 deletions

View File

@ -35,12 +35,12 @@ steps:
# the comment-bridge). Deploys the recipe at the PR head, runs install/upgrade/backup + any
# recipe-local tests via the shared harness, then guarantees teardown (plan §4.2/§4.3).
#
# Resource safety (plan §4.2/§4.3): DRONE_RUNNER_CAPACITY=2 (nix/modules/drone-runner.nix) +
# concurrency.limit=2 below allow two recipe runs in parallel. Concurrent-run safety is enforced by
# the harness, not by serialisation: same-recipe runs serialise on a per-recipe flock
# (lifecycle.acquire_recipe_lock — the shared ~/.abra/recipes/<recipe> checkout is the conflict),
# and every run registers its app domain + pid in /run/cc-ci-active so the run-start janitor only
# reaps orphans whose owning run is DEAD (alive → never touched; unknown → age fallback, default 2h).
# Resource safety (plan §4.2/§4.3): DRONE_RUNNER_CAPACITY=2 (nix/modules/drone-runner.nix, the
# single concurrency knob) allows two recipe runs in parallel. Concurrent-run safety is enforced by
# the harness, not by serialisation: every run holds an exclusive flock on its app domain
# (/run/lock/cc-ci-app-<domain>.lock) for its whole process lifetime, the run-start janitor probes
# that lock to reap only orphans (held lock = live run, never touched), and recipe working trees
# are per-run ($ABRA_DIR/recipes — no shared checkout, no recipe lock). See docs/concurrency.md.
kind: pipeline
type: exec
name: recipe-ci
@ -53,21 +53,37 @@ trigger:
event:
- custom
concurrency:
limit: 2
# NB deliberately NO `concurrency.limit` here: DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix
# maxTests) is the single concurrency knob (P4 — two knobs in two files drifted).
steps:
- name: ci
environment:
STAGES: install,upgrade,backup,restore,custom
# The exec runner points HOME at a per-build workspace; force it to /root so abra finds its
# server config + recipes under /root/.abra (as the manual M4/M5 runs did). Safe with
# capacity=2: app names are unique per (recipe,pr,ref) and same-recipe runs serialise on the
# per-recipe flock, so concurrent builds never touch the same recipe checkout or app.
# The exec runner points HOME at a per-build workspace; force it to /root so abra's server
# config is found via the per-run ABRA_DIR's servers/ symlink -> /root/.abra/servers.
# Recipe trees are PER-RUN ($ABRA_DIR/recipes, exported by run_recipe_ci before any abra
# call), so concurrent builds never share a recipe checkout; app .env files are per-domain
# in the shared canonical servers/ path, guarded by the app-domain flock.
HOME: /root
commands:
# RECIPE/REF/PR/SRC (+ CCCI_QUICK for `!testme --quick`) are injected as env vars from the
# build's custom params. CCCI_QUICK=1 makes run_recipe_ci take the opt-in fast lane (WC7);
# absent => full cold (default). run_quick ignores STAGES (always upgrade+custom).
- 'echo "recipe-ci: RECIPE=$RECIPE REF=$REF PR=$PR SRC=$SRC stages=$STAGES quick=${CCCI_QUICK:-0}"'
- cc-ci-run runner/run_recipe_ci.py
# P1 lock-lifetime hardening: run the harness in its own session/process group (setsid) and
# forward a drone cancel (TERM to this step shell) to the WHOLE group, so the harness's
# SIGTERM handler runs its teardown funnel instead of being leaked (the exec runner kills
# only the step shell, not the tree). PDEATHSIG inside the harness backstops the case where
# this shell dies without the trap firing. The harness exit code is captured explicitly and
# the traps cleared before exiting: the runner shell is `set -e`, and an EXIT-trap kill of
# the already-gone process group returns ESRCH, which otherwise poisons a GREEN run's exit
# status to 1 (observed live, build 269: all tiers pass, step exit 1).
- |
setsid cc-ci-run runner/run_recipe_ci.py &
PID=$!
trap 'kill -TERM -- "-$PID" 2>/dev/null || true' TERM EXIT
rc=0
wait "$PID" || rc=$?
trap - TERM EXIT
exit "$rc"

68
BACKLOG-conc.md Normal file
View File

@ -0,0 +1,68 @@
# BACKLOG — sub-phase conc
## Build backlog
- [x] P1 lock-lifetime hardening: prctl PDEATHSIG + ppid race check + SIGTERM handler →
teardown funnel + signal.alarm(3600) hard deadline; .drone.yml setsid/trap wrap;
PEP 446 comment on lock open()
- [x] P2 flock-probe janitor: acquire_app_lock(domain) at register_run_app's call site;
janitor probes per-domain lockfiles (acquired→reap under probe lock, held→leave,
>120min mtime→warn); delete registry symbols
- [x] P3 per-run ABRA_DIR: /var/lib/cc-ci-runs/<build>/abra with servers+catalogue symlinks,
fresh recipes/; fetch_recipe = plain clone; delete acquire_recipe_lock; route harness
recipe paths through ABRA_DIR
- [x] P4 config cleanup: remove concurrency.limit from .drone.yml; maxTests is the single knob
- [x] tests/concurrency suite (19 cases, real-kernel flock, explicit invocation only)
- [x] P5 docs/concurrency.md rewrite to the new model
- [ ] M1 claim (branch complete, both suites + lint green)
- [ ] M2: merge to main after M1 PASS, push build green, live verification ad
## Adversary findings
### [adversary] CONC-A1 — double-!testme same domain corrupts the shared deploy-count file (M2(c) FAIL)
**Severity:** blocks M2(c). Both runs of a same-domain double-!testme go RED.
**Root cause (two coupled defects, one shared root):**
1. The DG4.1 deploy-counter file is keyed by DOMAIN in the *shared* system tempdir, NOT per-run:
`run_recipe_ci.py:930 countfile = /tmp/ccci-deploys-<domain>`. P3 isolated `ABRA_DIR` per run
but this per-run state file was missed — it predates the restructure (ef44d46) and the OLD
recipe-flock used to serialize same-recipe runs end-to-end, incidentally masking it.
2. `lifecycle.deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE
`acquire_app_lock(domain)` (lifecycle.py:254, introduced by P2 b302f3a). So the counter
increment happens OUTSIDE the serialization window — a second same-domain run bumps the
shared counter before it ever blocks on the lock.
**Observed (live, builds 279 + 281, immich PR#2, same domain immi-ad3e33, 2026-06-10T05:04Z):**
- Lock serialization itself WORKS: 281 logged `== app lock: ... in flight — waiting ==` at 2s,
then `== app lock: acquired ==` at 194s — exactly when 279 exited (279 finished 05:07:35).
- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`. The `2` = 281's pre-lock `_record_deploy`
(fired ~2s, before 281 blocked) polluting the shared counter 279 was actively using.
- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33...` at run_recipe_ci.py:1213 —
279's end-of-run `os.remove(countfile)` (line 1215) deleted the shared file out from under 281,
whose single `_record_deploy` had already fired at 2s and never recreates it.
- Control: isolated immich (build 275, same fixed wrapper) → `deploy-count = 1`, GREEN. So this
is concurrency-specific, not a pre-existing immich/wrapper issue.
**Repro:** two `!testme` comments on the same recipe PR (same domain) in quick succession on the
deployed main harness → both builds RED (one DG4.1 false-violation, one FileNotFoundError).
**Fix direction (Builder owns):** key the deploy-counter per RUN, not per domain — e.g. put it in
`/var/lib/cc-ci-runs/<build>/` (alongside the per-run artifacts) or include the build/run id in the
filename, and export that path via `CCCI_DEPLOY_COUNT_FILE`. Per-run keying fixes BOTH defects at
once (no cross-run pollution; no shared remove). Moving `_record_deploy()` after `acquire_app_lock`
alone is INSUFFICIENT — the shared `os.remove`/`FileNotFoundError` collision survives. Add a
tests/concurrency case: two same-domain runs serialized on the app lock → each sees its own
deploy-count, neither removes the other's file (this is the gap vs the 19 planned cases — case 4
serialises acquire but never asserts deploy-count isolation across the two).
**Closure:** adversary-owned. Re-test the (c) double-!testme live (both GREEN, visible block line,
zero leakage) + the new unit case before this clears. Only I close it.
**CLOSED @2026-06-10T09:0xZ** — fix b6e12ef (run-keyed state files via `_run_state_path`) merged
139e319. Verified by me: (a) code cold-verified + mutation-proven (reverting to domain-keying fails
all 3 test_run_state cases); (b) suites green cold (unit 138, concurrency 23); (c) LIVE re-run
builds 290+291 (same immich domain immi-ad3e33) BOTH SUCCESS — 291 logged the block line
(`in flight — waiting``acquired`), both read `deploy-count = 1` (290 no longer false-2; 291 no
longer FileNotFoundError), zero leakage after (0 procs / 0 apps / 0 services / 0 volumes / 0 secrets
/ no held locks). Full evidence in REVIEW-conc M2(c) PASS.

23
BACKLOG-rcust.md Normal file
View File

@ -0,0 +1,23 @@
# BACKLOG — sub-phase rcust
## Build backlog
- [ ] P1.1 `runner/harness/meta.py`: KEYS registry (14 keys + 3 deprecated) + `load(recipe) -> RecipeMeta`
- [ ] P1.2 migrate readers L1L6 to `meta.load()` (orchestrator loads once, passes down)
- [ ] P1.3 mumble private constants → underscore-prefixed (`_WELCOME_TEXT_MARKER`, `_MAX_USERS`) + fix importers
- [ ] P1.4 `tests/unit/test_meta.py` (all-recipes-load-clean, MetaError cases, defaults, R2 proof)
- [ ] P1.5 `scripts/gen-meta-docs.py` + doc-sync unit test
- [ ] P2a compose.ccci.yml first-class (auto-copy + auto-chaos); strip ghost/discourse boilerplate
- [ ] P2b install-time deps only; migrate lasuite-docs; delete setup_custom_tests.sh machinery
- [ ] P2c SKIP_GENERIC meta key deleted; env form documented dev-only + loud warning in CI runs
- [ ] P2d conftest cleanup: delete deployed/deployed_app (+app_domain if unused); consolidate deps fixture; migrate 6 lasuite test files
- [ ] P3 HookCtx + convert all hook call sites + migrate in-repo users + unit tests
- [ ] P4 discovery placement rule + op_state/deps fixtures + migrate hand-parsers
- [ ] P5 customization manifest (print block + results.json key) + unit tests
- [ ] P6 docs rewrite (recipe-customization.md §8, testing.md, enroll-recipe.md)
- [ ] M1 pre-claim: run `pytest tests/concurrency -q` once to prove untouched
- [ ] M2 prep: build baseline matrix (21 recipe dirs, expected outcomes) BEFORE merging — commit to STATUS-rcust.md
## Adversary findings
(Adversary-owned section)

165
JOURNAL-conc.md Normal file
View File

@ -0,0 +1,165 @@
# JOURNAL — sub-phase conc (Builder, append-only)
## 2026-06-10 — bootstrap
Read concurrency-restructure-full-plan.md (SSOT) + plan.md §6.1/§7/§9. Oriented on the code:
- `runner/harness/lifecycle.py` — recipe flock (l.46), registry (l.6597), deploy_app
registration (l.283), teardown unregister (l.723), three-way janitor (l.726).
- `runner/run_recipe_ci.py``acquire_recipe_lock` call site (l.843), `fetch_recipe` (l.140,
rm-rf + reclone of the shared tree), janitor call sites (l.600 quick, l.932 cold).
- `.drone.yml` — recipe-ci step runs `cc-ci-run runner/run_recipe_ci.py` bare (P1 wraps it),
`concurrency.limit: 2` (P4 removes).
- Greps for P3 fallout: `~/.abra/recipes` referenced in abra.py (recipe_checkout,
has_lightweight_version_tags, recipe_head_commit, recipe_versions), generic.py:28,
lifecycle.prepull_images, run_recipe_ci (fetch_recipe, snapshot_recipe_tests, comment),
warm_reconcile.py:202 (runs OUTSIDE per-run context — keeps default), and
tests/ghost+discourse install_steps.sh (`${HOME}/.abra/recipes/...` — these run INSIDE a
run and copy compose.ccci.yml into the deploy tree, so they must resolve the per-run dir).
- `~/.abra/servers/...` paths are unaffected by design (servers/ is symlinked to the canonical
/root/.abra/servers, so both resolutions land on the same file).
Working setup: state files on main in this clone; code on branch `restructure/concurrency`
via a git worktree at ../cc-ci-conc; test runs on the cc-ci host via /root/builder-clone
(`cc-ci-run -m pytest ...`, `nix develop .#lint`).
## 2026-06-10 — P1P4 landed on restructure/concurrency
- P1 b492f99: harness/lifetime.py (PDEATHSIG+ppid recheck, SIGTERM/SIGALRM→SystemExit funnel
with re-entrancy guard, alarm(3600)); main() installs first; both finally blocks mark
begin_teardown(); .drone.yml setsid+trap wrap. Live smoke on cc-ci (cc-ci-run /tmp/p1-smoke.py):
TERM→rc=143+finally; ALRM→rc=142+finally+deadline log; parent-kill→child TERM'd, teardown ran.
- P2 b302f3a: acquire_app_lock + _probe_and_reap + janitor rewrite; registry deleted. Live smoke
(/tmp/p2-smoke*.py): held lock → "live concurrent run, leaving it", reaped=[]; killed holder →
reap exactly once + lockfile unlinked; waiter blocked during probe-held reap, then re-acquired
on the FRESH inode (probe confirmed held by waiter). Note: a select()-on-fd readline artifact
in my smoke script initially looked like a failure — kernel state was verified directly.
Unlink/recreate race guarded on BOTH sides via fstat/stat st_ino identity checks.
- P3 17ebdf3: per-run ABRA_DIR. Verified abra CLI honors $ABRA_DIR on-host (skeleton probe:
FATAs only on empty servers/; with servers+catalogue symlinks + recipes/ it works and even
auto-clones recipes for `app ls` resolution into the per-run dir). p3-smoke: setup + fetch of
custom-html-tiny landed in /tmp/p3runs/9999/abra/recipes, head commit + versions readable via
abra.recipe_dir(). install_steps.sh path fix justified in DECISIONS.md (conc P3 entry).
Pre-existing observation (NOT mine, unchanged): `abra app ls -S -m -n` currently FATAs
"unable to resolve '0cc57a5a'" under the DEFAULT abra dir too → janitor's abra discovery
yields [] and the docker-service sweep carries discovery. Out of this phase's scope.
- P4 91d3cc7: concurrency.limit removed; maxTests comment states single-knob + new model.
One stale comment line (.drone.yml l.39 "concurrency.limit=2 below") folds into P5.
All four commits: tests/unit 138 passed + lint PASS before each. Next: tests/concurrency suite.
## 2026-06-10 — tests/concurrency (84d90fb) + P5 (d3fe9e2) + M1 claim (e8e52cf)
- Suite: 20 tests / 19 plan cases, all real-kernel (helpers.py subprocesses hold real flocks,
install real prctl/alarm guards; CCCI_APP_LOCK_DIR sandboxes /run/lock; HelperPool reaps every
helper + recorded grandchildren). First full run on cc-ci: 20 passed in 9.96s, zero flakes in
3 repeat runs during the P5 verification re-runs.
- Design notes for the Adversary's blind-spot hunt (my own known limits):
- case 8 (two janitors) uses threads in one process — valid because flock conflicts are
per-open-file-description, and overlap is forced via a Barrier + 2s slow teardown stub.
- case 14 relies on reparent-to-pid-1 (true on the cc-ci host; would need adjustment in a
subreaper environment — marked NEVER_REPARENTED visibly if so).
- cases 5-12 stub teardown_app (recording) — janitor probe/reap ordering is what's under
test, not teardown internals (covered by Phase-1 e2e + M2 live checks).
- M1 claimed at e8e52cf; full verification recipe in STATUS-conc.md (WHAT/WHERE/HOW/EXPECTED).
## 2026-06-10 — M2: merge + live verification (a)
- Merge: bb5eb3d (--no-ff) pushed; push build 266 (self-test lint+hello) SUCCESS.
- (a) cancel-mid-run: !testme on immich#2 → build 267 (custom) running on the NEW harness —
log shows the setsid/trap wrap + "== per-run ABRA_DIR: /var/lib/cc-ci-runs/267/abra ==";
lock /run/lock/cc-ci-app-immi-ad3e33...lock held by pid 636902; 4 immich services up.
Canceled via drone API 04:42:07Z (HTTP 200, build status "killed"). Result: harness pid
GONE (no leaked python — the old §8.1 gap is closed), immich services 0, volumes 0,
secrets 0, .env 0 — the SIGTERM funnel ran the run's own teardown (better than the plan's
minimum, which allowed the janitor to do the reaping). Lock RELEASED (lockfile present but
unheld — tidy-swept by the next janitor, to be observed during (b)).
- (b) triggered 04:46:53Z: !testme immich#2 (comment 14287) + plausible#3 (14288) in parallel.
## 2026-06-10 — M2(b) round 1: green runs, poisoned exit code → wrapper fix
- Builds 268 (immich#2) + 269 (plausible#3) ran in PARALLEL on the new harness: both logs end
with all-tiers-pass RUN SUMMARY (level=4, deploy-count 1/1) and the host shows ZERO leakage
after (no harness processes, no immi/plau services/volumes/secrets, only unheld lockfiles).
Both steps nevertheless exited 1: the P1 EXIT trap's kill of the already-gone process group
returns ESRCH under the runner's `set -e` shell — a GREEN run reported failure.
- Reproduced minimally on-host (`sh -e` and `bash -e`: rc=1 on a clean exit with the old trap).
Fix e1c4198 (capture rc; `trap - TERM EXIT`; `|| true` on the trap kill) verified on-host:
green rc=0, red rc=7 propagated, TERM→wrapper forwards to child, exits 143. Merged to main
b7a009c; push builds 272-274 green. Adversary notified via inbox.
- (b) re-triggered on the fixed wrapper 04:56:10Z (immich#2 + plausible#3).
## 2026-06-10 — M2(b) PASS + (c) triggered
- (b) round 2 on fixed wrapper: builds 275 (immich#2) + 276 (plausible#3) ran in PARALLEL,
BOTH status=success (drone API). Host after: 0 python harness processes, 0 immi/plau
services/volumes/secrets/.envs — zero leakage. (d) satisfied by 275 (full green immich e2e).
Leftover unheld lockfiles present by design (tidy-swept at next janitor).
- (c) double-!testme on immich#2: two comments at 05:03:58Z → two custom builds, same run
domain immi-ad3e33 → exactly one must block on the app lock with the visible log line.
## 2026-06-10 — CONC-A1: (c) failure root-caused + fixed (run-keyed state files)
- (c) round 1 = builds 279+281, both RED. Root cause (independently also found+filed by the
Adversary as CONC-A1 while I was mid-diagnosis — same conclusion from both loops): the four
run-scoped state files (deploys/opstate/deps/depskip) were DOMAIN-keyed in shared /tmp;
281's main()-preamble + pre-lock _record_deploy fired before it blocked on the app lock →
279 read deploy-count 2 (false DG4.1 RED); 279's end-of-run os.remove deleted the shared
countfile → 281 crashed FileNotFoundError at its own read. Lock serialization itself worked
(281: waiting @+2s, acquired @+194s = 279's exit). Masked pre-restructure by the
end-to-end recipe flock.
- Fix b6e12ef on branch, merged to main 139e319: _run_state_path() keys all four by
run id + harness pid; consumers were always env-fed (CCCI_*_FILE), so domain keying was
never load-bearing. Both cleanup sites already remove all four on normal exit.
- New tests/concurrency/test_run_state.py (suite now 23): path invariants + real-process
CONC-A1 interleaving via helpers.py `deploy-count-run` (countfile init → pre-lock
_record_deploy → acquire → gated read). Teeth verified: under simulated shared keying the
regression test FAILS (host run: 3 failed); with the fix: 23 passed + 138 unit + lint PASS.
- Next: push build green → re-run (b)+(d), then (c), then (a) per the VETO's conditions.
## 2026-06-10 — M2 re-verification on CONC-A1-fixed main (139e319)
- Push builds 283/284/285 (branch fix, merge, inbox) all green.
- (b)+(d) round 3 (comments 14299/14300, 08:17:35Z): builds 287 (immich#2) + 288 (plausible#3)
BOTH success, started simultaneously 08:17:40Z (parallel), finished 08:21:06/08:21:13.
Both logs: deploy-count = 1 (expect 1), level=4. Host after: pgrep -f 'run_recipe_c[i]' → no
match (earlier "2" was pgrep self-match of the ssh cmdline); immi/plau services/volumes/
secrets/server-envs all 0. Zero leakage. (d) satisfied by 287 (full green immich e2e on the
final harness code).
- (c) round 2 triggered 08:22:13Z: comments 14303+14304 on immich#2 (same domain immi-ad3e33).
## 2026-06-10 — M2(c) PASS round 2 (builds 290+291) + (a) re-run triggered
- (c) round 2: builds 290 (08:22:30→08:46:05) + 291 (08:22:33→08:49:23) BOTH success.
291 log: "== app lock: another run of immi-ad3e33... in flight — waiting ==" at +1s,
"acquired" at +1411s = exactly 290's exit. Both: deploy-count = 1 (expect 1), level=4.
Slowness was an immich-ML healthcheck flake (Adversary cross-confirmed live via lslocks:
one holder pid 739163, one waiter pid 739341 on the same lock inode — serialization observed
in the kernel lock table); ML converged inside the 1500s window, both runs green anyway —
no clean re-run needed.
- After both: no harness procs (pgrep run_recipe_c[i] empty), 0 immi/plau services/volumes/
secrets/server-envs. Unheld lockfile remains by design (tidy-swept at next janitor probe).
- (a) re-run on fixed harness: !testme immich#2 comment 14307 @08:50:02Z; will cancel mid-run
via drone API once the deploy is in flight, then check pid/lock/leakage + janitor reap.
## 2026-06-10 — M2(a) re-run PASS (build 295) + M2 claim
- (a) on fixed harness: build 295 (comment 14307 @08:50:02Z) canceled @08:51:05Z (HTTP 200)
while mid-deploy (lock held by pid 763099, 4 immich services converging). Harness pid GONE
@08:51:15Z — the SIGTERM funnel ran the run's own teardown inside 10s; build status=killed;
lock released (lslocks empty); services/volumes/secrets/envs all 0. Zero leakage, no janitor
required.
- Adversary lifted the CONC-A1 VETO @09:05Z with its own M2(c) PASS (290/291 cold-verified,
kernel-lock-table serialization observation). Remaining for DONE: formal M2 claim (this
commit) + Adversary cold re-check of (a)/push-builds.
- M2 claimed in STATUS-conc.md with consolidated (a)-(d) evidence + cold re-check recipe.
## 2026-06-10 — M2 PASS → ## DONE
- Adversary M2 PASS @08:55Z (review 9987fba): all 7 claim items cold-confirmed, both M2-found
fixes verified, guardrails honored, no open veto. Parent-sha typo in my claim noted by the
Adversary (139e319^1 = 2173894, not 4ad55ed) — corrected in STATUS.
- ## DONE written to STATUS-conc.md. Phase conc complete: one mechanism (per-app-domain flock),
per-run ABRA_DIR isolation, flock-probe janitor, lifetime guards + 60-min deadline, single
concurrency knob, spec rewritten, 23-test real-kernel suite. Two live-found fixes along the
way: wrapper exit-code under set -e, CONC-A1 run-keyed state files.

10
JOURNAL-rcust.md Normal file
View File

@ -0,0 +1,10 @@
# JOURNAL — sub-phase rcust (Builder)
## 2026-06-10 bootstrap
Read phase plan (recipe-custom-restructure-full-plan.md), plan.md §6.1/§7/§9, and the reference
spec docs/recipe-customization.md @ 76a4b6b in full. Created phase state files. Work branch will
be `restructure/recipe-custom` off main @ 76a4b6b. Starting P1: reading the six current loaders
(run_recipe_ci.py::_load_meta, conftest.py::_recipe_meta, lifecycle.py::_recipe_extra_env,
lifecycle.py::_recipe_meta_flag, deps.py::declared_deps, canonical.py::is_canonical_enrolled)
before writing harness/meta.py.

442
REVIEW-conc.md Normal file
View File

@ -0,0 +1,442 @@
# REVIEW-conc.md — Adversary ledger, concurrency-restructure phase
Append-only. Verdicts: `<gate>: PASS @<ts>` + evidence, or `FAIL` + [adversary] finding in
BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.
## 2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)
Pulled main @5b65c6c. No STATUS-conc.md, no `restructure/concurrency` branch — nothing claimed yet.
Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to
anchor my later diff review in the as-is code, not the Builder's narrative.
Current-system facts I will hold the restructure against:
- Registry symbols slated for deletion (will grep for dangling refs at M1):
`register_run_app` (lifecycle.py:69, call site :283), `unregister_run_app` (:78, call sites :723, :766),
`_run_owner_state` (:83), `ACTIVE_RUN_DIR` (:43), `CCCI_JANITOR_MAX_AGE` (janitor :738),
`acquire_recipe_lock` (:46, call site run_recipe_ci.py:843), `RECIPE_LOCK_DIR` (:42).
- Must survive untouched: `RUN_APP_RE` (lifecycle.py:26) allowlist semantics (warm/canonical apps
never probed), `services_converged()` paused-is-settled logic, docker-service sweep discovery,
`teardown_app(verify=False)` idempotence.
- M1 verification plan (cold, my clone): checkout branch; `pytest tests/unit -q`,
`pytest tests/concurrency -q`, `scripts/lint.sh`; full diff review hunting: probe-vs-acquire
ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM),
teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently
released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a
waiter blocks on the old inode → two "held" locks on different inodes for one domain),
PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration,
setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases.
- Tests/concurrency must NOT be wired into the default `pytest tests/unit` gate (plan decision).
- M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme
same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.
No verdict yet — waiting for Builder bootstrap/claim.
## 2026-06-10T04:05Z — cold-verify environment established (prep, no gate)
Builder seeded STATUS/BACKLOG/JOURNAL-conc; STATUS says P1 in flight, no gate claimed. Mapped the
test-execution environment I'll use for the M1 cold run so a time-sensitive gate isn't spent
debugging tooling:
- Local VM devshell (`nix develop`) has only lintTools (no pytest). So pytest does NOT run here.
- pytest 8.3.3 + playwright live in the host `pyEnv` (nix/modules/harness.nix) exposed as
`cc-ci-run` on cc-ci. `cc-ci-run -m pytest <path> -q` works as the real harness interpreter
(verified: `cc-ci-run -c "import pytest" -> 8.3.3`).
- `.drone.yml` lint stage runs `nix develop .#lint --command bash scripts/lint.sh`.
- COLD M1 PLAN: fresh `git clone`/checkout of `restructure/concurrency` into a throwaway dir ON
cc-ci → `cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` +
`nix develop .#lint --command bash scripts/lint.sh`, all from that clean checkout (not the
Builder's working tree). Then adversarial diff review per my baseline hit-list.
- Baseline `.drone.yml` on main is still the pre-restructure version (concurrency.limit=2,
acquire_recipe_lock / /run/cc-ci-active registry referenced) — confirms P1/P4 edits are
branch-only so far. Good.
## 2026-06-10T04:23Z — early pre-review of P1+P2 (branch @b302f3a, NO gate claimed — NOT a verdict)
Builder has pushed P1 (b492f99) + P2 (b302f3a) to restructure/concurrency; P3/P4/P5/tests still
pending, so M1 is not claimable and this is NOT a PASS — it's pre-review to front-load the M1 diff
audit and avoid re-doing it under gate time pressure. Read code/diff + git only; did NOT read
JOURNAL (anti-anchoring intact). I actively tried to break the following and each concern was
REFUTED:
1. **Green-on-red via the .drone.yml EXIT trap** (my lead hypothesis). The wrapper is
`setsid cc-ci-run … & PID=$!; trap 'kill -TERM -- -$PID' TERM EXIT; wait $PID`. I worried the
EXIT trap's final `kill` status would override the harness exit code and mask a failing run.
EMPIRICALLY TESTED (4 bash repros incl. failing harness with a lingering group member that
makes kill succeed=0): bash PRESERVES the pre-trap exit status when the EXIT trap doesn't call
`exit`. Exit code propagates correctly in all cases (RED stays RED, GREEN stays GREEN). Refuted.
2. **P2 unlink/reacquire inode race** (janitor unlinks a reaped orphan's lockfile while a new run
blocks on the old inode). Handled: both acquire_app_lock and _probe_and_reap recheck
`fstat(fd).st_ino == stat(path).st_ino` after acquiring and retry/bail on mismatch — a lock on
an unlinked (anonymous) inode is never treated as authoritative, and the path's lockfile is
never unlinked out from under a newer run. Refuted.
3. **Half-reaped/new-app coexistence.** Reap runs WHILE HOLDING the probe lock; a new same-domain
run blocks in acquire_app_lock until reap completes. The pre-deploy window (lock held, app not
yet created) is covered: the stale-lockfile sweep sees the held lock (BlockingIOError) and
leaves it. Refuted.
4. **Signal mid-normal-teardown aborting cleanup.** begin_teardown() is the FIRST line of BOTH
finally blocks (run_recipe_ci.py:663 run_quick, :1134 main); the _funnel_handler swallows
(logs+returns) any SIGTERM/SIGALRM once tearing_down is set, so a second signal can't abort the
cleanup the first asked for. install_lifetime_guards() is the FIRST statement of main() (:829),
before any abra/lock call, with prctl→ppid==1 recheck in the correct order. Refuted.
Open items to confirm AT M1 (cold, full suite) — NOT defects, just unverified-until-then:
- `datetime` import removed from lifecycle.py along with _stack_age_seconds — grep for any
remaining datetime use (ruff would catch an undefined name; confirm import truly orphaned).
- `_stack_name` / age-fallback deadcode after the janitor rewrite — confirm no dangling refs.
- Registry-symbol deletion is only PARTIAL on this commit: acquire_recipe_lock still present
(P3 deletes it); register/unregister/_run_owner_state/ACTIVE_RUN_DIR/CCCI_JANITOR_MAX_AGE are
gone — full dangling-ref grep belongs at M1 once P3 lands.
- setsid-fork edge: if `setsid` ever forks (only when it's a pgrp leader; not the case for a
backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the
harness would reparent to ppid==1 and self-abort. Live-verify the trap+cancel path at M2(a).
- begin_teardown is process-global module state (lifetime._state) — fine for one harness process;
the tests/concurrency suite must not import-share it across in-process cases (verify at M1).
## 2026-06-10T04:32Z — pre-review P3+P4 (branch @91d3cc7, NO gate claimed — NOT a verdict)
Builder pushed P3 (17ebdf3 per-run ABRA_DIR) + P4 (91d3cc7 config cleanup). tests/concurrency +
P5 docs still pending, so M1 still not claimable. Continued the front-loaded diff audit (code/git
only; JOURNAL still unread). Findings — all CLEAN:
- **Dangling-ref grep across runner/bridge/dashboard/nix = ZERO hits** for all 9 deleted symbols:
acquire_recipe_lock, register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
CCCI_JANITOR_MAX_AGE, RECIPE_LOCK_DIR, _stack_age_seconds, _registry_path. The orphaned
`datetime` import is also gone from lifecycle.py. Clean deletion.
- **Path centralization**: all `~/.abra/recipes/<recipe>` literals replaced by `abra.recipe_dir()`
(resolves `$ABRA_DIR else ~/.abra`) across abra.py (recipe_checkout, has_lightweight_version_tags,
recipe_head_commit, recipe_versions), generic._recipe_dir, lifecycle.prepull_images,
snapshot_recipe_tests, fetch_recipe. prepull's env_path stays canonical `~/.abra/servers/...`
which is correct (servers/ is the shared symlink target).
- **Ordering verified** (main(), the only structural risk): install_lifetime_guards() is the FIRST
stmt (873); between it and setup_run_abra_dir() (891) there are ONLY env reads + a print — no
abra call; ABRA_DIR is exported at 891 BEFORE fetch_recipe (892) and before the first path-helper
recipe_head_commit (895). The `--quick` dispatch (run_quick, ~908) is AFTER 891, so the quick lane
inherits the per-run ABRA_DIR too. No tree is touched before ABRA_DIR is set.
- **Manual-run isolation**: rid=="manual" → "manual-<pid>" so two hand-runs don't share a tree.
Open items to confirm AT M1 (cold) — not defects:
- setup_run_abra_dir symlink idempotency: `if not os.path.islink(link): os.symlink(...)` — if a
NON-symlink file pre-exists at servers/catalogue (reused run dir from a crashed partial), symlink
raises FileExistsError. Low risk (fresh run-id per Drone build) but worth a glance.
- CCCI_SKIP_FETCH=1 now `rm -rf dest` + copytree(canonical, dest, symlinks=True) — confirm the
--quick rollback-proof staging tests still pass (they set CCCI_SKIP_FETCH).
- tests/{ghost,discourse}/install_steps.sh RECIPE_DIR=${ABRA_DIR:-$HOME/.abra} mechanical path fix
— confirm it changed NO assertion/gate (guardrail: never weaken recipe-test gates). Diff-check.
Net: the entire P1P4 diff has been pre-audited and is clean against my break-it hit-list. M1 cold
run, once claimed (after tests/concurrency + P5 land), reduces to: fresh checkout on cc-ci →
`cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` + lint, plus a
focused review of only the tests/concurrency suite (vs the 19 planned cases) and the P5 doc delta.
## M1: PASS @2026-06-10T04:38Z — implementation verified (branch restructure/concurrency @d3fe9e2)
Verdict formed from the plan (SSOT), the code/git, the STATUS claim's verify recipe, and my own
COLD acceptance run — WITHOUT reading JOURNAL first (anti-anchoring honored; noting here that I had
NOT consulted JOURNAL-conc at verdict time).
COLD ENVIRONMENT: fresh `git clone --branch restructure/concurrency` into /tmp/adv-m1 on cc-ci
(NOT the Builder's tree); `git rev-parse HEAD == d3fe9e26bb0fbaedb37383539ba3973bc1c80aff` (matches
claim), `git status` clean. Ran via the host `cc-ci-run` pyEnv (pytest 8.3.3 + playwright) and the
pinned `.#lint` devshell.
ACCEPTANCE RESULTS (expected → observed):
- `cc-ci-run -m pytest tests/unit -q` → 138 passed in 4.72s ✓ (claim: 138 passed)
- `cc-ci-run -m pytest tests/concurrency -q` → 20 passed in 9.91s ✓ (claim: 20 passed)
- `nix develop .#lint --command bash scripts/lint.sh``lint: PASS`
- `pytest tests/unit --collect-only` concurrency items → 0 ✓ (suite NOT in default gate)
- dangling-ref grep (register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
CCCI_JANITOR_MAX_AGE, acquire_recipe_lock, RECIPE_LOCK_DIR, _stack_age_seconds) over
*.py/*.nix/*.yml/*.sh → ZERO hits outside docs/ ✓
GATE-INTEGRITY (guardrails honored):
- `RUN_APP_RE` regex unchanged (lifecycle.py:26, identical pattern); warm/canonical apps still
never become probe candidates (test_11 asserts no lockfiles even created for warm names).
- `services_converged()` / paused-is-settled / `backup_app()` waits: NOT in the code diff — all
RUN_APP_RE/services_converged/paused diff hits are docs/concurrency.md prose (P5 rewrite).
- `teardown_app` ordering untouched; only its trailing unregister call removed (registry gone).
- Only `tests/<recipe>/` change is the mechanical `RECIPE_DIR=${ABRA_DIR:-$HOME/.abra}/...` line
in ghost+discourse install_steps.sh — NO assertion/gate touched (diff-confirmed). Guardrail
"never weaken recipe-test gates / touch tests/<recipe>/ content" honored.
- P4: `concurrency.limit` block removed from .drone.yml; drone-runner.nix comment makes
DRONE_RUNNER_CAPACITY the single knob.
ADVERSARIAL DIFF REVIEW (P1P4 pre-audited in the two notes above; refuted: green-on-red exit-code
masking [empirically tested], unlink/reacquire inode race [fstat==stat identity recheck],
half-reaped coexistence [reap-under-probe-lock], signal-mid-teardown reentrancy [begin_teardown
first line of both finally blocks], guard/ABRA_DIR/fetch ordering [no abra call pre-export]).
TEST-SUITE AUDIT vs the 19 plan cases: real kernel flocks, NEVER mocked (only teardown_app +
abra-discovery stubbed, both disclosed). Coverage complete: cases 14 test_locks, 512
test_janitor, 1316 test_lifetime, 1719 test_abra_dir, +test_18b (manual-pid isolation) = 20.
Assertions are substantive, not tautological: exact funnel exit codes 142/143 (test_15/16),
reap-vs-new-run timestamp ordering + fresh-inode `lock_state=="held"` (test_7), two-janitor
arbitration via separate open()s (test_8 — valid: flock binds the open file description, so
threads-with-distinct-fds model processes), long-held mtime-backdate flag-not-steal (test_10),
PEP 446 fd non-inheritance with a surviving child (test_3), divergent per-run trees + canonical
untouched (test_18).
INDEPENDENT PROBE (my own driver, NOT the Builder's helpers.py): drove the real
`lifecycle.acquire_app_lock` from a standalone script with a sandbox CCCI_APP_LOCK_DIR on cc-ci →
state `held` after acquire; a second acquirer BLOCKED while the first held (no ack2 after 1.5s);
after `SIGKILL` of the holder the second acquired within 10s (kernel auto-release). Core invariant
confirmed against the real code, not just the Builder's tests.
NON-BLOCKING NOTES (carry to M2 live-verify; none gate M1):
- setsid-fork edge in the .drone.yml trap wrapper: if `setsid` ever forks (only when it's a pgrp
leader — not the case for a backgrounded job in a non-job-control drone shell), $PID would be the
intermediate and the harness could reparent (ppid==1) and self-abort. MUST be live-verified by
the actual drone-cancel path at M2(a) — the plan already flags this ("verify drone exec runner
signal delivery; the trap must fire on drone cancel"). Not unit-testable here.
- End-of-janitor stale-lockfile tidy sweep (appless leftover lockfile unlink) is not directly
covered by a named test (not one of the 19); low risk (tidiness only). Noted, not a defect.
- test_14 (ppid race) depends on the helper reparenting to pid 1; under a subreaper it marks
NEVER_REPARENTED and FAILS VISIBLY (never false-passes). Passed in this env.
CONCLUSION: M1 — implementation verified — PASS. M2 (merge to main + live verification ad) is
unblocked. Reminder for both loops: recipe-mirror PRs are !testme targets only — never merge/push
them. (After this verdict I may consult JOURNAL-conc to contextualize, per §6.1.)
## 2026-06-10T04:49Z — M2 merge integrity pre-check (M2 NOT yet claimed — not a verdict)
Builder merged the branch to main (merge commit `bb5eb3d`, 2 parents 83a6c6e∘d3fe9e2, no force)
after my M1 PASS, and is mid-M2 live verification (journal: M2(a) cancel-mid-run evidence, (b)
parallel runs triggered). No `claim(conc): M2` commit yet; STATUS-conc still shows the stale M1
line (Builder's file — will update at the M2 claim). Independent merge check:
- `git diff bb5eb3d d3fe9e2 -- runner/ .drone.yml docs/concurrency.md tests/ nix/` = EMPTY → the
merge preserved EXACTLY the code I cold-verified at M1. No conflict-resolution drift introduced.
- `git merge-base --is-ancestor d3fe9e2 bb5eb3d` = true.
So deployed main == M1-verified tree. At the M2 claim I therefore re-verify only LIVE behavior +
the push build, not the code again:
push build green; (a) cancel mid-run → no leaked python/lock, next janitor reaps the app, zero
leakage; (b) two parallel !testme (immich#2 + plausible#3) → both green, zero leakage; (c)
double-!testme same PR → 2nd blocks on the app lock (visible in its drone log) then runs; (d) one
full green end-to-end run. Evidence to come from Drone build logs + cc-ci state (abra app ls /
lslocks / docker), cold from my own access path.
## 2026-06-10T05:00Z — wrapper exit-code fix verified + CORRECTION to my P1 pre-review (inbox consumed)
Consumed ADVERSARY-INBOX.md (deleted) — Builder reported an M2 live-verify finding + fix. Folded in:
**The defect (real, Builder-found, build 269 plausible#3):** the drone exec step shell is `set -e`.
On a NORMAL (green) harness exit the P1 EXIT trap still fired and its `kill -TERM -- -$PID` of the
already-exited process group returned ESRCH (exit 1), which under `set -e` poisoned the step's exit
status to 1 — a fully GREEN run (all tiers pass, level=4) reported RED.
**CORRECTION — my P1 pre-review was wrong on this point.** In my 04:23Z pre-review I claimed to have
"empirically tested" green-on-red exit-code masking and REFUTED it. That test was run with plain
`bash -c` WITHOUT `set -e` — the wrong shell mode. The real drone step runs `set -e`, where the bug
manifests. I re-ran the matrix correctly now (bash -e), reproducing the bug (old wrapper + green +
set -e → exit 1) and confirming I had the shell mode wrong. Lesson: model the EXACT runtime
(set -e) for shell-trap behavior. The Builder caught this live; I did not. Owning it.
NB the failure direction was false-RED (green reported red) — fail-safe-ish, not a green-on-red
(no failing run was ever reported green); still a real defect.
**The fix (e1c4198 on branch, merged to main b7a009c) — independently verified by me, cold under
`set -e` (the correct mode this time):**
```
setsid cc-ci-run runner/run_recipe_ci.py & PID=$!
trap 'kill -TERM -- "-$PID" 2>/dev/null || true' TERM EXIT
rc=0; wait "$PID" || rc=$?
trap - TERM EXIT
exit "$rc"
```
My 4-path matrix (all under `bash -e`, exact-shape repros):
- A green harness → step exit 0 ✓ (poisoning gone: `|| true` on the trap kill + `trap - EXIT` before exit)
- B **red harness (exit 7) → step exit 7 ✓ — NOT masked to green.** Critical false-GREEN check
PASSES: `wait || rc=$?` captures the real rc and `exit "$rc"` propagates it. The
"failing PR must report RED" gate is preserved by the fix.
- C old wrapper + green + set -e → exit 1 ✓ (bug reproduced — root-cause confirmed)
- D cancel (TERM to wrapper mid-wait) → wrapper exits 143 AND the child received TERM
(CHILD_GOT_TERM logged) ✓ — cancel-forwarding semantics unchanged; the `trap - TERM EXIT` runs
only AFTER `wait` returns (post-forward), so it can't disarm the forward during a real cancel.
Verdict on the fix: CORRECT and SAFE — resolves the false-RED poisoning without introducing
false-GREEN, and preserves cancel forwarding. Folds cleanly into the pending M2 review.
**M1 status unaffected:** M1 PASS was for the code/suites/lint/diff of d3fe9e2; this wrapper
exit-code-under-set-e is a LIVE behavior M1's checks could not exercise (the trap only runs in the
real drone exec shell). main now = d3fe9e2 + this .drone.yml wrapper fix; the fix is verified above.
Open for the formal M2 verdict: re-confirm lint green on the new .drone.yml (yamllint), the push
build green, and live (a) cancel-no-leak / (b) parallel both-green / (c) double-!testme blocks /
(d) one full green run — cold, once the Builder posts the M2 claim with evidence.
## M2(c): FAIL @2026-06-10T08:10Z — double-!testme same domain corrupts shared deploy-count → both runs RED + VETO
Proactive cold break-it probe of the live M2 evidence (M2 not yet formally `claim(conc)`'d — the
Builder's JOURNAL shows (c) "triggered" but NOT evidenced as PASS; I went straight to the Drone API
to verify the in-flight (c) runs independently, not to the JOURNAL narrative). I found a REAL defect
that breaks M2(c). Filed as BACKLOG-conc CONC-A1.
EVIDENCE (Drone API, recipe-maintainers/cc-ci, cold via /run/secrets/bridge_drone_token — my own
access path, not the Builder's word):
- (c) = builds **279 + 281**, both `event=custom PR=2 RECIPE=immich REF=a92b28d…` → SAME domain
`immi-ad3e33.ci.commoninternet.net`. Both `status=failure` (step `ci` exit_code=1).
- 281 (the blocked run): log `== app lock: ... in flight — waiting ==` @2s`== acquired ==` @194s,
which is exactly when 279's process exited (279 finished 05:07:35Z). **Lock serialisation + the
visible block line WORK** — that half of (c) is fine.
- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`.
- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33….ci.commoninternet.net` at
run_recipe_ci.py:1213.
- Control build 275 (isolated immich, same fixed wrapper) → `deploy-count = 1`, GREEN. Confirms the
failure is concurrency-specific, NOT a pre-existing immich/wrapper regression.
ROOT CAUSE (code, confirmed):
- DG4.1 counter file is DOMAIN-keyed in shared /tmp, not per-run: `run_recipe_ci.py:930
/tmp/ccci-deploys-<domain>`. P3 isolated ABRA_DIR per run but this per-run state file was missed
(predates the restructure, ef44d46; the old recipe-flock serialised same-recipe runs end-to-end,
masking it).
- `deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE `acquire_app_lock()` (:254,
introduced by P2 b302f3a) → the increment races OUTSIDE the lock. 281's single pre-lock
`_record_deploy` (@2s) bumps the shared counter 279 is using (→2, false violation), and 279's
end-of-run `os.remove(countfile)` (:1215) deletes the file under 281 → FileNotFoundError.
- Interleaving is fully reconstructed and self-consistent with the build timestamps (see CONC-A1).
This is squarely in M2(c) scope: the plan's DoD (c) requires the second run to "block … then RUN"
(implicitly green), and the phase's whole premise is "two concurrent !testme don't collide on
domain/volume/secrets." This is a domain-keyed-state collision — the restructure's narrower domain
lock no longer covers the deploy-count file. M1 (code/suites/lint/diff of d3fe9e2) is unaffected —
this is a live concurrency behavior M1's checks could not exercise; the tests/concurrency suite has
the matching blind spot (case 4 serialises acquire but never asserts deploy-count isolation across
two same-domain runs).
## VETO — M2 may NOT be marked DONE until CONC-A1 is fixed and I log a fresh (c) PASS
Forbidding `## DONE` in STATUS-conc until: (1) deploy-counter keyed per-run; (2) a tests/concurrency
case asserts same-domain deploy-count isolation; (3) live (c) re-run shows BOTH builds GREEN with
the visible block line and zero leakage; (4) (a),(b),(d) re-confirmed unaffected. Only I clear this.
(After this verdict I may consult JOURNAL-conc to contextualise — noting I had NOT read the (c)
journal reasoning before forming this FAIL; I verified from the Drone API + code directly.)
## 2026-06-10T08:20Z — CONC-A1 fix CODE-verified (veto conditions 1+2 met; 3+4 still pending — NOT cleared)
Builder fixed CONC-A1 (b6e12ef, merged main 139e319) and is re-running M2 live (a)(d). I
cold-verified the FIX CODE from my own clone + a fresh checkout on cc-ci (not the Builder's word):
- **Condition (1) per-run keying — MET.** `run_recipe_ci._run_state_path(name)` keys all four
run-scoped state files (`deploys`, `opstate`, `deps`, `depskip`) by `run_id()` + `os.getpid()`,
never domain. Grep: ZERO residual `ccci-<state>-{domain}` literals in prod code (only the
app-LOCK path stays domain-keyed, which is correct). All consumers env-read `CCCI_*_FILE`
(lifecycle:148, deps:72/155, generic:134) — no path re-derivation. Uniqueness holds even in the
manual fallback (`run_id()`→domain) because the `+pid` suffix separates two processes.
- **Condition (2) same-domain isolation test — MET, and proven non-tautological.**
tests/concurrency/test_run_state.py adds test_20/20b/20c. test_20c drives REAL processes + the
REAL lock + real `_run_state_path`/`_record_deploy`, reproducing the 279/281 interleaving: run A
reads `COUNT 1` (NOT polluted to 2 by B's pre-lock increment) and B's file survives A's remove
(no FileNotFoundError). **Mutation check (my own):** reverting `_run_state_path` to domain-keying
in a throwaway cc-ci clone → all 3 test_run_state cases FAIL (incl. test_20c). So the test
genuinely guards the fix.
- **Suites cold (fresh clone @4f6c955 on cc-ci):** unit 138 passed, concurrency 23 passed (was 20),
concurrency still NOT collected by the default `pytest tests/unit` run (0). lint not re-run here
(no .drone.yml/nix change in the fix; will confirm at the M2 claim).
**VETO NOT cleared.** Conditions (3) live (c) re-run BOTH builds GREEN + visible block line + zero
leakage, and (4) (a)/(b)/(d) re-confirmed on the fixed harness, still require the Builder's live
evidence (in flight). The code fix strongly predicts a (c) pass but M2 is a LIVE gate — I will
re-verify the (c) double-!testme cold from the Drone API once the Builder posts the M2 claim, and
only then clear the veto.
## 2026-06-10T08:43Z — live (c) round-2 (builds 290+291): serialization CONFIRMED via lslocks; delay is an immich-ML flake, NOT the restructure (not a verdict)
(b)+(d) re-passed on the fixed harness (builds 287 immich#2 + 288 plausible#3, parallel, both
success — I'll re-confirm at the M2 claim). (c) round 2 = builds 290+291 (both custom PR=2 immich,
same domain immi-ad3e33), started 08:22:30Z. I inspected the LIVE host state cold (my own ssh):
- **CORE INVARIANT DIRECTLY OBSERVED in the kernel lock table** — strongest possible proof of the
double-!testme serialization:
`lslocks`: pid 739163 (build 290) holds `WRITE` on cc-ci-app-immi-ad3e33….lock; pid 739341
(build 291) is blocked `WRITE*` on the SAME lock. Exactly one holder, one waiter, one inode.
- 290 (holder) is sleeping in `services_converged()` poll (hrtimer_nanosleep, no abra child) because
`immich-machine-learning` is stuck 0/1: its container repeatedly fails the healthcheck
(`non-zero exit (143): dockerexec: unhealthy container`, swarm restarting every 16 min). Current
attempt (08:43) has gunicorn up, health `starting` — slow/flaky ML readiness, not a deploy break.
- NOT caused by the restructure / teardown: 290's immich volumes (model-cache/postgres/uploads) +
.env are all from 290's OWN fresh deploy (08:23), not inherited from the earlier same-domain run
287. ML image present (1.36GB, no pull), host healthy (5.2Gi mem free, 65G disk). So this is an
immich-ML healthcheck flake, orthogonal to concurrency.
Bearing on M2(c): the SERIALIZATION mechanism under test is verified working live. The "both GREEN"
half of condition (3) is not yet demonstrated only because 290 is flake-blocked on immich-ML; if 290
REDs on deploy-timeout, (c) needs a clean re-run (flake, not a code fault). VETO unchanged — I still
require one clean (c) where both same-domain builds go GREEN with the block line + zero leakage.
Continuing to watch 290/291 to terminal.
## M2(c): PASS @2026-06-10T09:05Z — double-!testme same domain, CONC-A1 fixed; VETO LIFTED
(c) round-2 builds 290+291 (both `custom PR=2 immich`, same domain immi-ad3e33, on CONC-A1-fixed
main) both reached terminal **status=success**. Cold-verified from the Drone API + live host (my own
access path), not the Builder's word:
- **Both GREEN:** 290 success, 291 success (Drone API).
- **Visible block line (the (c) requirement):** 291 log —
`== app lock: another run of immi-ad3e33….ci.commoninternet.net is in flight — waiting ==`
then `== app lock: acquired … ==`. I ALSO observed the serialization directly in the kernel lock
table mid-run (lslocks: 290 held WRITE, 291 blocked WRITE* on the same inode; after 290 exited,
291 held it). Strongest possible proof of the double-!testme serialization invariant.
- **CONC-A1 regression GONE — the two exact round-1 failure points are now clean:**
- 290 (round-1 build 279 got false `deploy-count 2 != 1`) → now `deploy-count = 1 (expect 1)`,
all 5 tiers pass, level=4. Its run-keyed counter was NOT polluted by 291's concurrent pre-lock
`_record_deploy`.
- 291 (round-1 build 281 crashed `FileNotFoundError` at run_recipe_ci.py:1213) → now
`deploy-count = 1 (expect 1)`, all tiers pass, level=4, no traceback. Its own run-keyed countfile
survived 290's end-of-run remove.
- **Zero leakage after both:** 0 harness procs, 0 immich apps / services / volumes / secrets, no held
cc-ci locks. One unheld 0-byte leftover lockfile (mtime 08:46, 291's acquisition touch) — reaped
on sight by the next janitor probe, harmless by design.
- The ~20-min runtime each was an immich-machine-learning healthcheck slowness/flake (ML eventually
converged), NOT the restructure — already diagnosed in the 08:43Z note; serialization + isolation
both verified correct regardless.
**VETO LIFTED.** The CONC-A1 veto ("no DONE until CONC-A1 fixed + a fresh (c) PASS") is cleared:
conditions (1) per-run keying [code + mutation-proven], (2) same-domain isolation test
[non-tautological], and (3) live (c) both-GREEN + block line + zero leakage are ALL met. CONC-A1
closed in BACKLOG-conc.
**Still required before DONE (full M2 gate, not the CONC-A1 veto):** the Builder must post the formal
M2 claim in STATUS-conc with consolidated evidence, and I re-confirm condition (4) — specifically
**M2(a) cancel-mid-run re-run on the CONC-A1-fixed harness** (b+d already re-confirmed: builds
287+288 parallel both success on fixed main; a's only prior evidence (build 267) was on the
pre-CONC-A1, pre-wrapper-fix harness) — plus the push build green on current main. (a) re-run had
not yet appeared in Drone as of this verdict (Builder sequenced it after (c)). I will verify it cold
when it lands.
## M2: PASS @2026-06-10T08:55Z — merged + live-verified (a)(d) on final main 139e319/74ed240
Formal M2 gate verdict against the Builder's M2 claim (STATUS-conc, commit 74ed240). Formed from
the plan (SSOT), the code/git, the claim's verify recipe, and my OWN cold re-runs from my own clone
+ fresh checkouts/Drone-API on cc-ci — not the Builder's narrative. All seven claim items confirmed:
1. **Merge integrity** — `git diff 139e319 b6e12ef -- runner/ tests/ docs/ .drone.yml nix/` = 0 lines;
`b6e12ef ⊆ 139e319`; merge parents `2173894 ∘ b6e12ef`. So deployed main code == the CONC-A1 tree
I code-verified + mutation-proofed. No force-push (history linear). NB the claim mis-states the
first parent as `4ad55ed` (actual `2173894`, my M2(c)-FAIL commit) — immaterial: that's a state-
file commit, and the code-diff-empty check is authoritative.
2. **Push build green** — Drone push builds 283298 on main all `status=success`; no red push since
the merge.
3. **Suites + lint (cold, fresh clone on cc-ci)** — unit 138 passed, concurrency 23 passed
(concurrency NOT in the default unit gate), `lint: PASS` on final main 74ed240. test_run_state
mutation-proofed (reverting to domain-keying fails all 3 cases).
4. **(a) cancel-mid-run on fixed harness** — build 295 (custom immich#2): lockfile mtime 08:50:17
proves it acquired the app lock 7s in → canceled @08:51:05 MID-DEPLOY. After cancel (verified cold
~1 min later): 0 harness procs (no leaked python — old §8.1 gap stays closed), no held locks (lock
released), no immich app/.env/containers(even stopped)/services/volumes/secrets → ZERO leakage,
full teardown. Killed-step logs not API-retrievable (Drone truncates), but the end-state is the
actual test and it is clean.
5. **(b) parallel runs** — builds 287 (immich#2) + 288 (plausible#3), parallel, both
`status=success`, both `deploy-count = 1 (expect 1)`, level=4; host after = zero leakage.
6. **(c) double-!testme same PR** — builds 290 + 291 (same immich domain): both success, 291 logged
the block line then `acquired`, both `deploy-count = 1`, zero leakage. Serialization also observed
directly in the kernel lock table mid-run (lslocks). Covered in detail by my M2(c) PASS @09:05Z.
7. **(d) full green e2e** — build 287 (and 290): complete immich run, all 5 tiers pass, level=4.
Both M2-found fixes are folded in and independently verified: wrapper exit-code-under-set-e
(e1c4198/b7a009c, my 05:00Z note — red still propagates) and CONC-A1 run-keyed state files
(b6e12ef/139e319, my 09:05Z M2(c) PASS + mutation proof). The ~20-min (c) runtimes were an
immich-ML healthcheck flake (converged within DEPLOY_TIMEOUT=1500s), orthogonal to the restructure
(diagnosed 08:43Z). Unheld 0-byte leftover lockfiles are by-design (next-janitor tidy-sweep).
GUARDRAILS honored end-to-end: recipe-mirror PRs (immich#2, plausible#3) used as !testme targets
only, never merged/pushed; cc-ci main touched only by the gated merges (no force-push); no secrets in
any commit. RUN_APP_RE / services_converged / warm-canonical flows untouched (M1 diff review).
CONCLUSION: **M2 — merged + live-verified — PASS.** M1 PASS (04:38Z) + M2 PASS (here) are both fresh
in REVIEW-conc; no open VETO (CONC-A1 lifted). Per the phase DoD the Builder may now write `## DONE`
to STATUS-conc. (Post-verdict I may consult JOURNAL-conc to contextualize; I had NOT read its M2
reasoning before forming this verdict — verified from plan + code/git + Drone API + my own cold runs.)

34
REVIEW-rcust.md Normal file
View File

@ -0,0 +1,34 @@
# REVIEW-rcust.md — Adversary ledger for the recipe-customization restructure phase
SSOT for this phase: `/srv/cc-ci/cc-ci-plan/recipe-custom-restructure-full-plan.md`.
Gates: **M1** (implementation verified — branch `restructure/recipe-custom`, unit+concurrency+lint
green on cold clone, resolved-customization diff clean for all 21 recipes, adversarial diff review)
and **M2** (merged + real-CI regression sweep matching baseline matrix). DONE requires fresh PASS
for both with no open VETO.
I own this file and the `## Adversary findings` section of BACKLOG-rcust.md only.
---
## Standing watch items (what I will hunt at M1/M2)
- **Coverage loss** (cardinal risk): for every migrated recipe, old loaders' effective customization
values must equal new `meta.load()` values. Throwaway diff script over all 21 recipe dirs; any
delta = finding.
- **Assertion weakening** in `tests/<recipe>/` diffs — migrations must be mechanical only (signatures,
fixture/key renames, underscore prefixes). Any changed assert/expected value = VETO.
- **Deleted-code fallout** — dangling refs to `_recipe_meta`, `_load_meta`, `_recipe_extra_env`,
`_recipe_meta_flag`, `declared_deps`, `is_canonical_enrolled`, `OIDC_AT_INSTALL`,
`CHAOS_BASE_DEPLOY`, `SKIP_GENERIC`, `setup_custom_tests`, `deps_apps`, `deps_creds`, `deployed_app`.
- **Validation gaps** — typo'd key / wrong type / callable-on-data-key must raise MetaError, not pass.
- **R2 fixed end-to-end** — orchestrator load path delivers SCREENSHOT to screenshot.py.
- **HC2 / F2-11 integrity** — repo-local default-deny, requires_deps skip-report, generic floor
semantics all unchanged.
---
## Verdicts
_(none yet — phase just started; Builder has not yet created STATUS-rcust.md or branch
`restructure/recipe-custom`. Only the reference spec doc `76a4b6b` has landed. Awaiting first
`claim(rcust): M1` from the Builder.)_

62
STATUS-conc.md Normal file
View File

@ -0,0 +1,62 @@
# STATUS — sub-phase conc (concurrency restructure)
Plan: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md (SSOT for this phase)
## DONE
Both gates Adversary-verified fresh in REVIEW-conc.md, no open VETO:
- M1 — implementation verified: PASS @2026-06-10T04:38Z (branch @d3fe9e2)
- M2 — merged + live-verified (a)(d): PASS @2026-06-10T08:55Z (final main 139e319/74ed240)
- CONC-A1 (M2(c) live finding): fixed b6e12ef, veto LIFTED + closed @09:05Z
## Phase state
- Phase: conc — concurrency restructure (P1P5 + tests/concurrency) — COMPLETE
- Merged to main: bb5eb3d (restructure) + b7a009c (wrapper exit-code fix) + 139e319 (CONC-A1 fix)
- Correction per M2 verdict: 139e319's first parent is 2173894 (not 4ad55ed as the claim said);
immaterial — the code-diff-empty check (139e319 vs b6e12ef) is authoritative.
## Gate claim: M2 — merged + live-verified
**WHAT**: branch merged to main after M1 PASS; live verification (a)(d) all green on the final
main code (which includes two M2-found fixes, both already Adversary-verified: wrapper exit-code
e1c4198/b7a009c, CONC-A1 run-keyed state files b6e12ef/139e319).
**WHERE**: main tip code = merge 139e319 (parents 4ad55ed ∘ b6e12ef); branch tip b6e12ef.
All evidence builds ran post-139e319. Drone repo recipe-maintainers/cc-ci; host cc-ci.
**HOW + EXPECTED (cold re-check from your own access path):**
1. Merge integrity: `git diff 139e319 b6e12ef -- runner/ tests/ docs/ .drone.yml nix/` → EMPTY;
no force-push anywhere (reflog linear).
2. Push build green on main: Drone builds 283 (branch fix), 284 (merge 139e319), 285 (inbox
commit) → all `status=success` (push events). No main push since has a red build.
3. Suites at b6e12ef (cold clone): `cc-ci-run -m pytest tests/unit -q` → 138 passed;
`cc-ci-run -m pytest tests/concurrency -q` → 23 passed; `nix develop .#lint --command bash
scripts/lint.sh` → lint: PASS. (You already cold-verified these + mutation-proofed
test_run_state per REVIEW-conc 08:4xZ entry.)
4. **(a) cancel-mid-run, on fixed harness**: build **295** (custom immich PR=2, comment 14307
@08:50:02Z). Canceled via `DELETE /api/repos/recipe-maintainers/cc-ci/builds/295` @08:51:05Z
(HTTP 200) while mid-deploy (lock held by harness pid 763099, 4 immich services converging).
EXPECTED/observed: build `status=killed`; pid 763099 gone by 08:51:15Z (SIGTERM funnel ran
the run's own teardown); `pgrep -f run_recipe_c[i]` → none; `lslocks | grep cc-ci-app`
none (lock released); immi services/volumes/secrets/server-envs all 0. Zero leakage, no
janitor needed (better than plan minimum).
5. **(b) parallel runs**: builds **287** (immich#2) + **288** (plausible#3), both started
08:17:40Z (parallel), both `status=success`, both logs `deploy-count = 1 (expect 1)` +
level=4. Host after: zero harness procs / services / volumes / secrets / envs.
6. **(c) double-!testme same PR**: builds **290** + **291** (both immich#2, domain immi-ad3e33).
291 log line 1: `== app lock: another run of immi-ad3e33... is in flight — waiting ==`,
`acquired` @+1411s = exactly 290's exit (08:46:05Z). BOTH `status=success`, both
`deploy-count = 1`, level=4. Zero leakage after. (Your M2(c) PASS @09:05Z already covers
this; kernel-lock-table observation yours.)
7. **(d) full green run**: build **287** = complete immich e2e on final harness, all 5 tiers
pass, level=4 (288 plausible likewise).
**Notes for verification**: builds 290/291 ran ~20 min each due to an immich-ML healthcheck
flake (your 08:43Z note) — converged within DEPLOY_TIMEOUT=1500s; unrelated to the restructure.
Unheld 0-byte lockfiles left behind by design (tidy-swept at next janitor probe).
## Blockers
(none)

22
STATUS-rcust.md Normal file
View File

@ -0,0 +1,22 @@
# STATUS — sub-phase rcust (recipe-customization restructure)
Plan: /srv/cc-ci/cc-ci-plan/recipe-custom-restructure-full-plan.md (SSOT for this phase).
Reference spec: docs/recipe-customization.md @ 76a4b6b.
Work branch: `restructure/recipe-custom` (one commit per phase P1P6; merged to main only after M1 PASS).
## Phase progress
- [ ] P1 — harness/meta.py single loader + key registry + migrate L1L6 + unit tests + doc gen
- [ ] P2 — delete legacy keys/paths (CHAOS_BASE_DEPLOY, OIDC_AT_INSTALL, SKIP_GENERIC meta, conftest cleanup)
- [ ] P3 — uniform ctx hook convention
- [ ] P4 — custom-test ergonomics (placement rule, op_state/deps fixtures)
- [ ] P5 — customization manifest
- [ ] P6 — docs
## Gate
(none claimed yet — phase bootstrap)
## Current
Bootstrapping phase; starting P1.

236
docs/concurrency.md Normal file
View File

@ -0,0 +1,236 @@
# Concurrency: how parallel recipe CI runs stay safe
Spec of the concurrent-run system after the 2026-06-10 restructure (branch
`restructure/concurrency`; plan: cc-ci-plan `concurrency-restructure-full-plan.md`). The previous
registry + per-recipe-flock model is documented in this file's git history (`5b65c6c`).
## 1. Goal and design summary
Two recipe CI builds may run **at the same time** on the single cc-ci host. Safety is enforced by
the **harness**, not by serialising everything, and rests on ONE locking mechanism plus ONE
structural isolation:
| Rule | Mechanism |
|---|---|
| Different recipes run in parallel | nothing blocks them (isolation, §3) |
| Same-RECIPE runs run in parallel too | per-run `ABRA_DIR` recipe trees (§4) — no shared tree, no lock |
| Same-DOMAIN runs (double-`!testme` of one PR) serialise | per-app-domain `flock` (§5) |
| A starting run never reaps a live concurrent run's app | janitor probes the app lock; held = live (§6) |
| A crashed/canceled/rebooted run's leftovers get reaped | lock auto-released by the kernel → probe acquires → reap (§6) |
The invariant chain that makes "held lock = live owner" sound:
```
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
```
- **lock ⊆ process**: locks are kernel flocks on fds the process holds (and PEP 446 makes those
fds non-inheritable, so abra/docker/pytest children never carry them). The kernel releases them
on process death, however it dies. There is no unlock code path and no stale-lock failure mode.
- **process ⊆ step**: `PR_SET_PDEATHSIG(SIGTERM)` + the `.drone.yml` setsid/trap wrap (§2) — a
dead or canceled build cannot leak a running harness.
- **step ⊆ 60 min**: `signal.alarm(3600)` self-deadline (§2).
Never steal a held lock; manage the holder's lifetime. There is **no daemon and no shared state
service** — everything is kernel/file primitives under `/run/lock` and per-run directories.
## 2. Mechanism 0: run-lifetime hardening (`runner/harness/lifetime.py`)
`run_recipe_ci.main()` calls `lifetime.install_lifetime_guards()` before ANY abra call or lock
acquisition:
1. **`PR_SET_PDEATHSIG(SIGTERM)`** (ctypes prctl, return code checked): if the parent — the drone
step shell — dies, the kernel TERMs the harness. A post-prctl `ppid == 1` re-check closes the
start race: a harness whose parent died *before* the prctl armed would never get the signal,
so it refuses to run orphaned.
2. **SIGTERM handler**: logs, then raises `SystemExit(143)` so the run's `finally:` teardown
funnel executes and the process exits non-zero. Re-entrant signals during teardown are logged
and IGNORED (`lifetime.begin_teardown()`, also set at the top of the run's `finally:` blocks)
so a second signal can't abort the cleanup the first one asked for.
3. **`signal.alarm(3600)` hard deadline**: SIGALRM funnels into the same teardown path with a
distinct log line (`== run exceeded 60-minute hard deadline — tearing down ==`), exit 142.
Recipes keep their own smaller per-tier timeouts; this bounds the whole run. Teardown time
after the deadline is deliberately not alarm-bounded — the janitor is the backstop if a
teardown wedges and the process is killed harder.
The `.drone.yml` recipe-ci step runs the harness as `setsid cc-ci-run … &` with a
`trap 'kill -TERM -- "-$PID"' TERM EXIT; wait "$PID"` — a drone **cancel** (TERM to the step
shell) is forwarded to the harness's whole process group instead of leaking it (the exec runner
only kills the step shell). PDEATHSIG backstops the no-trap paths.
## 3. Isolation model: what is shared, what is per-run
Per-run (no conflict possible):
- **App + stack + volumes + secrets.** Run app domain = `naming.app_domain()`
`<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net`, unique per (recipe, pr, ref);
everything abra creates is namespaced by it. Run apps are recognised by
`RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`; warm/canonical apps
(e.g. `warm-keycloak...`) deliberately do NOT match → the janitor never probes them.
- **Recipe working trees** — `$ABRA_DIR/recipes/<recipe>`, per run (§4). NEW in the restructure.
- **Drone build workspace** (`/var/lib/drone-runner/drone-<id>/`) and **run artifacts**
(`/var/lib/cc-ci-runs/<run-id>/`).
- **Run-scoped state files** (`/tmp/ccci-{deploys,opstate,deps,depskip}-<run-id>-<pid>…`) —
keyed by run id + harness pid via `run_recipe_ci._run_state_path()`, NEVER by app domain.
A second run of the same domain executes its `main()` preamble before blocking at the app
lock (§5), so domain-keyed files would be reset/removed underneath the live first run
(live finding, M2(c) double-`!testme`: false DG4.1 deploy-count in run 1, countfile
`FileNotFoundError` in run 2). Tier/hook children get the exact paths via the
`CCCI_*_FILE` env vars; removed on normal run exit.
Shared (by design, conflict-free):
- **`/root/.abra/servers`** — app `.env` files, one per domain. The per-run `ABRA_DIR` symlinks
`servers/` here, so .env files land in the canonical path: janitor discovery (`abra app ls`)
and out-of-run tooling see every app. Per-domain filenames + the app-domain lock prevent write
conflicts.
- **`/root/.abra/catalogue`** — read-mostly, symlinked into each per-run dir.
- **`HOME=/root`** (forced in `.drone.yml`) — safe: nothing recipe-mutable lives under `~/.abra`
for a run anymore except through the two symlinks above.
## 4. Mechanism 1: per-run `ABRA_DIR` (replaces the per-recipe flock)
`run_recipe_ci.setup_run_abra_dir()` — called first thing in `main()`, before any abra call —
builds `<runs_dir>/<run-id>/abra/` (run-id = Drone build number; `manual-<pid>` for hand runs):
```
abra/
servers/ -> /root/.abra/servers (symlink; canonical shared .env path)
catalogue/ -> /root/.abra/catalogue (symlink; read-mostly)
recipes/ fresh, empty (THE isolation that matters)
```
and exports it as `$ABRA_DIR` — honored by the abra CLI itself and by every harness path helper
(`abra.abra_dir()` / `abra.recipe_dir()`; `generic._recipe_dir`, `prepull_images`,
`snapshot_recipe_tests`, `warm_reconcile._recipe_dir` all route through the same rule:
`$ABRA_DIR` if set, else `~/.abra`).
- `fetch_recipe()` is now a plain clone into `$ABRA_DIR/recipes/<recipe>` (PR-head clone+checkout
or `abra recipe fetch`); the upgrade tier's mid-run `git checkout`s happen in the run's own
tree. Two same-recipe runs can no longer corrupt each other — structurally, with no lock. The
old observed failure (immich builds 229/230 deploying a tree missing its config) is impossible.
- `CCCI_SKIP_FETCH=1` (test/Adversary staging) copies the canonically-staged
`~/.abra/recipes/<recipe>` clone into the per-run tree.
- Out-of-run flows (warm_reconcile's systemd timer, manual abra) set no `ABRA_DIR` and keep using
the canonical `/root/.abra` unchanged. In-run flows that touch canonical state on purpose
(warm/canonical .env files) go through `servers/` and are unaffected.
- The per-run dir rides along the existing `/var/lib/cc-ci-runs/<run-id>/` retention. abra
auto-clones any recipe it needs to resolve (e.g. during `app ls`) into the per-run `recipes/`
a few seconds of git per run, gone with the run dir.
## 5. Mechanism 2: per-app-domain flock (`lifecycle.acquire_app_lock`)
- Lock file: `/run/lock/cc-ci-app-<domain>.lock` (dir overridable via `CCCI_APP_LOCK_DIR` for the
test suite), exclusive `fcntl.flock`, taken in `deploy_app()` **before the app is created** — a
concurrent janitor can never see a run app without its held lock.
- Blocks (with a log line: `== app lock: another run of <domain> is in flight — waiting ==`) when
another run of the SAME domain is in flight — the double-`!testme` serialisation point; the
waiting run is visibly parked at that line in its drone log, by design.
- The returned file object is ALSO retained in module-level `_held_app_locks` — if a caller
dropped it, GC would close the fd and silently release the lock.
- mtime is touched at acquisition: lock age feeds the janitor's long-held flag (§6).
- **Unlink/recreate race guard**: the janitor unlinks reaped lockfiles, so after EVERY
acquisition the locked fd is verified to still be the inode the path names
(`fstat().st_ino == stat().st_ino`); a waiter that won a just-unlinked inode closes it and
retries on the live path. (A lock on an unlinked inode protects nothing: a later opener gets a
fresh inode and would acquire "the same" lock.)
- Release is implicit: process exit (any kind). `teardown_app()` does NOT release or unlink —
a clean run's leftover lockfile is unheld and is unlinked on sight by the next janitor sweep.
## 6. The flock-probe janitor (`lifecycle.janitor`)
Runs at every run start (cold + quick paths) and in the warm/upgrade sweeps. Candidate discovery
is unchanged from the old model: `abra app ls` + a docker-service sweep (catches stacks whose
`.env` is already gone), both matched against `RUN_APP_RE` — warm/canonical apps never match and
are never probed.
Decision table (per candidate domain, `_probe_and_reap`):
| Probe (`LOCK_EX\|LOCK_NB`) | Meaning | Action |
|---|---|---|
| acquires (+ inode identity OK) | nobody holds it → owner died (kernel-guaranteed) | **reap**: `teardown_app(verify=False)` WHILE HOLDING the probe lock, then unlink the lockfile, then release |
| acquires, inode stale | another janitor reaped + unlinked while we raced | skip (reap already done; unlinking now would hit a newer run's file) |
| `BlockingIOError` (held) | live concurrent run | leave it; if lockfile mtime > 120 min (2× the hard deadline): `!! lock for <domain> held >120min — possible leaked run; inspect with lslocks` — flag, **never steal** |
| `open()` fails (`OSError`) | garbled/unopenable lockfile | skip + log, never crash |
- Reaping under the probe lock closes the janitor-vs-new-run race: a new run of that domain
blocks in `acquire_app_lock` until the reap finishes — no window where a fresh app coexists
with a half-reaped one.
- Two racing janitors arbitrate on the flock: one reaps, the other sees "held" and leaves; reaps
are idempotent (`teardown_app(verify=False)` tolerates half-gone stacks).
- After the candidates, a tidy sweep unlinks stale **unheld** `cc-ci-app-*.lock` files with no
app behind them (under their own probe lock + identity check), keeping `/run/lock` clean.
- **Post-reboot**: `/run/lock` is tmpfs → lockfiles gone → every surviving app probes as an
orphan → reaped immediately. (Improvement over the old 2-hour age fallback; there IS no age
logic anymore.)
## 7. Failure-mode guarantees
| Event | Outcome |
|---|---|
| Run crashes / SIGKILL mid-run | flock auto-released by kernel → next janitor probe reaps app + lockfile |
| Drone build canceled via API | step trap TERMs the harness process group → SIGTERM funnel runs the run's own teardown (exit 143); if anything still leaks, PDEATHSIG + janitor reap (the old "cancel leaks the harness" gap is CLOSED) |
| Run exceeds 60 min | SIGALRM → distinct log line → own teardown → exit 142 |
| Host reboot | locks and lockfiles vanish (tmpfs, correct: no owners survived) → all surviving run apps reaped at the next run start, immediately |
| Two same-recipe `!testme`s (different PRs) | run in parallel — separate domains, separate per-run recipe trees |
| Double-`!testme` (same PR → same domain) | second blocks on the app lock before creating anything, visibly in its drone log, runs after the first finishes |
| Janitor vs. app being created | impossible to mis-reap: the lock is held before `app new`, and a held lock is never touched |
| Janitor unlink vs. blocked waiter | inode identity re-check on every acquisition → waiter retries on the live path |
| Lock held implausibly long (>120 min) | flagged loudly for a human (`lslocks`), never stolen |
## 8. Where convergence fits (adjacent; unchanged by the restructure)
Two swarm-convergence behaviors in `services_converged()` look like concurrency bugs but aren't —
any future work must keep them fixed:
- **N/N replicas ≠ converged** during a stop-first rolling update — `UpdateStatus.State` is also
inspected (build 238: backupbot exec'd into a container killed seconds later).
- **`paused` persists forever** (swarm's default `update-failure-action`) — only `updating` and
`rollback_started` block convergence; `paused`/`rollback_paused` are settled (build 241).
- `backup_app()` additionally waits (bounded 300s) for convergence before `backup create`.
## 9. Configuration knobs
| Knob | Where | Current | Meaning |
|---|---|---|---|
| `DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`) | `nix/modules/drone-runner.nix` (`maxTests`) | `2` | **THE single concurrency knob.** Max builds the exec runner executes at once; Drone queues the rest. (The `.drone.yml` `concurrency.limit` duplicate was removed.) Change requires `nixos-rebuild switch`. |
| `CCCI_APP_LOCK_DIR` | env, read at call time | unset → `/run/lock` | App-domain lockfile dir override — used by `tests/concurrency` to sandbox locks. Never set in production. |
| hard deadline | `lifetime.HARD_DEADLINE_SECONDS` | 3600 s | the whole-run alarm; long-held flag threshold is 2× this (`LONG_HELD_LOCK_SECONDS`) |
## 10. Testing: `tests/concurrency/`
Real-kernel suite (19 planned cases + companions): helper subprocesses hold REAL flocks and
install the REAL prctl/signal/alarm guards — flock itself is never mocked; the janitor runs with
injected candidates + stubbed teardown but probes real locks. **Not part of the default
`pytest tests/unit` gate** (it spawns processes and sleeps); run it explicitly:
```
cc-ci-run -m pytest tests/concurrency -q
```
Covers: kernel auto-release on SIGKILL; LOCK_NB probe semantics; PEP 446 fd non-inheritance;
same-domain serialisation; orphan reap + unlink; live-run protection; reap-under-probe-lock
blocking; two-janitor arbitration; reboot-immediate reap; long-held flag; RUN_APP_RE allowlist;
degrade-on-garbage; PDEATHSIG; ppid start race; deadline + SIGTERM funnels; per-run ABRA_DIR
construction/export; concurrent same-recipe fetch isolation; symlinked-servers .env canonicality;
run-keyed (never domain-keyed) run-scoped state files (M2(c) regression, `test_run_state.py`).
## 11. File / symbol index
| What | Where |
|---|---|
| lifetime guards (PDEATHSIG, signal funnels, deadline) | `runner/harness/lifetime.py`; installed in `run_recipe_ci.main()` |
| setsid/trap cancel forwarding | `.drone.yml` (`recipe-ci` step) |
| `acquire_app_lock`, `_held_app_locks`, `_app_lock_path` | `runner/harness/lifecycle.py` |
| `acquire_app_lock` call site | `lifecycle.deploy_app()` (before app creation) |
| janitor + probe (`janitor`, `_probe_and_reap`, `LONG_HELD_LOCK_SECONDS`) | `runner/harness/lifecycle.py` |
| per-run ABRA_DIR (`setup_run_abra_dir`, `fetch_recipe`) | `runner/run_recipe_ci.py` |
| path resolution (`abra_dir`, `recipe_dir`) | `runner/harness/abra.py` (used by `generic`, `lifecycle.prepull_images`, `warm_reconcile`) |
| run-app naming | `runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py` |
| capacity knob | `nix/modules/drone-runner.nix` (`maxTests`) |
| convergence (adjacent) | `lifecycle.services_converged()`, `lifecycle.backup_app()` |
| the test suite | `tests/concurrency/` (`helpers.py` subprocess entrypoints, `concutil.py` probes) |
Deleted in the restructure (grep should find NOTHING): `register_run_app`, `unregister_run_app`,
`_run_owner_state`, `ACTIVE_RUN_DIR`, `CCCI_JANITOR_MAX_AGE`, `_stack_age_seconds`,
`acquire_recipe_lock`, `RECIPE_LOCK_DIR`.

View File

@ -0,0 +1,383 @@
# Recipe customization — review spec
Status: REVIEW SPEC — describes the customization surface as it exists today (main), written so
the structure can be reviewed and potentially restructured. §8 lists known limitations and
restructuring candidates; everything before it is purely descriptive.
Companion docs: `docs/testing.md` (test architecture / tier semantics), `docs/enroll-recipe.md`
(step-by-step enrollment). This doc is the **complete reference** for the two questions those docs
answer only partially:
1. How are custom tests written for a particular recipe?
2. What are ALL the per-recipe CI settings, where do they live, and who reads them?
---
## 1. The three customization surfaces
A recipe customizes its CI through **three distinct mechanisms** (worth noticing for the
restructure review — they are three different config languages):
| Surface | Form | Examples |
|---|---|---|
| **Declarative settings** | Python assignments in `tests/<recipe>/recipe_meta.py` | `DEPLOY_TIMEOUT = 1500`, `UPGRADE_BASE_VERSION = "2.3.1+..."` |
| **Code hooks** | Callables in `recipe_meta.py`, `ops.py` functions, shell hooks | `def READY_PROBE(domain): ...`, `pre_upgrade()`, `install_steps.sh` |
| **File presence** | A file existing at a discovered path changes behavior | `test_upgrade.py` overlay, `functional/test_*.py`, `compose.ccci.yml` |
There is additionally a fourth, operator-facing surface: **environment variables**
(`CCCI_SKIP_GENERIC*`) that override declarative settings at run time (§4.4).
## 2. Zero-config baseline
A recipe with **no `tests/<recipe>/` directory at all** still gets the full generic floor:
- deploy base version → INSTALL (generic `assert_serving`: HTTP on `/`, expect 200/301/302)
- chaos-upgrade to PR head → UPGRADE (generic `assert_upgraded`: version label matches head, converged, serving)
- BACKUP (generic `assert_backup_artifact`) — iff the recipe's compose files carry
`backupbot.backup` labels (auto-detected), else N/A
- RESTORE (generic `assert_restore_healthy`)
- CUSTOM tier: empty (no custom tests discovered)
- teardown
Defaults: `HEALTH_PATH="/"`, `HEALTH_OK=(200,301,302)`, `DEPLOY_TIMEOUT=600`, `HTTP_TIMEOUT=300`.
Everything in this doc is opt-in deviation from that floor. The cardinal invariant
(docs/testing.md §1): the generic floor is **always on** and never depends on custom code;
custom is **additive** by default.
## 3. The per-recipe tree — every file that can exist
Two locations, with precedence and a security gate between them:
- **cc-ci-owned**: `tests/<recipe>/` in this repo (trusted, maintainer-reviewed)
- **repo-local**: the recipe repo's own `tests/` dir (PR-author-controlled → **default-deny**,
consulted only when the recipe is listed in `tests/repo-local-approved.txt` — gate HC2,
centralized in `runner/harness/discovery.py`)
```
tests/<recipe>/ # cc-ci side (repo-local mirrors the same shape)
├── recipe_meta.py # ALL declarative settings + meta callables (§4)
├── test_<op>.py # lifecycle overlay assertions, op ∈ install|upgrade|backup|restore (§5.1)
├── ops.py # pre_<op>(domain, meta) seed hooks (§5.2)
├── test_*.py # custom-tier tests (top-level, cross-cutting)(§5.3)
├── functional/test_*.py # custom tier: parity ports + recipe-specific (§5.3)
├── playwright/test_*.py # custom tier: UI flows (§5.3)
├── install_steps.sh # pre-deploy shell hook (§5.4)
├── setup_custom_tests.sh # deps/OIDC credential wiring hook (§5.5)
├── compose.ccci.yml # CI-only compose overlay (via install_steps) (§5.6)
└── PARITY.md # enrollment contract doc (human-read only)
```
Precedence (machine-docs/DECISIONS.md, implemented in `discovery.py`):
- lifecycle overlay `test_<op>.py`: repo-local **wins** over cc-ci (same-name collision); the
generic floor still runs additively alongside.
- custom tier `test_*.py`: **ALL** run, from both locations (no collision concept).
- `install_steps.sh`: repo-local > cc-ci, or none.
- `ops.py` pre-op hook: cc-ci wins; repo-local consulted only if approved.
- `recipe_meta.py`: cc-ci only — repo-local recipes cannot set CI settings (by design; the
settings surface stays maintainer-controlled).
## 4. `recipe_meta.py` — complete settings reference
The single settings file. Plain Python, `exec()`d by the harness (trusted, in-repo). A key is "set"
by a top-level assignment or `def`. Unknown names are ignored silently (a recipe may keep private
constants here, e.g. mumble's `WELCOME_TEXT_MARKER` — but see §8 R6: typos in real key names are
also silently ignored).
**Loader column legend** — this is the structural finding for the review (§8 R1). There is no
single loader; six independent code paths each `exec()` the file and pick out their own keys:
| # | Loader | Keys it sees |
|---|---|---|
| L1 | `runner/run_recipe_ci.py:_load_meta` (orchestrator) | 4 base + explicit 8-key allowlist |
| L2 | `tests/conftest.py:_recipe_meta` (pytest `meta` fixture) | 4 base keys ONLY |
| L3 | `runner/harness/lifecycle.py:_recipe_extra_env` | `EXTRA_ENV` only |
| L4 | `runner/harness/lifecycle.py:_recipe_meta_flag` | boolean flags by name (`CHAOS_BASE_DEPLOY`) |
| L5 | `runner/harness/deps.py:declared_deps` | `DEPS` only |
| L6 | `runner/harness/canonical.py:is_canonical_enrolled` | `WARM_CANONICAL` only |
> **Restructure status (rcust P1):** the six loaders above are HISTORY — they have been replaced by
> the single registry-backed loader `runner/harness/meta.py::load(recipe) -> RecipeMeta` (the only
> `exec()` of `recipe_meta.py`). Unknown ALL-CAPS keys / type mismatches are now hard errors;
> underscore-prefixed names are recipe-private. The authoritative key reference is the generated
> table below; the per-loader subsections §4.1§4.8 are retained for context until the P6 doc
> rewrite.
<!-- META-TABLE-START -->
_This table is GENERATED from the `runner/harness/meta.py` KEYS registry by `scripts/gen-meta-docs.py` — do not edit by hand (a unit test pins the sync)._
| Key | Type | Default | Meaning |
|---|---|---|---|
| `HEALTH_PATH` | `str` | `'/'` | Path probed for serving/health checks (deploy wait + generic `assert_serving`). |
| `HEALTH_OK` | `tuple[int]` | `(200, 301, 302)` | Acceptable HTTP status codes for health. |
| `DEPLOY_TIMEOUT` | `int` | `600` | Max seconds to wait for swarm convergence per deploy. |
| `HTTP_TIMEOUT` | `int` | `300` | Max seconds to wait for HTTP health after convergence. |
| `BACKUP_CAPABLE` | `bool` | `None` | Override the backup-tier capability auto-detect (compose `backupbot.backup` labels). `False` forces N/A; `True` forces the tier on; unset = auto-detect. |
| `EXPECTED_NA` | `dict` | `None` | Declare an N/A rung intentional: `{rung: reason}`. The cap stands either way; only the report wording changes. |
| `READY_PROBE` | `hook` | `None` | Callable `(ctx) -> [probe, ...]` returning extra readiness probes, run after install AND after upgrade: HTTP `{host, path, ok}` or TCP `{tcp_host, tcp_port, stable}`. |
| `UPGRADE_BASE_VERSION` | `str` | `None` | Exact published tag overriding the upgrade tier's base (default: `recipe_versions[-2]`). |
| `BACKUP_VERIFY` | `hook` | `None` | Callable `(ctx) -> bool` post-backup data-capture check; `False` re-runs the backup (truncated-dump race guard), retried up to 3 attempts. |
| `UPGRADE_EXTRA_ENV` | `dict_or_hook` | `None` | Extra `.env` keys applied after the PR-head checkout, before the chaos redeploy (env that exists only at head). Dict, or callable `(ctx) -> dict`. |
| `EXTRA_ENV` | `dict_or_hook` | `{}` | Extra `.env` keys applied at EVERY deploy (base install AND upgrade old-app). Dict, or callable `(ctx) -> dict` deriving values from the per-run domain (`ctx.domain`). |
| `DEPS` | `list[str]` | `[]` | Dep recipes deployed/provisioned alongside (e.g. `["keycloak"]`); creds land in `$CCCI_DEPS_FILE`. |
| `WARM_CANONICAL` | `bool` | `False` | Enroll the recipe in the warm/canonical app system (docs/warm.md): green cold runs on LATEST advance the canonical snapshot. |
| `SCREENSHOT` | `hook` | `None` | Callable `(page, ctx)` driving Playwright to a safe, credential-free post-login view for the results-card screenshot (default: landing page). |
<!-- META-TABLE-END -->
### 4.1 HTTP / health / timing (base 4 — seen by L1 AND L2)
| Key | Type / default | Meaning | Used by |
|---|---|---|---|
| `HEALTH_PATH` | str, `"/"` | Path probed for serving/health checks | deploy wait (`lifecycle.py`), generic `assert_serving` |
| `HEALTH_OK` | tuple, `(200, 301, 302)` | Acceptable HTTP status codes for health | same |
| `DEPLOY_TIMEOUT` | int s, `600` | Max wait for swarm convergence per deploy | `lifecycle.py`, generic ops |
| `HTTP_TIMEOUT` | int s, `300` | Max wait for HTTP health after converged | same |
Example: immich sets `DEPLOY_TIMEOUT = 1500`, `HTTP_TIMEOUT = 600` (ML containers are slow).
### 4.2 Upgrade tier (loader L1)
| Key | Type / default | Meaning |
|---|---|---|
| `UPGRADE_BASE_VERSION` | str (exact published tag), default `None` | **The "base pin"** — overrides the harness default base for the upgrade tier. Default base = `recipe_versions[-2]` (the previous published version); pin when that is not the PR's true predecessor (e.g. the PR is the first release on a new major, or the previous tag is known-broken). Must be an exact published tag — typos fail the base deploy. Consumed at `run_recipe_ci.py` (`prev = meta.get("UPGRADE_BASE_VERSION") or lifecycle.previous_version(recipe)`). Users: discourse, plausible. |
| `UPGRADE_EXTRA_ENV` | dict **or** callable `(domain) -> dict`, default `None` | Extra `.env` keys applied **after** the PR-head checkout, **before** the chaos redeploy (F2-14c) — for env vars that exist only at head (a new required setting introduced by the PR). Consumed in `generic.py:256`. User: mumble. |
### 4.3 Every-deploy shaping (loaders L3/L4 — NOT in the L1 allowlist)
| Key | Type / default | Meaning |
|---|---|---|
| `EXTRA_ENV` | dict **or** callable `(domain) -> dict`, default `{}` | Extra `.env` keys applied at **every** deploy (base install AND upgrade old-app). Callable form derives values from the per-run domain (e.g. cryptpad's `SANDBOX_DOMAIN`). Loaded by `lifecycle.py:_recipe_extra_env` (its own `exec()`). Users: cryptpad, discourse, ghost, matrix-synapse, mattermost-lts, mumble, plausible. |
| `CHAOS_BASE_DEPLOY` | bool, default `False` | Base deploy uses `--chaos` so it survives untracked files in the recipe checkout (required when `install_steps.sh` copies in a `compose.ccci.yml` overlay — §5.6; implicit coupling, see §8 R7). Loaded by `lifecycle.py:_recipe_meta_flag`. Users: discourse, ghost. |
### 4.4 Skips and intentional N/A (loader L1)
| Key | Type / default | Meaning |
|---|---|---|
| `SKIP_GENERIC` | list of op names or `"all"`/`"*"`, default `[]` | Suppress the generic floor for the listed ops (overlay becomes override instead of additive). Two env equivalents at run time: `CCCI_SKIP_GENERIC=1` (all ops), `CCCI_SKIP_GENERIC_<OP>=1` (one op). Currently set by **no enrolled recipe** (env form is the one used, ad hoc). |
| `EXPECTED_NA` | dict `{rung: reason}`, default `None` | Declares an N/A rung **intentional** (e.g. `{"backup": "stateless, nothing to back up"}`). Undeclared N/A is reported as an *unintentional coverage gap*. Both cap the achievable level — declaring does not un-cap, it only changes the report wording (`results.py`). User: custom-html-tiny. |
| `BACKUP_CAPABLE` | bool, default auto-detect | Overrides the backup-tier capability detection (scan of recipe compose files for `backupbot.backup` labels, `generic.py:34`). `False` forces N/A; `True` forces the tier on. Users: custom-html-bkp-bad/rst-bad (harness self-test recipes). |
### 4.5 Readiness & data-verification hooks (loader L1, callable values)
| Key | Type / default | Meaning |
|---|---|---|
| `READY_PROBE` | callable `(domain) -> [probe, ...]`, default `None` | Extra readiness probes run after install AND after upgrade, before that tier's assertions. Probe dicts: HTTP `{host, path, ok}` or TCP `{tcp_host, tcp_port, stable}` (`stable`: must stay connectable across 3 checks — for UDP-adjacent voice ports etc.). Consumed at `lifecycle.py:516`. Users: lasuite-drive, mumble (TCP voice port). |
| `BACKUP_VERIFY` | callable `(domain) -> bool`, default `None` | Post-backup data-capture check, retried — guards the truncated-dump race (backup snapshot taken before the seeded marker row hit disk). Return `False` → retry the backup, then fail. Users: discourse, ghost. |
### 4.6 Dependencies / SSO (loaders L5 + L1)
| Key | Type / default | Meaning |
|---|---|---|
| `DEPS` | list of recipe names, default `[]` | Dep recipes deployed alongside (e.g. `["keycloak"]`). Dep domain is `<dep[:4]>-<6hex>`, hashed from (parent, pr, ref, dep) — collision-free per run. Creds land in `$CCCI_DEPS_FILE` (JSON); tests use the `deps_apps` fixture; teardown deps LAST. Deploy-count guard becomes `1 + len(DEPS)`. Loaded by `deps.py:declared_deps`. Users: lasuite-docs/-drive/-meet. |
| `OIDC_AT_INSTALL` | bool, default `False` | Provision deps **before** the single base deploy so `install_steps.sh` can wire OIDC env into that one deploy (reads `$CCCI_DEPS_FILE`). Default (legacy) is post-deploy provisioning + a `setup_custom_tests.sh` redeploy. Consumed at `run_recipe_ci.py:514`. Users: lasuite-drive, lasuite-meet. |
### 4.7 Warm-canonical enrollment (loader L6)
| Key | Type / default | Meaning |
|---|---|---|
| `WARM_CANONICAL` | bool, default `False` | Enrolls the recipe in the warm/canonical app system (`docs/warm.md`): green COLD runs on LATEST advance the canonical snapshot; the nightly sweep iterates enrolled recipes. Loaded by `canonical.py:is_canonical_enrolled`. User: custom-html. |
### 4.8 Cosmetic (BROKEN — see §8 R2)
| Key | Type / default | Meaning |
|---|---|---|
| `SCREENSHOT` | callable `(page, domain, meta) -> None` | Drives Playwright to a safe post-login view for the results-card screenshot (default: landing page). **Currently unreachable from the CI path**: `screenshot.py:41` reads it from the meta dict the orchestrator passes (`run_recipe_ci.py:1056`), but the L1 allowlist never loads `SCREENSHOT`, so the hook is always `None`. No recipe sets it (consistent with it never having worked). |
## 5. Writing custom tests & hooks
### 5.1 Lifecycle overlay assertions — `test_<op>.py`
One pytest file per lifecycle op (`install` / `upgrade` / `backup` / `restore`). The
**orchestrator performs the op exactly once**; the overlay only *asserts* on the resulting state
(HC3 op/assertion split — overlays never deploy, never restore, never mutate). The generic floor
test runs additively against the same state.
Conventions (see `tests/immich/test_backup.py` etc.):
- use the `live_app` fixture (asserts `CCCI_APP_DOMAIN` is set, yields the domain)
- use the `meta` fixture for HEALTH_*/timeouts (note: only the 4 base keys — §8 R3)
- read op context from `$CCCI_OP_STATE_FILE` (JSON written by the orchestrator after the op:
versions, artifact paths)
- execute in-container checks via `harness.lifecycle.exec_in_app(domain, service, cmd)`
### 5.2 Pre-op seed hooks — `ops.py`
`def pre_<op>(domain, meta)` callables, imported and called by the orchestrator **before**
performing the op. This is where data gets seeded so the post-op overlay can assert on it:
```python
# tests/immich/ops.py (pattern)
def pre_upgrade(domain, meta): _psql(domain, "INSERT ... 'upgrade-survives'")
def pre_backup(domain, meta): _psql(domain, "INSERT ... 'original'")
def pre_restore(domain, meta): _psql(domain, "DROP TABLE ci_marker") # damage, restore must undo
```
Seed → op → assert is the whole pattern: `pre_backup` writes a marker, the orchestrator backs up,
`pre_restore` destroys it, the orchestrator restores, `test_restore.py` asserts the marker is back.
### 5.3 Custom tier — `functional/`, `playwright/`, top-level `test_*.py`
All non-lifecycle `test_*.py` (discovery: `discovery.py:custom_tests`, recursive over the
top-level dir + `functional/` + `playwright/`; files named `test_<op>.py` excluded). Run in the
CUSTOM tier, after restore, against the post-upgrade (PR-head) app. ALL discovered files run —
cc-ci's and (if HC2-approved) repo-local's, additively.
Enrollment contract (`docs/enroll-recipe.md`): ≥2 NEW functional tests beyond ports of existing
upstream checks; ported tests carry `SOURCE:` comments. Playwright tests get the shared
browser/harness helpers (`harness.browser`); SSO recipes get `harness.sso`
(`setup_keycloak_realm` — idempotent, `oidc_password_grant` — provider-pluggable).
Tests gate on deps via `CCCI_DEPS_READY` (skip-with-reason when `0`; the skip is counted and
fails the run if deps were declared but unprovisionable — `run_recipe_ci.py:816`).
### 5.4 Pre-deploy shell hook — `install_steps.sh`
Runs after `abra app new` + `EXTRA_ENV` application + secret generation, **before** the base
deploy. For setup that must precede the first deploy: writing extra config files into the recipe
checkout, copying in a `compose.ccci.yml` overlay (§5.6), editing `.env` beyond simple key=val.
Env contract: `CCCI_APP_DOMAIN`, `CCCI_RECIPE`, `CCCI_APP_ENV` (path to the app's `.env`), and —
when `OIDC_AT_INSTALL` deps exist — `CCCI_DEPS_FILE`. Must locate the recipe checkout
ABRA_DIR-aware: `RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"` (per-run
`ABRA_DIR` since the concurrency restructure — a hardcoded `~/.abra` writes to the wrong tree).
Graceful-generic rule: a recipe needing a hook but not shipping one simply fails the generic
install — a correct reported outcome, not a harness error.
### 5.5 Deps credential wiring — `setup_custom_tests.sh`
For legacy (post-deploy) deps provisioning: runs after deps are up, reads `$CCCI_DEPS_FILE`
(jq-readable JSON of dep creds/URLs), wires OIDC config via `abra app config set` + secrets, and
redeploys. With `OIDC_AT_INSTALL = True` this hook is unnecessary (wiring happens in
`install_steps.sh` before the only deploy) — preferred for new enrollments (one deploy, no
deploy-count exception).
### 5.6 CI-only compose overlay — `compose.ccci.yml`
Not auto-discovered: `install_steps.sh` copies it into the recipe checkout, and the recipe must
set `CHAOS_BASE_DEPLOY = True` so the base deploy (`--chaos`) tolerates the untracked file.
Policy: minimal, justified fallback only (ghost's is a 15m `start_period` grace — a literal,
because abra validates `start_period` before env substitution). The overlay is cc-ci-owned even
though it rides in the recipe checkout.
### 5.7 Environment contract summary (what custom code can read)
| Var | Set for | Meaning |
|---|---|---|
| `CCCI_APP_DOMAIN` | all tests + hooks | the app's per-run domain |
| `CCCI_BASE_URL` | approved repo-local code | `https://<domain>` |
| `CCCI_RECIPE`, `CCCI_APP_ENV` | `install_steps.sh` | recipe name, app `.env` path |
| `CCCI_OP_STATE_FILE` | overlay tests | JSON op context (versions, artifacts) |
| `CCCI_DEPS_FILE` | deps hooks + tests | JSON dep creds dict |
| `CCCI_DEPS_READY` / `CCCI_DEPS_NOT_READY_REASON` | custom tier | gate SSO tests, skip-with-reason |
## 6. Run-model context (what the settings plug into)
One deploy chain per run (full detail: `docs/testing.md` §2):
```
deploy BASE (UPGRADE_BASE_VERSION or recipe_versions[-2]; EXTRA_ENV; install_steps.sh;
CHAOS_BASE_DEPLOY?; OIDC_AT_INSTALL deps first?)
→ INSTALL tier (READY_PROBE; generic + overlay asserts)
→ pre_upgrade → chaos-deploy PR HEAD (UPGRADE_EXTRA_ENV)
→ UPGRADE tier (READY_PROBE; version-label == head_ref)
→ pre_backup → backup (BACKUP_CAPABLE; BACKUP_VERIFY)
→ BACKUP tier
→ pre_restore → restore
→ RESTORE tier
→ CUSTOM tier (functional/ + playwright/; deps via CCCI_DEPS_*)
→ teardown (deps LAST)
```
Deploy-count guard (DG4.1): exactly `1 + len(DEPS)` deploys per run (chaos redeploys don't
count); the per-run counter file is keyed by run since the concurrency restructure.
## 7. Local iteration
```
RECIPE=<recipe> PR=<n> REF=<sha> SRC=recipe-maintainers/<recipe> \
STAGES=install,upgrade,backup,restore,custom \
cc-ci-run runner/run_recipe_ci.py
```
(`docs/enroll-recipe.md` §5 for the full loop, including dep teardown caveats.)
## 8. Known limitations & restructuring candidates
The review section. Ordered by how much they'd shape a restructure.
**R1 — Six divergent meta loaders (the core drift hazard).** §4's L1L6: every loader re-`exec()`s
`recipe_meta.py` and cherry-picks its own keys. Adding a key means knowing *which* loader to touch
(or that you must extend the L1 allowlist — `SCREENSHOT` proves people don't, R2). Two conventions
coexist: L1's explicit allowlist vs L3L6's ad-hoc `ns.get(...)` which silently bypasses it.
*Candidate:* one `harness.meta.load(recipe) -> RecipeMeta` with a declarative key registry
(name, type, default, validator, consumer) as the single source of truth; L1L6 become lookups
into the one loaded object; the registry also generates §4 of this doc (kills doc drift, R5).
**R2 — `SCREENSHOT` is a dead knob.** Fully implemented consumer (`screenshot.py`), documented
hook contract, never reachable: the orchestrator's allowlist omits it, so the dict passed at
`run_recipe_ci.py:1056` can never contain it. Direct evidence of R1. *Candidate:* fix trivially by
adding to the allowlist — or delete the hook path if post-login screenshots aren't wanted; decide
during the restructure.
**R3 — The pytest `meta` fixture sees 4 keys.** `tests/conftest.py:_recipe_meta` loads only
HEALTH_*/timeouts. An overlay test wanting e.g. `EXPECTED_NA` or a recipe constant must re-exec
the file itself. Probably intended minimalism, but it's a third key-set to keep in sync.
*Folds into R1.*
**R4 — Settings split across three config languages** (§1): recipe_meta keys, file-presence
(`install_steps.sh` existing changes deploy behavior), and run-time env (`CCCI_SKIP_GENERIC*`).
A reviewer asking "what does this recipe customize?" must check all three. *Candidate:* keep the
three surfaces (they serve different actors) but make the run header log a single resolved
"customization manifest" per run: every non-default key + every discovered hook file + every
CCCI_* override, in one block.
**R5 — Reference-doc drift already happened.** `docs/testing.md` documents 6 meta keys,
`docs/enroll-recipe.md` shows others by example; neither is complete (18 keys exist). This doc is
now complete but handwritten — it will drift too. *Candidate:* generate the key table from the R1
registry (test asserts doc ⊆ registry).
**R6 — No schema validation / silent typos.** Unknown top-level names in `recipe_meta.py` are
ignored, which is load-bearing (recipes keep private constants there: mumble's
`WELCOME_TEXT_MARKER`, `MAX_USERS`). Consequence: misspelling `READY_PROBE` as `READINESS_PROBE`
silently disables the probe — the run goes green with less coverage, the worst failure mode for a
CI harness. *Candidate:* with the R1 registry, warn (not fail) on ALL-CAPS top-level names that
are not registered and not referenced by the recipe's own tests; or namespace private constants
(`_WELCOME_TEXT_MARKER`).
**R7 — `compose.ccci.yml` ⇄ `CHAOS_BASE_DEPLOY` implicit coupling.** The overlay only works if
the recipe *also* sets the flag; forgetting it fails the base deploy with an abra
untracked-files error far from the cause. *Candidate:* if `install_steps.sh` exists alongside a
`compose.ccci.yml`, the harness could auto-enable chaos for the base deploy (or at least assert
the flag and fail with a pointed message).
**R8 — `SKIP_GENERIC` (meta form) has zero users.** Only the env-var form is used, ad hoc. Either
the meta key earns its place (first real user) or it's surface to delete in the restructure.
**R9 — `recipe_meta.py` is code, not config.** Five keys take callables (`EXTRA_ENV`,
`UPGRADE_EXTRA_ENV`, `READY_PROBE`, `BACKUP_VERIFY`, `SCREENSHOT`), so the file must stay an
`exec()`d Python module — it can't be validated as data, serialized into results, or diffed
declaratively. This is a real expressiveness need (cryptpad derives `SANDBOX_DOMAIN` from the
per-run domain), not an accident. *Candidate if restructuring:* split data keys (TOML-able,
schema-validated) from a `hooks.py` (callables only) — but weigh against the cost of two files
per recipe; the R1 registry gets most of the value without the split.
## 9. File / symbol index
| Concern | Where |
|---|---|
| Orchestrator meta loader (L1, allowlist) | `runner/run_recipe_ci.py:250` `_load_meta` |
| Pytest meta fixture (L2) | `tests/conftest.py` `_recipe_meta` |
| `EXTRA_ENV` loader (L3) | `runner/harness/lifecycle.py:114` `_recipe_extra_env` |
| Boolean-flag loader (L4) | `runner/harness/lifecycle.py:132` `_recipe_meta_flag` |
| `DEPS` loader (L5) | `runner/harness/deps.py:37` `declared_deps` |
| `WARM_CANONICAL` loader (L6) | `runner/harness/canonical.py:36` `is_canonical_enrolled` |
| Overlay/custom/hook discovery + HC2 gate | `runner/harness/discovery.py` |
| HC2 allowlist | `tests/repo-local-approved.txt` |
| Generic assertions + `BACKUP_CAPABLE` detect | `runner/harness/generic.py` |
| `READY_PROBE` / `CHAOS_BASE_DEPLOY` consumption | `runner/harness/lifecycle.py:516` / `:283` |
| `EXPECTED_NA` reporting | `runner/harness/results.py` |
| Dead `SCREENSHOT` consumer | `runner/harness/screenshot.py:36`, called `run_recipe_ci.py:1056` |
| Skip-generic logic (meta + env) | `runner/run_recipe_ci.py:285` |
| Worked examples | `tests/ghost/` (overlay+chaos), `tests/mumble/` (TCP probe, UPGRADE_EXTRA_ENV), `tests/lasuite-drive/` (DEPS+OIDC_AT_INSTALL), `tests/immich/` (ops.py seed pattern) |

View File

@ -1283,3 +1283,15 @@ the commit), which is the correct SCM integration.
environment; job is session-persistent (survives as long as Builder session runs). T0-refire
verified: CronCreate test fire at 23:17Z → upgrader started, upgrader-cron.log created, status
RUNNING. (2026-06-01)
## conc P3 (2026-06-10, Builder): install_steps.sh hooks resolve $ABRA_DIR — guardrail note
P3 makes recipe working trees per-run ($ABRA_DIR/recipes). tests/{ghost,discourse}/install_steps.sh
hard-coded `${HOME}/.abra/recipes/...` to copy their compose.ccci.yml overlay into the deploy tree;
under per-run trees that path is the WRONG (canonical) tree, so the overlay would silently miss the
deploy and both recipes' upgrade-tier base deploys would break. Fixed with ONE mechanical line per
hook: `RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"` (identical resolution rule to
the abra CLI and abra.recipe_dir()). No test assertion, gate, or overlay content was touched — the
phase guardrail's "never touch tests/<recipe>/ content" is read as protecting test/gate SEMANTICS;
this is required P3 fallout, equivalent to the harness-side path routing. Flagged here for the
Adversary's gate-integrity review.

View File

@ -8,18 +8,18 @@
{ pkgs, config, lib, ... }:
let
# MAX_TESTS (plan §4.2/§4.3 resource safety): max CI builds the exec runner runs at once. Drone
# queues the rest in its native pending-build queue (no custom queue). THE concurrency cap that
# bounds how many test apps can be live at once.
# queues the rest in its native pending-build queue (no custom queue). THE SINGLE concurrency
# knob — nothing else caps recipe-ci parallelism (the .drone.yml concurrency.limit was removed:
# one knob, one place). Bounds how many test apps can be live at once.
#
# Raised to 2 (operator request 2026-06-09) so two recipes can be tested in parallel (e.g. immich
# and plausible under active development at once). Verified safe on the current node (Hetzner cpx22,
# ~7.6 GiB / 4 vCPU — NOTE: smaller than the original 28 GiB this was written for): a full immich CI
# stack measured ~1 GiB (server+ML+pg+redis) with multiple GiB free, so two concurrent recipes fit.
# The concurrency PRECONDITION holds: the run-start janitor is age-based (default 2h) + run-app-name
# scoped, so it never reaps a concurrent in-flight run (harness.lifecycle.janitor). TRADE-OFF: with
# capacity>1 a SIGKILL'd build (no teardown) leaves an orphan the run-start sweep can't reap
# immediately (it might be a live run) — bounded instead by the 2h janitor + the /upgrade-all
# start/end reap + sweep-orphans. Revert to "1" if OOM / disk-I/O contention is observed under load.
# Concurrent-run safety is the harness's job at ANY capacity (docs/concurrency.md): per-run
# ABRA_DIR recipe trees, per-app-domain flocks, and a flock-probe janitor that reaps a crashed
# build's orphan immediately (held lock = live run, never touched). Revert to "1" if OOM /
# disk-I/O contention is observed under load.
maxTests = "2";
in
{

View File

@ -10,6 +10,7 @@ Bakes in the known abra gotchas (re-verify per installed abra version, currently
from __future__ import annotations
import json
import os
import subprocess
ABRA = "abra"
@ -19,6 +20,20 @@ class AbraError(RuntimeError):
pass
def abra_dir() -> str:
"""abra's state dir, resolved the same way the abra CLI resolves it: $ABRA_DIR if set, else
~/.abra. Inside a CI run, run_recipe_ci exports a PER-RUN $ABRA_DIR (fresh recipes/, shared
servers/+catalogue/ symlinks) before any abra call, so every helper here and every abra
subprocess agree on the same tree; outside a run (warm_reconcile's systemd timer, manual use)
both fall back to the canonical /root/.abra."""
return os.environ.get("ABRA_DIR") or os.path.expanduser("~/.abra")
def recipe_dir(recipe: str) -> str:
"""The current ABRA_DIR's working tree for a recipe (per-run inside a CI run)."""
return os.path.join(abra_dir(), "recipes", recipe)
def _run_pty(
args: list[str], timeout: int = 900, check: bool = True
) -> subprocess.CompletedProcess:
@ -77,9 +92,7 @@ def recipe_checkout(recipe: str, version: str) -> None:
a chaos (`-C`) deploy ignores ENV VERSION and uses the current checkout — together that silently
deployed LATEST for a 'previous-version' base, making the upgrade a no-op (Adversary F1d-2). With
this checkout + a non-chaos deploy, a pinned deploy genuinely deploys that version."""
import os
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
path = recipe_dir(recipe)
# -f (force): the version-pinning checkout must yield the EXACT ref tree. Without it, a cc-ci
# install_steps-provided overlay (e.g. discourse's compose.ccci.yml, copied into the pinned base)
# is an UNTRACKED file that collides with the same path TRACKED in a later ref, and
@ -100,9 +113,7 @@ def has_lightweight_version_tags(recipe: str) -> bool:
'reference not found'.) The caller (deploy_app) uses this to fall back to a chaos base deploy
(which skips lint and deploys the explicitly-checked-out pinned version — see lifecycle.deploy_app).
Read-only: just `git tag` + `cat-file -t`; no fetch/mutation, so it can't trigger abra's revert."""
import os
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
path = recipe_dir(recipe)
tags = subprocess.run(
["git", "-C", path, "tag", "-l"], capture_output=True, text=True
).stdout.split()
@ -231,9 +242,7 @@ def recipe_head_commit(recipe: str) -> str | None:
"""The current HEAD commit of the recipe checkout — captured right after fetch (the PR head, or
the catalogue current) so the upgrade tier can re-checkout it for the chaos redeploy after the
prev-tag base deploy reset the working tree (HC1)."""
import os
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
path = recipe_dir(recipe)
proc = subprocess.run(["git", "-C", path, "rev-parse", "HEAD"], capture_output=True, text=True)
out = proc.stdout.strip()
return out or None
@ -241,10 +250,7 @@ def recipe_head_commit(recipe: str) -> str | None:
def recipe_versions(recipe: str) -> list[str]:
"""Published versions of a recipe, oldest→newest (from the recipe git tags)."""
import os
import subprocess
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
path = recipe_dir(recipe)
proc = subprocess.run(
["git", "-C", path, "tag", "--sort=creatordate"], capture_output=True, text=True
)

View File

@ -30,17 +30,13 @@ import subprocess
import time
from . import abra, warm, warmsnap
from . import meta as meta_mod
def is_enrolled(recipe: str) -> bool:
"""True if `tests/<recipe>/recipe_meta.py` sets `WARM_CANONICAL = True`. Missing meta → False."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
if not os.path.exists(path):
return False
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
return bool(ns.get("WARM_CANONICAL"))
"""True if `tests/<recipe>/recipe_meta.py` sets `WARM_CANONICAL = True`. Missing meta → False.
Reads through the single meta loader (rcust P1 — no per-module exec)."""
return bool(meta_mod.load(recipe).WARM_CANONICAL)
def canonical_domain(recipe: str) -> str:
@ -51,7 +47,7 @@ def canonical_domain(recipe: str) -> str:
def enrolled_recipes() -> list[str]:
"""All recipes enrolled as data-warm canonicals (recipe_meta.WARM_CANONICAL=True), sorted. Used
by the WC6 nightly sweep to know which canonicals to refresh via a green cold run on latest."""
tests_dir = os.path.join(os.path.dirname(__file__), "..", "..", "tests")
tests_dir = meta_mod.TESTS_DIR
out = []
try:
for name in sorted(os.listdir(tests_dir)):

View File

@ -20,7 +20,7 @@ Per Phase-2 DECISIONS:
Run state:
- `$CCCI_DEPS_FILE` — JSON file written by the orchestrator after each dep deploys; each entry is
`{"recipe": "<dep-recipe>", "domain": "<dep-domain>", "version": null}`. Tests access via the
`deps_apps` pytest fixture defined in `tests/conftest.py`.
`deps` pytest fixture defined in `tests/conftest.py`.
"""
from __future__ import annotations
@ -31,19 +31,7 @@ import os
from collections.abc import Iterable
from . import lifecycle, naming
def declared_deps(recipe: str) -> list[str]:
"""Read `DEPS` from `tests/<recipe>/recipe_meta.py` — a list of recipe names this recipe needs
deployed alongside it. Returns [] if none."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
if not os.path.exists(path):
return []
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
deps = ns.get("DEPS") or []
return [str(d) for d in deps if d]
from . import meta as meta_mod
def dep_domain(parent_recipe: str, pr: str, ref: str | None, dep_recipe: str) -> str:
@ -62,11 +50,11 @@ def write_run_state(deps_state) -> None:
"""Write the deps state file ($CCCI_DEPS_FILE). Two shapes supported (canonical=keyed dict):
1. **Legacy list-of-entries:** `[{"recipe": "<dep>", "domain": "<d>"}, ...]` (Q2.3 original).
Still accepted by `load_run_state` for backwards compat — `deps_apps` fixture flattens.
Still accepted by `load_run_state` for backwards compat — the `deps` fixture flattens.
2. **NEW per-spec dict (operator-2026-05-28 SSO-dep plan §3.2):**
`{"<dep_recipe>": {"recipe": "<dep>", "domain": "<d>", "realm": "...",
"client_id": "...", "client_secret": "...", "admin_user": "...", "admin_password": "..."}}`.
The `setup_custom_tests.sh` per-recipe hook reads this via `jq` to wire OIDC env.
The per-recipe `install_steps.sh` hook reads this via `jq` to wire OIDC env.
No-op if `$CCCI_DEPS_FILE` isn't set."""
path = os.environ.get("CCCI_DEPS_FILE")
@ -81,11 +69,12 @@ def deploy_deps(
pr: str,
ref: str | None,
deps: Iterable[str],
meta_for: dict[str, dict] | None = None,
meta_for: dict | None = None,
) -> list[dict]:
"""Deploy each declared dep, sequentially, at its per-run domain. Returns the list of state
dicts (one per dep). `meta_for` maps dep_recipe -> meta (HEALTH_PATH/HEALTH_OK/timeouts) so the
readiness wait uses per-dep config; missing dep meta falls back to (/, 200/301/302, 600s)."""
dicts (one per dep). `meta_for` maps dep_recipe -> RecipeMeta (HEALTH_PATH/HEALTH_OK/timeouts)
so the readiness wait uses per-dep config; a missing dep meta is loaded via meta.load()
(defaults: /, 200/301/302, 600s)."""
meta_for = meta_for or {}
state: list[dict] = []
for dep in deps:
@ -94,20 +83,21 @@ def deploy_deps(
# NB: each dep_app gets a fresh deploy_count entry only on `_record_deploy` which fires
# inside `lifecycle.deploy_app`. For Phase 2 the deploy-count guard (DG4.1) counts the
# parent + its deps as distinct install events — by design, since each is a separate app.
dm = meta_for.get(dep, {})
dm = meta_for.get(dep) or meta_mod.load(dep)
lifecycle.deploy_app(
dep,
domain,
secrets=True,
deploy_timeout=int(dm.get("DEPLOY_TIMEOUT", 900)),
deploy_timeout=int(dm.DEPLOY_TIMEOUT),
meta=dm,
)
try:
lifecycle.wait_healthy(
domain,
ok_codes=tuple(dm.get("HEALTH_OK", (200, 301, 302))),
path=dm.get("HEALTH_PATH", "/"),
deploy_timeout=int(dm.get("DEPLOY_TIMEOUT", 600)),
http_timeout=int(dm.get("HTTP_TIMEOUT", 600)),
ok_codes=tuple(dm.HEALTH_OK),
path=dm.HEALTH_PATH,
deploy_timeout=int(dm.DEPLOY_TIMEOUT),
http_timeout=int(dm.HTTP_TIMEOUT),
)
except Exception:
# If a dep fails to converge, abort the whole resolve — let the caller teardown
@ -163,7 +153,7 @@ def load_run_state():
def deps_as_dict(state) -> dict[str, dict]:
"""Coerce either shape (legacy list or new dict) into a recipe→entry dict for the deps_apps
"""Coerce either shape (legacy list or new dict) into a recipe→entry dict for the `deps`
fixture + dependent-tests consumption."""
if isinstance(state, dict):
return state

View File

@ -11,7 +11,8 @@ hook; the orchestrator decides additive-vs-skip. Sources, in precedence order
> cc-ci tests/<recipe>/test_<op>.py
(the generic tests/_generic/test_<op>.py is the always-present floor, run separately by default)
custom (non-lifecycle) test_*.py — ALL run, additively, from BOTH locations (opt-in).
custom test_*.py (functional/ + playwright/ ONLY, rcust P4 placement rule) — ALL run,
additively, from BOTH locations (opt-in).
install-steps hook — install_steps.sh: repo-local > cc-ci, or none.
@ -100,29 +101,22 @@ def resolve_op(recipe: str, op: str, repo_local_dir: str | None) -> tuple[str, s
def custom_tests(recipe: str, repo_local_dir: str | None) -> list[tuple[str, str]]:
"""All non-lifecycle test_*.py from cc-ci's tests/<recipe>/ and (if approved) the recipe's
repo-local tests/. Discovered locations (Phase 2 §4.1):
- the top-level dir tests/<recipe>/test_*.py (legacy + cross-cutting)
- functional/ tests/<recipe>/functional/test_*.py (parity ports + recipe-specific)
- playwright/ tests/<recipe>/playwright/test_*.py (UI flows P6)
Files named `test_<op>.py` (lifecycle ops) are excluded from this list — the orchestrator runs
those in their lifecycle tier, not the custom one. Repo-local is consulted only for
allowlist-approved recipes (HC2)."""
"""All custom-tier test_*.py from cc-ci's tests/<recipe>/ and (if approved) the recipe's
repo-local tests/. PLACEMENT RULE (rcust P4): custom tests live ONLY under
- functional/ tests/<recipe>/functional/test_*.py (parity ports + recipe-specific)
- playwright/ tests/<recipe>/playwright/test_*.py (UI flows)
A top-level test_*.py is a LIFECYCLE OVERLAY (test_<op>.py) and nothing else — top-level
non-lifecycle files are NOT discovered (zero users at the time of the change; the lifecycle-
name exclusion below stays as a safety net so a misfiled test_<op>.py can never double-run).
Repo-local is consulted only for allowlist-approved recipes (HC2)."""
lifecycle_names = {f"test_{op}.py" for op in LIFECYCLE_OPS}
subdirs = ("functional", "playwright")
found: list[tuple[str, str]] = []
for source, d in (("cc-ci", cc_ci_dir(recipe)), ("repo-local", _gated(recipe, repo_local_dir))):
if not d or not os.path.isdir(d):
continue
# top-level (legacy / cross-cutting tests not under functional/playwright)
for p in sorted(glob.glob(os.path.join(d, "test_*.py"))):
if os.path.basename(p) not in lifecycle_names:
found.append((source, p))
# functional/ and playwright/ subdirs (Phase 2 §4.1)
for sub in subdirs:
for p in sorted(glob.glob(os.path.join(d, sub, "test_*.py"))):
# Phase-2 layout: lifecycle ops never live under functional/playwright, but be
# explicit so a misfiled file doesn't silently get double-run.
if os.path.basename(p) not in lifecycle_names:
found.append((source, p))
return found
@ -144,7 +138,7 @@ def install_steps(recipe: str, repo_local_dir: str | None) -> tuple[str, str] |
def pre_op_hook(recipe: str, op: str, repo_local_dir: str | None) -> tuple[str, str] | None:
"""The pre-op seed hook for `op`: the path to a recipe `ops.py` module that defines a
`pre_<op>(domain, meta)` callable, or None. cc-ci's tests/<recipe>/ops.py wins; the repo-local
`pre_<op>(ctx)` callable, or None. cc-ci's tests/<recipe>/ops.py wins; the repo-local
ops.py is consulted only for allowlist-approved recipes (HC2). The orchestrator imports the
module and calls pre_<op> BEFORE performing the op (HC3 op/assertion split — overlays seed
pre-op state here, then assert post-op in test_<op>.py)."""

View File

@ -19,22 +19,24 @@ import ssl
import time
from . import abra, lifecycle
from . import meta as meta_mod
# A recipe is backup-capable iff a compose file carries a truthy backupbot.backup label.
_BACKUPBOT_RE = re.compile(r"backupbot\.backup\b[^\n]*\btrue\b", re.IGNORECASE)
def _recipe_dir(recipe: str) -> str:
return os.path.expanduser(f"~/.abra/recipes/{recipe}")
return abra.recipe_dir(recipe) # the per-run tree inside a CI run ($ABRA_DIR)
def backup_capable(recipe: str, meta: dict | None = None) -> bool:
def backup_capable(recipe: str, meta=None) -> bool:
"""Whether the harness should run the backup/restore tiers (else they are a clean N/A skip, DG3).
`recipe_meta.BACKUP_CAPABLE` (bool) overrides; otherwise auto-detect by scanning the recipe's
compose*.yml for a truthy `backupbot.backup` label (the Co-op Cloud backup convention)."""
if meta and "BACKUP_CAPABLE" in meta:
return bool(meta["BACKUP_CAPABLE"])
`recipe_meta.BACKUP_CAPABLE` (bool) overrides when explicitly set (RecipeMeta default is None =
unset); otherwise auto-detect by scanning the recipe's compose*.yml for a truthy
`backupbot.backup` label (the Co-op Cloud backup convention)."""
if meta is not None and meta.BACKUP_CAPABLE is not None:
return bool(meta.BACKUP_CAPABLE)
for path in glob.glob(os.path.join(_recipe_dir(recipe), "compose*.yml")):
try:
with open(path) as fh:
@ -75,7 +77,7 @@ def served_cert(domain: str, port: int = 443) -> tuple[bool, str]:
return (True, f"CN={cn} SAN={sans}")
def assert_serving(domain: str, meta: dict) -> None:
def assert_serving(domain: str, meta) -> None:
"""The single generic "is the app really serving?" assertion (DG1).
The app-vs-Traefik-fallback proof is steps 1+2 (both load-bearing, verified by the Adversary):
@ -90,14 +92,14 @@ def assert_serving(domain: str, meta: dict) -> None:
Steps 12 are BOUNDED POLLS (no bare sleep), so a state-mutating op (upgrade/restore) that leaves
the app briefly reconverging settles, while a persistent failure still fails within the timeout."""
deadline = time.time() + meta["DEPLOY_TIMEOUT"]
deadline = time.time() + meta.DEPLOY_TIMEOUT
while time.time() < deadline and not lifecycle.services_converged(domain):
time.sleep(5)
assert lifecycle.services_converged(domain), f"{domain}: services did not converge"
path = meta["HEALTH_PATH"]
ok = tuple(meta["HEALTH_OK"])
deadline = time.time() + meta["HTTP_TIMEOUT"]
path = meta.HEALTH_PATH
ok = tuple(meta.HEALTH_OK)
deadline = time.time() + meta.HTTP_TIMEOUT
served = False
status, body = 0, ""
while time.time() < deadline:
@ -141,7 +143,7 @@ def op_state() -> dict:
return {}
def assert_upgraded(domain: str, meta: dict) -> None:
def assert_upgraded(domain: str, meta) -> None:
"""Generic UPGRADE assertion (post-op): the orchestrator already performed the upgrade once via
`abra app deploy --chaos` of the PR-head checkout. Assert it reconverged + still serves AND that
the deployment is genuinely the PR-head code under test (HC1) — non-vacuously (guarding F1d-2).
@ -212,7 +214,7 @@ def assert_backup_artifact(domain: str) -> str:
return snap_id
def assert_restore_healthy(domain: str, meta: dict) -> None:
def assert_restore_healthy(domain: str, meta) -> None:
"""Generic RESTORE assertion (post-op): the orchestrator already restored. Assert the app is
healthy + serving again (assert_serving polls, so the post-restore reconverge settles)."""
assert_serving(domain, meta)
@ -226,7 +228,7 @@ def perform_upgrade(
recipe: str,
head_ref: str | None,
deploy_timeout: int = 900,
meta: dict | None = None,
meta=None,
) -> dict[str, str | None]:
"""Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
@ -244,7 +246,8 @@ def perform_upgrade(
STRICTER convergence+health wait here: services N/N (wait_healthy) + app HEALTH_PATH healthy +
any recipe READY_PROBE (collabora WOPI discovery 200). This bounds readiness by OUR generous
deadline, not abra's impatient one — and is stronger evidence than abra's monitor."""
meta = meta or {}
if meta is None:
meta = meta_mod.load(recipe)
before = lifecycle.deployed_identity(domain)
if head_ref:
lifecycle.recipe_checkout_ref(recipe, head_ref)
@ -253,9 +256,7 @@ def perform_upgrade(
# (target) version, so the base deploys minimally WITHOUT it and the upgrade adds it to COMPOSE_FILE
# here, after the PR-head checkout (which ships the overlay) and before the chaos redeploy that
# picks up the new .env. Dict or callable(domain)->dict. No-op for recipes without it.
upgrade_env = meta.get("UPGRADE_EXTRA_ENV") or {}
if callable(upgrade_env):
upgrade_env = upgrade_env(domain) or {}
upgrade_env = meta_mod.upgrade_extra_env(meta, meta_mod.hook_ctx(domain, meta, op="upgrade"))
for k, v in upgrade_env.items():
print(f" upgrade-env: {k}={v}", flush=True)
abra.env_set(domain, k, v)
@ -266,14 +267,12 @@ def perform_upgrade(
# Own the convergence verification (abra's monitor was skipped via -c).
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta.get("HEALTH_OK", (200, 301, 302))),
path=meta.get("HEALTH_PATH", "/"),
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout)),
http_timeout=int(meta.get("HTTP_TIMEOUT", 300)),
)
lifecycle.wait_ready_probes(
meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout))
ok_codes=tuple(meta.HEALTH_OK),
path=meta.HEALTH_PATH,
deploy_timeout=int(meta.DEPLOY_TIMEOUT),
http_timeout=int(meta.HTTP_TIMEOUT),
)
lifecycle.wait_ready_probes(meta, domain, timeout=int(meta.DEPLOY_TIMEOUT), op="upgrade")
after = lifecycle.deployed_identity(domain)
# Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
# PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.

View File

@ -7,18 +7,20 @@ next run. Callers wrap deploy()/teardown() in try/finally (or a pytest finalizer
from __future__ import annotations
import contextlib
import datetime
import fcntl
import glob
import json
import os
import re
import shutil
import socket
import ssl
import subprocess
import time
import urllib.request
from . import abra
from . import abra, lifetime
from . import meta as meta_mod
GATEWAY_IP = "143.244.213.108" # *.ci.commoninternet.net -> gateway (TLS passthrough to cc-ci)
# A run app domain is "<recipe[:4]>-<6hex>.ci.commoninternet.net" (see DECISIONS.md). Used by the
@ -31,72 +33,67 @@ class TeardownError(RuntimeError):
# --- Concurrent-run safety (capacity=2) -------------------------------------------------------
# Two cooperating mechanisms, both process-lifetime-scoped so SIGKILL can't leak a stale lock:
# 1. Per-recipe flock: ~/.abra/recipes/<recipe> is ONE shared working tree that fetch_recipe
# rm-rf's/reclones and the upgrade tier git-checkouts mid-run. Concurrent runs of the SAME
# recipe would corrupt each other's deploy tree (observed: immich builds 229/230 deployed a
# tree missing its config), so they serialise on an exclusive flock; different recipes run in
# parallel. The kernel drops a flock when the holder dies, however it dies.
# 2. Active-run registry: each run registers its app domain + pid before creating the app, so the
# janitor can tell a live concurrent run from a crashed run's orphan (see janitor()).
RECIPE_LOCK_DIR = "/run/lock"
ACTIVE_RUN_DIR = "/run/cc-ci-active"
# ONE mechanism, process-lifetime-scoped so SIGKILL can't leak a stale claim: every run holds an
# exclusive kernel flock on its app DOMAIN (/run/lock/cc-ci-app-<domain>.lock) for the whole run.
# A held lock implies a live owner — the kernel releases a flock when the holding process dies,
# however it dies. The janitor probes the lock (LOCK_NB) to tell a live concurrent run (held
# leave it) from a crashed run's orphan (acquirable → reap it); it never inspects pids and never
# steals a held lock. Recipe-tree corruption between same-recipe runs is gone structurally (each
# run deploys from its own per-run ABRA_DIR — there is no shared recipe tree and no recipe lock),
# and same-domain runs (double-!testme of one PR) serialise on this app lock.
# See docs/concurrency.md.
# Acquired app-lock file objects are retained here for the REMAINING PROCESS LIFETIME: if the
# caller drops the returned file object, GC would close the fd and silently release the lock —
# this list is the lock's owner of record. Never cleared; release is process exit.
_held_app_locks: list = []
def acquire_recipe_lock(recipe: str):
"""Take the per-recipe exclusive lock; blocks (with a log line) if another run of the same
recipe is in flight. Returns the open lock file — the CALLER must keep a reference for the
whole run; the lock is released only when the process exits and the fd closes."""
path = os.path.join(RECIPE_LOCK_DIR, f"cc-ci-recipe-{recipe}.lock")
f = open(path, "w") # noqa: SIM115 — deliberately held for the lifetime of the run
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
print(
f"== recipe lock: another {recipe} run is in flight — waiting for {path} "
"(shared ~/.abra/recipes checkout) ==",
flush=True,
)
fcntl.flock(f, fcntl.LOCK_EX)
print(f"== recipe lock: acquired {path} ==", flush=True)
def _app_lock_dir() -> str:
"""The app-domain lockfile dir. /run/lock (tmpfs: a reboot clears locks AND lockfiles, so
post-reboot apps probe as orphans and are reaped immediately). Env-overridable so the
tests/concurrency suite (and its helper subprocesses) can use a sandbox dir."""
return os.environ.get("CCCI_APP_LOCK_DIR", "/run/lock")
def _app_lock_path(domain: str) -> str:
return os.path.join(_app_lock_dir(), f"cc-ci-app-{domain}.lock")
def acquire_app_lock(domain: str):
"""Take the per-app-domain exclusive lock; blocks (with a log line) if another run of the
same domain is in flight (double-!testme serialisation). Returns the open lock file, which is
ALSO retained in _held_app_locks so the flock lives exactly as long as the process.
Unlink/recreate race guard: the janitor unlinks a reaped orphan's lockfile while holding its
flock, so a waiter blocked on the OLD inode can win a lock no later opener can observe (a new
open() at the path creates a FRESH inode). After every acquisition, verify the locked fd is
still the file at the path (st_ino match); if not, drop it and retry on the live path."""
path = _app_lock_path(domain)
waited = False
while True:
# PEP 446: the fd is non-inheritable, so subprocess children never carry the lock.
f = open(path, "a") # noqa: SIM115 — deliberately held for the rest of the process
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
if not waited:
print(f"== app lock: another run of {domain} is in flight — waiting ==", flush=True)
waited = True
fcntl.flock(f, fcntl.LOCK_EX)
try:
if os.fstat(f.fileno()).st_ino == os.stat(path).st_ino:
break # we hold the lock on the inode the path names — done
except FileNotFoundError:
pass
f.close() # locked a stale (unlinked) inode — retry on the live path
os.utime(f.fileno()) # mtime = acquisition time = lock age (janitor's long-held flag)
_held_app_locks.append(f)
if waited:
print(f"== app lock: acquired {path} ==", flush=True)
return f
def _registry_path(domain: str) -> str:
return os.path.join(ACTIVE_RUN_DIR, domain)
def register_run_app(domain: str) -> None:
"""Record this process as the live owner of a run app (called BEFORE the app is created, so a
concurrent run's janitor can never observe the app without its registration)."""
with contextlib.suppress(OSError):
os.makedirs(ACTIVE_RUN_DIR, exist_ok=True)
with open(_registry_path(domain), "w") as f:
f.write(str(os.getpid()))
def unregister_run_app(domain: str) -> None:
with contextlib.suppress(OSError):
os.remove(_registry_path(domain))
def _run_owner_state(domain: str) -> str:
"""'alive' if the registered owner is a live run_recipe_ci process, 'dead' if registered but
gone (definite orphan), 'unknown' if never registered (pre-registry code or post-reboot)."""
try:
with open(_registry_path(domain)) as f:
pid = int(f.read().strip())
except (OSError, ValueError):
return "unknown"
try:
with open(f"/proc/{pid}/cmdline", "rb") as f:
cmdline = f.read().decode(errors="replace").replace("\0", " ")
except OSError:
return "dead"
# Guard against pid reuse: the owner must still look like a harness run.
return "alive" if "run_recipe_ci" in cmdline else "dead"
def _docker_names(kind: str, stack: str) -> list[str]:
"""docker <kind> ls names filtered to a stack (kind: service|volume|secret)."""
proc = subprocess.run(
@ -116,62 +113,6 @@ def _residual(domain: str) -> dict:
}
def _stack_age_seconds(stack: str) -> float | None:
"""Age of the stack's oldest service, or None if not present."""
svcs = _docker_names("service", stack)
if not svcs:
return None
oldest = None
for s in svcs:
p = subprocess.run(
["docker", "service", "inspect", s, "--format", "{{.CreatedAt}}"],
capture_output=True,
text=True,
)
ts = p.stdout.strip()
try:
# docker emits e.g. 2026-05-27 00:12:33.123 +0000 UTC -> take the leading 19 chars
dt = datetime.datetime.strptime(ts[:19], "%Y-%m-%d %H:%M:%S").replace(
tzinfo=datetime.UTC
)
except ValueError:
continue
age = (datetime.datetime.now(datetime.UTC) - dt).total_seconds()
oldest = age if oldest is None else max(oldest, age)
return oldest
def _recipe_extra_env(recipe: str, domain: str) -> dict[str, str]:
"""Per-recipe extra .env keys, applied at every deploy (install + upgrade's old_app) so a recipe
with multi-domain / config needs is enrolled with NO shared-harness change (D5/M6.5). A recipe
declares `EXTRA_ENV` in tests/<recipe>/recipe_meta.py as either a dict or a callable
`EXTRA_ENV(domain) -> dict` (callable form lets it derive values from the per-run domain, e.g.
cryptpad's SANDBOX_DOMAIN). Returns {} if none."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
if not os.path.exists(path):
return {}
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
ee = ns.get("EXTRA_ENV")
if callable(ee):
ee = ee(domain)
return {str(k): str(v) for k, v in (ee or {}).items()}
def _recipe_meta_flag(recipe: str, key: str) -> bool:
"""Read a boolean flag from tests/<recipe>/recipe_meta.py (e.g. CHAOS_BASE_DEPLOY). Returns
False if the recipe ships no meta or the flag is absent/falsey. Trusted in-repo exec, same as
_recipe_extra_env."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
if not os.path.exists(path):
return False
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
return bool(ns.get(key))
def _record_deploy() -> None:
"""Increment the per-run deploy counter (DG4.1: one deploy per run). No-op unless the
orchestrator set CCCI_DEPLOY_COUNT_FILE — so it never affects standalone/manual use."""
@ -185,6 +126,34 @@ def _record_deploy() -> None:
f.write(str(n + 1))
def ccci_overlay_path(recipe: str) -> str:
"""The cc-ci-owned compose overlay for a recipe (rcust P2a: first-class, auto-discovered)."""
return os.path.join(meta_mod.TESTS_DIR, recipe, "compose.ccci.yml")
def has_ccci_overlay(recipe: str) -> bool:
return os.path.isfile(ccci_overlay_path(recipe))
def provide_ccci_overlay(recipe: str) -> None:
"""Copy tests/<recipe>/compose.ccci.yml into THIS run's recipe checkout (ABRA_DIR-aware), so
the recipe's COMPOSE_FILE reference resolves (rcust P2a — the harness owns the copy; recipes
no longer ship install_steps.sh boilerplate for it). No-op for recipes without an overlay."""
src = ccci_overlay_path(recipe)
if not os.path.isfile(src):
return
dest_dir = abra.recipe_dir(recipe)
if not os.path.isdir(dest_dir):
print(f" ccci-overlay: recipe dir {dest_dir} missing — cannot provide overlay", flush=True)
raise RuntimeError(f"recipe checkout missing for {recipe}: {dest_dir}")
shutil.copy(src, os.path.join(dest_dir, "compose.ccci.yml"))
print(
f" ccci-overlay: provided compose.ccci.yml to the {recipe} checkout "
"(first-class overlay; base deploy auto-chaos)",
flush=True,
)
def _run_install_steps(hook: tuple[str, str], recipe: str, domain: str) -> None:
"""Run a recipe's custom install-steps hook (install_steps.sh) during the install tier — after
`abra app new` + env defaults + secret generate, before deploy (Phase 1d DG5). The hook gets the
@ -217,9 +186,9 @@ def prepull_images(recipe: str, domain: str) -> None:
app-INIT time (slow-init apps like collabora/immich still need their recipe healthcheck/READY_PROBE).
Best-effort on resolution failure (skip + let the deploy pull as usual); HARD-fails on a real
pull error (don't mask it)."""
import os
recipe_dir = os.path.expanduser(f"~/.abra/recipes/{recipe}")
recipe_dir = abra.recipe_dir(recipe) # per-run tree inside a CI run
# The app .env lives in the CANONICAL servers path (the per-run ABRA_DIR's servers/ is a
# symlink to it, so abra and this path agree on the same file).
env_path = os.path.expanduser(f"~/.abra/servers/default/{domain}.env")
if not os.path.isdir(recipe_dir) or not os.path.isfile(env_path):
print(f" prepull: recipe dir or .env missing for {recipe} — skipping", flush=True)
@ -268,19 +237,28 @@ def deploy_app(
secrets: bool = True,
install_steps_hook: tuple[str, str] | None = None,
deploy_timeout: int = 900,
meta=None,
) -> None:
"""Create + configure + deploy an app. Forces LETS_ENCRYPT_ENV='' so traefik serves the
wildcard cert via the file provider and NEVER attempts ACME (adversary finding A1). Applies any
per-recipe EXTRA_ENV (recipe_meta.py) and the custom install-steps hook (Phase 1d) before deploy.
per-recipe EXTRA_ENV (recipe_meta.py), the custom install-steps hook (Phase 1d), and the
first-class `tests/<recipe>/compose.ccci.yml` overlay (rcust P2a) before deploy.
`meta` is the recipe's loaded RecipeMeta (EXTRA_ENV); the orchestrator loads once and passes
it down. Callers without one in hand (fixtures, warm reconcile) may omit it — it is then
loaded here via the single meta.load() path.
`deploy_timeout` is the subprocess timeout for `abra app deploy`. Caller (orchestrator) passes
`recipe_meta.DEPLOY_TIMEOUT` so heavy recipes (ghost, matrix-synapse, lasuite-meet) can extend
past the 900s default. abra's INTERNAL TIMEOUT (recipe's TIMEOUT env, default 300s) is set via
EXTRA_ENV; this is the Python subprocess wrapper's timeout so abra doesn't get SIGKILLed mid-deploy."""
if meta is None:
meta = meta_mod.load(recipe)
_record_deploy()
# Register BEFORE the app exists: a concurrent run's janitor must never see this app without
# its live-owner registration (it would reap an in-flight deploy).
register_run_app(domain)
# Lock BEFORE the app exists: a concurrent run's janitor must never see this app without a
# held app lock (it would probe it as an orphan and reap an in-flight deploy). Also the
# double-!testme serialisation point: a second run of the same domain blocks here.
acquire_app_lock(domain)
abra.app_config_remove(domain) # clear any stale .env from a prior crashed run
abra.app_new(recipe, domain, version=version, secrets=secrets)
# A pinned version must actually deploy that version: check the recipe out to the tag so the
@ -303,16 +281,18 @@ def deploy_app(
flush=True,
)
chaos = True
# A recipe may force a chaos base deploy via recipe_meta CHAOS_BASE_DEPLOY=True when an
# install_steps hook adds an untracked compose overlay to the recipe checkout (e.g. discourse's
# compose.ccci.yml, provided by install_steps for the pinned base). The untracked file makes
# abra's pinned-deploy clean-tree check FATA ('has locally unstaged changes'); chaos skips lint +
# the clean-tree gate and deploys the EXPLICITLY-checked-out pinned version (we already ran
# recipe_checkout(version) above) — NOT latest. Same mechanism as the lightweight-tag branch.
elif _recipe_meta_flag(recipe, "CHAOS_BASE_DEPLOY"):
# A first-class cc-ci compose overlay (tests/<recipe>/compose.ccci.yml, copied into the
# checkout below — rcust P2a) is an UNTRACKED file in the recipe checkout, which makes
# abra's pinned-deploy clean-tree check FATA ('has locally unstaged changes'). Auto-chaos:
# chaos skips lint + the clean-tree gate and deploys the EXPLICITLY-checked-out pinned
# version (we already ran recipe_checkout(version) above) — NOT latest. Same mechanism as
# the lightweight-tag branch. (Replaces the deleted CHAOS_BASE_DEPLOY meta flag — the
# overlay's presence IS the signal, killing the R7 implicit coupling.)
elif has_ccci_overlay(recipe):
print(
f" deploy_app({recipe}@{version}): CHAOS_BASE_DEPLOY set → chaos base deploy of the "
"checked-out pinned version (skips clean-tree/lint; deploys version, not LATEST)",
f" deploy_app({recipe}@{version}): compose.ccci.yml overlay present → chaos base "
"deploy of the checked-out pinned version (skips clean-tree/lint; deploys version, "
"not LATEST)",
flush=True,
)
chaos = True
@ -322,12 +302,18 @@ def deploy_app(
# it ourselves is recipe-agnostic and canonical (the run domain IS the app's domain).
abra.env_set(domain, "DOMAIN", domain)
abra.env_set(domain, "LETS_ENCRYPT_ENV", "")
for k, v in _recipe_extra_env(recipe, domain).items():
for k, v in meta_mod.extra_env(meta, meta_mod.hook_ctx(domain, meta)).items():
abra.env_set(domain, k, v)
if secrets:
abra.secret_generate(domain)
if install_steps_hook:
_run_install_steps(install_steps_hook, recipe, domain)
# First-class cc-ci compose overlay (rcust P2a): if the recipe ships
# tests/<recipe>/compose.ccci.yml, copy it into THIS run's recipe checkout (ABRA_DIR-aware)
# so the COMPOSE_FILE reference in the recipe's EXTRA_ENV resolves. Untracked, so it persists
# across the later PR-head checkout (idempotent when the head ships the same fix). Replaces
# the per-recipe install_steps.sh copy boilerplate + CHAOS_BASE_DEPLOY flag (auto-chaos above).
provide_ccci_overlay(recipe)
# HQ1: warm the local image store before the (real, unchanged) abra deploy.
prepull_images(recipe, domain)
abra.deploy(domain, chaos=chaos, timeout=deploy_timeout)
@ -384,7 +370,13 @@ def services_converged(domain: str) -> bool:
if proc.returncode != 0:
return False # a service vanished mid-check — not settled
for state in proc.stdout.split("\n"):
if state.strip() not in ("", "completed", "rollback_completed"):
# Only ACTIVE states block convergence. 'paused'/'rollback_paused' are terminal-without-
# intervention: swarm's default update-failure-action pauses the update on one task flicker
# and the flag then persists FOREVER (immich CI 241: app service 'paused' from a restart
# during restore, service back at 1/1 and healthy — the wait hung to its deadline). With
# N/N already required above, a paused update is settled for our purposes; the HTTP-health
# and tier assertions still gate whether the app actually works.
if state.strip() in ("updating", "rollback_started"):
return False
return True
@ -533,7 +525,7 @@ def chaos_redeploy(
abra.deploy(domain, chaos=True, timeout=deploy_timeout, no_converge_checks=no_converge_checks)
def wait_ready_probes(meta: dict, domain: str, timeout: int = 600) -> None:
def wait_ready_probes(meta, domain: str, timeout: int = 600, op: str | None = None) -> None:
"""Poll a recipe's optional READY_PROBE endpoints until each returns an accepted status, or raise.
A recipe_meta may define `READY_PROBE(domain) -> [{"host":..., "path":..., "ok":(200,)}, ...]`
@ -550,10 +542,10 @@ def wait_ready_probes(meta: dict, domain: str, timeout: int = 600) -> None:
must be released by the old task + rebound by the new) the voice server can be down while
HTTP-200 still passes — and backup-bot then execs into a not-running app container (409). Requiring
the voice port to be stably listening before proceeding closes that window."""
probe_fn = meta.get("READY_PROBE")
probe_fn = meta.READY_PROBE
if not callable(probe_fn):
return
probes = probe_fn(domain) or []
probes = probe_fn(meta_mod.hook_ctx(domain, meta, op=op)) or []
for probe in probes:
if "tcp_port" in probe:
host = probe.get("tcp_host", "127.0.0.1")
@ -713,23 +705,84 @@ def teardown_app(domain: str, verify: bool = True) -> None:
residual = _residual(domain)
if any(residual.values()):
raise TeardownError(f"teardown left residual for {domain}: {residual}")
# The app is gone — drop its active-run registration (janitor() also clears it when reaping).
unregister_run_app(domain)
# No unregistration step: the app lock releases implicitly at process exit. The clean run's
# leftover lockfile (unheld) is unlinked on sight by the next janitor's stale-lockfile sweep.
def janitor(max_age_seconds: int | None = None) -> None:
"""Reap orphaned run apps from crashed/rebooted runs. Matches the real naming scheme. Safe under
CONCURRENT runs (capacity=2): every harness run registers its app in the active-run registry
(register_run_app), so the janitor distinguishes the three cases instead of using age alone:
- registered + owner run_recipe_ci process ALIVE -> in-flight concurrent run: never reap
- registered + owner DEAD (crashed/SIGKILLed run) -> definite orphan: reap immediately
- no registry entry (pre-registry code, reboot) -> fall back to the age threshold
Reaps via docker primitives so it works even when the .env is gone (A2/A3). Age fallback default
2h, env-overridable via CCCI_JANITOR_MAX_AGE."""
import os
# A lock held longer than 2x the 60-min hard deadline can only be a leaked run (the deadline
# bounds every healthy run). Flag it for a human — NEVER steal a held lock.
LONG_HELD_LOCK_SECONDS = 2 * lifetime.HARD_DEADLINE_SECONDS
if max_age_seconds is None:
max_age_seconds = int(os.environ.get("CCCI_JANITOR_MAX_AGE", "7200"))
def _probe_and_reap(domain: str) -> None:
"""Probe one run app's lock; reap iff nobody holds it (kernel-guaranteed orphan).
Reaping happens WHILE HOLDING the probe lock, closing the janitor-vs-new-run race: a new run
of the same domain blocks in acquire_app_lock until the reap finishes, so a fresh app never
coexists with a half-reaped one. The lockfile is unlinked before release (still holding the
lock); a waiter that blocked on the unlinked inode re-checks identity and retries. Two racing
janitors arbitrate on the same flock: one reaps, the other sees 'held' and leaves —
teardown_app(verify=False) is idempotent either way."""
path = _app_lock_path(domain)
try:
# PEP 446: non-inheritable fd, same as acquire_app_lock.
f = open(path, "a") # noqa: SIM115 — closed in the finally below, lock released with it
except OSError as e:
print(f"!! janitor: cannot open lockfile {path} ({e}) — skipping {domain}", flush=True)
return
try:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
# Held -> live run. Never steal; flag if it has been held implausibly long.
try:
held_for = time.time() - os.stat(path).st_mtime
except OSError:
held_for = 0
if held_for > LONG_HELD_LOCK_SECONDS:
print(
f"!! lock for {domain} held >{LONG_HELD_LOCK_SECONDS // 60}min — possible "
"leaked run; inspect with lslocks",
flush=True,
)
else:
print(
f" janitor: {domain} lock held — live concurrent run, leaving it", flush=True
)
return
# Acquired — but only the inode the PATH names counts (another janitor may have reaped
# and unlinked this inode while we raced; a lock on an unlinked inode protects nothing
# and unlinking the path now would delete a NEWER run's lockfile).
try:
if os.fstat(f.fileno()).st_ino != os.stat(path).st_ino:
return
except FileNotFoundError:
return
# Orphan: no live owner (the kernel released the lock when the owner died). Reap while
# holding the probe lock, then unlink the lockfile before releasing.
print(f" janitor: {domain} lock acquirable — orphan, reaping", flush=True)
with contextlib.suppress(Exception):
teardown_app(domain, verify=False)
with contextlib.suppress(OSError):
os.unlink(path)
finally:
f.close()
def janitor() -> None:
"""Reap orphaned run apps from crashed/rebooted runs; the kernel flock is the only liveness
oracle. For every candidate run app, probe its app-domain lock (LOCK_NB):
acquirable -> nobody holds it -> orphan -> reap under the probe lock + unlink lockfile
held -> live concurrent run -> leave it (warn if held >2x the hard deadline)
Candidate discovery is unchanged: `abra app ls` + a docker-service sweep (catches stacks
whose .env is already gone), both matched against RUN_APP_RE — warm/canonical apps never
match and are never probed. Post-reboot, /run/lock (tmpfs) is empty, so every surviving app
probes as an orphan and is reaped immediately (no age threshold). Stale lockfiles with no
app behind them are unlinked on sight. Degrades safely: an unreadable lockfile/dir is
skipped with a log line, never a crash. Reaps via docker primitives so it works even when
the .env is gone (A2/A3)."""
seen = set()
for app in abra.app_ls():
name = app.get("appName") or app.get("domain") or ""
@ -743,18 +796,22 @@ def janitor(max_age_seconds: int | None = None) -> None:
seen.add(f"{m.group(1)}.ci.commoninternet.net")
for name in seen:
owner = _run_owner_state(name)
if owner == "alive":
print(f" janitor: {name} is a live concurrent run — leaving it", flush=True)
continue
if owner == "unknown":
# No registry entry (manual run on pre-registry code, or post-reboot): only the age
# threshold protects it, as before.
stack = _stack_name(name)
age = _stack_age_seconds(stack)
if age is not None and age < max_age_seconds:
continue # young and of unknown provenance; leave it
# owner == "dead" (a crashed/killed run's definite orphan) or old enough -> reap
with contextlib.suppress(Exception):
teardown_app(name, verify=False)
unregister_run_app(name)
_probe_and_reap(name)
# Tidy /run/lock: a clean run's leftover lockfile is unheld and appless — unlink it (under
# its own probe lock, with the same identity check as above).
with contextlib.suppress(OSError):
for path in glob.glob(os.path.join(_app_lock_dir(), "cc-ci-app-*.lock")):
domain = os.path.basename(path)[len("cc-ci-app-") : -len(".lock")]
if domain in seen:
continue # handled (or deliberately left) above
with contextlib.suppress(OSError):
f = open(path, "a") # noqa: SIM115 — closed below, lock released with it
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
if os.fstat(f.fileno()).st_ino == os.stat(path).st_ino:
os.unlink(path)
except (BlockingIOError, FileNotFoundError):
pass # held (live run pre-deploy) or already gone — leave it
finally:
f.close()

View File

@ -0,0 +1,95 @@
"""Run-lifetime hardening (concurrency restructure P1).
The concurrency model's invariant chain is:
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
Locks are kernel flocks released on process exit, so the only thing that needs managing is the
PROCESS lifetime. Three guards, installed at run startup (before any abra call) by
`install_lifetime_guards()`:
1. `PR_SET_PDEATHSIG(SIGTERM)`: if the parent (the drone step shell) dies — cancel, runner
crash, host shutdown of the step — the kernel delivers SIGTERM to the harness, so a dead
build can never leak a running harness that holds locks. Paired with a ppid==1 re-check
AFTER the prctl: a parent that died BEFORE the prctl took effect would never trigger the
death signal, so a harness that finds itself already reparented refuses to run.
2. SIGTERM handler: raise SystemExit so the run's `finally:` teardown funnel executes and the
process exits non-zero. Re-entrant deliveries during teardown are logged and IGNORED so a
second signal can't abort the cleanup the first one asked for (`begin_teardown()` guards
this; the run's own `finally:` blocks also call it so a signal landing mid-normal-teardown
can't abort that either).
3. `signal.alarm(3600)`: self-imposed hard deadline. SIGALRM funnels into the same teardown
path with a distinct log line. Teardown time after the deadline is not alarm-bounded —
interrupting a teardown buys nothing; the janitor (flock probe) is the backstop if a
teardown wedges and the process is killed harder.
"""
from __future__ import annotations
import ctypes
import os
import signal
import sys
HARD_DEADLINE_SECONDS = 60 * 60
_PR_SET_PDEATHSIG = 1 # linux/prctl.h
_state = {"tearing_down": False}
def begin_teardown() -> None:
"""Mark the teardown funnel as running. From here on SIGTERM/SIGALRM must NOT raise — it
would abort the very cleanup it asks for — so the handlers log and return instead. Called by
the handlers themselves before raising, and at the top of the run's `finally:` blocks."""
_state["tearing_down"] = True
def _funnel_handler(log_line: str, exit_code: int):
"""A signal handler that routes into the teardown funnel exactly once: log, then raise
SystemExit (propagates through the run's try/finally → teardown executes → non-zero exit).
While teardown is already running, further signals are logged and swallowed."""
def handler(signum: int, frame) -> None: # noqa: ARG001
print(log_line, flush=True)
if _state["tearing_down"]:
print(
f"== signal {signum} during teardown — ignored (teardown continues, "
"exit stays non-zero) ==",
flush=True,
)
return
begin_teardown()
raise SystemExit(exit_code)
return handler
def install_lifetime_guards(deadline_seconds: int = HARD_DEADLINE_SECONDS) -> None:
"""Install all three lifetime guards (see module docstring). Must run at harness startup,
before any abra call and before any lock is taken."""
libc = ctypes.CDLL("libc.so.6", use_errno=True)
if libc.prctl(_PR_SET_PDEATHSIG, signal.SIGTERM, 0, 0, 0) != 0:
err = ctypes.get_errno()
raise OSError(err, f"prctl(PR_SET_PDEATHSIG, SIGTERM) failed: {os.strerror(err)}")
# The prctl is armed now — but only fires for a parent death AFTER this point. If the parent
# already died, we are reparented (ppid 1) and would never get the signal: refuse to run, an
# orphaned harness would hold locks/apps with nothing managing its lifetime.
if os.getppid() == 1:
sys.exit("parent died before prctl(PR_SET_PDEATHSIG) — refusing to run orphaned")
signal.signal(
signal.SIGTERM,
_funnel_handler(
"== SIGTERM received (drone cancel / parent death) — tearing down ==",
128 + signal.SIGTERM,
),
)
minutes = deadline_seconds // 60
signal.signal(
signal.SIGALRM,
_funnel_handler(
f"== run exceeded {minutes}-minute hard deadline — tearing down ==",
128 + signal.SIGALRM,
),
)
signal.alarm(deadline_seconds)

320
runner/harness/meta.py Normal file
View File

@ -0,0 +1,320 @@
"""Single recipe-meta loader + declarative key registry (recipe-custom restructure P1; spec
docs/recipe-customization.md §8 R1).
THE one place `tests/<recipe>/recipe_meta.py` is `exec()`d. Every consumer (orchestrator, pytest
`meta` fixture, deploy env shaping, deps, warm-canonical enrollment, screenshot) reads the ONE
loaded `RecipeMeta` object instead of re-exec'ing the file and cherry-picking keys — that drift
(six divergent loaders, spec §4 L1L6) is what made `SCREENSHOT` an unreachable knob (R2) and let
key typos silently disable coverage (R6).
Validation (locked decision, recipe-custom-restructure-full-plan.md):
- unknown ALL-CAPS top-level name → MetaError (hard error, fails fast at load; the all-recipes
unit test catches it at PR time). Underscore-prefixed names (`_FOO`) are recipe-private and
exempt; lowercase names (helper functions/imports) are ignored.
- type mismatch → MetaError. Callables are accepted ONLY for hook-typed keys.
The KEYS registry is the single source of truth for the key set: it drives validation, the
RecipeMeta dataclass fields, and the generated reference table in docs/recipe-customization.md §4
(scripts/gen-meta-docs.py; a unit test asserts the committed table matches).
"""
from __future__ import annotations
import copy
import dataclasses
import difflib
import inspect
import json
import os
from collections.abc import Callable
ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
TESTS_DIR = os.path.join(ROOT, "tests")
class MetaError(Exception):
"""A recipe_meta.py failed registry validation (unknown key / type mismatch / callable on a
data key). Hard error by design: a typo'd key must fail the run at load, not silently reduce
coverage (spec §8 R6 — the worst failure mode for a CI harness)."""
@dataclasses.dataclass(frozen=True)
class Key:
"""One registered recipe_meta key: name, type tag, default, one-line doc (rendered into the
generated reference table), optional extra validator, and a deprecation marker (deprecated
keys still load+validate but are scheduled for deletion)."""
name: str
type: str # "int"|"str"|"tuple[int]"|"bool"|"dict_or_hook"|"hook"|"list[str]"|"dict"
default: object
doc: str
validate: Callable[[object], None] | None = None
deprecated: bool = False
# Expected positional-parameter names for a callable value (rcust P3 uniform ctx convention).
# Enforced at load so a legacy-signature hook (e.g. `def READY_PROBE(domain)`) fails with a
# CLEAR MetaError naming the migration — never a silent TypeError mid-run.
hook_params: tuple[str, ...] | None = None
KEYS: tuple[Key, ...] = (
Key(
"HEALTH_PATH",
"str",
"/",
"Path probed for serving/health checks (deploy wait + generic `assert_serving`).",
),
Key("HEALTH_OK", "tuple[int]", (200, 301, 302), "Acceptable HTTP status codes for health."),
Key("DEPLOY_TIMEOUT", "int", 600, "Max seconds to wait for swarm convergence per deploy."),
Key("HTTP_TIMEOUT", "int", 300, "Max seconds to wait for HTTP health after convergence."),
Key(
"BACKUP_CAPABLE",
"bool",
None,
"Override the backup-tier capability auto-detect (compose `backupbot.backup` labels). `False` forces N/A; `True` forces the tier on; unset = auto-detect.",
),
Key(
"EXPECTED_NA",
"dict",
None,
"Declare an N/A rung intentional: `{rung: reason}`. The cap stands either way; only the report wording changes.",
),
Key(
"READY_PROBE",
"hook",
None,
"Callable `(ctx) -> [probe, ...]` returning extra readiness probes, run after install AND after upgrade: HTTP `{host, path, ok}` or TCP `{tcp_host, tcp_port, stable}`.",
hook_params=("ctx",),
),
Key(
"UPGRADE_BASE_VERSION",
"str",
None,
"Exact published tag overriding the upgrade tier's base (default: `recipe_versions[-2]`).",
),
Key(
"BACKUP_VERIFY",
"hook",
None,
"Callable `(ctx) -> bool` post-backup data-capture check; `False` re-runs the backup (truncated-dump race guard), retried up to 3 attempts.",
hook_params=("ctx",),
),
Key(
"UPGRADE_EXTRA_ENV",
"dict_or_hook",
None,
"Extra `.env` keys applied after the PR-head checkout, before the chaos redeploy (env that exists only at head). Dict, or callable `(ctx) -> dict`.",
hook_params=("ctx",),
),
Key(
"EXTRA_ENV",
"dict_or_hook",
{},
"Extra `.env` keys applied at EVERY deploy (base install AND upgrade old-app). Dict, or callable `(ctx) -> dict` deriving values from the per-run domain (`ctx.domain`).",
hook_params=("ctx",),
),
Key(
"DEPS",
"list[str]",
[],
'Dep recipes deployed/provisioned alongside (e.g. `["keycloak"]`); creds land in `$CCCI_DEPS_FILE`.',
),
Key(
"WARM_CANONICAL",
"bool",
False,
"Enroll the recipe in the warm/canonical app system (docs/warm.md): green cold runs on LATEST advance the canonical snapshot.",
),
Key(
"SCREENSHOT",
"hook",
None,
"Callable `(page, ctx)` driving Playwright to a safe, credential-free post-login view for the results-card screenshot (default: landing page).",
hook_params=("page", "ctx"),
),
# (CHAOS_BASE_DEPLOY, OIDC_AT_INSTALL and SKIP_GENERIC were deleted in restructure P2:
# compose.ccci.yml is first-class + auto-chaos; install-time deps wiring is the only mode;
# the generic floor is suppressible only via the dev-only CCCI_SKIP_GENERIC* env form.)
)
_REGISTRY: dict[str, Key] = {k.name: k for k in KEYS}
# The one validated, attribute-access view of a recipe's customization. Generated from KEYS so the
# field set can never drift from the registry (frozen: consumers share one immutable object).
RecipeMeta = dataclasses.make_dataclass(
"RecipeMeta",
[(k.name, object, dataclasses.field(default=None)) for k in KEYS],
frozen=True,
)
RecipeMeta.__doc__ = (
"Validated per-recipe customization (one field per registered key; attribute access). "
"Built ONLY by meta.load()."
)
def meta_path(recipe: str, tests_dir: str | None = None) -> str:
"""Canonical path of a recipe's meta file (pure)."""
return os.path.join(tests_dir or TESTS_DIR, recipe, "recipe_meta.py")
def check_hook_signature(fn, expected: tuple[str, ...], where: str) -> None:
"""Enforce the uniform ctx hook convention (rcust P3): a hook callable's positional parameters
must be exactly `expected` (e.g. ("ctx",) or ("page", "ctx")). A legacy-signature hook (the
pre-restructure `(domain)` / `(domain, meta)` / `(page, domain, meta)` forms) raises a CLEAR
MetaError naming the migration — never a silent TypeError mid-run."""
try:
params = [
p.name
for p in inspect.signature(fn).parameters.values()
if p.kind in (p.POSITIONAL_ONLY, p.POSITIONAL_OR_KEYWORD)
]
except (TypeError, ValueError): # builtins/odd callables — let the call site surface it
return
if tuple(params) != expected:
raise MetaError(
f"{where}: hook signature is ({', '.join(params)}) — the recipe-customization "
f"restructure (P3) changed ALL recipe hook signatures to ({', '.join(expected)}); "
f"read fields off the HookCtx (ctx.domain, ctx.base_url, ctx.meta, ctx.deps, ctx.op). "
f"See docs/recipe-customization.md §5."
)
def _coerce(key: Key, value: object, path: str) -> object:
"""Validate `value` against `key`'s declared type; normalize containers (tuple[int]/list[str]).
Raises MetaError on mismatch — including a callable supplied for a data-typed key."""
t = key.type
if callable(value) and t not in ("hook", "dict_or_hook"):
raise MetaError(
f"{path}: {key.name} is a data key (type {t}) — callables are accepted only for "
f"hook-typed keys"
)
if t == "int":
if isinstance(value, int) and not isinstance(value, bool):
return value
elif t == "str":
if isinstance(value, str):
return value
elif t == "bool":
if isinstance(value, bool):
return value
elif t == "tuple[int]":
if isinstance(value, tuple | list) and all(
isinstance(x, int) and not isinstance(x, bool) for x in value
):
return tuple(value)
elif t == "list[str]":
if isinstance(value, tuple | list) and all(isinstance(x, str) for x in value):
return list(value)
elif t == "dict":
if isinstance(value, dict):
return value
elif (
t == "hook"
and callable(value)
or t == "dict_or_hook"
and (isinstance(value, dict) or callable(value))
):
return value
raise MetaError(f"{path}: {key.name} must be {t}, got {type(value).__name__} ({value!r})")
def load(recipe: str, tests_dir: str | None = None):
"""Load + validate a recipe's customization -> RecipeMeta. THE only exec() of recipe_meta.py.
Missing file -> all registry defaults (the zero-config baseline, spec §2). Unknown
non-underscore ALL-CAPS top-level name or type mismatch -> MetaError (hard error).
`tests_dir` overrides the recipe-meta root (unit tests / fixtures)."""
path = meta_path(recipe, tests_dir)
values = {k.name: copy.copy(k.default) for k in KEYS}
if os.path.exists(path):
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
for name in sorted(ns):
if name.startswith("_") or not name.isupper():
continue # _FOO = recipe-private (exempt); lowercase = helpers/imports (ignored)
key = _REGISTRY.get(name)
if key is None:
near = difflib.get_close_matches(name, _REGISTRY, n=1)
hint = f" — did you mean {near[0]!r}?" if near else ""
raise MetaError(
f"{path}: unknown recipe_meta key {name!r}{hint}. Registered keys: "
f"{', '.join(sorted(_REGISTRY))}. Recipe-private constants must be "
f"underscore-prefixed (e.g. _{name})."
)
values[name] = _coerce(key, ns[name], path)
if key.hook_params and callable(values[name]):
check_hook_signature(values[name], key.hook_params, f"{path}: {name}")
if key.validate:
key.validate(values[name])
return RecipeMeta(**values)
def as_dict(meta) -> dict:
"""RecipeMeta -> {key: value} (every registered key, defaults included)."""
return dataclasses.asdict(meta)
def non_default(meta) -> dict:
"""The keys a recipe explicitly customized: {key: value} where value differs from the registry
default. Hooks compare by identity-vs-None (a set hook is always non-default). Feeds the run's
customization manifest (P5)."""
out = {}
for k in KEYS:
v = getattr(meta, k.name)
if v != k.default:
out[k.name] = v
return out
@dataclasses.dataclass(frozen=True)
class HookCtx:
"""The single argument every recipe hook receives (rcust P3 uniform ctx convention):
`EXTRA_ENV(ctx)`, `UPGRADE_EXTRA_ENV(ctx)`, `READY_PROBE(ctx)`, `BACKUP_VERIFY(ctx)`,
`SCREENSHOT(page, ctx)`, ops.py `pre_<op>(ctx)`."""
domain: str # the app's per-run domain
base_url: str # https://<domain>
meta: object # the recipe's full RecipeMeta
deps: dict | None # provisioned dep creds ({dep_recipe: entry}) or None if absent/empty
op: str | None # current lifecycle op (install|upgrade|backup|restore) or None
def _run_deps() -> dict | None:
"""The current run's provisioned dep creds from $CCCI_DEPS_FILE (either shape), or None.
Read directly (not via harness.deps) to keep meta.py import-cycle-free."""
path = os.environ.get("CCCI_DEPS_FILE")
if not path or not os.path.exists(path):
return None
try:
with open(path) as f:
data = json.load(f)
except (OSError, ValueError):
return None
if isinstance(data, dict):
return data or None
if isinstance(data, list):
out = {e["recipe"]: e for e in data if isinstance(e, dict) and e.get("recipe")}
return out or None
return None
def hook_ctx(domain: str, meta, *, op: str | None = None) -> HookCtx:
"""Build the HookCtx for a hook call site. Dep creds are picked up from the run's
$CCCI_DEPS_FILE when present (None otherwise)."""
return HookCtx(domain=domain, base_url=f"https://{domain}", meta=meta, deps=_run_deps(), op=op)
def _env_map(value, ctx: HookCtx) -> dict[str, str]:
if callable(value):
value = value(ctx)
return {str(k): str(v) for k, v in (value or {}).items()}
def extra_env(meta, ctx: HookCtx) -> dict[str, str]:
"""Resolve EXTRA_ENV (dict or callable(ctx)->dict) to the concrete per-run env map."""
return _env_map(meta.EXTRA_ENV, ctx)
def upgrade_extra_env(meta, ctx: HookCtx) -> dict[str, str]:
"""Resolve UPGRADE_EXTRA_ENV (dict or callable(ctx)->dict) to the concrete env map."""
return _env_map(meta.UPGRADE_EXTRA_ENV, ctx)

View File

@ -8,7 +8,7 @@ Secret-safety (R7, the cardinal screenshot guardrail): the screenshot step must
that displays generated credentials (an install wizard showing the initial admin password, a secrets
page, etc.). The DEFAULT capture is the app's **landing page** (a login form shows fields, not the
password) — safe for every recipe. A recipe that needs a post-login view opts in via a recipe-meta
`SCREENSHOT` hook: a callable `screenshot(page, domain, meta) -> None` that drives Playwright to a
`SCREENSHOT` hook: a callable `SCREENSHOT(page, ctx) -> None` that drives Playwright to a
safe, credential-free view and is responsible for not landing on a secrets page. The harness never
auto-fills a wizard.
@ -21,6 +21,7 @@ from __future__ import annotations
import os
from . import browser as harness_browser
from . import meta as meta_mod
# Default viewport for the captured screenshot — a desktop-ish frame that crops well into the card.
VIEWPORT = {"width": 1280, "height": 800}
@ -33,12 +34,19 @@ def screenshot_path(run_artifact_dir: str) -> str:
return os.path.join(run_artifact_dir, "screenshot.png")
def _load_screenshot_hook(recipe_meta: dict | None):
def _load_screenshot_hook(recipe_meta):
"""Return the recipe's optional SCREENSHOT hook (a callable) if it declared one, else None.
The hook drives Playwright to a safe post-login view; default is the landing page."""
if not recipe_meta:
The hook drives Playwright to a safe post-login view; default is the landing page.
`recipe_meta` is the loaded RecipeMeta (rcust P1 — the single loader actually delivers
SCREENSHOT now; under the old L1 allowlist the key never arrived, spec §8 R2). A plain dict
is still accepted for direct/manual callers."""
if recipe_meta is None:
return None
hook = recipe_meta.get("SCREENSHOT")
if isinstance(recipe_meta, dict):
hook = recipe_meta.get("SCREENSHOT")
else:
hook = getattr(recipe_meta, "SCREENSHOT", None)
return hook if callable(hook) else None
@ -67,8 +75,9 @@ def capture(domain: str, out_path: str, *, recipe_meta: dict | None = None) -> s
if hook is not None:
# Recipe-specific safe view (post-login etc.). The hook owns navigation +
# the no-secret-page guarantee; it should call page.screenshot itself, but if
# it doesn't, we still snap the resulting page below.
hook(page, domain, recipe_meta)
# it doesn't, we still snap the resulting page below. SCREENSHOT(page, ctx) —
# the uniform ctx convention (rcust P3).
hook(page, meta_mod.hook_ctx(domain, recipe_meta))
if not os.path.exists(out_path):
page.screenshot(path=out_path, full_page=False)
else:

View File

@ -47,6 +47,7 @@ from harness import ( # noqa: E402
discovery,
generic,
lifecycle,
lifetime,
naming,
warm,
warmsnap,
@ -57,6 +58,9 @@ from harness import ( # noqa: E402
from harness import ( # noqa: E402
deps as deps_mod,
)
from harness import ( # noqa: E402
meta as meta_mod,
)
from harness import ( # noqa: E402
results as results_mod,
)
@ -69,7 +73,7 @@ ALL_STAGES = ("install", "upgrade", "backup", "restore", "custom")
def sso_dep_unverified(declared, deps_ready: bool, requires_deps_skipped: int) -> bool:
"""F2-11 gate predicate (pure, unit-tested). True when a recipe declares DEPS but its
setup_custom_tests failed (deps not ready) AND that caused ≥1 `requires_deps` (SSO/OIDC) test
dep provisioning failed (deps not ready) AND that caused ≥1 `requires_deps` (SSO/OIDC) test
to SKIP. In that case the recipe's characteristic SSO claim was NOT verified, so the run must
NOT report GREEN — even though a skip-only pytest file exits 0 and leaves every tier 'pass'.
Generic-tier failure-isolation is preserved (those results stand); only the green SIGNAL is
@ -137,18 +141,73 @@ def _gitea_token() -> str | None:
return tok or None
def _run_state_path(name: str) -> str:
"""Run-scoped state file in the tempdir, keyed by run id + harness pid — NEVER by app domain.
A second run of the SAME domain overlaps this process (its main() preamble executes before it
blocks at the app lock inside deploy_app), so domain-keyed files get reset/removed under the
live run: M2(c) double-!testme produced a false DG4.1 deploy-count=2 in run 1 and a countfile
FileNotFoundError crash in run 2. Children never re-derive these paths — they receive them
via the CCCI_*_FILE env vars, so the key only has to be unique per harness process."""
rid = results_mod.run_id()
return os.path.join(tempfile.gettempdir(), f"ccci-{name}-{rid}-{os.getpid()}")
def setup_run_abra_dir() -> str:
"""P3: build + export this run's PER-RUN ABRA_DIR — structural isolation of recipe trees.
`<runs_dir>/<run-id>/abra/` with:
servers/ -> symlink to the canonical ~/.abra/servers. App .env files land in the shared
canonical path, so janitor discovery (`abra app ls`) and env-based teardown
work unchanged from any process; per-domain filenames + the app-domain lock
prevent write conflicts.
catalogue/ -> symlink to the canonical ~/.abra/catalogue (read-mostly).
recipes/ fresh + empty — THE isolation that matters: each run clones and git-checkouts
its own recipe trees, so concurrent runs (same recipe included) can never
corrupt each other's deploy tree. Replaces the per-recipe flock.
Exported as $ABRA_DIR — honored by the abra CLI and by every harness path helper
(abra.abra_dir()) — BEFORE any abra call. Rides along the existing run-dir retention."""
canonical = os.path.expanduser("~/.abra")
rid = results_mod.run_id()
if rid == "manual":
rid = f"manual-{os.getpid()}" # two concurrent hand-runs must not share a tree
run_abra_dir = os.path.join(results_mod.runs_dir(), rid, "abra")
os.makedirs(os.path.join(run_abra_dir, "recipes"), exist_ok=True)
for shared in ("servers", "catalogue"):
link = os.path.join(run_abra_dir, shared)
if not os.path.islink(link):
os.symlink(os.path.join(canonical, shared), link)
os.environ["ABRA_DIR"] = run_abra_dir
print(
f"== per-run ABRA_DIR: {run_abra_dir} (servers/catalogue -> canonical; fresh recipes/) ==",
flush=True,
)
return run_abra_dir
def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None:
"""Make the recipe available at the code under test. If SRC+REF point at the mirror PR,
"""Make the recipe available at the code under test in THIS RUN's recipe tree
($ABRA_DIR/recipes/<recipe>): a plain clone — no locking needed, no rm-rf of any shared
state (the rm below only clears this run's own leftovers, e.g. a janitor-triggered
`abra app ls` auto-clone or a Drone build-number reuse). If SRC+REF point at the mirror PR,
clone it at that ref; otherwise fetch the catalogue copy. Private mirror repos need the bot
token — passed via a per-command http.extraHeader (not persisted in .git/config, not printed)."""
recipes_dir = os.path.expanduser("~/.abra/recipes")
os.makedirs(recipes_dir, exist_ok=True)
dest = os.path.join(recipes_dir, recipe)
# CCCI_SKIP_FETCH=1: use the local recipe clone as-is (lets a test/Adversary stage a fake/broken
# ref — e.g. a simulated broken PR head for the --quick rollback proof — without it being clobbered
# by a re-fetch). Never set in production CI.
dest = abra.recipe_dir(recipe)
os.makedirs(os.path.dirname(dest), exist_ok=True)
# CCCI_SKIP_FETCH=1: use the locally STAGED recipe clone as-is (lets a test/Adversary stage a
# fake/broken ref — e.g. a simulated broken PR head for the --quick rollback proof — without it
# being clobbered by a re-fetch). Staging happens in the canonical ~/.abra/recipes/<recipe>;
# copy it into the per-run tree so the rest of the run reads the staged state. Never set in
# production CI.
if os.environ.get("CCCI_SKIP_FETCH") == "1":
print(f"[fetch] CCCI_SKIP_FETCH=1 — using local {recipe} recipe clone as-is", flush=True)
canonical = os.path.expanduser(f"~/.abra/recipes/{recipe}")
subprocess.run(["rm", "-rf", dest], check=False)
if os.path.isdir(canonical):
shutil.copytree(canonical, dest, symlinks=True)
print(
f"[fetch] CCCI_SKIP_FETCH=1 — using staged {recipe} clone as-is "
f"(copied {canonical} -> per-run tree)",
flush=True,
)
return
if src and ref:
url = f"https://git.autonomic.zone/{src}.git"
@ -177,7 +236,7 @@ def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None:
def snapshot_recipe_tests(recipe: str) -> str | None:
"""Copy the recipe-shipped tests/ to a stable temp dir, immune to abra re-checking-out the
recipe to a version tag during the run. Returns the snapshot path, or None if no tests/."""
src = os.path.expanduser(f"~/.abra/recipes/{recipe}/tests")
src = os.path.join(abra.recipe_dir(recipe), "tests")
if not os.path.isdir(src):
return None
has_overlay = glob.glob(os.path.join(src, "test_*.py")) or os.path.isfile(
@ -191,52 +250,29 @@ def snapshot_recipe_tests(recipe: str) -> str | None:
return dst
def _load_meta(recipe: str) -> dict:
"""Mirror tests/conftest._recipe_meta so the orchestrator's deploy/wait uses the same per-recipe
config the tiers see (timeouts, health path/codes)."""
meta = {
"HEALTH_PATH": "/",
"HEALTH_OK": (200, 301, 302),
"DEPLOY_TIMEOUT": 600,
"HTTP_TIMEOUT": 300,
}
path = os.path.join(ROOT, "tests", recipe, "recipe_meta.py")
if os.path.exists(path):
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
for k in list(meta) + [
"BACKUP_CAPABLE",
"SKIP_GENERIC",
"EXPECTED_NA",
"OIDC_AT_INSTALL",
"READY_PROBE",
"UPGRADE_BASE_VERSION",
"BACKUP_VERIFY",
"UPGRADE_EXTRA_ENV",
]:
if k in ns:
meta[k] = ns[k]
return meta
def _tier_env(domain: str) -> dict:
return dict(os.environ, CCCI_APP_DOMAIN=domain, CCCI_BASE_URL=f"https://{domain}")
def _skip_generic(op: str, meta: dict) -> bool:
def skip_generic_env_overrides() -> list[str]:
"""Active CCCI_SKIP_GENERIC* env overrides (rcust P2c: the meta key is deleted; the env form
is a documented LOCAL-DEV-ONLY escape hatch). Surfaced loudly when set in a CI (drone) run —
it reduces generic-floor coverage and must never silently ride a CI verdict."""
return sorted(
k for k in os.environ if k.startswith("CCCI_SKIP_GENERIC") and _truthy(os.environ.get(k))
)
def _skip_generic(op: str) -> bool:
"""Whether the generic assertion for `op` is opted out (Phase 1e HC3). Default: run (additive).
Opt-out, any of: env CCCI_SKIP_GENERIC (all ops), env CCCI_SKIP_GENERIC_<OP>, or the recipe's
declarative recipe_meta.SKIP_GENERIC list (op name, or "all"/"*")."""
Opt-out via env only (dev-only escape hatch, P2c): CCCI_SKIP_GENERIC (all ops) or
CCCI_SKIP_GENERIC_<OP>. The recipe_meta SKIP_GENERIC key is deleted (zero users)."""
if _truthy(os.environ.get("CCCI_SKIP_GENERIC")):
return True
if _truthy(os.environ.get(f"CCCI_SKIP_GENERIC_{op.upper()}")):
return True
sg = [str(s).lower() for s in (meta.get("SKIP_GENERIC") or [])]
return "all" in sg or "*" in sg or op in sg
return _truthy(os.environ.get(f"CCCI_SKIP_GENERIC_{op.upper()}"))
def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, meta: dict) -> None:
def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, meta) -> None:
"""Run the optional pre-op seed hook (recipe ops.py `pre_<op>`) BEFORE the harness performs the
op (HC3 op/assertion split): overlays seed data-continuity markers / the backup→restore mutation
here, then assert post-op in test_<op>.py. cc-ci's ops.py is trusted; a repo-local ops.py is
@ -253,7 +289,11 @@ def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, met
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
print(f" pre-op seed ({source}): {os.path.relpath(path, ROOT)}::pre_{op}", flush=True)
getattr(mod, f"pre_{op}")(domain, meta)
fn = getattr(mod, f"pre_{op}")
# Uniform ctx convention (rcust P3): pre_<op>(ctx). A legacy (domain, meta) hook fails
# HERE with a clear migration message, not a TypeError mid-call.
meta_mod.check_hook_signature(fn, ("ctx",), f"{os.path.relpath(path, ROOT)}::pre_{op}")
fn(meta_mod.hook_ctx(domain, meta, op=op))
finally:
if d in sys.path:
sys.path.remove(d)
@ -266,7 +306,7 @@ def _perform_op(
head_ref: str | None,
op_state: dict,
deploy_timeout: int = 900,
meta: dict | None = None,
meta=None,
) -> None:
"""Perform the single mutating op ONCE (the harness owns the op, HC3). install has no op. Records
what the assertions need (pre-upgrade identity, backup snapshot_id) into op_state. None of these
@ -289,9 +329,10 @@ def _perform_op(
# verify fails we re-run the WHOLE backup (fresh restic snapshot) with a re-stabilised DB, up to
# 3 attempts. Recipes without BACKUP_VERIFY are unaffected (single backup, as before).
snap = generic.perform_backup(domain)
verify = meta.get("BACKUP_VERIFY") if meta else None
verify = meta.BACKUP_VERIFY if meta else None
verify_ctx = meta_mod.hook_ctx(domain, meta, op="backup") if meta else None
attempt = 1
while callable(verify) and not verify(domain) and attempt < 3:
while callable(verify) and not verify(verify_ctx) and attempt < 3:
attempt += 1
print(
f" backup-verify FAILED (attempt {attempt - 1}/3) — backup did not capture the "
@ -299,7 +340,7 @@ def _perform_op(
flush=True,
)
snap = generic.perform_backup(domain)
if callable(verify) and not verify(domain):
if callable(verify) and not verify(verify_ctx):
print(
f" !! backup-verify still FAILED after {attempt} attempts — backup is incomplete",
flush=True,
@ -315,7 +356,7 @@ def run_lifecycle_tier(
op: str,
repo_local: str | None,
domain: str,
meta: dict,
meta,
head_ref: str | None,
op_state: dict,
records: list[dict] | None = None,
@ -330,7 +371,7 @@ def run_lifecycle_tier(
a {tier,source,file,rc,junit} record appended, so the run can assemble per-stage/per-test
results.json + the level afterwards. Purely additive — does not change the verdict."""
overlay = discovery.resolve_overlay_op(recipe, op, repo_local)
skip_gen = _skip_generic(op, meta)
skip_gen = _skip_generic(op)
files: list[tuple[str, str]] = []
if not skip_gen:
files.append(discovery.generic_op(op))
@ -355,7 +396,7 @@ def run_lifecycle_tier(
recipe,
head_ref,
op_state,
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)),
deploy_timeout=int(meta.DEPLOY_TIMEOUT),
meta=meta,
)
with open(os.environ["CCCI_OP_STATE_FILE"], "w") as f:
@ -393,7 +434,7 @@ def run_lifecycle_tier(
def _enrich_deps_with_sso(parent_recipe: str, parent_domain: str, deps_list) -> dict[str, dict]:
"""For each dep, set up a fresh realm/client + test user via the harness's provider-specific
setup function, then return a recipe→entry dict carrying domain + admin + realm/client/user
info — the shape the `setup_custom_tests.sh` hook (and dependent tests) read.
info — the shape the `install_steps.sh` hook (and dependent tests) read.
Provider routing: today only `keycloak` is supported. authentik will need a parallel
`setup_authentik_realm` when an authentik-dep recipe enrolls (DEFERRED.md #9).
@ -407,7 +448,7 @@ def _enrich_deps_with_sso(parent_recipe: str, parent_domain: str, deps_list) ->
if not dep_recipe or not dep_domain:
continue
if dep_recipe != "keycloak":
# Provider not yet supported — record bare entry; setup_custom_tests.sh / tests will
# Provider not yet supported — record bare entry; install_steps.sh / tests will
# raise if they need realm/client info they don't see.
out[dep_recipe] = entry
continue
@ -451,12 +492,10 @@ def _provision_deps(
Splits deps into live-warm (shared provider at a stable domain + a per-run realm) vs cold
(co-deployed per run), provisions each dep's SSO realm/client/user, and persists the enriched
dict the `setup_custom_tests.sh`/`install_steps.sh` hooks + dependent tests read. Raises on any
failure (the caller marks deps-not-ready). Used by BOTH wiring paths:
- post-deploy (legacy): provision AFTER generic tiers, then `setup_custom_tests.sh` does an
in-place OIDC redeploy.
- install-time (`OIDC_AT_INSTALL`, Q3.2a): provision BEFORE the single deploy so the
install-tier `install_steps.sh` hook wires OIDC env into that one deploy — no reconverge.
dict the `install_steps.sh` hooks + dependent tests read. Raises on any failure (the caller
marks deps-not-ready). Install-time wiring is the ONLY mode (rcust P2b): provision BEFORE the
single deploy so the install-tier `install_steps.sh` hook wires OIDC env into that one deploy —
no reconverge, no post-deploy `setup_custom_tests.sh` machinery.
"""
warm_deps, cold_deps = [], []
for d in declared:
@ -467,7 +506,7 @@ def _provision_deps(
if wd:
print(f" dep: {d} warm provider {wd} not up — cold fallback", flush=True)
cold_deps.append(d)
dep_metas = {d: _load_meta(d) for d in cold_deps}
dep_metas = {d: meta_mod.load(d) for d in cold_deps}
deps_list = (
deps_mod.deploy_deps(recipe, os.environ.get("PR", "0"), ref, cold_deps, meta_for=dep_metas)
if cold_deps
@ -485,32 +524,6 @@ def _provision_deps(
return deps_state
def _run_setup_custom_tests_hook(recipe: str, domain: str, deps_file: str) -> None:
"""Run `tests/<recipe>/setup_custom_tests.sh` if present (operator-2026-05-28 SSO-dep plan
§3.2). The hook reads `$CCCI_DEPS_FILE`, sets OIDC env via `abra app config set` + secret
insert, and triggers an in-place `abra app deploy --force --chaos`. Failure here propagates
to mark deps-not-ready (caught in main())."""
path = os.path.join(ROOT, "tests", recipe, "setup_custom_tests.sh")
if not os.path.isfile(path):
# No hook = recipe doesn't need post-deps wiring; deps are deployed + creds available
# via deps_apps fixture as-is.
print(
f" setup_custom_tests: no hook at {os.path.relpath(path, ROOT)} (deps creds ready in $CCCI_DEPS_FILE)",
flush=True,
)
return
print(f" setup_custom_tests hook: {os.path.relpath(path, ROOT)}", flush=True)
rc = subprocess.run(
["bash", path],
check=False,
env=dict(os.environ, CCCI_APP_DOMAIN=domain, CCCI_RECIPE=recipe, CCCI_DEPS_FILE=deps_file),
)
if rc.returncode != 0:
raise RuntimeError(
f"setup_custom_tests.sh exited {rc.returncode} (deps env not wired into parent)"
)
def run_custom(
recipe: str,
repo_local: str | None,
@ -553,7 +566,7 @@ def _wait_undeployed(domain: str, timeout: int = 120) -> None:
def run_quick(
recipe: str, ref: str | None, head_ref: str | None, repo_local: str | None, meta: dict
recipe: str, ref: str | None, head_ref: str | None, repo_local: str | None, meta
) -> int:
"""WC4 `--quick` opt-in fast lane (plan §2). Reattach the data-warm canonical (known-good volume)
→ upgrade IN PLACE to the PR head (chaos) → assert generic UPGRADE (reconverge+moved+serving) +
@ -574,22 +587,22 @@ def run_quick(
flush=True,
)
statefile = os.path.join(tempfile.gettempdir(), f"ccci-opstate-{domain}.json")
statefile = _run_state_path("opstate") + ".json"
with open(statefile, "w") as f:
json.dump({}, f)
os.environ["CCCI_OP_STATE_FILE"] = statefile
depsfile = os.path.join(tempfile.gettempdir(), f"ccci-deps-{domain}.json")
depsfile = _run_state_path("deps") + ".json"
with open(depsfile, "w") as f:
json.dump({}, f)
os.environ["CCCI_DEPS_FILE"] = depsfile
skipfile = os.path.join(tempfile.gettempdir(), f"ccci-depskip-{domain}.txt")
skipfile = _run_state_path("depskip") + ".txt"
with contextlib.suppress(OSError):
os.remove(skipfile)
os.environ["CCCI_DEPS_SKIP_REPORT"] = skipfile
op_state: dict = {}
results: dict[str, str] = {}
declared = deps_mod.declared_deps(recipe)
declared = list(meta.DEPS)
deps_state: dict = {}
deps_ready = True
deps_not_ready_reason = ""
@ -601,28 +614,32 @@ def run_quick(
try:
# 1) reattach the canonical (warm boot at the known-good version + retained volume)
try:
canonical.deploy_canonical(recipe, timeout=int(meta.get("DEPLOY_TIMEOUT", 900)))
canonical.deploy_canonical(recipe, timeout=int(meta.DEPLOY_TIMEOUT))
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta["HEALTH_OK"]),
path=meta["HEALTH_PATH"],
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
ok_codes=tuple(meta.HEALTH_OK),
path=meta.HEALTH_PATH,
deploy_timeout=meta.DEPLOY_TIMEOUT,
http_timeout=meta.HTTP_TIMEOUT,
)
warm_ok = True
except Exception as e: # noqa: BLE001
print(f"!! canonical reattach/readiness failed: {_scrub(str(e))}", flush=True)
if warm_ok:
# 2) deps (warm keycloak + per-run realm) — mirrors main()'s warm/cold split
# 2) deps (warm keycloak + per-run realm) — mirrors main()'s warm/cold split. NB
# (rcust P2b): deps are provisioned (realm/creds in $CCCI_DEPS_FILE) but quick mode
# cannot do install-time OIDC env wiring — the canonical app pre-exists its per-run
# realm. No quick-enrolled recipe declares DEPS today; if one ever does, its
# requires_deps tests will exercise creds-only flows or skip (F2-11 keeps the signal).
if declared:
print(f"\n===== setup_custom_tests (quick): deps {declared} =====", flush=True)
print(f"\n===== deps (quick): {declared} =====", flush=True)
try:
warm_deps, cold_deps = [], []
for d in declared:
wd = warm.warm_domain(d)
(warm_deps if (wd and warm.is_warm_up(d, wd)) else cold_deps).append(d)
dep_metas = {d: _load_meta(d) for d in cold_deps}
dep_metas = {d: meta_mod.load(d) for d in cold_deps}
deps_list = (
deps_mod.deploy_deps(
recipe, os.environ.get("PR", "0"), ref, cold_deps, meta_for=dep_metas
@ -637,12 +654,11 @@ def run_quick(
print(f" dep: using live-warm {d} @ {wd} (per-run realm)", flush=True)
deps_state = _enrich_deps_with_sso(recipe, domain, deps_list)
deps_mod.write_run_state(deps_state)
_run_setup_custom_tests_hook(recipe, domain, depsfile)
except Exception as e: # noqa: BLE001
deps_ready = False
deps_not_ready_reason = _scrub(str(e))[:300]
print(
f"!! setup_custom_tests failed (deps-not-ready): {deps_not_ready_reason}",
f"!! dep provisioning failed (deps-not-ready): {deps_not_ready_reason}",
flush=True,
)
@ -658,6 +674,8 @@ def run_quick(
results["upgrade"] = "fail"
results["custom"] = "skip"
finally:
# Teardown funnel running: further SIGTERM/SIGALRM are logged + ignored (lifetime.py).
lifetime.begin_teardown()
# F2-11 skip count (read before deciding pass/fail)
requires_deps_skipped = 0
try:
@ -755,7 +773,7 @@ def run_quick(
overall = 1
if sso_unverified:
print(
f"!! DEPS={declared} but setup_custom_tests failed and {requires_deps_skipped} "
f"!! DEPS={declared} but dep provisioning failed and {requires_deps_skipped} "
"requires_deps SKIPPED — SSO NOT verified (F2-11)",
file=sys.stderr,
)
@ -790,7 +808,7 @@ def promote_canonical(recipe: str, head_ref: str | None) -> None:
if not latest:
print(f"WC5 promote: no version tags for {recipe} — skip", flush=True)
return
meta = _load_meta(recipe)
meta = meta_mod.load(recipe)
# The cold run's deploy-count was already asserted + the countfile removed; don't perturb it.
os.environ.pop("CCCI_DEPLOY_COUNT_FILE", None)
print(
@ -802,14 +820,15 @@ def promote_canonical(recipe: str, head_ref: str | None) -> None:
domain,
version=latest,
secrets=True,
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)),
deploy_timeout=int(meta.DEPLOY_TIMEOUT),
meta=meta,
)
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta["HEALTH_OK"]),
path=meta["HEALTH_PATH"],
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
ok_codes=tuple(meta.HEALTH_OK),
path=meta.HEALTH_PATH,
deploy_timeout=meta.DEPLOY_TIMEOUT,
http_timeout=meta.HTTP_TIMEOUT,
)
abra.undeploy(domain)
_wait_undeployed(domain)
@ -821,6 +840,9 @@ def promote_canonical(recipe: str, head_ref: str | None) -> None:
def main() -> int:
# P1 lock-lifetime hardening: PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard
# deadline, armed before ANY abra call or lock acquisition (see harness/lifetime.py).
lifetime.install_lifetime_guards()
recipe = os.environ.get("RECIPE")
if not recipe:
print("RECIPE env is required", file=sys.stderr)
@ -835,19 +857,28 @@ def main() -> int:
print(
f"== cc-ci run: recipe={recipe} ref={ref} pr={os.environ.get('PR', '0')} stages={sorted(stages)}"
)
# Concurrent-run safety: runs of the SAME recipe serialise on a per-recipe flock — they share
# ONE ~/.abra/recipes/<recipe> working tree which fetch_recipe (below) rm-rf's/reclones and the
# upgrade tier git-checkouts mid-run. Must be taken BEFORE fetch_recipe. Different recipes run
# in parallel (capacity=2). The reference must stay alive for the whole run: the kernel drops
# the flock when the fd closes (including on any crash/SIGKILL — no stale-lock failure mode).
_recipe_lock = lifecycle.acquire_recipe_lock(recipe) # noqa: F841
# P2c: the CCCI_SKIP_GENERIC* env escape hatch is LOCAL-DEV-ONLY. If it rides a CI (drone)
# run, shout — generic-floor coverage is reduced and the verdict must not look routine.
for ov in skip_generic_env_overrides():
if os.environ.get("DRONE"):
print(
f"!! {ov}=1 — dev-only generic-floor override ACTIVE IN A CI RUN; generic "
"assertions are suppressed for the affected op(s). This must never gate a merge.",
flush=True,
)
else:
print(f"== {ov}=1 (dev-only generic-floor override active)", flush=True)
# Concurrent-run safety is structural: this run's recipe trees live in its own ABRA_DIR
# (exported here, before ANY abra call), so no recipe-tree lock exists; same-DOMAIN runs
# serialise on the app-domain flock taken in deploy_app (see docs/concurrency.md).
setup_run_abra_dir()
fetch_recipe(recipe, ref, src)
# The PR-head commit the upgrade tier re-checks out for the chaos redeploy to the code under test
# (HC1). Prefer the explicit PR head sha ($REF) — robust + exact; fall back to the recipe checkout
# HEAD (the catalogue current) for a non-PR `!testme`. Captured before any version-tag checkout.
head_ref = ref or lifecycle.recipe_head_commit(recipe)
repo_local = snapshot_recipe_tests(recipe)
meta = _load_meta(recipe)
meta = meta_mod.load(recipe)
# WC4/WC7: opt-in `--quick` fast lane. Requires an existing data-warm canonical; if none, fall
# back cleanly to the full COLD run below so the PR is still tested (DECISIONS Phase-2w).
@ -870,16 +901,14 @@ def main() -> int:
# override must be an exact published version tag (deployed as a pinned base). (Adversary §7.1.)
want_upgrade = "upgrade" in stages
prev = (
(meta.get("UPGRADE_BASE_VERSION") or lifecycle.previous_version(recipe))
if want_upgrade
else None
(meta.UPGRADE_BASE_VERSION or lifecycle.previous_version(recipe)) if want_upgrade else None
)
base = prev or target
backup_cap = generic.backup_capable(recipe, meta)
hook = discovery.install_steps(recipe, repo_local)
# Deploy-count guard (DG4.1): exactly one deploy_app() per run.
countfile = os.path.join(tempfile.gettempdir(), f"ccci-deploys-{domain}")
countfile = _run_state_path("deploys")
with open(countfile, "w") as f:
f.write("0")
os.environ["CCCI_DEPLOY_COUNT_FILE"] = countfile
@ -895,35 +924,27 @@ def main() -> int:
# Run-scoped op state (HC3): the orchestrator records op results (pre-upgrade identity, backup
# snapshot_id) here for the assertion tiers (generic + overlay) to read via generic.op_state().
statefile = os.path.join(tempfile.gettempdir(), f"ccci-opstate-{domain}.json")
statefile = _run_state_path("opstate") + ".json"
with open(statefile, "w") as f:
json.dump({}, f)
os.environ["CCCI_OP_STATE_FILE"] = statefile
op_state: dict = {}
# Run-scoped dep state (Phase 2 Q2.3, refined per operator-2026-05-28 SSO-dep plan §1):
# deps now deploy AFTER generic tiers (between RESTORE and CUSTOM) so a failed dep deploy
# cannot break the generic-tier signal. The `setup_custom_tests` step deploys each dep + runs
# `tests/<recipe>/setup_custom_tests.sh` to wire OIDC env via in-place redeploy.
# Run-scoped dep state (Phase 2 Q2.3; install-time-only since rcust P2b): deps are provisioned
# BEFORE the single deploy so install_steps.sh wires OIDC env into that one deploy.
# `$CCCI_DEPS_FILE` is written with the full creds dict the hook script needs (jq-readable).
depsfile = os.path.join(tempfile.gettempdir(), f"ccci-deps-{domain}.json")
depsfile = _run_state_path("deps") + ".json"
with open(depsfile, "w") as f:
json.dump({}, f)
os.environ["CCCI_DEPS_FILE"] = depsfile
# F2-11: conftest appends the count of requires_deps tests it skips (deps-not-ready) here.
skipfile = os.path.join(tempfile.gettempdir(), f"ccci-depskip-{domain}.txt")
skipfile = _run_state_path("depskip") + ".txt"
with contextlib.suppress(OSError):
os.remove(skipfile)
os.environ["CCCI_DEPS_SKIP_REPORT"] = skipfile
declared = deps_mod.declared_deps(recipe)
# Q3.2a: a recipe that tolerates OIDC env at first boot AND whose deps are live-warm wires OIDC
# at INSTALL time (provision the realm BEFORE the single deploy; install_steps.sh writes the env
# into it) instead of the post-deploy in-place `--chaos` redeploy — which is flaky on the heavy
# 12-service lasuite-drive stack (collabora WOPI race; see JOURNAL Step 0). Opt-in per recipe.
oidc_at_install = bool(meta.get("OIDC_AT_INSTALL")) and bool(declared)
declared = list(meta.DEPS)
if declared:
when = "BEFORE deploy (install-time OIDC)" if oidc_at_install else "AFTER generic tiers"
print(f"\n===== DEPS declared (provision {when}): {declared} =====", flush=True)
print(f"\n===== DEPS declared (provision BEFORE deploy): {declared} =====", flush=True)
deps_state: dict[str, dict] = {} # new shape: recipe→entry dict (sso-dep plan §1)
deps_ready = True
deps_not_ready_reason: str = ""
@ -937,7 +958,7 @@ def main() -> int:
# install_steps.sh can read $CCCI_DEPS_FILE and wire the OIDC env into that one deploy. On
# failure we mark deps-not-ready but STILL deploy the recipe alone (install_steps.sh no-ops
# on an empty deps file) so the generic tiers run; the OIDC custom test then skips → F2-11. ----
if oidc_at_install:
if declared:
print(
f"\n===== install-time OIDC: provisioning deps {declared} BEFORE deploy =====",
flush=True,
@ -964,18 +985,21 @@ def main() -> int:
version=base,
secrets=True,
install_steps_hook=hook,
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)),
deploy_timeout=int(meta.DEPLOY_TIMEOUT),
meta=meta,
)
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta["HEALTH_OK"]),
path=meta["HEALTH_PATH"],
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
ok_codes=tuple(meta.HEALTH_OK),
path=meta.HEALTH_PATH,
deploy_timeout=meta.DEPLOY_TIMEOUT,
http_timeout=meta.HTTP_TIMEOUT,
)
# Recipe READY_PROBE (e.g. lasuite-drive collabora WOPI discovery) — readiness beyond
# replica convergence + app HEALTH_PATH; no-op for recipes without one.
lifecycle.wait_ready_probes(meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", 900)))
lifecycle.wait_ready_probes(
meta, domain, timeout=int(meta.DEPLOY_TIMEOUT), op="install"
)
deploy_ok = True
except Exception as e: # noqa: BLE001 — a failed deploy is a reported INSTALL failure
print(f"!! deploy/readiness failed: {e}", flush=True)
@ -1072,41 +1096,11 @@ def main() -> int:
if backup_cap
else "skip"
)
# ---- setup_custom_tests step (NEW, operator-2026-05-28 SSO-dep plan §3.2) ----
# Deploy each declared dep + wire OIDC env into the parent app via the per-recipe
# setup_custom_tests.sh hook + in-place redeploy. Failure here marks deps-not-ready
# but does NOT abort the run — @pytest.mark.requires_deps tests skip with reason;
# non-deps custom tests still run normally.
if declared and not oidc_at_install:
# LEGACY post-deploy path: provision deps AFTER generic tiers, then wire OIDC env
# into the parent via the setup_custom_tests.sh hook + an in-place `--chaos` redeploy.
print("\n===== setup_custom_tests: deps + OIDC wiring =====", flush=True)
try:
deps_state = _provision_deps(recipe, domain, ref, declared)
# Run the per-recipe post-deps hook (jq-driven OIDC wiring + in-place redeploy)
_run_setup_custom_tests_hook(recipe, domain, depsfile)
except Exception as e: # noqa: BLE001 — setup failure is ISOLATED to dep-marked tests
deps_ready = False
deps_not_ready_reason = _scrub(str(e))[:300]
print(
f"!! setup_custom_tests failed (deps-not-ready): {deps_not_ready_reason}",
flush=True,
)
elif declared and oidc_at_install and deps_ready:
# INSTALL-TIME path (Q3.2a): deps were provisioned BEFORE the single deploy and the
# install-tier install_steps.sh hook already wired OIDC env into that one deploy —
# so NO re-provision, NO reconverge here. Run only the post-deploy setup hook
# (e.g. lasuite-drive's minio-createbuckets one-shot), which needs the live stack.
print("\n===== post-deploy setup (OIDC already wired at install) =====", flush=True)
try:
_run_setup_custom_tests_hook(recipe, domain, depsfile)
except Exception as e: # noqa: BLE001 — isolated to dep-marked / state-dependent tests
deps_ready = False
deps_not_ready_reason = _scrub(str(e))[:300]
print(
f"!! post-deploy setup failed: {deps_not_ready_reason}",
flush=True,
)
# (rcust P2b: install-time deps wiring is the ONLY mode — deps were provisioned BEFORE
# the single deploy and install_steps.sh wired the OIDC env into it. The legacy
# post-deploy provisioning + setup_custom_tests.sh redeploy machinery is deleted; a
# recipe's post-deploy seeding belongs in ops.py pre_install, e.g. lasuite-drive's
# MinIO bucket one-shot.)
# ---- CUSTOM tier ----
if "custom" in stages:
@ -1123,6 +1117,9 @@ def main() -> int:
if op in stages:
results[op] = "skip"
finally:
# From here the teardown funnel runs: a SIGTERM/SIGALRM landing now is logged + ignored
# (lifetime.py) so a second signal can't abort the cleanup the first one asked for.
lifetime.begin_teardown()
# Teardown the recipe under test FIRST, then deps in reverse declaration order.
# Parent verify=False (Phase 1d): keep as-is so a parent residual doesn't mask a tier
# failure. Dep teardown uses verify=True via teardown_deps (F2-5 fix); failures are
@ -1178,8 +1175,7 @@ def main() -> int:
# ---- per-op summary (DG6 feed) ----
# SSO-dep plan §1: DG4.1 generalised — one `abra app new` per app in the run (recipe + each
# COLD dep). In-place reconfigure-and-redeploy (the setup_custom_tests step's
# `abra app deploy --force --chaos`) is NOT a fresh `app_new` and does NOT increment the count.
# COLD dep). Chaos redeploys are NOT a fresh `app_new` and do NOT increment the count.
# WC1: a live-warm dep (keycloak) is NOT deployed by the run — it only gets a per-run realm — so
# warm deps contribute 0. So expected = 1 + (number of COLD deps that actually got deployed).
_dep_entries = deps_state.values() if isinstance(deps_state, dict) else (deps_state or [])
@ -1220,12 +1216,12 @@ def main() -> int:
overall = 1
if any(v == "fail" for v in results.values()):
overall = 1
# F2-11: a deps-declaring recipe whose setup_custom_tests failed has NOT verified its SSO/OIDC
# F2-11: a deps-declaring recipe whose dep provisioning failed has NOT verified its SSO/OIDC
# claim — its requires_deps tests SKIPPED (a skip-only file exits 0, so without this the run
# would report GREEN). Fail the run for that recipe; generic-tier results above are untouched.
if sso_dep_unverified(declared, deps_ready, requires_deps_skipped):
print(
f"!! recipe declares DEPS={declared} but setup_custom_tests failed and "
f"!! recipe declares DEPS={declared} but dep provisioning failed and "
f"{requires_deps_skipped} requires_deps (SSO) test(s) were SKIPPED — SSO claim NOT "
f"verified; failing run (F2-11). deps-not-ready: {deps_not_ready_reason}",
file=sys.stderr,
@ -1252,7 +1248,7 @@ def main() -> int:
no_secret_leak=True, # narrowed below by an actual scan of the serialised artifact
screenshot=screenshot_rel, # Phase 3 U1 (R4): relative PNG name iff capture succeeded
finished_ts=time.time(),
expected_na=meta.get("EXPECTED_NA"), # declared intentional-skip map (recipe_meta)
expected_na=meta.EXPECTED_NA, # declared intentional-skip map (recipe_meta)
)
# Real (if narrow) leak check: no known infra-secret value may appear in the artifact (R7).
blob = json.dumps(data)

View File

@ -199,7 +199,13 @@ def _run(cmd, timeout=120, check=False):
def _recipe_dir(recipe: str) -> str:
return os.path.expanduser(f"~/.abra/recipes/{recipe}")
# Resolve like the abra CLI does: $ABRA_DIR (the per-run tree when imported by a CI run,
# e.g. promote_canonical) else the canonical ~/.abra (this module's own systemd-timer runs,
# which set no ABRA_DIR). Keeps fetch_recipe (an `abra` subprocess) and the git readers
# below pointed at the SAME tree in both contexts.
return os.path.join(
os.environ.get("ABRA_DIR") or os.path.expanduser("~/.abra"), "recipes", recipe
)
def recipe_tags(recipe: str) -> list[str]:

71
scripts/gen-meta-docs.py Normal file
View File

@ -0,0 +1,71 @@
#!/usr/bin/env python3
"""Render the harness.meta KEYS registry to the markdown key-reference table in
docs/recipe-customization.md §4 (rcust P1.5; kills the R5 doc-drift class).
Usage:
python3 scripts/gen-meta-docs.py # rewrite the table in-place between the markers
python3 scripts/gen-meta-docs.py --print # print the rendered table to stdout (used by the
# doc-sync unit test, tests/unit/test_meta.py)
The table lives between `<!-- META-TABLE-START -->` / `<!-- META-TABLE-END -->` markers; a unit
test asserts the committed table equals this rendering, so editing it by hand fails CI.
"""
from __future__ import annotations
import os
import sys
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, os.path.join(ROOT, "runner"))
from harness.meta import KEYS # noqa: E402
DOC = os.path.join(ROOT, "docs", "recipe-customization.md")
START = "<!-- META-TABLE-START -->"
END = "<!-- META-TABLE-END -->"
def _default_repr(v) -> str:
if v is None:
return "`None`"
return f"`{v!r}`"
def render() -> str:
lines = [
START,
"",
"_This table is GENERATED from the `runner/harness/meta.py` KEYS registry by"
" `scripts/gen-meta-docs.py` — do not edit by hand (a unit test pins the sync)._",
"",
"| Key | Type | Default | Meaning |",
"|---|---|---|---|",
]
for k in KEYS:
doc = k.doc.replace("|", "\\|")
name = f"`{k.name}`" + (" **(deprecated)**" if k.deprecated else "")
lines.append(f"| {name} | `{k.type}` | {_default_repr(k.default)} | {doc} |")
lines += ["", END]
return "\n".join(lines)
def main() -> int:
table = render()
if "--print" in sys.argv:
print(table)
return 0
with open(DOC) as f:
text = f.read()
if START not in text or END not in text:
print(f"{DOC}: missing {START}/{END} markers", file=sys.stderr)
return 1
head, _, rest = text.partition(START)
_, _, tail = rest.partition(END)
with open(DOC, "w") as f:
f.write(head + table + tail)
print(f"{DOC}: key table rewritten from the registry ({len(KEYS)} keys)")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -9,14 +9,14 @@ sys.path.insert(0, os.path.dirname(__file__))
import _p4 # noqa: E402
def pre_upgrade(domain, meta):
_p4.create_account(domain)
def pre_upgrade(ctx):
_p4.create_account(ctx.domain)
def pre_backup(domain, meta):
_p4.create_account(domain)
def pre_backup(ctx):
_p4.create_account(ctx.domain)
def pre_restore(domain, meta):
_p4.delete_account(domain)
assert not _p4.account_exists(domain), "marker account delete did not take (pre_restore)"
def pre_restore(ctx):
_p4.delete_account(ctx.domain)
assert not _p4.account_exists(ctx.domain), "marker account delete did not take (pre_restore)"

View File

@ -0,0 +1,108 @@
"""Shared utilities for the real-kernel concurrency suite (imported by the test modules; the
fixtures in conftest.py wrap these). No flock mocking anywhere — probes use real LOCK_NB."""
from __future__ import annotations
import contextlib
import fcntl
import os
import signal
import subprocess
import sys
import time
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
HELPERS = os.path.join(os.path.dirname(__file__), "helpers.py")
DOMAIN = "test-abc123.ci.commoninternet.net" # matches RUN_APP_RE
class HelperPool:
"""Spawns helpers.py subprocesses and GUARANTEES their cleanup (incl. recorded grandchild
pids from `hold-with-child`/`wrapper` markers) — no leaked children in the test VM."""
def __init__(self, out_dir: str):
self.out_dir = out_dir
self.procs: list[subprocess.Popen] = []
self.extra_pids: list[int] = []
self._n = 0
def spawn(self, *args: str, env_extra: dict | None = None) -> tuple[subprocess.Popen, str]:
"""Start `helpers.py <args...>`; returns (proc, marker_file)."""
self._n += 1
out = os.path.join(self.out_dir, f"helper-{self._n}.out")
env = dict(os.environ, CCCI_HELPER_OUT=out, **(env_extra or {}))
p = subprocess.Popen( # noqa: S603
[sys.executable, HELPERS, *args],
env=env,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
)
self.procs.append(p)
return p, out
def track_pid(self, pid: int) -> None:
self.extra_pids.append(pid)
def cleanup(self) -> None:
for p in self.procs:
if p.poll() is None:
p.kill()
with contextlib.suppress(subprocess.TimeoutExpired):
p.wait(timeout=10)
for pid in self.extra_pids:
with contextlib.suppress(OSError):
os.kill(pid, signal.SIGKILL)
def wait_marker(out: str, token: str, timeout: float = 15.0) -> str | None:
"""Poll a helper's marker file for a line containing `token`; returns the line or None."""
deadline = time.time() + timeout
while time.time() < deadline:
try:
with open(out) as f:
for line in f:
if token in line:
return line.strip()
except OSError:
pass
time.sleep(0.1)
return None
def lock_state(domain: str) -> str:
"""'held' | 'free' | 'absent' for the domain's lockfile, probed with a REAL LOCK_NB."""
path = lifecycle._app_lock_path(domain) # noqa: SLF001
if not os.path.exists(path):
return "absent"
with open(path, "a") as f:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
return "free"
except BlockingIOError:
return "held"
def wait_lock_state(domain: str, want: str, timeout: float = 10.0) -> str:
"""Poll until lock_state(domain) == want (kernel release on process death is fast, but give
the scheduler room). Returns the final observed state."""
deadline = time.time() + timeout
state = lock_state(domain)
while state != want and time.time() < deadline:
time.sleep(0.1)
state = lock_state(domain)
return state
def pid_alive(pid: int) -> bool:
return os.path.exists(f"/proc/{pid}")
def wait_pid_gone(pid: int, timeout: float = 15.0) -> bool:
deadline = time.time() + timeout
while time.time() < deadline:
if not pid_alive(pid):
return True
time.sleep(0.1)
return False

View File

@ -0,0 +1,34 @@
"""Fixtures for the real-kernel concurrency suite (concurrency-restructure plan, 19 cases).
NOT part of the default `pytest tests/unit` gate — run explicitly with `pytest tests/concurrency
-q` (docs/concurrency.md). Locks live in a per-test tmp dir (CCCI_APP_LOCK_DIR); helper
subprocesses hold REAL flocks / install the REAL prctl+signal guards and are always reaped in
fixture finalizers (no leaked children in the test VM).
"""
from __future__ import annotations
import os
import sys
import pytest
sys.path.insert(0, os.path.dirname(__file__))
from concutil import HelperPool # noqa: E402
@pytest.fixture
def lock_dir(tmp_path, monkeypatch):
"""Sandbox lock dir, exported so BOTH this process's lifecycle calls and helper subprocesses
(which inherit os.environ) resolve their lockfiles here — never /run/lock."""
d = tmp_path / "locks"
d.mkdir()
monkeypatch.setenv("CCCI_APP_LOCK_DIR", str(d))
return str(d)
@pytest.fixture
def pool(tmp_path):
hp = HelperPool(str(tmp_path))
yield hp
hp.cleanup()

View File

@ -0,0 +1,149 @@
#!/usr/bin/env python3
"""Subprocess helpers for tests/concurrency — REAL kernel locks and the REAL lifetime guards in
separate processes (flock/prctl are never mocked; tests assert on actual kernel behavior).
Invoked as: python3 helpers.py <command> <args...>
Env contract (set by the spawning test):
CCCI_APP_LOCK_DIR sandbox lock dir (never /run/lock in tests)
CCCI_HELPER_OUT marker file this helper APPENDS progress lines to (ACQUIRED/READY/...)
Commands:
hold <domain> acquire the app lock, mark `ACQUIRED <ts>`, sleep forever
hold-with-child <domain> acquire the lock, spawn a plain sleeping subprocess child, mark
`ACQUIRED <ts>` + `CHILD <pid>` (PEP 446: the child must NOT
inherit the lock fd), sleep forever
guarded <domain> <deadline> install the REAL lifetime guards (alarm=<deadline>s), acquire the
lock, mark `READY`; when the teardown funnel runs (`finally:`),
mark `TEARDOWN` before exiting
wrapper <domain> spawn `guarded <domain> 3600` as MY child, mark `WRAPPED <pid>`,
sleep — the test kills me to prove PDEATHSIG TERMs the child
orphan-probe wait (bounded) until reparented (ppid==1), then install the
guards; mark `REFUSED` if they exit (expected) or `GUARDS_OK`
fetch-checkout <recipe> <ref> run run_recipe_ci.fetch_recipe (the test sets CCCI_SKIP_FETCH=1
+ a per-"run" ABRA_DIR), git-checkout <ref>, mark
`RESULT <head> <data.txt content>`
"""
from __future__ import annotations
import os
import subprocess
import sys
import time
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "runner"))
from harness import abra, lifecycle, lifetime # noqa: E402
OUT = os.environ.get("CCCI_HELPER_OUT")
def mark(line: str) -> None:
if OUT:
with open(OUT, "a") as f:
f.write(line + "\n")
f.flush()
print(line, flush=True)
def cmd_hold(domain: str) -> None:
lifecycle.acquire_app_lock(domain)
mark(f"ACQUIRED {time.time()}")
time.sleep(3600)
def cmd_hold_with_child(domain: str) -> None:
lifecycle.acquire_app_lock(domain)
child = subprocess.Popen([sys.executable, "-c", "import time; time.sleep(3600)"])
mark(f"ACQUIRED {time.time()}")
mark(f"CHILD {child.pid}")
time.sleep(3600)
def cmd_guarded(domain: str, deadline: str) -> None:
lifetime.install_lifetime_guards(deadline_seconds=int(deadline))
lifecycle.acquire_app_lock(domain)
mark("READY")
try:
time.sleep(3600)
finally:
mark("TEARDOWN")
def cmd_wrapper(domain: str) -> None:
p = subprocess.Popen( # noqa: S603
[sys.executable, os.path.abspath(__file__), "guarded", domain, "3600"],
env=os.environ.copy(),
)
mark(f"WRAPPED {p.pid}")
time.sleep(3600)
def cmd_orphan_probe() -> None:
# Our spawner exits immediately after fork; wait (bounded) until we are reparented so the
# prctl is installed with the parent ALREADY dead — the exact race the ppid check closes.
for _ in range(200):
if os.getppid() == 1:
break
time.sleep(0.05)
else:
mark("NEVER_REPARENTED") # e.g. a subreaper environment — test will fail visibly
return
try:
lifetime.install_lifetime_guards()
except SystemExit:
mark("REFUSED")
raise
mark("GUARDS_OK")
def cmd_fetch_checkout(recipe: str, ref: str) -> None:
import run_recipe_ci
run_recipe_ci.fetch_recipe(recipe, None, None)
abra.recipe_checkout(recipe, ref)
head = abra.recipe_head_commit(recipe)
with open(os.path.join(abra.recipe_dir(recipe), "data.txt")) as f:
content = f.read().strip()
mark(f"RESULT {head} {content}")
def cmd_deploy_count_run(domain: str, gate: str) -> None:
"""Mirror the REAL run flow for the DG4.1 counter (CONC-A1 regression): countfile init
(main() preamble) → _record_deploy (deploy_app fires it BEFORE the app lock) → acquire
the app lock → wait for `gate` (file path; '' = no wait) → read + remove own countfile.
Two of these on the SAME domain must each see COUNT 1 and never lose their file."""
import run_recipe_ci
countfile = run_recipe_ci._run_state_path("deploys")
with open(countfile, "w") as f:
f.write("0")
os.environ["CCCI_DEPLOY_COUNT_FILE"] = countfile
lifecycle._record_deploy() # pre-lock, exactly like lifecycle.deploy_app()
mark("PRELOCK")
lifecycle.acquire_app_lock(domain)
mark("ACQUIRED")
if gate:
deadline = time.time() + 15
while not os.path.exists(gate) and time.time() < deadline:
time.sleep(0.05)
try:
with open(countfile) as f:
n = int(f.read().strip() or "0")
os.remove(countfile)
mark(f"COUNT {n}")
except FileNotFoundError:
mark("COUNT_FILE_MISSING")
if __name__ == "__main__":
cmd, *args = sys.argv[1:]
{
"hold": cmd_hold,
"hold-with-child": cmd_hold_with_child,
"guarded": cmd_guarded,
"wrapper": cmd_wrapper,
"orphan-probe": cmd_orphan_probe,
"fetch-checkout": cmd_fetch_checkout,
"deploy-count-run": cmd_deploy_count_run,
}[cmd](*args)

View File

@ -0,0 +1,175 @@
"""Per-run ABRA_DIR isolation (concurrency-restructure plan, cases 17-19). Real directories,
real symlinks, real git — abra itself is replaced by a recording stub where a CLI call is
involved (case 17), because these cases test OUR dir/env plumbing, not abra."""
from __future__ import annotations
import os
import stat
import subprocess
import sys
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
import run_recipe_ci # noqa: E402
from concutil import wait_marker # noqa: E402
from harness import abra # noqa: E402
RECIPE = "fakerecipe"
def _git(cwd, *args):
subprocess.run(
["git", "-c", "user.email=t@t", "-c", "user.name=t", *args],
cwd=cwd,
check=True,
capture_output=True,
)
def _make_fake_home(tmp_path):
"""A fake $HOME with a canonical ~/.abra: servers/default + catalogue dirs, and a recipe git
repo with two tags whose data.txt differs (v1 -> 'one', v2 -> 'two', HEAD at v2)."""
home = tmp_path / "home"
(home / ".abra" / "servers" / "default").mkdir(parents=True)
(home / ".abra" / "catalogue").mkdir(parents=True)
repo = home / ".abra" / "recipes" / RECIPE
repo.mkdir(parents=True)
_git(repo, "init", "-q")
(repo / "data.txt").write_text("one\n")
_git(repo, "add", "data.txt")
_git(repo, "commit", "-qm", "v1")
_git(repo, "tag", "v1")
(repo / "data.txt").write_text("two\n")
_git(repo, "add", "data.txt")
_git(repo, "commit", "-qm", "v2")
_git(repo, "tag", "v2")
return home
def test_17_per_run_dir_built_and_exported_before_abra(tmp_path, monkeypatch):
"""Case 17: setup_run_abra_dir builds the per-run dir correctly (servers/catalogue symlinks
resolve to the canonical tree, recipes/ empty + writable) and $ABRA_DIR is exported before
the first abra call — proven by a stub `abra` on PATH that records the env it saw."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.setenv("DRONE_BUILD_NUMBER", "777")
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten") # so monkeypatch restores it
d = run_recipe_ci.setup_run_abra_dir()
assert d == str(tmp_path / "runs" / "777" / "abra")
assert os.environ["ABRA_DIR"] == d
assert os.readlink(os.path.join(d, "servers")) == str(home / ".abra" / "servers")
assert os.readlink(os.path.join(d, "catalogue")) == str(home / ".abra" / "catalogue")
# symlinks RESOLVE (targets exist) and recipes/ is empty + writable
assert os.path.isdir(os.path.join(d, "servers", "default"))
assert os.path.isdir(os.path.join(d, "catalogue"))
assert os.listdir(os.path.join(d, "recipes")) == []
probe = os.path.join(d, "recipes", ".write-probe")
open(probe, "w").close()
os.remove(probe)
# idempotent re-entry (Drone build-number retry): must not raise on existing symlinks
assert run_recipe_ci.setup_run_abra_dir() == d
# stub abra records $ABRA_DIR at call time; fetch_recipe's catalogue branch invokes it
stub_dir = tmp_path / "bin"
stub_dir.mkdir()
log = tmp_path / "abra-env.log"
stub = stub_dir / "abra"
stub.write_text(f'#!/bin/sh\necho "$ABRA_DIR" >> {log}\nexit 0\n')
stub.chmod(stub.stat().st_mode | stat.S_IEXEC)
monkeypatch.setenv("PATH", f"{stub_dir}{os.pathsep}{os.environ['PATH']}")
monkeypatch.delenv("CCCI_SKIP_FETCH", raising=False)
run_recipe_ci.fetch_recipe(RECIPE, None, None)
assert log.read_text().strip() == d, "abra was called without the per-run ABRA_DIR exported"
def test_18_concurrent_same_recipe_fetch_no_cross_talk(tmp_path, monkeypatch, pool):
"""Case 18: two CONCURRENT fetch+checkout flows of the SAME recipe into different ABRA_DIRs
produce two correct, divergent trees (v1 vs v2) — the old shared-tree corruption scenario,
now structurally safe with no lock. The canonical staged clone is untouched."""
home = _make_fake_home(tmp_path)
canonical_repo = home / ".abra" / "recipes" / RECIPE
head_before = subprocess.run(
["git", "-C", canonical_repo, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
runs = {}
for name, ref in (("runA", "v1"), ("runB", "v2")):
abra_dir = tmp_path / name / "abra"
abra_dir.mkdir(parents=True)
_, out = pool.spawn(
"fetch-checkout",
RECIPE,
ref,
env_extra={
"HOME": str(home),
"ABRA_DIR": str(abra_dir),
"CCCI_SKIP_FETCH": "1",
},
)
runs[name] = (out, ref, abra_dir)
expect = {"v1": "one", "v2": "two"}
for name, (out, ref, abra_dir) in runs.items():
line = wait_marker(out, "RESULT", timeout=30)
assert line, f"{name} never produced a RESULT"
_, head, content = line.split()
assert content == expect[ref], f"{name}@{ref}: tree content {content!r}"
tree = abra_dir / "recipes" / RECIPE
assert (tree / "data.txt").read_text().strip() == expect[ref]
assert (
head
== subprocess.run(
["git", "-C", tree, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
)
# the two trees genuinely diverge AND the canonical staged clone is untouched
a = (runs["runA"][2] / "recipes" / RECIPE / "data.txt").read_text()
b = (runs["runB"][2] / "recipes" / RECIPE / "data.txt").read_text()
assert a != b
head_after = subprocess.run(
["git", "-C", canonical_repo, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
assert head_after == head_before, "canonical clone must not be touched by per-run fetches"
def test_19_env_written_through_servers_symlink_lands_canonical(tmp_path, monkeypatch):
"""Case 19: an app .env written through the per-run servers/ symlink (what abra does under
$ABRA_DIR) lands in the CANONICAL shared path — so janitor discovery and every
expanduser('~/.abra/servers/...') reader keep working unchanged."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.setenv("DRONE_BUILD_NUMBER", "778")
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")
d = run_recipe_ci.setup_run_abra_dir()
domain = "test-abc123.ci.commoninternet.net"
via_symlink = os.path.join(d, "servers", "default", f"{domain}.env")
with open(via_symlink, "w") as f:
f.write("TYPE=fakerecipe:1.0.0\nDOMAIN=placeholder\n")
canonical = home / ".abra" / "servers" / "default" / f"{domain}.env"
assert canonical.is_file(), ".env written via the symlink must land in the canonical path"
# the canonical-path readers/writers (abra.env_get/env_set use ~/.abra) see the same file
assert abra.env_get(domain, "TYPE") == "fakerecipe:1.0.0"
abra.env_set(domain, "DOMAIN", domain)
with open(via_symlink) as f:
assert f"DOMAIN={domain}" in f.read()
def test_18b_run_id_manual_fallback_is_per_process(tmp_path, monkeypatch):
"""Companion to case 18: two concurrent MANUAL runs (no DRONE_BUILD_NUMBER) must not share an
abra dir either — the manual fallback is pid-suffixed."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.delenv("DRONE_BUILD_NUMBER", raising=False)
monkeypatch.delenv("CCCI_APP_DOMAIN", raising=False)
monkeypatch.delenv("CCCI_RUN_ID", raising=False)
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")
d = run_recipe_ci.setup_run_abra_dir()
assert f"manual-{os.getpid()}" in d

View File

@ -0,0 +1,189 @@
"""Janitor / flock-probe semantics (concurrency-restructure plan, cases 5-12).
The janitor runs IN-PROCESS with its discovery monkeypatched (candidates injected via a stubbed
abra.app_ls + empty docker sweep) and teardown_app stubbed to record calls — but the LOCKS are
real kernel flocks, held by real helper subprocesses where a live owner is needed."""
from __future__ import annotations
import os
import sys
import threading
import time
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import DOMAIN, lock_state, wait_marker # noqa: E402
from harness import lifecycle # noqa: E402
def _inject_candidates(monkeypatch, domains):
"""Point janitor discovery at exactly `domains`: abra lists them, docker sweep is empty.
teardown_app is stubbed to a recorder; returns the calls list."""
calls = []
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": d} for d in domains])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", lambda d, verify=True: calls.append(d))
return calls
def test_5_orphan_reaped_lockfile_unlinked(lock_dir, pool, monkeypatch):
"""Case 5: an orphan (lockfile exists, no holder — its run was SIGKILL'd) is reaped exactly
once and its lockfile unlinked."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == [DOMAIN], f"teardown calls: {calls} (expected exactly one)"
assert lock_state(DOMAIN) == "absent", "reaped orphan's lockfile must be unlinked"
def test_6_live_run_never_reaped(lock_dir, pool, monkeypatch, capsys):
"""Case 6: a held lock (live helper) is never reaped and is logged as live."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == []
assert "live concurrent run" in capsys.readouterr().out
assert lock_state(DOMAIN) == "held"
def test_7_new_run_blocks_until_reap_finishes(lock_dir, pool, monkeypatch):
"""Case 7: the janitor reaps WHILE HOLDING the probe lock, so a new run of the same domain
blocks in acquire_app_lock until the reap completes — no window where a fresh app coexists
with a half-reaped one."""
# Make an orphan.
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
state = {"teardown_end": None, "acquirer_out": None}
def slow_teardown(domain, verify=True):
# While the janitor holds the probe lock mid-reap, a new run starts acquiring.
_, aout = pool.spawn("hold", DOMAIN)
state["acquirer_out"] = aout
time.sleep(2.0)
state["teardown_end"] = time.time()
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": DOMAIN}])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", slow_teardown)
lifecycle.janitor()
line = wait_marker(state["acquirer_out"], "ACQUIRED", timeout=15)
assert line, "new run never acquired after the reap"
acquired_ts = float(line.split()[1])
assert (
acquired_ts >= state["teardown_end"]
), f"new run acquired at {acquired_ts} BEFORE the reap finished at {state['teardown_end']}"
# The new run must hold a lock the next probe can SEE (fresh inode at the path).
assert lock_state(DOMAIN) == "held"
def test_8_two_janitors_exactly_one_reaps(lock_dir, pool, monkeypatch):
"""Case 8: two concurrent janitors arbitrate on the probe flock — exactly one reaps (the
other sees 'held' and leaves). Teardown is slowed so the runs genuinely overlap."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
calls = []
calls_lock = threading.Lock()
def slow_teardown(domain, verify=True):
with calls_lock:
calls.append(domain)
time.sleep(2.0)
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": DOMAIN}])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", slow_teardown)
barrier = threading.Barrier(2)
def run_janitor():
barrier.wait()
lifecycle.janitor()
t1, t2 = threading.Thread(target=run_janitor), threading.Thread(target=run_janitor)
t1.start(), t2.start()
t1.join(timeout=30), t2.join(timeout=30)
assert calls == [DOMAIN], f"expected exactly one reap, got {calls}"
assert lock_state(DOMAIN) == "absent"
def test_9_reboot_lockfile_absent_reaped_immediately(lock_dir, monkeypatch):
"""Case 9: post-reboot simulation — the app exists but its lockfile is gone (/run/lock is
tmpfs). The probe trivially acquires -> immediate reap, NO age threshold (improvement over
the old 2h fallback)."""
assert lock_state(DOMAIN) == "absent"
calls = _inject_candidates(monkeypatch, [DOMAIN])
t0 = time.time()
lifecycle.janitor()
assert calls == [DOMAIN]
assert time.time() - t0 < 5, "reap must be immediate (no age wait)"
def test_10_long_held_lock_flagged_never_stolen(lock_dir, pool, monkeypatch, capsys):
"""Case 10: a lock held with mtime older than 120min is flagged as a possible leaked run —
and NOT reaped (never steal a held lock)."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
backdate = time.time() - (130 * 60)
os.utime(path, (backdate, backdate))
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == []
out_text = capsys.readouterr().out
assert "possible leaked run" in out_text and "lslocks" in out_text
assert lock_state(DOMAIN) == "held"
def test_11_warm_canonical_names_never_probed(lock_dir, monkeypatch):
"""Case 11: RUN_APP_RE allowlist — warm/canonical-shaped names never become candidates, so
they are never probed (no lockfile is even created for them) and never reaped."""
warmish = [
"warm-keycloak.ci.commoninternet.net",
"keycloak.ci.commoninternet.net",
"warm-hedgedoc.ci.commoninternet.net",
"drone.ci.commoninternet.net",
]
calls = []
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": d} for d in warmish])
monkeypatch.setattr(
lifecycle,
"_docker_names",
lambda kind, stack: ["warm-keycloak_ci_commoninternet_net_app"]
if kind == "service"
else [],
)
monkeypatch.setattr(lifecycle, "teardown_app", lambda d, verify=True: calls.append(d))
lifecycle.janitor()
assert calls == []
lockdir = os.environ["CCCI_APP_LOCK_DIR"]
assert [
f for f in os.listdir(lockdir) if f.startswith("cc-ci-app-")
] == [], "janitor must not create lockfiles for non-run-app names"
def test_12_degrades_safely_on_bad_lockfile_and_missing_dir(lock_dir, monkeypatch, capsys):
"""Case 12: a garbled/unopenable lockfile (here: a DIRECTORY at the lockfile path) is skipped
with a log line; a missing lock dir doesn't crash the janitor either. Never a crash."""
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
os.makedirs(path) # open(path, "a") -> IsADirectoryError (an OSError)
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor() # must not raise
assert calls == []
assert "skipping" in capsys.readouterr().out
os.rmdir(path)
monkeypatch.setenv("CCCI_APP_LOCK_DIR", os.path.join(os.environ["CCCI_APP_LOCK_DIR"], "gone"))
lifecycle.janitor() # missing dir: probe open fails -> skip; tidy glob -> empty. No crash.
assert calls == []

View File

@ -0,0 +1,82 @@
"""Lifetime hardening (concurrency-restructure plan, cases 13-16): the REAL prctl/signal/alarm
guards installed by helper subprocesses; tests assert teardown ran, exit was non-zero, and the
lock was released."""
from __future__ import annotations
import os
import signal
import sys
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import ( # noqa: E402
DOMAIN,
wait_lock_state,
wait_marker,
wait_pid_gone,
)
def test_13_pdeathsig_parent_kill_terms_harness(lock_dir, pool):
"""Case 13: wrapper-parent spawns a guarded harness-child; the parent is SIGKILL'd (the
harness gets no courtesy signal) -> the kernel's PDEATHSIG TERMs the child, its teardown
funnel runs, it exits, and the lock is released."""
p, out = pool.spawn("wrapper", DOMAIN)
line = wait_marker(out, "WRAPPED")
assert line, "wrapper never spawned its child"
child_pid = int(line.split()[1])
pool.track_pid(child_pid)
assert wait_marker(out, "READY"), "guarded child never got ready"
p.kill() # parent dies WITHOUT signalling the child — only PDEATHSIG can save us
p.wait(timeout=10)
assert wait_pid_gone(child_pid), "guarded child must exit on parent death (PDEATHSIG)"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run"
assert wait_lock_state(DOMAIN, "free") == "free"
def test_14_already_orphaned_helper_refuses_to_run(lock_dir, pool):
"""Case 14 (ppid race): a helper whose parent died BEFORE the prctl was armed (it starts
already reparented to pid 1) must refuse to run — PDEATHSIG would never fire for it."""
# Spawn an intermediate parent that forks orphan-probe and exits immediately.
import subprocess
out = os.path.join(pool.out_dir, "orphan.out")
intermediate = (
"import subprocess, sys, os; "
"subprocess.Popen([sys.executable, os.environ['CCCI_HELPERS'], 'orphan-probe']); "
)
env = dict(
os.environ,
CCCI_HELPER_OUT=out,
CCCI_HELPERS=os.path.join(os.path.dirname(__file__), "helpers.py"),
)
subprocess.run([sys.executable, "-c", intermediate], env=env, timeout=15, check=True)
line = wait_marker(out, "REFUSED", timeout=20)
assert line, "orphaned helper did not refuse to run (or never reparented to pid 1)"
def test_15_deadline_alarm_fires_teardown_and_releases(lock_dir, pool):
"""Case 15: the self-deadline (alarm). A guarded helper with a 2s deadline tears down via
the funnel (finally: ran), exits NON-zero, and its lock is released."""
p, out = pool.spawn("guarded", DOMAIN, "2")
assert wait_marker(out, "READY")
rc = p.wait(timeout=20)
assert rc != 0, f"deadline exit must be non-zero (got {rc})"
assert rc == 128 + signal.SIGALRM, f"expected 142 (128+SIGALRM), got {rc}"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run on deadline"
assert wait_lock_state(DOMAIN, "free") == "free"
def test_16_sigterm_runs_teardown_funnel_and_releases(lock_dir, pool):
"""Case 16: SIGTERM (drone cancel path) -> the finally: teardown funnel runs, exit is
non-zero, lock released."""
p, out = pool.spawn("guarded", DOMAIN, "3600")
assert wait_marker(out, "READY")
p.send_signal(signal.SIGTERM)
rc = p.wait(timeout=20)
assert rc != 0, f"SIGTERM exit must be non-zero (got {rc})"
assert rc == 128 + signal.SIGTERM, f"expected 143 (128+SIGTERM), got {rc}"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run on SIGTERM"
assert wait_lock_state(DOMAIN, "free") == "free"

View File

@ -0,0 +1,85 @@
"""Lock fundamentals (concurrency-restructure plan, cases 1-4). Real kernel flocks held by real
subprocesses — nothing mocked."""
from __future__ import annotations
import fcntl
import os
import sys
import time
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import ( # noqa: E402
DOMAIN,
lock_state,
wait_lock_state,
wait_marker,
)
from harness import lifecycle # noqa: E402
def test_1_sigkill_releases_lock(lock_dir, pool):
"""Case 1: acquire -> holder SIGKILL'd -> lock immediately acquirable (kernel auto-release).
The exact property the old pidfile registry approximated with /proc checks."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED"), "holder never acquired"
assert lock_state(DOMAIN) == "held"
p.kill()
p.wait(timeout=10)
assert wait_lock_state(DOMAIN, "free") == "free"
def test_2_nb_probe_held_vs_unheld(lock_dir, pool):
"""Case 2: LOCK_NB probe raises BlockingIOError against a held lock; succeeds when unheld."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
with open(path, "a") as f:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
raise AssertionError("LOCK_NB succeeded against a held lock")
except BlockingIOError:
pass
p.kill()
p.wait(timeout=10)
assert wait_lock_state(DOMAIN, "free") == "free"
with open(path, "a") as f:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB) # must not raise now
def test_3_lock_fd_not_inherited_by_children(lock_dir, pool):
"""Case 3 (PEP 446): the holder spawns a subprocess child, the holder dies, the child lives —
and the lock is STILL released (the child never inherited the lock fd). This is what makes
'held lock == live HARNESS owner' sound even though runs spawn abra/docker/pytest children."""
p, out = pool.spawn("hold-with-child", DOMAIN)
assert wait_marker(out, "ACQUIRED")
child_line = wait_marker(out, "CHILD")
assert child_line, "holder never reported its child pid"
child_pid = int(child_line.split()[1])
pool.track_pid(child_pid)
p.kill()
p.wait(timeout=10)
assert os.path.exists(f"/proc/{child_pid}"), "child should outlive the holder"
assert (
wait_lock_state(DOMAIN, "free") == "free"
), "lock must release on holder death even with a live child (PEP 446 non-inheritable fd)"
def test_4_second_acquire_blocks_until_first_exits(lock_dir, pool):
"""Case 4: a second same-domain acquire blocks until the first holder exits — the
double-!testme serialisation property."""
p1, out1 = pool.spawn("hold", DOMAIN)
assert wait_marker(out1, "ACQUIRED")
p2, out2 = pool.spawn("hold", DOMAIN)
# p2 must NOT acquire while p1 holds.
time.sleep(1.5)
assert wait_marker(out2, "ACQUIRED", timeout=0.1) is None, "second acquire did not block"
t_kill = time.time()
p1.kill()
p1.wait(timeout=10)
line = wait_marker(out2, "ACQUIRED", timeout=15)
assert line, "second acquire never completed after first holder exited"
acquired_ts = float(line.split()[1])
assert acquired_ts >= t_kill - 0.05, "second holder acquired before the first exited"
assert lock_state(DOMAIN) == "held"

View File

@ -0,0 +1,79 @@
"""Run-scoped state files — M2(c) live-verify regression (not one of the 19 plan cases).
The four CCCI state files (deploys countfile, opstate, deps, depskip) must be keyed by
run id + harness pid, NEVER by app domain: a second run of the SAME domain executes its
main() preamble (state-file init, deploy_app's _record_deploy) BEFORE it blocks at the
app lock, so domain-keyed files in the shared tempdir get reset/removed underneath the
live first run. Observed live (builds 279/281): false DG4.1 deploy-count=2 in run 1,
countfile FileNotFoundError crash in run 2. Children never re-derive these paths — they
receive them via the CCCI_*_FILE env vars, so per-process uniqueness is sufficient.
"""
from __future__ import annotations
import os
import sys
import tempfile
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
import run_recipe_ci # noqa: E402
from concutil import wait_marker # noqa: E402
DOMAIN = "fake-abc123.ci.commoninternet.net"
def test_20_state_paths_keyed_by_run_and_pid_never_by_domain(monkeypatch):
domain = "immi-ad3e33.ci.commoninternet.net"
monkeypatch.setenv("CCCI_APP_DOMAIN", domain)
monkeypatch.setenv("DRONE_BUILD_NUMBER", "279")
p279 = run_recipe_ci._run_state_path("deploys")
monkeypatch.setenv("DRONE_BUILD_NUMBER", "281")
p281 = run_recipe_ci._run_state_path("deploys")
# the double-!testme invariant: two runs (same domain) share NO state file
assert p279 != p281
# keyed by run id + pid, under the tempdir
base = os.path.basename(p279)
assert base == f"ccci-deploys-279-{os.getpid()}"
assert os.path.dirname(p279) == tempfile.gettempdir()
# the app domain must not appear in the path at all
assert domain not in p279 and domain not in p281
def test_20c_same_domain_runs_each_keep_their_own_count(tmp_path, lock_dir, pool):
"""The live CONC-A1 interleaving, with REAL processes + the REAL lock and counter code:
run A holds the app lock; run B (same domain) fires its pre-lock _record_deploy and
blocks; A then reads its counter — must still be 1 (not polluted by B) — and removes
its own file; B acquires and must find ITS file intact (no FileNotFoundError)."""
gate = tmp_path / "gate"
env_a = {"TMPDIR": str(tmp_path), "DRONE_BUILD_NUMBER": "9001"}
env_b = {"TMPDIR": str(tmp_path), "DRONE_BUILD_NUMBER": "9002"}
pa, out_a = pool.spawn("deploy-count-run", DOMAIN, str(gate), env_extra=env_a)
assert wait_marker(out_a, "ACQUIRED")
pb, out_b = pool.spawn("deploy-count-run", DOMAIN, "", env_extra=env_b)
# B's main()-preamble + pre-lock increment have fired; B is now blocked on the app lock
assert wait_marker(out_b, "PRELOCK")
assert wait_marker(out_b, "ACQUIRED", timeout=1.0) is None # still serialised behind A
gate.touch() # let A read its counter only AFTER B's pre-lock work landed
line_a = wait_marker(out_a, "COUNT")
assert line_a is not None and line_a.strip() == "COUNT 1", line_a # not 2: B didn't pollute A
pa.wait(timeout=15)
line_b = wait_marker(out_b, "COUNT")
assert (
line_b is not None and line_b.strip() == "COUNT 1"
), line_b # B's file survived A's remove
pb.wait(timeout=15)
def test_20b_manual_runs_distinct_via_pid(monkeypatch):
# no DRONE_BUILD_NUMBER and no domain/run-id env → run_id() falls back to "manual";
# the pid suffix still separates two concurrent hand-runs of the same domain.
for var in ("DRONE_BUILD_NUMBER", "CCCI_APP_DOMAIN", "CCCI_RUN_ID"):
monkeypatch.delenv(var, raising=False)
p = run_recipe_ci._run_state_path("opstate")
assert os.path.basename(p) == f"ccci-opstate-manual-{os.getpid()}"

View File

@ -14,32 +14,7 @@ import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "runner"))
from harness import deps as deps_mod # noqa: E402
from harness import lifecycle, naming
def _short(s: str, n: int = 8) -> str:
return "".join(c for c in s if c.isalnum())[:n] or "local"
def _recipe_meta(recipe: str) -> dict:
"""Optional per-recipe config so enrolling a recipe needs NO shared-harness change (D5).
A recipe may ship tests/<recipe>/recipe_meta.py with any of: HEALTH_PATH (str),
HEALTH_OK (tuple of status codes), DEPLOY_TIMEOUT (int), HTTP_TIMEOUT (int)."""
path = os.path.join(os.path.dirname(__file__), recipe, "recipe_meta.py")
meta = {
"HEALTH_PATH": "/",
"HEALTH_OK": (200, 301, 302),
"DEPLOY_TIMEOUT": 600,
"HTTP_TIMEOUT": 300,
}
if os.path.exists(path):
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
for k in meta:
if k in ns:
meta[k] = ns[k]
return meta
from harness import meta as meta_mod # noqa: E402
@pytest.fixture(scope="session")
@ -48,18 +23,10 @@ def recipe() -> str:
@pytest.fixture(scope="session")
def app_domain(recipe) -> str:
# Docker swarm config/secret names = <stackname>_<res>_<ver> must be <= 64 chars, and
# stackname is the sanitized domain. ".ci.commoninternet.net" alone is 22 chars, so the
# subdomain label must stay short. Use <recipe[:4]>-<6hex(recipe|pr|ref)> — unique per run,
# collision-safe across recipes (full recipe in the hash), readable context lives in the
# Drone build params + PR comment. (Deviation from plan §4.0 long name; see DECISIONS.md.)
return naming.app_domain(recipe, os.environ.get("PR", "0"), os.environ.get("REF"))
@pytest.fixture(scope="session")
def meta(recipe) -> dict:
return _recipe_meta(recipe)
def meta(recipe):
"""The recipe's FULL validated customization (RecipeMeta, attribute access) via the single
loader (rcust P1 — previously this fixture saw only the 4 base keys, spec §8 R3)."""
return meta_mod.load(recipe)
@pytest.fixture(scope="session")
@ -73,32 +40,55 @@ def live_app() -> str:
return domain
@pytest.fixture(scope="session")
def deps_apps() -> dict[str, str]:
"""Phase 2 Q2.3 dependency-resolver contract (refined operator-2026-05-28 SSO-dep plan §1):
when a recipe declares `DEPS = [...]` in its `recipe_meta.py`, the orchestrator deploys each
dep AFTER the generic tiers (between RESTORE and CUSTOM) and persists their per-run identity
+ SSO creds to `$CCCI_DEPS_FILE`. Tests access the dep's per-run domain via this fixture.
For full SSO creds (realm/client/secret/admin) use the `deps_creds` fixture instead.
@pytest.fixture
def op_state() -> dict:
"""The orchestrator's run-scoped op context (rcust P4): versions, artifact paths — written to
`$CCCI_OP_STATE_FILE` after each lifecycle op (e.g. `{"upgrade": {"before": {...},
"head_ref": ...}, "backup": {"snapshot_id": ...}}`). Overlay tests read op facts from here
instead of hand-parsing env/JSON. Skips with a clear reason outside an orchestrator run."""
import json
Returns `{dep_recipe: domain}` (str→str). Empty when no deps declared OR deps-not-ready."""
path = os.environ.get("CCCI_OP_STATE_FILE")
if not path:
pytest.skip(
"CCCI_OP_STATE_FILE not set — op_state is only available under the orchestrator"
)
if not os.path.exists(path):
pytest.skip(f"op-state file missing ({path}) — orchestrator has not performed an op yet")
try:
with open(path) as f:
return json.load(f)
except ValueError:
pytest.skip(f"op-state file unreadable/not JSON ({path})")
class _DepEntry(dict):
"""One provisioned dep (full creds dict) with attribute sugar: `entry.domain`, `entry.realm`,
`entry.client_secret`, ... — dict-style access works too (rcust P2d)."""
def __getattr__(self, name):
try:
return self[name]
except KeyError as e:
raise AttributeError(name) from e
@pytest.fixture(scope="session")
def deps() -> dict[str, _DepEntry]:
"""The recipe's provisioned deps (rcust P2d — consolidates the old `deps_apps`+`deps_creds`
pair). When a recipe declares `DEPS = [...]` in its `recipe_meta.py`, the orchestrator
provisions each dep BEFORE the single deploy and persists per-run identity + SSO creds to
`$CCCI_DEPS_FILE`. `deps["keycloak"]` carries domain/realm/client_id/client_secret/user/
password/email/admin_user/admin_password/discovery_url/token_url/... (`.domain` etc. work as
attributes). Empty when no deps declared OR deps-not-ready — pair with
`@pytest.mark.requires_deps` so the F2-11 skip-report keeps the green signal honest."""
state = deps_mod.deps_as_dict(deps_mod.load_run_state())
return {r: e["domain"] for r, e in state.items() if e.get("domain")}
@pytest.fixture(scope="session")
def deps_creds() -> dict[str, dict]:
"""Full SSO-creds dict for each declared dep (operator-2026-05-28 SSO-dep plan §1).
`deps_creds["keycloak"]` returns the entry written by setup_custom_tests with keys
domain/realm/client_id/client_secret/user/password/email/admin_user/admin_password/
discovery_url/token_url/.... Use this in `@pytest.mark.requires_deps` tests that need to
authenticate via OIDC."""
return deps_mod.deps_as_dict(deps_mod.load_run_state())
return {r: _DepEntry(e) for r, e in state.items()}
def pytest_collection_modifyitems(config, items):
"""SSO-dep plan §4: tests marked `@pytest.mark.requires_deps` are skipped with reason
`deps-not-ready: <captured-err>` when the orchestrator's setup_custom_tests step failed
`deps-not-ready: <captured-err>` when the orchestrator's dep provisioning failed
(orchestrator sets CCCI_DEPS_READY=0 in env). Non-deps custom tests are unaffected.
This is failure-isolation per plan §1 — generic tiers cannot break the SSO-marked tests'
@ -131,40 +121,5 @@ def pytest_configure(config):
"""Register the `requires_deps` marker so pytest doesn't warn about it."""
config.addinivalue_line(
"markers",
"requires_deps: test requires DEPS-declared services + setup_custom_tests success.",
"requires_deps: test requires DEPS-declared services + dep provisioning success.",
)
def _wait_healthy(domain, meta):
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta["HEALTH_OK"]),
path=meta["HEALTH_PATH"],
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
)
@pytest.fixture
def deployed(recipe, app_domain, meta, request):
"""Function-scoped: deploy the current/$REF version healthy, guaranteed teardown after.
Used by stages that start from current (install/backup)."""
version = os.environ.get("VERSION") or None
lifecycle.janitor()
request.addfinalizer(lambda: lifecycle.teardown_app(app_domain))
lifecycle.deploy_app(recipe, app_domain, version=version)
_wait_healthy(app_domain, meta)
return app_domain
@pytest.fixture(scope="session")
def deployed_app(recipe, app_domain, meta):
"""Install stage: deploy the recipe and wait until healthy; tear down at session end."""
version = os.environ.get("VERSION") or None
lifecycle.janitor() # sweep orphans from crashed runs first
try:
lifecycle.deploy_app(recipe, app_domain, version=version, secrets=True)
_wait_healthy(app_domain, meta)
yield app_domain
finally:
lifecycle.teardown_app(app_domain)

View File

@ -15,13 +15,13 @@ def _write(domain, val):
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo {val} > {MARKER}"])
def pre_upgrade(domain, meta):
_write(domain, "upgrade-survives")
def pre_upgrade(ctx):
_write(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_write(domain, "original")
def pre_backup(ctx):
_write(ctx.domain, "original")
def pre_restore(domain, meta):
_write(domain, "mutated") # diverge so a successful restore is observable
def pre_restore(ctx):
_write(ctx.domain, "mutated") # diverge so a successful restore is observable

View File

@ -7,9 +7,9 @@ DEPLOY_TIMEOUT = 600
HTTP_TIMEOUT = 600
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
"""cryptpad needs a SANDBOX_DOMAIN distinct from the main DOMAIN (it serves user content from a
separate origin; the web router routes both). Derive a sibling subdomain under the same wildcard
(covered by the wildcard cert, so no cert work)."""
label, _, rest = domain.partition(".")
label, _, rest = ctx.domain.partition(".")
return {"SANDBOX_DOMAIN": f"{label}-sb.{rest}"}

View File

@ -12,8 +12,8 @@ from harness import lifecycle
MARKER_PATH = "/usr/share/nginx/html/ci-marker.txt"
def pre_restore(domain: str, meta: dict) -> None:
def pre_restore(ctx) -> None:
"""Write 'mutated' to the marker before restore runs. If restore brings back the
snapshot (which has no marker — never seeded by pre_backup), the marker ends up
MISSING or 'mutated' after restore → test_restore_returns_state FAILS → restore=RED."""
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])
lifecycle.exec_in_app(ctx.domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])

View File

@ -11,5 +11,5 @@ from harness import lifecycle
MARKER_PATH = "/usr/share/nginx/html/ci-marker.txt"
def pre_restore(domain: str, meta: dict) -> None:
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])
def pre_restore(ctx) -> None:
lifecycle.exec_in_app(ctx.domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])

View File

@ -1,4 +1,4 @@
"""custom-html — pre-op seed hooks (Phase 1e HC3). The orchestrator runs `pre_<op>(domain, meta)`
"""custom-html — pre-op seed hooks (Phase 1e HC3). The orchestrator runs `pre_<op>(ctx)`
BEFORE it performs the op; the matching test_<op>.py asserts the post-op state (assertion-only).
nginx serves the volume at /usr/share/nginx/html, so the marker file survives an upgrade / a
@ -17,16 +17,16 @@ def _write(domain: str, val: str) -> None:
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo {val} > {MARKER_PATH}"])
def pre_upgrade(domain, meta):
def pre_upgrade(ctx):
# seed a marker before the upgrade so the overlay can prove the data survives it
_write(domain, "upgrade-survives")
_write(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
def pre_backup(ctx):
# establish a known original state before the backup op captures it
_write(domain, "original")
_write(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backed-up state so a successful restore (back to "original") is observable
_write(domain, "mutated")
_write(ctx.domain, "mutated")

View File

@ -1,26 +0,0 @@
#!/usr/bin/env bash
# discourse — INSTALL-TIME hook (Phase 2 Q4.6). Runs during the install tier AFTER `abra app new` +
# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN in env.
#
# Purpose: provide the cc-ci re-pin+grace overlay (compose.ccci.yml) to the recipe checkout so the
# UPGRADE-tier BASE deploy (published 0.7.0+3.3.1, whose compose pins the Docker-Hub-removed
# `bitnami/discourse:3.3.1` and ships a too-tight 5m start_period) is deployable and can survive the
# 15-25min Rails cold boot — so upgrade-to-latest can run. See compose.ccci.yml's header for the full
# rationale. The overlay is referenced by recipe_meta COMPOSE_FILE; it is a cc-ci file (not part of the
# recipe), so copying it here makes it resolvable. It persists across the later `git checkout <head>`
# (untracked) so the head deploy also merges it (idempotent — the PR head already re-pins + ships 20m).
# CHAOS_BASE_DEPLOY=True is set so abra's pinned-deploy clean-tree check doesn't FATA on the overlay.
set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then
echo " discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2
exit 1
fi
cp "$SCRIPT_DIR/compose.ccci.yml" "$RECIPE_DIR/compose.ccci.yml"
echo " discourse install_steps: provided compose.ccci.yml (bitnamilegacy re-pin + 20m start_period grace) to recipe checkout (${CCCI_RECIPE})"

View File

@ -30,18 +30,18 @@ def _seed(domain, value):
assert got == value, f"seed did not commit (read back {got!r}, expected {value!r})"
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backup so a successful restore is observable
_psql(domain, "DROP TABLE IF EXISTS ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE IF EXISTS ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -29,11 +29,11 @@ HTTP_TIMEOUT = 1200
# (1) it pins the Docker-Hub-removed `bitnami/discourse:3.3.1` (404) → overlay re-pins app+sidekiq to
# `bitnamilegacy/discourse:3.3.1` (namespace-only, identical image), the same re-pin the PR makes;
# (2) its 5m start_period is too tight for the 15-25min Rails boot → overlay widens it to 20m (grace).
# install_steps.sh provides the overlay; CHAOS_BASE_DEPLOY skips the clean-tree gate on the untracked
# overlay; it persists across the head checkout (idempotent — the PR head already re-pins + ships 20m).
# The harness auto-provides the overlay to the checkout and auto-chaoses the base deploy
# (first-class compose.ccci.yml, rcust P2a); it persists across the head checkout (idempotent — the
# PR head already re-pins + ships 20m).
# Upgrade crossover: 0.7.0 (re-pinned base) → PR head; full assertions run on the HEAD. The 0.7.0
# *custom* tests are not separately run (custom tier runs once, on the head — policy §1 allows skip+record).
CHAOS_BASE_DEPLOY = True
UPGRADE_BASE_VERSION = "0.7.0+3.3.1"
EXTRA_ENV = {
"TIMEOUT": "3600", # abra's internal convergence wait; matches DEPLOY_TIMEOUT (slow Rails boot headroom)
@ -41,7 +41,7 @@ EXTRA_ENV = {
}
def BACKUP_VERIFY(domain):
def BACKUP_VERIFY(ctx):
"""Post-backup integrity check (Q4.6, same race ghost F2-14b hit). The recipe's backupbot db
pre-hook (`/pg_backup.sh backup`) dumps the discourse postgres DB to `/var/lib/postgresql/data/
backup.sql` (gzip), then restic captures that path. On the loaded single CI node the db container
@ -60,7 +60,7 @@ def BACKUP_VERIFY(domain):
try:
out = lifecycle.exec_in_app(
domain,
ctx.domain,
[
"sh",
"-c",

View File

@ -1,26 +0,0 @@
#!/usr/bin/env bash
# ghost — INSTALL-TIME hook (Phase 2 F2-14b). Runs during the install tier AFTER `abra app new` +
# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN in env.
#
# Purpose: provide the cc-ci start_period-grace overlay (compose.ccci.yml) to the recipe checkout so
# the UPGRADE-tier BASE deploy (a previous published version whose app healthcheck still ships the
# too-tight 1m start_period) can survive ghost's ~6-9min fresh-DB migration and converge. See
# compose.ccci.yml's header for the full rationale. The overlay is referenced by recipe_meta
# COMPOSE_FILE; copying it here (it is a cc-ci file, not part of the recipe) makes it resolvable.
# It persists across the later `git checkout <head>` (untracked) so the head deploy also merges it
# (idempotent — the PR head already ships 15m). CHAOS_BASE_DEPLOY=True is set so abra's pinned-deploy
# clean-tree check doesn't FATA on the untracked overlay.
set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then
echo " ghost install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2
exit 1
fi
cp "$SCRIPT_DIR/compose.ccci.yml" "$RECIPE_DIR/compose.ccci.yml"
echo " ghost install_steps: provided compose.ccci.yml (app start_period grace) to recipe checkout (${CCCI_RECIPE})"

View File

@ -36,19 +36,19 @@ def _seed(domain, value):
assert got == value, f"seed did not commit (read back {got!r}, expected {value!r})"
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backup so a successful restore is observable: drop the marker table.
_mysql(domain, "DROP TABLE IF EXISTS ci_marker;")
_mysql(ctx.domain, "DROP TABLE IF EXISTS ci_marker;")
got = _mysql(
domain,
ctx.domain,
"SELECT COUNT(*) FROM information_schema.tables "
"WHERE table_schema='ghost' AND table_name='ci_marker';",
)

View File

@ -31,23 +31,22 @@ HTTP_TIMEOUT = 900
# (plan-ccci-compose-overlay-policy.md §1), so the harness base-deploys the previous PUBLISHED version
# (1.1.1+6-alpine) — which predates the PR and still ships the too-tight 1m start_period → it would
# deadlock on the same migration kill. compose.ccci.yml re-applies the 15m grace to the BASE so the
# from-version is deployable; install_steps.sh provides it to the checkout; CHAOS_BASE_DEPLOY skips the
# clean-tree gate on that untracked overlay. It persists across the head checkout (idempotent — the PR
# head already ships 15m). This is the policy-blessed "minimal overlay on the from-version so
# from-version is deployable; the harness auto-provides it to the checkout and auto-chaoses the base
# deploy (first-class compose.ccci.yml, rcust P2a). It persists across the head checkout (idempotent —
# the PR head already ships 15m). This is the policy-blessed "minimal overlay on the from-version so
# upgrade-to-latest can run" — grace-only, masks no defect, weakens no test.
# TIMEOUT/DEPLOY_TIMEOUT 2400s: the BASE cold boot's wall-time is mysql fresh-dir init (~6min, during
# which the app crash-loops harmlessly on `ECONNREFUSED 3306` until mysql accepts connections — no
# migration progress lost, it hasn't started) PLUS the ~9-15min schema migration (round-trip-bound,
# slower under host load). 1200s was too tight (full4 killed at the near-final `email_recipients`
# tables while still 0/1); 2400s gives headroom while still bounding a genuine hang (matches discourse).
CHAOS_BASE_DEPLOY = True
EXTRA_ENV = {
"TIMEOUT": "2400",
"COMPOSE_FILE": "compose.yml:compose.ccci.yml",
}
def BACKUP_VERIFY(domain):
def BACKUP_VERIFY(ctx):
"""Post-backup integrity check (F2-14b). The recipe's backupbot db pre-hook dumps the ghost MySQL
DB to `/var/lib/mysql/backup.sql.gz` (then restic captures that path). On the loaded single CI node
the db container intermittently CYCLES mid-dump (observed: full5/6/7 RED, full8 green — pure race;
@ -62,7 +61,7 @@ def BACKUP_VERIFY(domain):
try:
out = lifecycle.exec_in_app(
domain,
ctx.domain,
[
"sh",
"-c",

View File

@ -25,17 +25,17 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
def pre_restore(ctx):
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -14,20 +14,20 @@ def _token(domain):
return kc_admin.admin_token(domain, kc_admin.admin_password(domain))
def pre_upgrade(domain, meta):
def pre_upgrade(ctx):
# create the marker realm (DB data) before the upgrade so the overlay can prove it survives
assert kc_admin.create_marker_realm(domain, _token(domain)) in (201, 409)
assert kc_admin.create_marker_realm(ctx.domain, _token(ctx.domain)) in (201, 409)
def pre_backup(domain, meta):
def pre_backup(ctx):
# establish the marker realm before the backup op captures mariadb
assert kc_admin.create_marker_realm(domain, _token(domain)) in (201, 409)
assert kc_admin.create_marker_realm(ctx.domain, _token(ctx.domain)) in (201, 409)
def pre_restore(domain, meta):
def pre_restore(ctx):
# backup-bot-two cycles the keycloak container during backup → wait for serving, re-auth, then
# delete the realm (diverge from the backup) so a successful restore is observable
generic.assert_serving(domain, meta)
tok = _token(domain)
assert kc_admin.delete_marker_realm(domain, tok) in (204, 200)
assert not kc_admin.marker_realm_exists(domain, tok), "delete did not take"
generic.assert_serving(ctx.domain, ctx.meta)
tok = _token(ctx.domain)
assert kc_admin.delete_marker_realm(ctx.domain, tok) in (204, 200)
assert not kc_admin.marker_realm_exists(ctx.domain, tok), "delete did not take"

View File

@ -5,7 +5,7 @@ persistence". This is the canonical create-an-object + read-it-back for lasuite-
Flow (uses an OIDC token from the dep keycloak):
1. Obtain a JWT via OIDC password grant against the dep keycloak (the test user is provisioned
by the orchestrator's setup_custom_tests step).
by the orchestrator's dep-provisioning step).
2. POST `/api/v1.0/documents/` with `Authorization: Bearer <jwt>` to create a new doc with a
unique title; capture the returned `id`.
3. GET `/api/v1.0/documents/<id>/` with the same Bearer token; assert the returned title and
@ -15,7 +15,7 @@ Non-vacuous: a misconfigured OIDC, broken backend, or missing endpoint fails at
broken. The marker-in-the-title + id round-trip proves the doc actually persisted in lasuite-
docs's database after going through the recipe's nginx → backend → postgres path.
Marked @pytest.mark.requires_deps — skips with `deps-not-ready` if setup_custom_tests failed.
Marked @pytest.mark.requires_deps — skips with `deps-not-ready` if dep provisioning failed.
"""
from __future__ import annotations
@ -32,9 +32,9 @@ from harness import sso
@pytest.mark.requires_deps
def test_create_doc_and_read_back(live_app, deps_creds):
def test_create_doc_and_read_back(live_app, deps):
"""Create a doc via the authenticated API; fetch it back; assert round-trip."""
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Obtain a JWT via OIDC password grant
access_token = sso.oidc_password_grant(

View File

@ -5,13 +5,13 @@ SOURCE: references/recipe-maintainer/recipe-info/lasuite-docs/tests/oidc_login.p
End-to-end flow:
1. GET `/api/v1.0/users/me/` without auth → asserts the response REDIRECTS to the dep
keycloak's realm auth endpoint (the recipe is correctly configured to challenge
unauthenticated callers — wired via setup_custom_tests.sh).
unauthenticated callers — wired via install_steps.sh).
2. Obtain an OIDC token from the dep keycloak via password grant
(the test user provisioned by the orchestrator's realm setup).
3. Call `/api/v1.0/users/me/` with `Authorization: Bearer <jwt>` → asserts 200 and the
returned user's email matches the provisioned test user.
Marked @pytest.mark.requires_deps — skips with `deps-not-ready` if setup_custom_tests failed.
Marked @pytest.mark.requires_deps — skips with `deps-not-ready` if dep provisioning failed.
"""
from __future__ import annotations
@ -51,9 +51,9 @@ def _get_no_redirect(url: str) -> tuple[int, str]:
@pytest.mark.requires_deps
def test_oidc_login_via_keycloak(live_app, deps_creds):
def test_oidc_login_via_keycloak(live_app, deps):
"""Anonymous → redirect to keycloak; password-grant token → 200 from /api/v1.0/users/me/."""
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Step 1: unauthenticated GET → 302 to keycloak realm's auth endpoint
status, redirect = _get_no_redirect(f"https://{live_app}/api/v1.0/users/me/")

View File

@ -3,10 +3,10 @@
Refactored to the refined SSO-dep model:
- The orchestrator deploys a per-run keycloak dep AFTER generic tiers and provisions a fresh
realm/client/user via `harness.sso.setup_keycloak_realm`. The creds are written to
`$CCCI_DEPS_FILE` (read here via the `deps_creds` fixture).
`$CCCI_DEPS_FILE` (read here via the `deps` fixture).
- This test no longer calls `setup_keycloak_realm` itself — that's the orchestrator's job in
the setup_custom_tests step. We just consume the credentials and exercise the OIDC flow.
- Marked `@pytest.mark.requires_deps` so if setup_custom_tests failed, this test SKIPs with a
the dep-provisioning step. We just consume the credentials and exercise the OIDC flow.
- Marked `@pytest.mark.requires_deps` so if dep provisioning failed, this test SKIPs with a
clear `deps-not-ready` reason rather than red-flagging a non-recipe failure.
"""
@ -31,13 +31,13 @@ def _b64url_decode(seg: str) -> bytes:
@pytest.mark.requires_deps
def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
def test_oidc_password_grant_against_dep_keycloak(live_app, deps):
"""The dep keycloak issues a JWT for the pre-provisioned test user via OIDC password grant."""
assert "keycloak" in deps_creds, (
f"keycloak creds not in deps_creds; got {list(deps_creds.keys())}. "
"setup_custom_tests should have populated this."
assert "keycloak" in deps, (
f"keycloak creds not in deps; got {list(deps.keys())}. "
"dep provisioning should have populated this."
)
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Sanity-check the creds shape — orchestrator-written
assert kc["domain"]

View File

@ -0,0 +1,74 @@
#!/usr/bin/env bash
# lasuite-docs — INSTALL-TIME OIDC wiring hook (rcust P2b; migrated from the deleted
# setup_custom_tests.sh post-deploy path — sibling of lasuite-drive/-meet's hooks).
#
# Runs during the install tier AFTER `abra app new` + EXTRA_ENV + `abra app secret generate`, and
# BEFORE the single `abra app deploy` (lifecycle.py::_run_install_steps). Writing OIDC env + the
# real client secret HERE means the recipe deploys ONCE with OIDC already wired — no post-deploy
# reconverge. The orchestrator provisions the per-run realm/client on the (live-warm) keycloak
# BEFORE this hook and writes $CCCI_DEPS_FILE (the recipe→creds dict). docs' OIDC settings are
# config-only (validated by `manage.py check`, not fetched at boot), so the stack boots healthy
# with the env set. Env names per lasuite-docs's .env.sample (same values the old post-deploy
# hook wrote — byte-identical wiring, only the timing moved).
#
# Env supplied by the harness:
# CCCI_APP_DOMAIN — the per-run lasuite-docs app domain
# CCCI_APP_ENV — path to the app's .env (the one `abra app deploy` reads)
# CCCI_DEPS_FILE — JSON {keycloak: {domain, realm, client_id, client_secret, ...}} (may be empty)
set -euo pipefail
: "${CCCI_APP_DOMAIN:?missing}"
ENV_PATH="${CCCI_APP_ENV:?missing}"
# No deps file / no keycloak entry → install-time provisioning failed or was skipped. NO-OP so the
# recipe still boots; the @requires_deps OIDC custom test then SKIPs and F2-11 flips the run RED.
if [ -z "${CCCI_DEPS_FILE:-}" ] || [ ! -s "${CCCI_DEPS_FILE}" ]; then
echo " install_steps: no deps file — skipping OIDC wiring (recipe boots without OIDC)"
exit 0
fi
KC_DOMAIN=$(jq -r '.keycloak.domain // empty' "$CCCI_DEPS_FILE")
KC_REALM=$(jq -r '.keycloak.realm // empty' "$CCCI_DEPS_FILE")
KC_CLIENT=$(jq -r '.keycloak.client_id // empty' "$CCCI_DEPS_FILE")
KC_SECRET=$(jq -r '.keycloak.client_secret // empty' "$CCCI_DEPS_FILE")
if [ -z "$KC_DOMAIN" ] || [ -z "$KC_SECRET" ]; then
echo " install_steps: deps file has no keycloak domain/secret — skipping OIDC wiring"
exit 0
fi
echo " lasuite-docs install_steps: wiring OIDC at install against keycloak ${KC_DOMAIN}"
# 1) Insert the OIDC client secret at a bumped version (abra already generated oidc_rpcs:v1; swarm
# forbids overwriting a secret at the same version). The app is not deployed yet — a swarm secret
# can be created independently — so the single deploy below picks up v2.
CUR_VER=$(grep -E '^\s*SECRET_OIDC_RPCS_VERSION=' "$ENV_PATH" | tail -1 | cut -d= -f2 | tr -d '"\r' || echo "v1")
NEW_NUM=$((${CUR_VER#v} + 1))
NEW_VER="v${NEW_NUM}"
INSERT_LOG=$(abra app secret insert "$CCCI_APP_DOMAIN" oidc_rpcs "$NEW_VER" "$KC_SECRET" --no-input -C -o 2>&1) ||
INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) ||
{
echo " install_steps: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"
exit 1
}
sed -i "s|^\s*SECRET_OIDC_RPCS_VERSION=.*|SECRET_OIDC_RPCS_VERSION=$NEW_VER|" "$ENV_PATH"
echo " install_steps: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)"
# 2) Write OIDC env vars to the app's .env (names per lasuite-docs's .env.sample). Ensure a
# trailing newline first so appends never concatenate onto the last line.
write_env() {
local key="$1" val="$2"
sed -i "/^\s*#\?\s*${key}=/d" "$ENV_PATH"
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
printf '%s=%s\n' "$key" "$val" >>"$ENV_PATH"
}
write_env OIDC_REALM "$KC_REALM"
write_env OIDC_OP_DISCOVERY_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration"
write_env OIDC_OP_AUTHORIZATION_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
write_env OIDC_OP_TOKEN_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
write_env OIDC_OP_USER_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
write_env OIDC_OP_LOGOUT_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
write_env OIDC_OP_JWKS_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
write_env OIDC_RP_CLIENT_ID "$KC_CLIENT"
write_env OIDC_RP_SIGN_ALGO "RS256"
write_env OIDC_RP_SCOPES "openid email profile"
echo " lasuite-docs install_steps: OIDC env wired into .env (deploy will pick it up, no reconverge)"

View File

@ -24,18 +24,18 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -15,7 +15,7 @@ HTTP_TIMEOUT = 600
DEPS = ["keycloak"]
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
# abra's internal per-deploy convergence timeout (the recipe's TIMEOUT env, default 300s) is too
# short for this 9-service stack on a COLD image cache (~9 large images: impress frontend/backend,
# minio, postgres18, redis, docspec, y-provider). Cold pulls exceed 300s -> "deploy timed out 🟠".

View File

@ -1,91 +0,0 @@
#!/usr/bin/env bash
# lasuite-docs — post-deps setup hook (operator-2026-05-28 SSO-dep plan §3.2).
#
# Runs AFTER the generic tiers (install/upgrade/backup/restore) and AFTER each declared dep is
# deployed + provisioned with realm/client via the harness. The orchestrator has written
# $CCCI_DEPS_FILE with the keycloak dep's domain + realm + client_secret + admin creds.
#
# This hook:
# 1. Reads the dep's connection info from $CCCI_DEPS_FILE.
# 2. Inserts the OIDC client secret as an abra app secret (recipe-conventional name oidc_rpcs).
# 3. Writes the OIDC env vars to the running app's .env via `abra app config set`.
# 4. Triggers an in-place `abra app deploy --force --chaos` so the new env takes effect.
# THIS IS NOT a fresh `abra app new` — the deploy-count guard (DG4.1, generalised) still
# sees one app_new per app.
#
# Env supplied by the orchestrator:
# CCCI_APP_DOMAIN — the running per-run lasuite-docs app domain
# CCCI_RECIPE — "lasuite-docs"
# CCCI_DEPS_FILE — JSON file (dict shape: {dep_recipe: {domain, realm, client_id, ...}, ...})
set -euo pipefail
: "${CCCI_APP_DOMAIN:?missing}"
: "${CCCI_DEPS_FILE:?missing}"
test -s "$CCCI_DEPS_FILE" || {
echo " setup_custom_tests: deps file empty"
exit 1
}
# Read keycloak dep info via jq
KC_DOMAIN=$(jq -r '.keycloak.domain' "$CCCI_DEPS_FILE")
KC_REALM=$(jq -r '.keycloak.realm' "$CCCI_DEPS_FILE")
KC_CLIENT=$(jq -r '.keycloak.client_id' "$CCCI_DEPS_FILE")
KC_SECRET=$(jq -r '.keycloak.client_secret' "$CCCI_DEPS_FILE")
if [ -z "$KC_DOMAIN" ] || [ "$KC_DOMAIN" = "null" ]; then
echo " setup_custom_tests: no keycloak.domain in deps"
exit 1
fi
if [ -z "$KC_SECRET" ] || [ "$KC_SECRET" = "null" ]; then
echo " setup_custom_tests: no keycloak.client_secret"
exit 1
fi
echo " lasuite-docs setup_custom_tests: wiring OIDC against keycloak dep ${KC_DOMAIN}"
# 1) Insert the OIDC client secret AT A BUMPED VERSION (the recipe-maintainer pattern).
# `abra app new -S` already generated `oidc_rpcs:v1` (random) — Docker Swarm forbids overwriting
# a secret at the same version, so we bump the version (v2), insert our value there, then
# update SECRET_OIDC_RPCS_VERSION in the .env to point at the new one.
ENV_PATH="$HOME/.abra/servers/default/${CCCI_APP_DOMAIN}.env"
CUR_VER=$(grep -E '^\s*SECRET_OIDC_RPCS_VERSION=' "$ENV_PATH" | tail -1 | cut -d= -f2 | tr -d '"\r' || echo "v1")
NEW_NUM=$((${CUR_VER#v} + 1))
NEW_VER="v${NEW_NUM}"
INSERT_LOG=$(abra app secret insert "$CCCI_APP_DOMAIN" oidc_rpcs "$NEW_VER" "$KC_SECRET" --no-input -C -o 2>&1) ||
INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) ||
{
echo " setup_custom_tests: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"
exit 1
}
# Repoint the env var to the new version
sed -i "s|^\s*SECRET_OIDC_RPCS_VERSION=.*|SECRET_OIDC_RPCS_VERSION=$NEW_VER|" "$ENV_PATH"
echo " setup_custom_tests: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)"
# 2) Write OIDC env vars to the app's .env (names per lasuite-docs's .env.sample).
# Ensure the file ends with a newline FIRST so our appends don't concatenate onto the last line
# (we saw `TIMEOUT=900OIDC_REALM=...` malformed by a missing-trailing-newline file).
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
write_env() {
local key="$1" val="$2"
# remove any existing key (commented or live) then append the live key=val
sed -i "/^\s*#\?\s*${key}=/d" "$ENV_PATH"
# Re-ensure trailing newline after each delete (sed may leave the file without one)
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
printf '%s=%s\n' "$key" "$val" >>"$ENV_PATH"
}
write_env OIDC_REALM "$KC_REALM"
write_env OIDC_OP_DISCOVERY_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration"
write_env OIDC_OP_AUTHORIZATION_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
write_env OIDC_OP_TOKEN_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
write_env OIDC_OP_USER_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
write_env OIDC_OP_LOGOUT_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
write_env OIDC_OP_JWKS_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
write_env OIDC_RP_CLIENT_ID "$KC_CLIENT"
write_env OIDC_RP_SIGN_ALGO "RS256"
write_env OIDC_RP_SCOPES "openid email profile"
# 3) Trigger an in-place redeploy so the env update takes effect. --force re-deploys even when
# the recipe hasn't changed; --chaos avoids the chaos prompt; --no-input non-interactive.
abra app deploy "$CCCI_APP_DOMAIN" --force --chaos --no-input 2>&1 | tail -10
echo " lasuite-docs setup_custom_tests: OIDC wired + redeployed"

View File

@ -3,12 +3,12 @@
Drive (La Suite Drive) is OIDC-required: login is gated by an external OpenID Connect provider.
Mirrors the proven lasuite-docs SSO model:
- The orchestrator deploys a per-run keycloak dep AFTER the generic tiers and provisions a fresh
realm/client/user via `harness.sso.setup_keycloak_realm`; `setup_custom_tests.sh` then wires the
realm/client/user via `harness.sso.setup_keycloak_realm`; `install_steps.sh` then wires the
OIDC env + client secret into the running drive app and redeploys. Creds land in `$CCCI_DEPS_FILE`
(read here via the `deps_creds` fixture).
(read here via the `deps` fixture).
- This test consumes those creds and exercises the real OIDC flow against the dep keycloak: discovery
endpoint advertises the realm, and a password grant yields a valid JWT with the expected claims.
- Marked `@pytest.mark.requires_deps` so if setup_custom_tests failed the test SKIPs with a clear
- Marked `@pytest.mark.requires_deps` so if dep provisioning failed the test SKIPs with a clear
`deps-not-ready` reason — and (per F2-11) the orchestrator then fails the run rather than going
green on a skipped SSO test.
@ -36,13 +36,13 @@ def _b64url_decode(seg: str) -> bytes:
@pytest.mark.requires_deps
def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
def test_oidc_password_grant_against_dep_keycloak(live_app, deps):
"""The dep keycloak issues a JWT for the pre-provisioned test user via OIDC password grant."""
assert "keycloak" in deps_creds, (
f"keycloak creds not in deps_creds; got {list(deps_creds.keys())}. "
"setup_custom_tests should have populated this."
assert "keycloak" in deps, (
f"keycloak creds not in deps; got {list(deps.keys())}. "
"dep provisioning should have populated this."
)
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Creds shape. WC1: realm is per-run namespaced "<parent>-<6hex>"; client_id stays the parent.
assert kc["domain"]

View File

@ -6,7 +6,7 @@
# BEFORE the single `abra app deploy` (runner/harness/lifecycle.py::_run_install_steps). By writing
# the OIDC env + the real client secret into the app's `.env` HERE, the recipe deploys ONCE with
# OIDC already wired — eliminating the flaky post-deploy in-place `--force --chaos` 12-service
# reconverge that the old setup_custom_tests.sh did (collabora WOPI-discovery race; see JOURNAL
# post-deploy reconverge (collabora WOPI-discovery race; see JOURNAL
# Step 0). The orchestrator provisions the per-run realm/client on the live-warm keycloak BEFORE
# this hook and writes $CCCI_DEPS_FILE (the recipe→creds dict).
#

View File

@ -5,6 +5,7 @@ in the `db` service. The backup path exercises the recipe's pg_backup.sh DB-dump
backupbot-labelled)."""
import os
import subprocess
import sys
import time
@ -12,6 +13,47 @@ sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner")
from harness import lifecycle # noqa: E402
def pre_install(ctx):
"""Post-deploy seed for the custom tier (the former setup_custom_tests.sh, moved here in rcust
P2b — install_steps.sh runs PRE-deploy and cannot touch the live stack). The deploy alone does
NOT create the MinIO bucket: `minio-createbuckets` is a `replicas:0` one-shot (restart_policy:
none) that must be triggered. The MinIO storage test asserts the bucket exists, so trigger it
here and poll. `--detach` is REQUIRED: the job creates the bucket then EXITS 0, so it never
holds a steady 1/1 replica — a blocking scale would wait forever."""
stack = ctx.domain.replace(".", "_")
print(" pre_install: creating MinIO bucket via the minio-createbuckets one-shot", flush=True)
subprocess.run(
["docker", "service", "scale", "--detach", f"{stack}_minio-createbuckets=1"],
capture_output=True,
check=False,
)
check = (
'mc alias set _c http://localhost:9000 "$(cat /run/secrets/minio_ru)" '
'"$(cat /run/secrets/minio_rp)" >/dev/null 2>&1 && '
"mc ls _c/drive-media-storage >/dev/null 2>&1"
)
for i in range(30):
cid = subprocess.run(
["docker", "ps", "-q", "-f", f"name={stack}_minio.1"],
capture_output=True,
text=True,
check=False,
).stdout.split()
if cid and (
subprocess.run(
["docker", "exec", cid[0], "sh", "-c", check], capture_output=True, check=False
).returncode
== 0
):
print(
f" pre_install: bucket drive-media-storage present after {i + 1} poll(s)",
flush=True,
)
return
time.sleep(3)
raise AssertionError("minio-createbuckets one-shot did not create drive-media-storage in 90s")
def _wait_collabora_ready(domain, timeout=420):
"""Gate the upgrade op on collabora being FULLY ready (WOPI discovery endpoint → 200), not just
container 1/1 'running'. coolwsd takes ~2min to boot (pre-reads 1300+ l10n files + RSA keygen);
@ -49,21 +91,21 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
def pre_upgrade(ctx):
# Gate the chaos redeploy on a fully-ready collabora (else it kills a still-booting coolwsd and
# abra aborts the upgrade deploy — Q3.2a run 1). Then seed the data-integrity marker.
_wait_collabora_ready(domain)
_seed(domain, "upgrade-survives")
_wait_collabora_ready(ctx.domain)
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -18,34 +18,31 @@ DEPLOY_TIMEOUT = 1800
HTTP_TIMEOUT = 900
# Base deploy/lifecycle proven cold-green @2026-05-28 (install: pass; 12 services incl.
# onlyoffice+collabora) once the Docker Hub rate limit was fixed. The keycloak SSO dep is now
# enabled: declaring DEPS triggers the orchestrator's setup_custom_tests step (deploy keycloak +
# provision realm/client/user + run tests/lasuite-drive/setup_custom_tests.sh to wire OIDC env +
# in-place redeploy). functional/test_oidc_with_keycloak.py then exercises the SSO flow.
# onlyoffice+collabora) once the Docker Hub rate limit was fixed. Declaring DEPS makes the
# orchestrator provision keycloak (realm/client/user) BEFORE the single deploy;
# functional/test_oidc_with_keycloak.py then exercises the SSO flow.
DEPS = ["keycloak"]
# Q3.2a (plan-lasuite-drive-oidc-robustness.md Part A): wire OIDC at INSTALL time, not via a
# post-deploy in-place `--chaos` redeploy. The orchestrator provisions the per-run realm on the
# live-warm keycloak BEFORE the single `abra app deploy`, and tests/lasuite-drive/install_steps.sh
# writes the OIDC env + client secret into the .env that one deploy reads. This eliminates the flaky
# 12-service reconverge (collabora WOPI-discovery race; JOURNAL Step 0). Drive boots fine with OIDC
# env set because keycloak is live-warm (discovery reachable at boot). setup_custom_tests.sh now
# only triggers the post-deploy MinIO bucket one-shot.
OIDC_AT_INSTALL = True
# OIDC is wired at INSTALL time (the only deps mode since rcust P2b; Q3.2a pioneered it here):
# the orchestrator provisions the per-run realm on the live-warm keycloak BEFORE the single
# `abra app deploy`, and tests/lasuite-drive/install_steps.sh writes the OIDC env + client secret
# into the .env that one deploy reads. No post-deploy reconverge (the flaky 12-service collabora
# WOPI race is structurally gone). The post-deploy MinIO bucket one-shot lives in ops.py
# pre_install (the former setup_custom_tests.sh, deleted in P2b).
def READY_PROBE(domain):
def READY_PROBE(ctx):
"""Readiness signals beyond replica-convergence + the app HEALTH_PATH (Q3.2/F2-12). collabora's
coolwsd reports its container 1/1 'running' while still doing jail/config init, and its WOPI
discovery endpoint 404s until ready — so the harness waits for `/hosting/discovery` → 200 on the
collabora sibling host after the install deploy AND after the upgrade chaos redeploy. This is what
makes the heavy prev→PR-head crossover reliably green (the new collabora 25.04.9.x finishes init
within swarm's healthcheck retries; abra's own converge monitor was too impatient — F2-12)."""
label, _, rest = domain.partition(".")
return [{"host": f"collabora-{domain}", "path": "/hosting/discovery", "ok": (200,)}]
label, _, rest = ctx.domain.partition(".")
return [{"host": f"collabora-{ctx.domain}", "path": "/hosting/discovery", "ok": (200,)}]
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
# Two of lasuite-drive's services route on DOMAIN-DERIVED **nested** subdomains —
# `MINIO_DOMAIN="minio.${DOMAIN}"` and `COLLABORA_DOMAIN="collabora.${DOMAIN}"`. The cc-ci
# wildcard TLS cert is `*.ci.commoninternet.net` (single label only), so a 2-label name like
@ -55,8 +52,8 @@ def EXTRA_ENV(domain):
# no cert/gateway change. See DECISIONS.md "Phase 2 — nested DOMAIN-derived subdomains".
# `AWS_S3_DOMAIN_REPLACE` derives from MINIO_DOMAIN in-compose, so setting MINIO_DOMAIN is enough.
return {
"MINIO_DOMAIN": f"minio-{domain}",
"COLLABORA_DOMAIN": f"collabora-{domain}",
"MINIO_DOMAIN": f"minio-{ctx.domain}",
"COLLABORA_DOMAIN": f"collabora-{ctx.domain}",
# abra's internal per-deploy convergence timeout (recipe TIMEOUT env, default 300s) is too
# short for this 12-service stack on a cold image cache (impress frontend/backend, minio,
# postgres, redis, collabora ~1GB, onlyoffice ~2GB). Bump so abra waits long enough for

View File

@ -1,39 +0,0 @@
#!/usr/bin/env bash
# lasuite-drive — POST-DEPLOY setup hook (Phase 2 Q3.2a).
#
# As of Q3.2a (plan-lasuite-drive-oidc-robustness.md Part A) OIDC is wired at INSTALL time by
# tests/lasuite-drive/install_steps.sh (before the single `abra app deploy`), so this hook NO LONGER
# does any OIDC env wiring or in-place redeploy — that eliminated the flaky 12-service reconverge
# (collabora WOPI race; see JOURNAL Step 0). What remains here is the ONE post-deploy step that
# genuinely needs the live stack: triggering the MinIO bucket-creation one-shot. The orchestrator
# runs this only on the install-time path AFTER the deploy is healthy (deps already provisioned).
#
# Env supplied by the orchestrator:
# CCCI_APP_DOMAIN — the running per-run lasuite-drive app domain
# CCCI_DEPS_FILE — JSON deps creds dict (unused here now; OIDC handled at install)
set -euo pipefail
: "${CCCI_APP_DOMAIN:?missing}"
# The deploy alone does NOT create the MinIO bucket — `minio-createbuckets` is a `replicas:0`
# one-shot (restart_policy: none) that must be triggered. The MinIO storage test asserts the bucket
# exists, so create it here. `--detach` is REQUIRED: the job creates the bucket then EXITS 0, so it
# never holds a steady 1/1 replica; a blocking `docker service scale ...=1` would wait forever and
# hang the run. With `--detach` the scale just submits the one-run and returns; the poll loop below
# confirms the bucket was actually created.
STACK=$(printf '%s' "$CCCI_APP_DOMAIN" | tr '.' '_')
echo " setup: creating MinIO bucket via the minio-createbuckets one-shot (scale 0->1)"
docker service scale --detach "${STACK}_minio-createbuckets=1" >/dev/null 2>&1 || true
# Wait up to 90s for the one-shot to create the bucket (mc mb drive/drive-media-storage; exit 0).
# Poll by checking the bucket directly from the running minio replica container.
for i in $(seq 1 30); do
MC_CID=$(docker ps -q -f "name=${STACK}_minio.1" | head -1)
if [ -n "$MC_CID" ] && docker exec "$MC_CID" sh -c \
'mc alias set _c http://localhost:9000 "$(cat /run/secrets/minio_ru)" "$(cat /run/secrets/minio_rp)" >/dev/null 2>&1 && mc ls _c/drive-media-storage >/dev/null 2>&1'; then
echo " setup: bucket drive-media-storage present after ${i} poll(s)"
break
fi
sleep 3
done
echo " lasuite-drive setup_custom_tests: post-deploy MinIO bucket step complete (OIDC wired at install)"

View File

@ -36,8 +36,8 @@ def _b64url(seg: str) -> bytes:
return base64.urlsafe_b64decode(seg + "=" * ((4 - len(seg) % 4) % 4))
def _creds(deps_creds: dict) -> dict:
kc = deps_creds["keycloak"]
def _creds(deps: dict) -> dict:
kc = deps["keycloak"]
return {
"provider": "keycloak",
"provider_domain": kc["domain"],
@ -55,10 +55,10 @@ def _creds(deps_creds: dict) -> dict:
@pytest.mark.requires_deps
def test_create_room_get_livekit_token_and_read_back(live_app, deps_creds):
assert "keycloak" in deps_creds, f"keycloak creds missing; got {list(deps_creds.keys())}"
def test_create_room_get_livekit_token_and_read_back(live_app, deps):
assert "keycloak" in deps, f"keycloak creds missing; got {list(deps.keys())}"
base = f"https://{live_app}"
token = sso.oidc_password_grant(_creds(deps_creds))
token = sso.oidc_password_grant(_creds(deps))
assert isinstance(token, str) and token.count(".") == 2, "OIDC access token is not a JWT"
auth = {"Authorization": f"Bearer {token}"}

View File

@ -3,12 +3,12 @@
Meet (La Suite Meet) is OIDC-required: login is gated by an external OpenID Connect provider.
Mirrors the proven lasuite-docs SSO model:
- The orchestrator deploys a per-run keycloak dep AFTER the generic tiers and provisions a fresh
realm/client/user via `harness.sso.setup_keycloak_realm`; `setup_custom_tests.sh` then wires the
realm/client/user via `harness.sso.setup_keycloak_realm`; `install_steps.sh` then wires the
OIDC env + client secret into the running drive app and redeploys. Creds land in `$CCCI_DEPS_FILE`
(read here via the `deps_creds` fixture).
(read here via the `deps` fixture).
- This test consumes those creds and exercises the real OIDC flow against the dep keycloak: discovery
endpoint advertises the realm, and a password grant yields a valid JWT with the expected claims.
- Marked `@pytest.mark.requires_deps` so if setup_custom_tests failed the test SKIPs with a clear
- Marked `@pytest.mark.requires_deps` so if dep provisioning failed the test SKIPs with a clear
`deps-not-ready` reason — and (per F2-11) the orchestrator then fails the run rather than going
green on a skipped SSO test.
@ -36,13 +36,13 @@ def _b64url_decode(seg: str) -> bytes:
@pytest.mark.requires_deps
def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
def test_oidc_password_grant_against_dep_keycloak(live_app, deps):
"""The dep keycloak issues a JWT for the pre-provisioned test user via OIDC password grant."""
assert "keycloak" in deps_creds, (
f"keycloak creds not in deps_creds; got {list(deps_creds.keys())}. "
"setup_custom_tests should have populated this."
assert "keycloak" in deps, (
f"keycloak creds not in deps; got {list(deps.keys())}. "
"dep provisioning should have populated this."
)
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Creds shape. WC1: realm is per-run namespaced "<parent>-<6hex>"; client_id stays the parent.
assert kc["domain"]

View File

@ -4,7 +4,8 @@
# Runs during the install tier AFTER `abra app new` + EXTRA_ENV + `abra app secret generate`, and
# BEFORE the single `abra app deploy` (lifecycle.py::_run_install_steps). Writing OIDC env + the real
# client secret HERE means the recipe deploys ONCE with OIDC already wired — no post-deploy reconverge
# (OIDC_AT_INSTALL). The orchestrator provisions the per-run realm/client on the live-warm keycloak
# (install-time deps wiring — the only mode since rcust P2b). The orchestrator provisions the
# per-run realm/client on the live-warm keycloak
# BEFORE this hook and writes $CCCI_DEPS_FILE (the recipe→creds dict).
#
# Meet's OIDC is REQUIRED (recipe README). Same La Suite/impress env contract as drive, with meet's

View File

@ -27,18 +27,18 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -13,16 +13,15 @@ HEALTH_OK = (200, 301, 302)
DEPLOY_TIMEOUT = 1200
HTTP_TIMEOUT = 600
# SSO-dependent (recipe.toml requires=["keycloak"], [sso] provider=keycloak). Wire OIDC at INSTALL
# time against the live-warm keycloak — same machinery as lasuite-drive (Q3.2a): the orchestrator
# provisions the per-run realm BEFORE the single `abra app deploy`, and tests/lasuite-meet/
# install_steps.sh writes the OIDC env + client secret into that one deploy (no post-deploy
# reconverge). Meet boots fine with OIDC env set because keycloak is live-warm.
# SSO-dependent (recipe.toml requires=["keycloak"], [sso] provider=keycloak). OIDC is wired at
# INSTALL time (the only deps mode since rcust P2b) against the live-warm keycloak: the
# orchestrator provisions the per-run realm BEFORE the single `abra app deploy`, and
# tests/lasuite-meet/install_steps.sh writes the OIDC env + client secret into that one deploy
# (no post-deploy reconverge). Meet boots fine with OIDC env set because keycloak is live-warm.
DEPS = ["keycloak"]
OIDC_AT_INSTALL = True
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
# lasuite-meet routes LiveKit's WebSocket signaling on a DOMAIN-derived **nested** subdomain
# `LIVEKIT_DOMAIN="livekit.${DOMAIN}"`. The cc-ci wildcard TLS cert is `*.ci.commoninternet.net`
# (single label only), so a 2-label name like `livekit.lasuite-meet-pr0-abc.ci.commoninternet.net`
@ -31,7 +30,7 @@ def EXTRA_ENV(domain):
# no cert/gateway change. Same fix as lasuite-drive's minio/collabora siblings (DECISIONS.md
# "Phase 2 — nested DOMAIN-derived subdomains").
return {
"LIVEKIT_DOMAIN": f"livekit-{domain}",
"LIVEKIT_DOMAIN": f"livekit-{ctx.domain}",
# abra's internal per-deploy convergence TIMEOUT (default 300s) is too short for this stack on
# a cold image cache; bump it (kept under DEPLOY_TIMEOUT so Python never kills abra mid-wait).
"TIMEOUT": "1000",

View File

@ -21,10 +21,10 @@ DEPLOY_TIMEOUT = 900
HTTP_TIMEOUT = 600
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
return {
"MAIL_DOMAIN": domain,
"HOSTNAMES": domain,
"MAIL_DOMAIN": ctx.domain,
"HOSTNAMES": ctx.domain,
"TRAEFIK_STACK_NAME": "traefik_ci_commoninternet_net",
"TLS_FLAVOR": "notls",
"SITENAME": "ccci-mail",

View File

@ -24,18 +24,18 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -29,18 +29,18 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -26,9 +26,9 @@ def test_configured_max_users_surfaces_in_serverconfig(live_app):
assert r["server_sync"], f"ServerSync handshake did not complete — {r.get('error')}"
cfg = r["server_config"]
assert cfg, f"server did not send a ServerConfig message — {r!r}"
assert cfg.get("max_users") == recipe_meta.MAX_USERS, (
assert cfg.get("max_users") == recipe_meta._MAX_USERS, (
f"ServerConfig.max_users={cfg.get('max_users')!r} does not match the configured "
f"USERS={recipe_meta.MAX_USERS} — deploy-time server-limit config did not propagate"
f"USERS={recipe_meta._MAX_USERS} — deploy-time server-limit config did not propagate"
)
# allow_html defaults true in the recipe; assert it is present/boolean to prove the field set
# is the real ServerConfig (not an empty/garbled decode).

View File

@ -20,7 +20,7 @@ import recipe_meta # noqa: E402
def test_configured_welcome_text_surfaces_in_serversync(live_app):
marker = recipe_meta.WELCOME_TEXT_MARKER
marker = recipe_meta._WELCOME_TEXT_MARKER
r = _mumble_proto.retry_handshake(attempts=12, interval=5.0)
assert r["server_sync"], f"ServerSync handshake did not complete — {r.get('error')}"

View File

@ -38,16 +38,18 @@ def _seed(domain, value):
assert got == value, f"seed did not commit (read back {got!r}, expected {value!r})"
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backup so a successful restore is observable: drop the marker table.
_sqlite(domain, "DROP TABLE IF EXISTS ci_marker;")
got = _sqlite(domain, "SELECT name FROM sqlite_master WHERE type='table' AND name='ci_marker';")
_sqlite(ctx.domain, "DROP TABLE IF EXISTS ci_marker;")
got = _sqlite(
ctx.domain, "SELECT name FROM sqlite_master WHERE type='table' AND name='ci_marker';"
)
assert got == "", f"drop did not take (sqlite_master still lists ci_marker: {got!r})"

View File

@ -31,18 +31,19 @@ HEALTH_OK = (200,)
DEPLOY_TIMEOUT = 900 # two images to pull (mumble-server + mumble-web) on a cold node
HTTP_TIMEOUT = 300
# A unique, stable welcome-text marker the round-trip test asserts surfaces over the protocol.
WELCOME_TEXT_MARKER = "cc-ci-mumble-welcome-7f3a9c"
# A unique, stable welcome-text marker the round-trip test asserts surfaces over the protocol
# (underscore prefix = recipe-private constant, exempt from registry validation — rcust P1).
_WELCOME_TEXT_MARKER = "cc-ci-mumble-welcome-7f3a9c"
# A distinctive max-users value (not the recipe default 100) the server_config test asserts.
MAX_USERS = 42
_MAX_USERS = 42
# BASE deploy (0.2.0): mumble-web only — NO host-ports (0.2.0 predates it). The voice-config env is
# set here and persists across the upgrade so it takes effect on the latest (where the custom config
# round-trip tests assert it).
EXTRA_ENV = {
"COMPOSE_FILE": "compose.yml:compose.mumbleweb.yml",
"WELCOME_TEXT": WELCOME_TEXT_MARKER,
"USERS": str(MAX_USERS),
"WELCOME_TEXT": _WELCOME_TEXT_MARKER,
"USERS": str(_MAX_USERS),
}
# UPGRADE-target deploy (latest 1.0.0+): add the NATIVE compose.host-ports.yml so 64738 is
@ -52,7 +53,7 @@ UPGRADE_EXTRA_ENV = {
}
def READY_PROBE(domain):
def READY_PROBE(ctx):
# The voice server on 64738 is testable on-host ONLY when compose.host-ports.yml is active — i.e.
# the post-upgrade LATEST, not the minimal 0.2.0 base. Read the live COMPOSE_FILE to decide, so the
# SAME probe fn is correct in both phases: the post-install probe (base, no host-ports) returns []
@ -63,7 +64,7 @@ def READY_PROBE(domain):
# backup-bot would then exec into a not-running app container -> 409).
from harness import abra # lazy: recipe_meta is exec'd with `harness` importable at call time
cf = abra.env_get(domain, "COMPOSE_FILE") or ""
cf = abra.env_get(ctx.domain, "COMPOSE_FILE") or ""
if "compose.host-ports.yml" in cf:
return [{"tcp_host": "127.0.0.1", "tcp_port": 64738, "stable": 3}]
return []

View File

@ -15,13 +15,13 @@ def _write(domain, val):
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo {val} > {MARKER}"])
def pre_upgrade(domain, meta):
_write(domain, "upgrade-survives")
def pre_upgrade(ctx):
_write(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_write(domain, "original")
def pre_backup(ctx):
_write(ctx.domain, "original")
def pre_restore(domain, meta):
_write(domain, "mutated") # diverge so a successful restore is observable
def pre_restore(ctx):
_write(ctx.domain, "mutated") # diverge so a successful restore is observable

View File

@ -24,17 +24,17 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
def pre_restore(ctx):
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -13,6 +13,7 @@ import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import canonical, warm # noqa: E402
from harness import meta as harness_meta # noqa: E402
def test_canonical_domain():
@ -33,11 +34,9 @@ def test_is_enrolled_reads_flag(tmp_path, monkeypatch):
tests_dir = tmp_path / "tests" / recipe
tests_dir.mkdir(parents=True)
(tests_dir / "recipe_meta.py").write_text("WARM_CANONICAL = True\n")
# canonical.is_enrolled builds the path from canonical.__file__/../../tests/<recipe>; emulate by
# creating the layout under a fake harness dir and pointing __file__ there.
fake_harness = tmp_path / "runner" / "harness"
fake_harness.mkdir(parents=True)
monkeypatch.setattr(canonical, "__file__", str(fake_harness / "canonical.py"))
# is_enrolled reads through the single meta loader (rcust P1); point its tests/ root at the
# temp layout.
monkeypatch.setattr(harness_meta, "TESTS_DIR", str(tmp_path / "tests"))
assert canonical.is_enrolled(recipe) is True
(tests_dir / "recipe_meta.py").write_text("WARM_CANONICAL = False\n")
assert canonical.is_enrolled(recipe) is False
@ -65,9 +64,7 @@ def test_registry_roundtrip(tmp_path, monkeypatch):
def test_enrolled_recipes_scans_meta(tmp_path, monkeypatch):
# enrolled_recipes() lists recipes whose tests/<r>/recipe_meta.py sets WARM_CANONICAL=True.
fake_harness = tmp_path / "runner" / "harness"
fake_harness.mkdir(parents=True)
monkeypatch.setattr(canonical, "__file__", str(fake_harness / "canonical.py"))
monkeypatch.setattr(harness_meta, "TESTS_DIR", str(tmp_path / "tests"))
for name, body in (
("aaa", "WARM_CANONICAL = True\n"),
("bbb", "DEPS=['x']\n"),

View File

@ -0,0 +1,48 @@
"""Unit tests for the shared conftest fixtures added/reshaped by the rcust restructure (P2d/P4):
`op_state` (run-scoped op context from $CCCI_OP_STATE_FILE) and `deps` (consolidated dep creds
with attribute sugar). Pure — exercised via request.getfixturevalue with env monkeypatched."""
from __future__ import annotations
import json
import pytest
def test_op_state_fixture_reads_file(tmp_path, monkeypatch, request):
f = tmp_path / "op.json"
f.write_text(json.dumps({"backup": {"snapshot_id": "abc123"}, "upgrade": {"head_ref": "h"}}))
monkeypatch.setenv("CCCI_OP_STATE_FILE", str(f))
st = request.getfixturevalue("op_state")
assert st["backup"]["snapshot_id"] == "abc123"
assert st["upgrade"]["head_ref"] == "h"
def test_op_state_fixture_skips_without_env(monkeypatch, request):
monkeypatch.delenv("CCCI_OP_STATE_FILE", raising=False)
with pytest.raises(pytest.skip.Exception, match="orchestrator"):
request.getfixturevalue("op_state")
def test_op_state_fixture_skips_on_missing_file(tmp_path, monkeypatch, request):
monkeypatch.setenv("CCCI_OP_STATE_FILE", str(tmp_path / "nope.json"))
with pytest.raises(pytest.skip.Exception, match="missing"):
request.getfixturevalue("op_state")
def test_deps_fixture_entries_expose_attributes(tmp_path, monkeypatch, request):
"""`deps` (session-scoped) coerces the run deps file into entries with .domain/.realm/...
attribute sugar while keeping dict-style access (rcust P2d). Single test for the session-
cached fixture (one instantiation)."""
f = tmp_path / "deps.json"
f.write_text(
json.dumps(
{"keycloak": {"recipe": "keycloak", "domain": "kc.x", "client_secret": "s3cret"}}
)
)
monkeypatch.setenv("CCCI_DEPS_FILE", str(f))
deps = request.getfixturevalue("deps")
assert deps["keycloak"].domain == "kc.x"
assert deps["keycloak"]["client_secret"] == "s3cret"
with pytest.raises(AttributeError):
_ = deps["keycloak"].not_a_field

View File

@ -1,9 +1,9 @@
"""Unit tests for runner/harness/deps.py (Phase 2 §4.2 / Q2.3).
Pure-Python: no real deploys. Tests the declarative parts of the dep resolver — declared_deps
reading from `tests/<recipe>/recipe_meta.py`, the per-dep domain derivation, and write/load of the
run state file. The deploy_deps + teardown_deps integration is exercised by real e2e against cc-ci
(Q2.4 acceptance).
Pure-Python: no real deploys. Tests the declarative parts of the dep resolver — DEPS declaration
(read through the single meta loader since rcust P1), the per-dep domain derivation, and write/load
of the run state file. The deploy_deps + teardown_deps integration is exercised by real e2e against
cc-ci (Q2.4 acceptance).
"""
from __future__ import annotations
@ -13,42 +13,23 @@ import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import deps # noqa: E402
from harness import meta as meta_mod # noqa: E402
def test_declared_deps_returns_empty_for_no_meta(monkeypatch, tmp_path):
"""A recipe with no recipe_meta.py returns []."""
fake_recipe = "ccci-no-meta"
# No file at tests/<fake_recipe>/recipe_meta.py -> declared_deps reads nothing -> []
monkeypatch.chdir(tmp_path)
assert deps.declared_deps(fake_recipe) == []
def test_declared_deps_empty_for_no_meta(monkeypatch, tmp_path):
"""A recipe with no recipe_meta.py declares no deps (rcust P1: DEPS via meta.load)."""
monkeypatch.setattr(meta_mod, "TESTS_DIR", str(tmp_path / "tests"))
assert meta_mod.load("ccci-no-meta").DEPS == []
def test_declared_deps_reads_DEPS_list(tmp_path, monkeypatch):
"""A recipe_meta.py with `DEPS = [...]` returns the list."""
fake_recipe = "ccci-with-deps"
# Build a fake repo layout under tmp_path
recipe_dir = tmp_path / "tests" / fake_recipe
"""A recipe_meta.py with `DEPS = [...]` surfaces the list on the loaded meta (the orchestrator
reads meta.DEPS — the successor of the deleted deps.declared_deps loader)."""
recipe_dir = tmp_path / "tests" / "ccci-with-deps"
recipe_dir.mkdir(parents=True)
(recipe_dir / "recipe_meta.py").write_text('HEALTH_PATH = "/"\nDEPS = ["keycloak", "redis"]\n')
# Patch the deps module's idea of "where the repo is" by monkey-patching __file__ for the
# function indirectly: declared_deps uses `os.path.dirname(__file__), "..", "..", "tests"` —
# which resolves to the real repo's `tests/`. So instead, override that with a symlink/dir
# under tmp_path: deps.__file__ points at the runner module. We can't easily relocate that.
# Instead, mock the path by writing the fake recipe under the REAL tests/ dir.
real_tests = os.path.join(os.path.dirname(deps.__file__), "..", "..", "tests")
target_dir = os.path.join(real_tests, fake_recipe)
os.makedirs(target_dir, exist_ok=True)
target_meta = os.path.join(target_dir, "recipe_meta.py")
try:
with open(target_meta, "w") as f:
f.write('DEPS = ["keycloak", "redis"]\n')
result = deps.declared_deps(fake_recipe)
assert result == ["keycloak", "redis"]
finally:
if os.path.exists(target_meta):
os.remove(target_meta)
if os.path.isdir(target_dir):
os.rmdir(target_dir)
monkeypatch.setattr(meta_mod, "TESTS_DIR", str(tmp_path / "tests"))
assert meta_mod.load("ccci-with-deps").DEPS == ["keycloak", "redis"]
def test_dep_domain_distinct_per_dep():

View File

@ -71,17 +71,18 @@ def test_repo_local_wins_when_approved(tmp_path):
def test_custom_tests_repo_local_gated(tmp_path, monkeypatch):
# non-lifecycle test_*.py from repo-local only count for approved recipes; lifecycle names excluded
# custom test_*.py from repo-local only count for approved recipes (HC2); placement rule
# (rcust P4): custom tests live under functional/ (or playwright/) — top-level files are
# lifecycle overlays only, so the repo-local custom here sits in functional/.
# Use a synthetic recipe name + monkeypatched cc_ci_dir so this is independent of what
# tests/<real-recipe>/ ships (Phase-2 custom-html now also ships functional/ + playwright/,
# which would legitimately appear in custom_tests for "custom-html" — F2-1).
# tests/<real-recipe>/ ships (F2-1).
fake_recipe = "ccci-hc2-fixture"
monkeypatch.setattr(discovery, "cc_ci_dir", lambda r: str(tmp_path / "cc-ci" / r))
(tmp_path / "cc-ci" / fake_recipe).mkdir(parents=True)
rl = tmp_path / "repo"
rl.mkdir()
(rl / "test_sso.py").write_text("# repo-local custom\n")
(rl / "test_install.py").write_text("# lifecycle name -> excluded from custom\n")
(rl / "functional").mkdir(parents=True)
(rl / "functional" / "test_sso.py").write_text("# repo-local custom\n")
(rl / "functional" / "test_install.py").write_text("# lifecycle name -> excluded from custom\n")
_approve(tmp_path) # not approved -> repo-local custom ignored
assert discovery.custom_tests(fake_recipe, str(rl)) == []

View File

@ -1,6 +1,6 @@
"""Unit tests for Phase-2 discovery additions (plan §4.1).
Proves the `custom_tests` discovery recurses into the per-recipe `functional/` + `playwright/`
Proves the `custom_tests` discovery covers exactly the per-recipe `functional/` + `playwright/`
subdirs as well as the top-level dir, while still excluding lifecycle `test_<op>.py` names and
honouring the HC2 repo-local approval gate.
@ -27,16 +27,16 @@ def teardown_function():
os.environ.pop("CCCI_REPO_LOCAL_APPROVED_FILE", None)
def test_custom_tests_recurses_functional_and_playwright(tmp_path, monkeypatch):
"""A Phase-2 cc-ci recipe layout: functional/test_*.py + playwright/test_*.py + top-level
test_*.py — all are discovered as custom tests; the lifecycle names are excluded."""
def test_custom_tests_placement_rule_functional_playwright_only(tmp_path, monkeypatch):
"""Placement rule (rcust P4): custom tests are discovered ONLY under functional/ +
playwright/. A top-level non-lifecycle test_*.py is NOT discovered (top level is reserved
for lifecycle overlays); lifecycle names inside the subdirs stay excluded (defensive)."""
# Point cc-ci's per-recipe dir at a fake recipe in tmp_path
fake_recipe = "ccci-phase2-fixture"
fake_dir = tmp_path / "tests" / fake_recipe
(fake_dir / "functional").mkdir(parents=True)
(fake_dir / "playwright").mkdir()
# legitimate custom tests at multiple levels
(fake_dir / "test_sso_smoke.py").write_text("# top-level cross-cutting\n")
(fake_dir / "test_sso_smoke.py").write_text("# top-level — NOT discovered since P4\n")
(fake_dir / "functional" / "test_health_check.py").write_text("# parity port\n")
(fake_dir / "functional" / "test_content_roundtrip.py").write_text("# recipe-specific\n")
(fake_dir / "playwright" / "test_login_flow.py").write_text("# UI flow\n")
@ -49,11 +49,11 @@ def test_custom_tests_recurses_functional_and_playwright(tmp_path, monkeypatch):
customs = discovery.custom_tests(fake_recipe, None)
names = sorted((src, os.path.basename(p)) for src, p in customs)
# Top-level + functional/ + playwright/ all discovered; lifecycle name excluded
assert ("cc-ci", "test_sso_smoke.py") in names
# functional/ + playwright/ discovered; top-level custom + lifecycle name are NOT
assert ("cc-ci", "test_health_check.py") in names
assert ("cc-ci", "test_content_roundtrip.py") in names
assert ("cc-ci", "test_login_flow.py") in names
assert ("cc-ci", "test_sso_smoke.py") not in names
assert ("cc-ci", "test_install.py") not in names

View File

@ -30,7 +30,7 @@ def test_sso_dep_unverified_true_when_declared_notready_and_skipped():
def test_sso_dep_unverified_false_when_deps_ready():
"""deps ready (setup_custom_tests succeeded) → SSO tests actually ran → not a failure."""
"""deps ready (dep provisioning succeeded) → SSO tests actually ran → not a failure."""
assert not run_recipe_ci.sso_dep_unverified(
["keycloak"], deps_ready=True, requires_deps_skipped=0
)

View File

@ -14,6 +14,7 @@ So `-c` + owned-wait is non-vacuous: a genuinely-broken upgrade stays RED.
from __future__ import annotations
import dataclasses
import os
import sys
@ -21,6 +22,7 @@ import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle as lc # noqa: E402
from harness import meta as harness_meta # noqa: E402
def _fake_clock(monkeypatch):
@ -31,11 +33,15 @@ def _fake_clock(monkeypatch):
return state
_DRIVE_META = {
"READY_PROBE": lambda d: [
{"host": f"collabora-{d}", "path": "/hosting/discovery", "ok": (200,)}
]
}
# RecipeMeta (rcust P1: wait_ready_probes reads meta.READY_PROBE off the loaded object); defaults
# + the drive-style probe hook (P3 ctx signature: the probe receives a HookCtx).
_DRIVE_META = dataclasses.replace(
harness_meta.load("ccci-no-such-recipe"),
READY_PROBE=lambda ctx: [
{"host": f"collabora-{ctx.domain}", "path": "/hosting/discovery", "ok": (200,)}
],
)
_NO_PROBE_META = harness_meta.load("ccci-no-such-recipe")
def test_wait_ready_probes_raises_when_never_ready(monkeypatch):
@ -57,7 +63,7 @@ def test_wait_ready_probes_returns_when_ready(monkeypatch):
def test_wait_ready_probes_noop_without_probe(monkeypatch):
"""A recipe with no READY_PROBE is a clean no-op (default behavior preserved for all recipes)."""
monkeypatch.setattr(lc, "http_get", lambda *a, **k: 599) # would fail if it were consulted
lc.wait_ready_probes({}, "x.ci.commoninternet.net", timeout=1) # no raise, no call
lc.wait_ready_probes(_NO_PROBE_META, "x.ci.commoninternet.net", timeout=1) # no raise, no call
def test_wait_healthy_raises_when_services_never_converge(monkeypatch):

276
tests/unit/test_meta.py Normal file
View File

@ -0,0 +1,276 @@
"""Unit tests for the single recipe-meta loader + key registry (rcust P1; spec §8 R1/R6).
Covers: every in-repo recipe_meta.py loads clean through the registry (THE typo gate), validation
hard-errors (unknown key, wrong type, callable on a data key), the zero-config baseline defaults
(spec §2), the underscore exemption for recipe-private constants, and the registry↔generated-doc
sync (P1.5; drift fails CI). Run: cc-ci-run -m pytest tests/unit/test_meta.py -q
"""
from __future__ import annotations
import os
import subprocess
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import meta as meta_mod # noqa: E402
from harness.meta import KEYS, MetaError, RecipeMeta # noqa: E402
ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
def _recipes_with_meta() -> list[str]:
tests_dir = os.path.join(ROOT, "tests")
return sorted(
n
for n in os.listdir(tests_dir)
if os.path.isfile(os.path.join(tests_dir, n, "recipe_meta.py"))
)
# ---- the typo gate: every in-repo recipe meta must validate against the registry --------------
@pytest.mark.parametrize("recipe", _recipes_with_meta())
def test_every_recipe_meta_loads_clean(recipe):
"""All tests/*/recipe_meta.py in the repo load + validate through the registry. A typo'd or
unregistered ALL-CAPS key in any recipe meta fails HERE, at PR time — not silently at run
time (the R6 failure mode this restructure kills)."""
meta = meta_mod.load(recipe)
assert isinstance(meta, RecipeMeta)
# sanity: the 4 base keys always materialize with usable types
assert isinstance(meta.HEALTH_PATH, str)
assert isinstance(meta.HEALTH_OK, tuple) and meta.HEALTH_OK
assert isinstance(meta.DEPLOY_TIMEOUT, int) and isinstance(meta.HTTP_TIMEOUT, int)
# ---- zero-config baseline (spec §2) ------------------------------------------------------------
def test_missing_meta_yields_spec_baseline(tmp_path):
meta = meta_mod.load("no-such-recipe", tests_dir=str(tmp_path))
assert meta.HEALTH_PATH == "/"
assert meta.HEALTH_OK == (200, 301, 302)
assert meta.DEPLOY_TIMEOUT == 600
assert meta.HTTP_TIMEOUT == 300
assert meta.BACKUP_CAPABLE is None # None = auto-detect (tri-state, not False)
assert meta.EXPECTED_NA is None
assert meta.READY_PROBE is None
assert meta.UPGRADE_BASE_VERSION is None
assert meta.BACKUP_VERIFY is None
assert meta.UPGRADE_EXTRA_ENV is None
assert meta.EXTRA_ENV == {}
assert meta.DEPS == []
assert meta.WARM_CANONICAL is False
assert meta.SCREENSHOT is None
assert meta_mod.non_default(meta) == {}
def test_registry_field_set_matches_dataclass():
"""The RecipeMeta field set is generated from KEYS — no drift possible, pinned anyway."""
import dataclasses
assert [f.name for f in dataclasses.fields(RecipeMeta)] == [k.name for k in KEYS]
# the 14 final keys, no more (the 3 P2-deleted legacy keys are gone from the registry,
# so any recipe_meta still setting them hard-fails the typo gate)
assert len(KEYS) == 14
assert not [k for k in KEYS if k.deprecated]
for gone in ("CHAOS_BASE_DEPLOY", "OIDC_AT_INSTALL", "SKIP_GENERIC"):
assert gone not in {k.name for k in KEYS}
# ---- validation hard errors (locked decision: fail fast at load) -------------------------------
def _write_meta(tmp_path, body: str, recipe: str = "r") -> str:
d = tmp_path / recipe
d.mkdir(exist_ok=True)
(d / "recipe_meta.py").write_text(body)
return recipe
def test_unknown_key_raises_with_suggestion(tmp_path):
r = _write_meta(tmp_path, "READINESS_PROBE = None\n") # the R6 typo example
with pytest.raises(MetaError) as ei:
meta_mod.load(r, tests_dir=str(tmp_path))
msg = str(ei.value)
assert "READINESS_PROBE" in msg and "READY_PROBE" in msg # names the typo + nearest key
def test_unknown_key_without_near_match_lists_registry(tmp_path):
r = _write_meta(tmp_path, "TOTALLY_BOGUS_KNOB = 1\n")
with pytest.raises(MetaError) as ei:
meta_mod.load(r, tests_dir=str(tmp_path))
assert "HEALTH_PATH" in str(ei.value) # registered keys listed for the reader
def test_wrong_type_raises(tmp_path):
r = _write_meta(tmp_path, 'DEPLOY_TIMEOUT = "900"\n')
with pytest.raises(MetaError, match="DEPLOY_TIMEOUT"):
meta_mod.load(r, tests_dir=str(tmp_path))
def test_bool_not_accepted_as_int(tmp_path):
r = _write_meta(tmp_path, "DEPLOY_TIMEOUT = True\n")
with pytest.raises(MetaError, match="DEPLOY_TIMEOUT"):
meta_mod.load(r, tests_dir=str(tmp_path))
def test_callable_on_data_key_rejected(tmp_path):
r = _write_meta(tmp_path, "def HEALTH_PATH():\n return '/'\n")
with pytest.raises(MetaError, match="hook-typed"):
meta_mod.load(r, tests_dir=str(tmp_path))
def test_non_callable_on_hook_key_rejected(tmp_path):
r = _write_meta(tmp_path, "READY_PROBE = ['not', 'a', 'callable']\n")
with pytest.raises(MetaError, match="READY_PROBE"):
meta_mod.load(r, tests_dir=str(tmp_path))
def test_underscore_names_are_private_and_exempt(tmp_path):
r = _write_meta(
tmp_path,
"_WELCOME_TEXT_MARKER = 'marker-xyz'\n_MAX_USERS = 42\n"
"EXTRA_ENV = {'WELCOME_TEXT': _WELCOME_TEXT_MARKER, 'USERS': str(_MAX_USERS)}\n",
)
meta = meta_mod.load(r, tests_dir=str(tmp_path))
assert meta.EXTRA_ENV == {"WELCOME_TEXT": "marker-xyz", "USERS": "42"}
def test_lowercase_helpers_ignored(tmp_path):
r = _write_meta(
tmp_path,
"def _helper(d):\n return {'K': d}\n\ndef EXTRA_ENV(ctx):\n return _helper(ctx.domain)\n",
)
meta = meta_mod.load(r, tests_dir=str(tmp_path))
ctx = meta_mod.hook_ctx("x.example", meta)
assert meta_mod.extra_env(meta, ctx) == {"K": "x.example"}
# ---- normalization + helpers --------------------------------------------------------------------
def test_health_ok_list_normalized_to_tuple(tmp_path):
r = _write_meta(tmp_path, "HEALTH_OK = [200, 302]\n")
assert meta_mod.load(r, tests_dir=str(tmp_path)).HEALTH_OK == (200, 302)
def test_extra_env_dict_and_callable_forms(tmp_path):
r = _write_meta(tmp_path, "EXTRA_ENV = {'A': 1}\n")
meta = meta_mod.load(r, tests_dir=str(tmp_path))
assert meta_mod.extra_env(meta, meta_mod.hook_ctx("d", meta)) == {"A": "1"} # stringified
r2 = _write_meta(
tmp_path, "UPGRADE_EXTRA_ENV = lambda ctx: {'COMPOSE_FILE': ctx.domain}\n", recipe="r2"
)
meta2 = meta_mod.load(r2, tests_dir=str(tmp_path))
ctx2 = meta_mod.hook_ctx("dom.x", meta2, op="upgrade")
assert meta_mod.upgrade_extra_env(meta2, ctx2) == {"COMPOSE_FILE": "dom.x"}
assert meta_mod.extra_env(meta2, ctx2) == {} # unset EXTRA_ENV resolves to {}
# ---- P3: uniform ctx hook convention -------------------------------------------------------------
def test_hook_ctx_fields(tmp_path):
meta = meta_mod.load("no-such", tests_dir=str(tmp_path))
ctx = meta_mod.hook_ctx("app.ci.example", meta, op="backup")
assert ctx.domain == "app.ci.example"
assert ctx.base_url == "https://app.ci.example"
assert ctx.meta is meta
assert ctx.op == "backup"
assert meta_mod.hook_ctx("d", meta).op is None
def test_hook_ctx_deps_from_run_file(tmp_path, monkeypatch):
import json
meta = meta_mod.load("no-such", tests_dir=str(tmp_path))
monkeypatch.delenv("CCCI_DEPS_FILE", raising=False)
assert meta_mod.hook_ctx("d", meta).deps is None
f = tmp_path / "deps.json"
f.write_text(json.dumps({"keycloak": {"recipe": "keycloak", "domain": "kc.x"}}))
monkeypatch.setenv("CCCI_DEPS_FILE", str(f))
deps = meta_mod.hook_ctx("d", meta).deps
assert deps["keycloak"]["domain"] == "kc.x"
f.write_text("{}") # empty dict -> None (deps declared but not provisioned)
assert meta_mod.hook_ctx("d", meta).deps is None
def test_legacy_hook_signature_raises_clear_meta_error(tmp_path):
"""A pre-restructure hook signature must fail AT LOAD with a migration message — never a
silent TypeError mid-run (P3.4)."""
r = _write_meta(tmp_path, "def READY_PROBE(domain):\n return []\n")
with pytest.raises(MetaError, match="ctx"):
meta_mod.load(r, tests_dir=str(tmp_path))
r2 = _write_meta(tmp_path, "EXTRA_ENV = lambda domain: {}\n", recipe="r2")
with pytest.raises(MetaError, match="restructure"):
meta_mod.load(r2, tests_dir=str(tmp_path))
r3 = _write_meta(
tmp_path, "def SCREENSHOT(page, domain, meta):\n return None\n", recipe="r3"
)
with pytest.raises(MetaError, match="page, ctx"):
meta_mod.load(r3, tests_dir=str(tmp_path))
def test_ctx_hook_signatures_accepted(tmp_path):
r = _write_meta(
tmp_path,
"def READY_PROBE(ctx):\n return []\n"
"def BACKUP_VERIFY(ctx):\n return True\n"
"def SCREENSHOT(page, ctx):\n return None\n"
"def EXTRA_ENV(ctx):\n return {}\n",
)
meta = meta_mod.load(r, tests_dir=str(tmp_path))
assert callable(meta.READY_PROBE) and callable(meta.SCREENSHOT)
def test_check_hook_signature_for_pre_op_hooks():
"""The orchestrator validates ops.py pre_<op> hooks with the same checker (legacy
(domain, meta) form names the migration)."""
def legacy(domain, meta):
pass
def new(ctx):
pass
with pytest.raises(MetaError, match="ctx"):
meta_mod.check_hook_signature(legacy, ("ctx",), "tests/x/ops.py::pre_upgrade")
meta_mod.check_hook_signature(new, ("ctx",), "tests/x/ops.py::pre_upgrade") # no raise
def test_non_default_reports_only_customized_keys(tmp_path):
r = _write_meta(tmp_path, "DEPLOY_TIMEOUT = 1500\nDEPS = ['keycloak']\n")
nd = meta_mod.non_default(meta_mod.load(r, tests_dir=str(tmp_path)))
assert nd == {"DEPLOY_TIMEOUT": 1500, "DEPS": ["keycloak"]}
def test_meta_is_frozen():
import dataclasses
meta = meta_mod.load("custom-html")
with pytest.raises(dataclasses.FrozenInstanceError):
meta.DEPLOY_TIMEOUT = 1
# ---- doc generation sync (P1.5: the committed §4 table == the registry rendering) ---------------
def test_generated_doc_table_in_sync():
"""docs/recipe-customization.md's key reference table is GENERATED from the registry
(scripts/gen-meta-docs.py). If this fails: re-run `python3 scripts/gen-meta-docs.py` and
commit the result — the table must never drift from the registry (R5)."""
gen = os.path.join(ROOT, "scripts", "gen-meta-docs.py")
doc = os.path.join(ROOT, "docs", "recipe-customization.md")
rendered = subprocess.run(
[sys.executable, gen, "--print"], capture_output=True, text=True, check=True
).stdout
with open(doc) as f:
committed = f.read()
assert rendered.strip() in committed, (
"docs/recipe-customization.md key table is out of sync with the harness.meta registry — "
"run `python3 scripts/gen-meta-docs.py` and commit"
)

View File

@ -11,6 +11,7 @@ import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import meta as meta_mod # noqa: E402
from harness import screenshot as S # noqa: E402
@ -29,3 +30,19 @@ def test_hook_returned_when_callable():
pass
assert S._load_screenshot_hook({"SCREENSHOT": hook}) is hook
def test_screenshot_reachable_through_real_load_path(tmp_path):
"""R2 proof (rcust P1): a recipe SCREENSHOT hook declared in recipe_meta.py arrives at
screenshot._load_screenshot_hook through the REAL orchestrator load path (meta.load — the
object run_recipe_ci passes to capture()). Under the old six-loader world the orchestrator's
L1 allowlist dropped SCREENSHOT, so the hook was unreachable (spec §8 R2)."""
d = tmp_path / "shotrecipe"
d.mkdir()
(d / "recipe_meta.py").write_text(
"def SCREENSHOT(page, ctx):\n return None\n",
)
meta = meta_mod.load("shotrecipe", tests_dir=str(tmp_path))
hook = S._load_screenshot_hook(meta)
assert callable(hook), "SCREENSHOT hook did not survive the orchestrator load path (R2)"
assert S._load_screenshot_hook(meta_mod.load("no-such", tests_dir=str(tmp_path))) is None