Compare commits

...

94 Commits

Author SHA1 Message Date
ce50f641cc feat(shot): harness default capture fix — bounded networkidle settle after domcontentloaded + blank-frame retry (≤60s wait budget, R7 best-effort preserved); 6 unit tests; lint PASS, 205 unit tests pass via cc-ci-run
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:31:03 +00:00
ae10b553b0 review(shot): M1 PASS — audit matrix 19/19 cold-verified (enrolled set complete, no omissions), all non-OK root-causes evidence-backed (plausible 500-by-design via drone build-357 log; bluesky deploy-gated; BLANK/LOADING=domcontentloaded paint race; mumble NOT N/A via mumble-web), 11 PNGs independently Read incl plausible+multiple 4801B, every matrix read matched reality. N/A args agreed (bluesky justified, mumble denied). No VETO.
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:29:55 +00:00
e005897cb9 claim(shot): M1 — audit matrix 19/19 (every PNG visually inspected), all non-OK rows root-caused with evidence (plausible 500-by-design via drone build-357 log; blank/loading = domcontentloaded paint race, 4801B fingerprint; bluesky-pds deploy-gated N/A; mumble NOT N/A), N/A candidates argued
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:26:50 +00:00
8978fa6ae3 status(shot): phase open — P1 audit matrix complete (19/19 recipes, every PNG visually inspected) + P2 root causes (plausible /-500s-by-design via build-357 log; blank/loading = domcontentloaded paint race; bluesky-pds deploy-gated; mumble has real web UI; custom-html nginx-welcome is honest fresh-install content)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:26:23 +00:00
4f3a74759d review(shot): phase open — independent cold pre-audit ground truth (immich/n8n/cryptpad blank 4801-2B, keycloak/lasuite-docs loading-spinner, plausible null); awaiting M1 claim
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:19:52 +00:00
1bcb2ed8fe status(rcust): ## DONE — M1 (01f9f70) + M2 (3245150) both PASS, no VETO; phase complete
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:16:27 +00:00
3245150982 review(rcust): M2 PASS — merged-main regression sweep cold-verified. Canaries 7/7 (re-ran myself incl. false-green detector); all 21 recipes reconciled (every baseline deviation proven rcust-neutral via same-ref old-vs-new A/B or stale-schema w/ coverage preserved, all in DEFERRED); drone-path 356/357 custom success; customizations execute (manifest 21/21, mumble tcp, ghost overlay+chaos, immich seeds); zero leaks; both fix-forwards cleared. M1+M2 both PASS → DoD handshake satisfied, Builder may write ## DONE. No VETO.
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:15:45 +00:00
f7b9b6f167 status(rcust): Current section → M2 CLAIMED
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:07:13 +00:00
d7f85c3f28 claim(rcust): M2 — merge+2 approved fix-forwards green, canaries 7/7, 21/21 reconciled vs corrected baseline (3 lasuite via accepted L5≡L4+OIDC equivalence, bluesky-pds justified exclusion), drone path covered (356/357), zero leaks
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:06:48 +00:00
89dec5188f inbox(rcust): consumed 01:12Z be2026a-cleared note; bluesky-pds filed in DEFERRED.md as non-rcust upstream image breakage (justified M2 exclusion, A/B-proven harness-neutral)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 01:00:32 +00:00
24a203a098 review(rcust): be2026a fix-forward CLEARED (all 3 conditions met, independently verified) + ACCEPT L5≡L4+OIDC-pass equivalence — lasuite-* L5 baselines stale (c51cd84 4-rung predates rcust, git-proven), rcust innocent, OIDC coverage preserved. Consumed 01:10Z inbox. M2 still open: bluesky upstream-breakage note, drone-path runs, zero-leak, my sample re-check
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:59:29 +00:00
f359069d40 inbox(rcust): m2p2 GREEN rc=0 3m19s (both fix-forwards exercised end-to-end; OIDC+MinIO pass) — level=4 vs condition-1 'L5' explained: 6-rung ladder removed on MAINLINE 06-09 (46e2cdb/c51cd84 PR#6) pre-merge; equivalence proposed (L4 all-pass + requires_deps OIDC PASSED)
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-11 00:57:12 +00:00
a13a83a775 status(rcust): discourse A/B CLOSED — old==new byte-identical upgrade-HC1 at baseline ref+invocation (harness-neutral, env drift since 06-05; branch-tip/tag/abra-pin drift eliminated); m2p2 lasuite-drive binding proof started
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:51:10 +00:00
4428e76f48 review(rcust): be2026a merge cold-verified — merged lifecycle.py + test file byte-identical to branch (condition #2 met); m2p-lasuite-drive L0 = diagnosed pre-fix symptom; awaiting discourse A/B + post-fix L5
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:42:54 +00:00
b4505acbbd status(rcust): disclosed SIGINT shortcut of doomed m2p overlay install burn (KeyboardInterrupt at the diagnosed converge line); m2p2 is the binding proof
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:39:44 +00:00
9715ab5c50 status(rcust): be2026a merged as 6cabbe7 (build 350 green on 914c166); m2p2-lasuite-drive post-fix proof queued behind discourse A/B
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:38:06 +00:00
914c1663b5 inbox(rcust): consumed 00:31Z conditional APPROVE — merging be2026a, post-merge lasuite-drive re-run queued behind discourse A/B pair
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:33:07 +00:00
6cabbe73b7 fix(harness): merge fix/converged-oneshot @ be2026a — services_converged completed-one-shot rule (rcust M2 fix-forward #2, Adversary-approved a531746) 2026-06-11 00:33:07 +00:00
a531746e53 review(rcust): APPROVE fix-forward be2026a (services_converged completed-one-shot rule) — cold-verified diff+7 tests+199 unit+lint on fresh checkout, no false-green path (HTTP floor + minio custom test independent); conditional on post-merge lasuite-drive L5 + merged-diff==branch-diff + discourse PR=2 A/B cold re-check. Consumed 00:40Z inbox
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:31:54 +00:00
49d796d9ac status(rcust): m2p-lasuite-drive WILL land L0 — second P2b regression (completed one-shot 0/1 vs services_converged) root-caused live; fix on branch be2026a awaiting approval
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:28:33 +00:00
73421dabb4 inbox(rcust): lasuite-drive SECOND P2b regression root-caused live (completed one-shot 0/1 poisons services_converged after hook moved pre-assert) — fix-forward on branch fix/converged-oneshot @ be2026a, 199 unit + lint green, awaiting approval
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:27:49 +00:00
be2026aafb fix(harness): services_converged — a replica deficit explained entirely by Complete tasks is converged (triggered one-shot, rcust M2 lasuite-drive root cause)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:26:53 +00:00
77a9415b37 inbox(rcust): consumed Builder 00:20Z reply — proof runs confirmed queued; m2b-discourse/sidekiq/bluesky facts noted for independent cold-verify (not taken on trust)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:06:42 +00:00
4dcfb5ba96 review(rcust): M2 proof in flight — Builder running discourse PR=2 A/B (new vs old main) + lasuite-drive post-fix; self-correct my m2b L1 finding (PR=0 confound on HC1 re-checkout) — awaiting PR=2 results to cold-verify
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:06:16 +00:00
1ec0e772e8 inbox(rcust): consumed 23:53Z asks — lasuite-drive proof RUNNING, discourse same-ref 2x2 queued (new-main PR=2 + old-main PR=2 @7ae7b0f); m2b-discourse HC1 facts pinned (re-checkout persisted, eb96de94=base tag, sidekiq line benign); bluesky-pds = upstream image breakage (MODULE_NOT_FOUND x3, harness-neutral)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-11 00:06:13 +00:00
40b59b356b review(rcust): M2 proof-run cold analysis — 3/6 (immich/mattermost/plausible) reproduce baseline L4 at baseline ref on merged main (restructure innocent); discourse L4->L1 upgrade-HC1 at baseline ref UNexplained (A/B was at wrong ref) + lasuite-drive needs fresh L5 post-fix-forward; M2 OPEN
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 23:54:36 +00:00
5c0676b7d0 note(rcust): M2-prep hook-port audit — only lasuite-drive flipped best-effort->fatal (fix approved); lasuite-docs exit1->exit0 is intentional P2b (F2-11-gated); all other ops.py pure mechanical ctx migration. Closes M1-method gap (key-diff missed hook bodies)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:55:01 +00:00
efd7efc32b inbox(rcust): consumed 20:53Z approval — fix-forward pushed as 57c66ad; proof re-run at baseline REF queued behind tests 2+3
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:53:52 +00:00
1357544301 fix(tests): restore best-effort semantics of lasuite-drive pre_install bucket trigger (rcust M2 regression)
All checks were successful
continuous-integration/drone/push Build is passing
The P2b port of setup_custom_tests.sh -> ops.py::pre_install made the 90s bucket-poll timeout a
fatal AssertionError; the original shell hook fell through on timeout BY DESIGN (best-effort) and
the custom-tier MinIO storage test is the real gate for a genuinely missing bucket. Live evidence:
in both M2 sweep failures the bucket landed just after the window and every later tier including
the custom MinIO test passed. Warn loudly + continue, exactly the old semantics.

Adversary-approved fix-forward (REVIEW-rcust 57c66ad, scoped to this raise).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 20:53:31 +00:00
57c66add51 review(rcust): APPROVE lasuite-drive pre_install fix-forward (scoped to line-54 bucket-poll raise→best-effort; verified old=best-effort, custom MinIO test is real gate, no coverage loss); conditioned on L5 re-run + my diff re-verify. Auditing other shell->python hook ports for same drift
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:52:53 +00:00
a95fad4fa0 inbox(rcust): lasuite-drive P2b port regression root-caused (best-effort poll became fatal assert) — trivial fix-forward proposed, awaiting Adversary approval
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:50:31 +00:00
b9abf48116 inbox(rcust): consumed 20:33Z ACK — ref-mismatch independently confirmed; tests 2+3 concurred; proceeding
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:34:36 +00:00
4cb1f57e2c inbox(rcust): consumed Builder 20:35Z ref-mismatch heads-up + ACK — independently confirmed sweep ran default-branch heads (7d53d4ec/da159375) != baseline PR refs; concur tests 2+3 separate harness×content; will run own cold A/B at claim
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:33:56 +00:00
e30a414ce1 inbox(rcust): heads-up — restore cluster is a REF-mismatch vs baseline (sweep ran old default heads; baselines were PR-head runs); baseline-REF re-runs + old-main A/B queued
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:32:33 +00:00
41033b4500 inbox(rcust): consumed 20:15Z follow-up — restore cluster confirmed pre-existing, VETO threat withdrawn; proceeding to satisfy the 4 M2 PASS conditions (re-runs at baseline, canary+zero-leak, log sample, !testme x2)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:19:12 +00:00
a7a558ada3 note(rcust): M2 follow-up — confirmed restore cluster is the PRE-EXISTING truncated-dump race (documented in discourse BACKUP_VERIFY docstring on pre-merge 49fb818); VETO-threat withdrawn; stated M2 PASS conditions (re-runs at baseline + spot-checks)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:18:26 +00:00
37dcfab07d inbox(rcust): consumed Adversary 20:13Z restore-cluster heads-up — ACK: serial re-runs of all 6 already in flight (/root/m2-rerun-logs/, results m2rr-*); will ALSO run immich on OLD main (pre-merge c2508c7) serially in the same env as the requested A/B regardless of re-run outcome; no M2 claim until both legs are documented in STATUS
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:18:13 +00:00
ffc88848f3 note(rcust): M2 heads-up — restore-failure cluster (discourse/immich/plausible/mattermost ci_marker-missing) blocks M2 PASS; evidence says infra/pre-existing not restructure (restore orchestration unchanged, no BACKUP_VERIFY correlation, peers pass); suggest A/B vs old main (NOT a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:17:14 +00:00
85d14101ef status(rcust): M2 sweep first pass — canaries 7/7, 15/21 at baseline, 6 flake-shaped reds re-running serially; spot-grep evidence + zero leaks
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 20:14:05 +00:00
9aa0c5d624 status(rcust): fix stale Current section — M2 in progress
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 19:33:23 +00:00
4d342a2c5d status(rcust): M1 PASS — merged to main 01e6d49, push build 326 green; M2 canaries running, sweep driver staged
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 19:33:05 +00:00
01e6d497ba Merge branch 'restructure/recipe-custom' — recipe-customization restructure (rcust M1 PASS @858e0f5, REVIEW-rcust 01f9f70)
All checks were successful
continuous-integration/drone/push Build is passing
Single registry-backed meta loader, legacy key/path deletion, uniform ctx hooks, custom-test
placement rule + fixtures, customization manifest, docs. M2 real-CI regression sweep follows.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:28:38 +00:00
01f9f70970 review(rcust): M1 PASS @858e0f5 — cold unit 192+conc 23+lint PASS; coverage diff 0 real deltas/21 (mumble byte-identical, deleted keys all accounted); 18=18 asserts no weakening (no VETO); validation gaps closed; R2 delivered end-to-end; HC2/F2-11/generic-floor intact; manifest secret-redaction verified surgical. DONE still gated on M2 (real-CI sweep).
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 19:27:49 +00:00
c2508c7fd2 claim(rcust): M1 — P1–P6 complete on restructure/recipe-custom @ 858e0f5; unit 192 + concurrency 23 + lint PASS; baseline matrix committed
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 19:13:36 +00:00
8984b57b35 status(rcust): P6 complete (da558ca) + Adversary inbox consumed — manifest redaction landed (858e0f5); M1 prep starting
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 19:10:00 +00:00
858e0f582f fix(harness): redact secret-named meta values in the customization manifest (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
Adversary heads-up (inbox 2026-06-10T19:06Z): meta values are repo-public by construction, but
the manifest lands on the dashboard — a field literally named SECRET_KEY_BASE showing a value
(plausible's committed CI dummy) is needless secret-scan noise. Mask values whose key NAME is
secret-shaped (SECRET|PASSWORD|TOKEN|CREDENTIAL|word-segment KEY), top-level and nested dict
keys; the key name stays visible. Unit test pins redacted vs passthrough (KEYCLOAK_URL).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:09:09 +00:00
da558ca946 docs: P6 — rewrite customization docs to the restructured end state (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
recipe-customization.md: review spec -> reference. Single registry-backed loader + validation
rules + HookCtx convention (§4); generated key table kept byte-identical (sync test); §5 end-state
shape (op_state/deps fixtures, ctx ops.py, placement rule, first-class compose.ccci.yml, no
setup_custom_tests.sh); §7 manifest block + dev-only CCCI_SKIP_GENERIC*; §8 rewritten as
restructure outcomes (R1/R2/R3/R5/R6/R7/R8 resolved + how, R4 mitigated by manifest, R9
rejected-by-decision); §9 index updated to the new symbols.

testing.md: install-time deps isolation replaces the setup_custom_tests step in the invariant
(generic still never depends on custom — failure isolation via requires_deps/F2-11); ops.py
example to pre_<op>(ctx); placement rule; generic opt-out now documented LOCAL-DEV-ONLY env with
CI !! warning (declarative SKIP_GENERIC gone); partial key list points at the generated table.

enroll-recipe.md: tree + worked examples updated (lasuite-docs install-time OIDC wiring +
install_steps.sh; mumble post-F2-14c shape — UPGRADE_EXTRA_ENV native overlay, private _
constants, no CHAOS_BASE_DEPLOY); deps fixture (entry.domain) replaces deps_apps; ctx hook
signatures; compose.ccci.yml first-class bullet; key list points at the generated table.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:07:41 +00:00
5ccc0d1c34 note(rcust): interim pre-review of frozen P5 (68954be) — cold unit 191 + lint PASS reproduced; manifest exposes NO generated/real secrets (HC2-honoring, pure presentation); one non-blocking heads-up re plausible SECRET_KEY_BASE public-dummy on dashboard (NOT an M1 verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 19:07:24 +00:00
52f5266dfb status(rcust): P5 complete on branch (68954be) — unit 191 green + lint PASS; starting P6
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 18:58:33 +00:00
68954be53e feat(harness): P5 — customization manifest (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
One block at run start answering "what does this recipe customize?" across every surface
(non-default recipe_meta keys, ops.py pre-ops, install_steps.sh, compose.ccci.yml, lifecycle
overlays by source, custom-test counts, active CCCI_SKIP_GENERIC* env overrides — !!-flagged when
riding a CI run, P2c), printed to the run log and embedded verbatim in results.json under
"customization". Pure presentation — building/printing it never influences a verdict; the
manifest honors the HC2 repo-local gate so it never advertises code the run will not execute.

Unit tests: synthetic recipe exercising every surface -> complete + deterministic + JSON-clean;
HC2 invisibility; env-override flagging; render golden lines; build_results threads the dict
verbatim (key always present, None when absent).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:57:26 +00:00
270476beb3 note(rcust): interim pre-review of frozen P4 (29a28e2) — cold unit 184 + lint PASS reproduced; placement-rule claim holds (0 non-lifecycle top-level customs), HC2 intact, tests strengthened not weakened (NOT an M1 verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 18:53:32 +00:00
ff09c4075b status(rcust): P4 complete on branch (29a28e2) — unit 184 green + lint PASS; starting P5
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 17:14:38 +00:00
63befd05b0 note(rcust): interim pre-review of frozen P3 — mechanical migration held (0 changed asserts), HookCtx complete, legacy-sig guard live-probed PASS, coverage diff still 0/21 (NOT M1)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 17:14:37 +00:00
29a28e2028 feat(harness): P4 — custom-test ergonomics (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
Placement RULE: discovery.custom_tests covers ONLY functional/ + playwright/ —
the top-level test_*.py glob for recipe dirs is removed (top level is reserved
for lifecycle overlays; zero in-repo users of top-level custom tests, verified
by sweep). Lifecycle-name exclusion inside the subdirs stays as the double-run
safety net. HC2 default-deny unchanged (repo-local custom now pinned via
functional/ in the gate test).

New conftest fixture op_state: parses $CCCI_OP_STATE_FILE (op context: versions,
artifact paths), skipping with a clear reason when unset/absent/unparseable —
overlay tests read op facts from the fixture instead of hand-parsing env (zero
existing hand-parsers found; the fixture is the documented path forward). deps
fixture landed in P2d.

Unit tests: placement-rule discovery tests (top-level custom NOT discovered;
functional/playwright are; misfiled lifecycle names excluded), op_state fixture
contract (reads file / skips without env / skips on missing file), deps fixture
attribute sugar.

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 184 passed; scripts/lint.sh -> PASS.
2026-06-10 17:14:21 +00:00
802b2792a7 note(rcust): interim pre-review of frozen P1+P2 — fallout clean, typo gate PASS, coverage diff 0/21 deltas, validation gaps closed (NOT an M1 verdict; M1 unclaimed)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 17:11:41 +00:00
0264af72c7 status(rcust): P3 complete on branch (fd02d9f) — unit 180 green + lint PASS; starting P4
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 17:10:45 +00:00
fd02d9f4b8 feat(harness): P3 — uniform ctx hook convention (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
harness.meta.HookCtx (frozen): .domain, .base_url, .meta (RecipeMeta), .deps
(provisioned dep creds from $CCCI_DEPS_FILE or None), .op (current lifecycle op
or None); built via meta.hook_ctx() at each hook call site.

All recipe callables now take ctx: EXTRA_ENV(ctx), UPGRADE_EXTRA_ENV(ctx),
READY_PROBE(ctx), BACKUP_VERIFY(ctx), SCREENSHOT(page, ctx), ops.py pre_<op>(ctx).
Dict-valued EXTRA_ENV/UPGRADE_EXTRA_ENV unchanged (only the callable signature
moved). Call sites converted: deploy_app env shaping, perform_upgrade,
wait_ready_probes (gains op=), _perform_op BACKUP_VERIFY, screenshot.capture,
_run_pre_hook.

Legacy signatures fail FAST with a clear migration message: the registry carries
hook_params per hook key, enforced at meta.load() (MetaError names the old vs new
signature); ops.py pre-op hooks get the same check at the orchestrator call site
(meta.check_hook_signature) — no silent TypeError mid-run.

Migrated every in-repo user mechanically (17 ops.py files; cryptpad/lasuite-*/
mailu EXTRA_ENV; mumble+lasuite-drive READY_PROBE; ghost/discourse BACKUP_VERIFY)
— seeded values, probes and assertions byte-identical (domain -> ctx.domain;
keycloak pre_restore's meta arg -> ctx.meta).

Unit tests: hook_ctx field contract, ctx.deps from the run deps file, legacy-
signature MetaError (READY_PROBE/EXTRA_ENV/SCREENSHOT + pre-op checker), ctx
signatures accepted. Docs table regenerated (signature docs in key docs).

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 180 passed; scripts/lint.sh -> PASS.
2026-06-10 17:10:26 +00:00
8945d13674 status(rcust): P2 complete on branch (8cd72fd) — unit 175 green + lint PASS; starting P3
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 17:01:58 +00:00
8cd72fd78d feat(harness): P2 — delete legacy customization keys & paths (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
a) compose.ccci.yml is FIRST-CLASS: the harness auto-copies tests/<recipe>/
   compose.ccci.yml into the run's recipe checkout (ABRA_DIR-aware, lifecycle.
   provide_ccci_overlay) and auto-chaoses the pinned base deploy on its presence
   (kills the R7 implicit coupling). ghost/discourse install_steps.sh (copy-only
   boilerplate) deleted; CHAOS_BASE_DEPLOY removed from both metas + the registry.

b) install-time deps wiring is the ONLY mode: deps with DEPS provision BEFORE the
   single deploy; legacy post-deploy provisioning + the setup_custom_tests.sh
   invocation machinery deleted. lasuite-docs migrated to install_steps.sh OIDC
   wiring (same env names/values as the old hook — only the timing moved);
   lasuite-drive's remaining post-deploy MinIO bucket one-shot moved to ops.py
   pre_install; both setup_custom_tests.sh files deleted; OIDC_AT_INSTALL removed
   from drive/meet metas + the registry.

c) SKIP_GENERIC meta key deleted (zero users). Env form CCCI_SKIP_GENERIC* stays
   as the documented dev-only escape hatch; when active in a drone CI run the
   orchestrator prints a loud !! warning (manifest embedding lands in P5).

d) conftest cleanup: dead pre-deploy-once fixtures deployed/deployed_app deleted
   (zero users), app_domain + _short + _wait_healthy dropped (only users were the
   deleted fixtures); deps_apps+deps_creds consolidated into ONE deps fixture
   (entries expose .domain etc. as attributes; dict access intact); the 6 lasuite
   test files renamed deps_creds->deps (fixture name only — assertions and flows
   byte-identical). requires_deps marker + F2-11 skip-report plumbing unchanged.

Registry is now exactly the 14 final keys; docs §4 table regenerated. Stale
setup_custom_tests/OIDC_AT_INSTALL prose in docstrings/comments/assert MESSAGES
updated (no assert logic or expected value touched).

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.
2026-06-10 17:01:33 +00:00
f5119a9703 status(rcust): P1 complete on branch (472a68b) — unit 175 green + lint PASS; starting P2
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-10 16:47:35 +00:00
472a68b32c feat(harness): P1 — single registry-backed meta loader (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
One loader: runner/harness/meta.py::load(recipe) -> RecipeMeta (frozen dataclass,
attribute access), backed by the declarative KEYS registry (14 final keys + 3
P2-deprecated). The ONLY exec() of tests/<recipe>/recipe_meta.py. Validation per
the locked decision: unknown ALL-CAPS top-level name or type mismatch = MetaError
(hard error at load); underscore-prefixed names recipe-private; callables only on
hook-typed keys.

Migrated all six legacy loaders (spec §4 L1–L6):
- run_recipe_ci.py::_load_meta deleted; orchestrator loads once, passes meta down
- tests/conftest.py::_recipe_meta deleted; meta fixture returns full RecipeMeta (R3)
- lifecycle.py::_recipe_extra_env/_recipe_meta_flag deleted; deploy_app takes meta
- deps.py::declared_deps deleted; callers read meta.DEPS
- canonical.py::is_enrolled reads through meta.load()
- screenshot.py now actually receives SCREENSHOT through the orchestrator path (R2
  fix; proven by unit test through the real load path)

Mumble private constants underscore-prefixed (_WELCOME_TEXT_MARKER/_MAX_USERS) +
importers fixed. New tests/unit/test_meta.py (all-recipes-load-clean typo gate,
MetaError cases, spec §2 baseline defaults, underscore exemption, doc sync). Docs
§4 key table now GENERATED from the registry (scripts/gen-meta-docs.py); drift
fails CI.

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.
2026-06-10 16:46:58 +00:00
49fb818c60 status(rcust): bootstrap phase state files — P1 starting on branch restructure/recipe-custom
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 16:29:44 +00:00
12318582aa review(rcust): seed Adversary ledger — phase start, awaiting M1 claim
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 16:29:26 +00:00
76a4b6b3fa docs: recipe-customization review spec — full settings reference + restructuring candidates
All checks were successful
continuous-integration/drone/push Build is passing
Complete reference for per-recipe CI customization: all 18 recipe_meta keys
(incl. the base pin UPGRADE_BASE_VERSION), the six divergent meta loaders,
every hook file (test_<op>.py overlays, ops.py, install_steps.sh,
setup_custom_tests.sh, compose.ccci.yml), env contracts, and §8 known
limitations / restructuring candidates (R1 loader drift, R2 dead SCREENSHOT
knob, R6 silent-typo hazard, ...). Written for operator review ahead of a
possible restructure.
2026-06-10 15:55:34 +00:00
6060086c01 status(conc): ## DONE — M1+M2 both Adversary-PASS, no open veto; phase complete
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:56:02 +00:00
9987fba4b6 review(conc): M2 PASS — merged + live-verified (a)-(d) on final main 139e319; M1+M2 both fresh PASS, no open veto — DONE unblocked
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:55:19 +00:00
74ed24053d claim(conc): M2 — merged + live-verified (a)-(d) on final main 139e319; (a) re-run build 295 clean; awaiting Adversary
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:52:48 +00:00
2894778810 review(conc): M2(c) PASS — double-!testme both GREEN on CONC-A1-fixed harness; VETO LIFTED, CONC-A1 closed
All checks were successful
continuous-integration/drone/push Build is passing
Builds 290+291 (same immich domain) both success: 291 logged block line + acquired,
both deploy-count=1 (290 no false-2, 291 no FileNotFoundError), zero leakage.
Serialization also observed live in lslocks. CONC-A1 conditions 1-3 met; veto lifted.
Remaining for full M2: (a) cancel-mid-run re-run on fixed harness + Builder M2 claim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:51:46 +00:00
536a3595b9 journal(conc): M2(c) PASS round 2 — 290+291 both green, block line visible, zero leakage; (a) re-run triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:50:26 +00:00
0684576d74 chore(conc): consume BUILDER-INBOX (ML-flake context on (c) round-2; concur — will re-trigger (c) clean after 290/291 terminal)
Some checks reported errors
continuous-integration/drone/push Build is passing
continuous-integration/drone Build was killed
2026-06-10 08:45:14 +00:00
fa9a89bcf8 review(conc): live (c) round-2 — serialization confirmed via lslocks; delay is immich-ML healthcheck flake, not the restructure; veto unchanged
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:44:30 +00:00
374371966f journal(conc): (b)+(d) PASS on CONC-A1-fixed main (287/288 parallel green, zero leakage); (c) round 2 triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:22:40 +00:00
b1bca1a745 chore(conc): CONC-A1 fix code-verified (veto conditions 1+2 met, mutation-proven); 3+4 pending live (c) re-run
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:19:37 +00:00
4f6c9554b7 inbox(adversary): consumed CONC-A1-fixed message from Builder
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:17:16 +00:00
96ba67a63f inbox(adversary): CONC-A1 fixed b6e12ef/139e319 — run-keyed state files + regression test; re-running M2 live checks
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:16:43 +00:00
139e319d7e Merge branch 'restructure/concurrency': fix(harness) CONC-A1 run-keyed state files (M2(c) live-verify finding)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:16:18 +00:00
2173894f07 review(conc): M2(c) FAIL — double-!testme same domain corrupts shared deploy-count file (CONC-A1) + VETO
All checks were successful
continuous-integration/drone/push Build is passing
Builds 279+281 (immich#2, same domain immi-ad3e33) both RED: 279 false DG4.1
'deploy-count 2!=1' from 281's pre-lock _record_deploy polluting the shared
/tmp/ccci-deploys-<domain> counter; 281 FileNotFoundError after 279 os.remove'd it.
Lock serialisation works (281 logged block+acquire); per-run isolation of the
deploy-count file does not (P3 missed it; _record_deploy at lifecycle:250 fires
before acquire_app_lock at :254). Control build 275 (isolated) green.
Veto DONE until counter keyed per-run + same-domain test + live (c) both-green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:11:07 +00:00
e392c73cbc journal(conc): M2(b)+(d) PASS evidence; (c) double-!testme triggered
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing
2026-06-10 05:04:14 +00:00
3180ae1355 review(conc): wrapper exit-code fix verified safe (red still propagates) + correct my set -e pre-review miss; inbox consumed
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:58:27 +00:00
9d82a02026 journal(conc): M2(b) round-1 evidence + wrapper fix verification
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-10 04:56:22 +00:00
bbc2bafbcb inbox(adversary): M2 wrapper exit-code fix e1c4198/b7a009c — context for M2 review
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-10 04:55:07 +00:00
b7a009c1fc Merge branch 'restructure/concurrency': fix(ci) wrapper exit-code poisoning on green runs (M2 live-verify finding)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:54:51 +00:00
56723ae0ec chore(conc): M2 merge-integrity pre-check — merged main == M1-verified tree (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:49:55 +00:00
dfa5c8b9ee journal(conc): M2(a) cancel-mid-run PASS evidence; (b) parallel runs triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:47:19 +00:00
bb5eb3d3aa Merge branch 'restructure/concurrency': concurrency restructure (P1-P5 + tests/concurrency)
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing
M1 Adversary-verified PASS (REVIEW-conc.md @83a6c6e): lock-lifetime hardening (PDEATHSIG +
signal funnels + 60-min deadline + setsid/trap cancel forwarding), flock-probe janitor
(registry deleted), per-run ABRA_DIR (recipe flock deleted), single concurrency knob,
tests/concurrency real-kernel suite, docs/concurrency.md rewrite.
2026-06-10 04:40:00 +00:00
83a6c6e157 review(M1): PASS — branch @d3fe9e2 cold-verified (unit 138, conc 20, lint, 0 dangling refs, gate-integrity, independent flock probe)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:39:16 +00:00
8b9033f3d6 journal(conc): tests suite + P5 evidence, M1 claim context
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:34:19 +00:00
e8e52cf4c6 claim(conc): M1 CLAIMED — branch restructure/concurrency complete (P1-P5 + tests, tip d3fe9e2), awaiting Adversary
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:33:59 +00:00
c51692b57e chore(conc): pre-review P3+P4 — zero dangling refs, ABRA_DIR ordering clean (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:28:41 +00:00
ffcf441364 journal(conc): P1-P4 evidence (live smokes on cc-ci) + pre-existing abra app ls FATA observation
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:21:17 +00:00
2080d734d3 status(conc): P1-P4 on branch (b492f99..91d3cc7), tests/concurrency next
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:20:20 +00:00
f98b444559 decisions(conc): record P3 install_steps.sh ABRA_DIR path fix (guardrail justification)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:18:45 +00:00
08b629f52a chore(conc): pre-review P1+P2 — 4 break-it concerns tested + refuted (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:16:41 +00:00
e350c94c3f chore(conc): record cold-verify environment (cc-ci-run pytest env, M1 plan)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:03:23 +00:00
81 changed files with 4579 additions and 925 deletions

View File

@ -2,21 +2,67 @@
## Build backlog
- [ ] P1 lock-lifetime hardening: prctl PDEATHSIG + ppid race check + SIGTERM handler →
- [x] P1 lock-lifetime hardening: prctl PDEATHSIG + ppid race check + SIGTERM handler →
teardown funnel + signal.alarm(3600) hard deadline; .drone.yml setsid/trap wrap;
PEP 446 comment on lock open()
- [ ] P2 flock-probe janitor: acquire_app_lock(domain) at register_run_app's call site;
- [x] P2 flock-probe janitor: acquire_app_lock(domain) at register_run_app's call site;
janitor probes per-domain lockfiles (acquired→reap under probe lock, held→leave,
>120min mtime→warn); delete registry symbols
- [ ] P3 per-run ABRA_DIR: /var/lib/cc-ci-runs/<build>/abra with servers+catalogue symlinks,
- [x] P3 per-run ABRA_DIR: /var/lib/cc-ci-runs/<build>/abra with servers+catalogue symlinks,
fresh recipes/; fetch_recipe = plain clone; delete acquire_recipe_lock; route harness
recipe paths through ABRA_DIR
- [ ] P4 config cleanup: remove concurrency.limit from .drone.yml; maxTests is the single knob
- [ ] tests/concurrency suite (19 cases, real-kernel flock, explicit invocation only)
- [ ] P5 docs/concurrency.md rewrite to the new model
- [x] P4 config cleanup: remove concurrency.limit from .drone.yml; maxTests is the single knob
- [x] tests/concurrency suite (19 cases, real-kernel flock, explicit invocation only)
- [x] P5 docs/concurrency.md rewrite to the new model
- [ ] M1 claim (branch complete, both suites + lint green)
- [ ] M2: merge to main after M1 PASS, push build green, live verification ad
## Adversary findings
(adversary-owned)
### [adversary] CONC-A1 — double-!testme same domain corrupts the shared deploy-count file (M2(c) FAIL)
**Severity:** blocks M2(c). Both runs of a same-domain double-!testme go RED.
**Root cause (two coupled defects, one shared root):**
1. The DG4.1 deploy-counter file is keyed by DOMAIN in the *shared* system tempdir, NOT per-run:
`run_recipe_ci.py:930 countfile = /tmp/ccci-deploys-<domain>`. P3 isolated `ABRA_DIR` per run
but this per-run state file was missed — it predates the restructure (ef44d46) and the OLD
recipe-flock used to serialize same-recipe runs end-to-end, incidentally masking it.
2. `lifecycle.deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE
`acquire_app_lock(domain)` (lifecycle.py:254, introduced by P2 b302f3a). So the counter
increment happens OUTSIDE the serialization window — a second same-domain run bumps the
shared counter before it ever blocks on the lock.
**Observed (live, builds 279 + 281, immich PR#2, same domain immi-ad3e33, 2026-06-10T05:04Z):**
- Lock serialization itself WORKS: 281 logged `== app lock: ... in flight — waiting ==` at 2s,
then `== app lock: acquired ==` at 194s — exactly when 279 exited (279 finished 05:07:35).
- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`. The `2` = 281's pre-lock `_record_deploy`
(fired ~2s, before 281 blocked) polluting the shared counter 279 was actively using.
- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33...` at run_recipe_ci.py:1213 —
279's end-of-run `os.remove(countfile)` (line 1215) deleted the shared file out from under 281,
whose single `_record_deploy` had already fired at 2s and never recreates it.
- Control: isolated immich (build 275, same fixed wrapper) → `deploy-count = 1`, GREEN. So this
is concurrency-specific, not a pre-existing immich/wrapper issue.
**Repro:** two `!testme` comments on the same recipe PR (same domain) in quick succession on the
deployed main harness → both builds RED (one DG4.1 false-violation, one FileNotFoundError).
**Fix direction (Builder owns):** key the deploy-counter per RUN, not per domain — e.g. put it in
`/var/lib/cc-ci-runs/<build>/` (alongside the per-run artifacts) or include the build/run id in the
filename, and export that path via `CCCI_DEPLOY_COUNT_FILE`. Per-run keying fixes BOTH defects at
once (no cross-run pollution; no shared remove). Moving `_record_deploy()` after `acquire_app_lock`
alone is INSUFFICIENT — the shared `os.remove`/`FileNotFoundError` collision survives. Add a
tests/concurrency case: two same-domain runs serialized on the app lock → each sees its own
deploy-count, neither removes the other's file (this is the gap vs the 19 planned cases — case 4
serialises acquire but never asserts deploy-count isolation across the two).
**Closure:** adversary-owned. Re-test the (c) double-!testme live (both GREEN, visible block line,
zero leakage) + the new unit case before this clears. Only I close it.
**CLOSED @2026-06-10T09:0xZ** — fix b6e12ef (run-keyed state files via `_run_state_path`) merged
139e319. Verified by me: (a) code cold-verified + mutation-proven (reverting to domain-keying fails
all 3 test_run_state cases); (b) suites green cold (unit 138, concurrency 23); (c) LIVE re-run
builds 290+291 (same immich domain immi-ad3e33) BOTH SUCCESS — 291 logged the block line
(`in flight — waiting``acquired`), both read `deploy-count = 1` (290 no longer false-2; 291 no
longer FileNotFoundError), zero leakage after (0 procs / 0 apps / 0 services / 0 volumes / 0 secrets
/ no held locks). Full evidence in REVIEW-conc M2(c) PASS.

23
BACKLOG-rcust.md Normal file
View File

@ -0,0 +1,23 @@
# BACKLOG — sub-phase rcust
## Build backlog
- [ ] P1.1 `runner/harness/meta.py`: KEYS registry (14 keys + 3 deprecated) + `load(recipe) -> RecipeMeta`
- [ ] P1.2 migrate readers L1L6 to `meta.load()` (orchestrator loads once, passes down)
- [ ] P1.3 mumble private constants → underscore-prefixed (`_WELCOME_TEXT_MARKER`, `_MAX_USERS`) + fix importers
- [ ] P1.4 `tests/unit/test_meta.py` (all-recipes-load-clean, MetaError cases, defaults, R2 proof)
- [ ] P1.5 `scripts/gen-meta-docs.py` + doc-sync unit test
- [ ] P2a compose.ccci.yml first-class (auto-copy + auto-chaos); strip ghost/discourse boilerplate
- [ ] P2b install-time deps only; migrate lasuite-docs; delete setup_custom_tests.sh machinery
- [ ] P2c SKIP_GENERIC meta key deleted; env form documented dev-only + loud warning in CI runs
- [ ] P2d conftest cleanup: delete deployed/deployed_app (+app_domain if unused); consolidate deps fixture; migrate 6 lasuite test files
- [ ] P3 HookCtx + convert all hook call sites + migrate in-repo users + unit tests
- [ ] P4 discovery placement rule + op_state/deps fixtures + migrate hand-parsers
- [ ] P5 customization manifest (print block + results.json key) + unit tests
- [ ] P6 docs rewrite (recipe-customization.md §8, testing.md, enroll-recipe.md)
- [ ] M1 pre-claim: run `pytest tests/concurrency -q` once to prove untouched
- [ ] M2 prep: build baseline matrix (21 recipe dirs, expected outcomes) BEFORE merging — commit to STATUS-rcust.md
## Adversary findings
(Adversary-owned section)

78
BACKLOG-shot.md Normal file
View File

@ -0,0 +1,78 @@
# BACKLOG-shot.md — phase `shot` (recipe screenshot audit & repair)
SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-shot-screenshots.md. Gates: M1 (audit+diagnosis), M2 (all OK / agreed N/A).
## Build backlog
### P1 — Audit matrix (status: complete, all 19 PNGs visually inspected 2026-06-11)
Enrolled set (19) = `tests/<r>/recipe_meta.py` minus fixtures (`_generic`, `regression`, `concurrency`,
`custom-html-bkp-bad`, `custom-html-rst-bad`). Evidence: `/var/lib/cc-ci-runs/<run>/` on cc-ci;
PNGs pulled to /tmp/shot-audit/ on the builder host and each one Read (visually).
| recipe | latest run w/ artifacts | screenshot field | PNG bytes | visual content (I looked) | class |
|---|---|---|---|---|---|
| bluesky-pds | ab-bluesky-pds-oldmain | null | — | no PNG; install=fail level=0 (upstream image breakage, rcust DEFERRED) → capture correctly skipped (`if deploy_ok`) | N-A-candidate (blocked upstream) |
| cryptpad | m2r-cryptpad | screenshot.png | 4802 | solid light-grey frame, nothing else | BLANK |
| custom-html | m2r-custom-html | screenshot.png | 35707 | "Welcome to nginx!" default page | OK? (diagnose: is this the recipe's true fresh-install content?) |
| custom-html-tiny | m2r-custom-html-tiny | screenshot.png | 12950 | seeded CI content ("cc-ci custom-html-tiny … DG5") | OK |
| discourse | m2p-discourse | screenshot.png | 66121 | real forum UI, welcome topic, Sign Up/Log In | OK |
| ghost | m2r-ghost | screenshot.png | 444183 | real blog landing ("Thoughts, stories and ideas") | OK |
| hedgedoc | m2r-hedgedoc | screenshot.png | 131967 | real landing (logo, Sign In, feature intro) | OK |
| immich | 356 | screenshot.png | 4801 | pure white frame | BLANK |
| keycloak | m2r-keycloak | screenshot.png | 8764 | spinner + "Loading the Administration Console" | LOADING |
| lasuite-docs | m2r-lasuite-docs | screenshot.png | 6022 | lone spinner on white | LOADING |
| lasuite-drive | m2p2-lasuite-drive | screenshot.png | 5895 | lone spinner on white | LOADING |
| lasuite-meet | m2r-lasuite-meet | screenshot.png | 4801 | pure white frame | BLANK |
| mailu | m2r-mailu | screenshot.png | 33800 | real sign-in page (empty fields) | OK |
| matrix-synapse | m2r-matrix-synapse | screenshot.png | 33296 | "It works! Synapse is running" landing | OK |
| mattermost-lts | m2b-mattermost-lts | screenshot.png | 242139 | brand splash/loading screen (logo on blue), NOT the login form | LOADING (borderline — brand-recognizable but a loading state) |
| mumble | m2r-mumble | screenshot.png | 7913 | spinner on grey — a web page IS served on the domain | LOADING (diagnose what serves it; N/A may NOT be justified) |
| n8n | m2r-n8n | screenshot.png | 4801 | off-white blank frame. Flaky: run 197 (30256 B) shows the real "Set up owner account" form (empty fields, credential-free) | BLANK (flaky) |
| plausible | 357 | null | — | no PNG on ANY run (122→357) | NULL |
| uptime-kuma | m2r-uptime-kuma | screenshot.png | 30858 | real "Create your admin account" setup form (empty fields) | OK |
PNG-size note: 4801/4802 B at 1280×800 is a byte-stable blank-frame fingerprint (3 different apps, same size).
### P2 — Root-cause diagnoses
- [x] **NULL — plausible** (evidence: Drone build 357 ci-step log, t=73s):
`screenshot: capture failed (non-fatal, verdict unaffected): page.goto(https://plau-b51425.ci.commoninternet.net/) never returned a status in (200, 301, 302, 303, 401, 403) after 15 attempts (45s); last status=500`.
Plausible's `/` 500s **by design** under `DISABLE_AUTH=true` (auth_controller; documented in
`tests/plausible/functional/test_health_check.py` docstring and recipe_meta — that's why HEALTH_PATH
is `/api/health`). Default landing-page capture can NEVER succeed → needs a per-recipe SCREENSHOT
hook to a path that actually renders (probe live: e.g. /login or /sites).
- [x] **NULL — bluesky-pds**: install fails (level=0) before the app is up → `if deploy_ok:` gate in
runner/run_recipe_ci.py:1024 correctly skips capture. Not a screenshot defect; upstream image
breakage already filed in machine-docs/DEFERRED.md (rcust). → documented N/A while upstream is broken.
- [x] **BLANK class — immich, lasuite-meet, n8n(flaky), cryptpad**: SPA paint race. capture() navigates
with `wait_until="domcontentloaded"` (runner/harness/screenshot.py:91) and screenshots immediately;
SPA shell HTML has loaded but JS hasn't painted → solid 4801-2 B frame. n8n flakiness = same race,
sometimes JS wins (run 197 captured the real form).
- [x] **LOADING class — keycloak, lasuite-docs, lasuite-drive, mumble, mattermost-lts(borderline)**:
same race, caught mid-paint (spinner/splash rendered, app JS still loading/connecting).
- [x] **mumble** web stack identified: recipe deploys a `web` service (mumble-web client) on the domain —
spinner is its connecting state; landing renders a connect dialog once JS settles. NOT an N/A.
- [x] **custom-html** nginx-welcome question: the recipe's fresh install genuinely serves the nginx
default page at `/` (no content seeded for this recipe's install; only custom-html-tiny seeds via
install_steps.sh). Screenshot is an honest representative view of a fresh install. → OK as-is.
### P3 — Fixes
- [ ] Harness default improvement (fixes BLANK+LOADING classes): after domcontentloaded nav, bounded
network-idle/paint wait + blank-frame detect (tiny PNG → one retry with stronger wait), all within
NAV_DEADLINE_S=45 / step worst-case ≤ ~60s. Unit tests in tests/unit/test_screenshot.py.
- [ ] plausible SCREENSHOT hook (tests/plausible/recipe_meta.py) to a rendering, credential-free path.
- [ ] Re-audit mattermost-lts / mumble / keycloak / lasuite-* after harness fix; per-recipe hooks only
where the default still can't work.
- [ ] bluesky-pds: document N/A in matrix (Adversary agreement at M1/M2).
### P4 — Proof runs
- [ ] Fresh real-CI run per fixed recipe (immich, lasuite-meet, n8n, cryptpad, keycloak, lasuite-docs,
lasuite-drive, mumble, mattermost-lts, plausible), ≥2 via drone `!testme`; visual check each PNG;
card + dashboard render. Healthy class: cite existing artifact + visual check (done in P1).
## Adversary findings
(Adversary-owned section.)

View File

@ -22,3 +22,144 @@ Read concurrency-restructure-full-plan.md (SSOT) + plan.md §6.1/§7/§9. Orient
Working setup: state files on main in this clone; code on branch `restructure/concurrency`
via a git worktree at ../cc-ci-conc; test runs on the cc-ci host via /root/builder-clone
(`cc-ci-run -m pytest ...`, `nix develop .#lint`).
## 2026-06-10 — P1P4 landed on restructure/concurrency
- P1 b492f99: harness/lifetime.py (PDEATHSIG+ppid recheck, SIGTERM/SIGALRM→SystemExit funnel
with re-entrancy guard, alarm(3600)); main() installs first; both finally blocks mark
begin_teardown(); .drone.yml setsid+trap wrap. Live smoke on cc-ci (cc-ci-run /tmp/p1-smoke.py):
TERM→rc=143+finally; ALRM→rc=142+finally+deadline log; parent-kill→child TERM'd, teardown ran.
- P2 b302f3a: acquire_app_lock + _probe_and_reap + janitor rewrite; registry deleted. Live smoke
(/tmp/p2-smoke*.py): held lock → "live concurrent run, leaving it", reaped=[]; killed holder →
reap exactly once + lockfile unlinked; waiter blocked during probe-held reap, then re-acquired
on the FRESH inode (probe confirmed held by waiter). Note: a select()-on-fd readline artifact
in my smoke script initially looked like a failure — kernel state was verified directly.
Unlink/recreate race guarded on BOTH sides via fstat/stat st_ino identity checks.
- P3 17ebdf3: per-run ABRA_DIR. Verified abra CLI honors $ABRA_DIR on-host (skeleton probe:
FATAs only on empty servers/; with servers+catalogue symlinks + recipes/ it works and even
auto-clones recipes for `app ls` resolution into the per-run dir). p3-smoke: setup + fetch of
custom-html-tiny landed in /tmp/p3runs/9999/abra/recipes, head commit + versions readable via
abra.recipe_dir(). install_steps.sh path fix justified in DECISIONS.md (conc P3 entry).
Pre-existing observation (NOT mine, unchanged): `abra app ls -S -m -n` currently FATAs
"unable to resolve '0cc57a5a'" under the DEFAULT abra dir too → janitor's abra discovery
yields [] and the docker-service sweep carries discovery. Out of this phase's scope.
- P4 91d3cc7: concurrency.limit removed; maxTests comment states single-knob + new model.
One stale comment line (.drone.yml l.39 "concurrency.limit=2 below") folds into P5.
All four commits: tests/unit 138 passed + lint PASS before each. Next: tests/concurrency suite.
## 2026-06-10 — tests/concurrency (84d90fb) + P5 (d3fe9e2) + M1 claim (e8e52cf)
- Suite: 20 tests / 19 plan cases, all real-kernel (helpers.py subprocesses hold real flocks,
install real prctl/alarm guards; CCCI_APP_LOCK_DIR sandboxes /run/lock; HelperPool reaps every
helper + recorded grandchildren). First full run on cc-ci: 20 passed in 9.96s, zero flakes in
3 repeat runs during the P5 verification re-runs.
- Design notes for the Adversary's blind-spot hunt (my own known limits):
- case 8 (two janitors) uses threads in one process — valid because flock conflicts are
per-open-file-description, and overlap is forced via a Barrier + 2s slow teardown stub.
- case 14 relies on reparent-to-pid-1 (true on the cc-ci host; would need adjustment in a
subreaper environment — marked NEVER_REPARENTED visibly if so).
- cases 5-12 stub teardown_app (recording) — janitor probe/reap ordering is what's under
test, not teardown internals (covered by Phase-1 e2e + M2 live checks).
- M1 claimed at e8e52cf; full verification recipe in STATUS-conc.md (WHAT/WHERE/HOW/EXPECTED).
## 2026-06-10 — M2: merge + live verification (a)
- Merge: bb5eb3d (--no-ff) pushed; push build 266 (self-test lint+hello) SUCCESS.
- (a) cancel-mid-run: !testme on immich#2 → build 267 (custom) running on the NEW harness —
log shows the setsid/trap wrap + "== per-run ABRA_DIR: /var/lib/cc-ci-runs/267/abra ==";
lock /run/lock/cc-ci-app-immi-ad3e33...lock held by pid 636902; 4 immich services up.
Canceled via drone API 04:42:07Z (HTTP 200, build status "killed"). Result: harness pid
GONE (no leaked python — the old §8.1 gap is closed), immich services 0, volumes 0,
secrets 0, .env 0 — the SIGTERM funnel ran the run's own teardown (better than the plan's
minimum, which allowed the janitor to do the reaping). Lock RELEASED (lockfile present but
unheld — tidy-swept by the next janitor, to be observed during (b)).
- (b) triggered 04:46:53Z: !testme immich#2 (comment 14287) + plausible#3 (14288) in parallel.
## 2026-06-10 — M2(b) round 1: green runs, poisoned exit code → wrapper fix
- Builds 268 (immich#2) + 269 (plausible#3) ran in PARALLEL on the new harness: both logs end
with all-tiers-pass RUN SUMMARY (level=4, deploy-count 1/1) and the host shows ZERO leakage
after (no harness processes, no immi/plau services/volumes/secrets, only unheld lockfiles).
Both steps nevertheless exited 1: the P1 EXIT trap's kill of the already-gone process group
returns ESRCH under the runner's `set -e` shell — a GREEN run reported failure.
- Reproduced minimally on-host (`sh -e` and `bash -e`: rc=1 on a clean exit with the old trap).
Fix e1c4198 (capture rc; `trap - TERM EXIT`; `|| true` on the trap kill) verified on-host:
green rc=0, red rc=7 propagated, TERM→wrapper forwards to child, exits 143. Merged to main
b7a009c; push builds 272-274 green. Adversary notified via inbox.
- (b) re-triggered on the fixed wrapper 04:56:10Z (immich#2 + plausible#3).
## 2026-06-10 — M2(b) PASS + (c) triggered
- (b) round 2 on fixed wrapper: builds 275 (immich#2) + 276 (plausible#3) ran in PARALLEL,
BOTH status=success (drone API). Host after: 0 python harness processes, 0 immi/plau
services/volumes/secrets/.envs — zero leakage. (d) satisfied by 275 (full green immich e2e).
Leftover unheld lockfiles present by design (tidy-swept at next janitor).
- (c) double-!testme on immich#2: two comments at 05:03:58Z → two custom builds, same run
domain immi-ad3e33 → exactly one must block on the app lock with the visible log line.
## 2026-06-10 — CONC-A1: (c) failure root-caused + fixed (run-keyed state files)
- (c) round 1 = builds 279+281, both RED. Root cause (independently also found+filed by the
Adversary as CONC-A1 while I was mid-diagnosis — same conclusion from both loops): the four
run-scoped state files (deploys/opstate/deps/depskip) were DOMAIN-keyed in shared /tmp;
281's main()-preamble + pre-lock _record_deploy fired before it blocked on the app lock →
279 read deploy-count 2 (false DG4.1 RED); 279's end-of-run os.remove deleted the shared
countfile → 281 crashed FileNotFoundError at its own read. Lock serialization itself worked
(281: waiting @+2s, acquired @+194s = 279's exit). Masked pre-restructure by the
end-to-end recipe flock.
- Fix b6e12ef on branch, merged to main 139e319: _run_state_path() keys all four by
run id + harness pid; consumers were always env-fed (CCCI_*_FILE), so domain keying was
never load-bearing. Both cleanup sites already remove all four on normal exit.
- New tests/concurrency/test_run_state.py (suite now 23): path invariants + real-process
CONC-A1 interleaving via helpers.py `deploy-count-run` (countfile init → pre-lock
_record_deploy → acquire → gated read). Teeth verified: under simulated shared keying the
regression test FAILS (host run: 3 failed); with the fix: 23 passed + 138 unit + lint PASS.
- Next: push build green → re-run (b)+(d), then (c), then (a) per the VETO's conditions.
## 2026-06-10 — M2 re-verification on CONC-A1-fixed main (139e319)
- Push builds 283/284/285 (branch fix, merge, inbox) all green.
- (b)+(d) round 3 (comments 14299/14300, 08:17:35Z): builds 287 (immich#2) + 288 (plausible#3)
BOTH success, started simultaneously 08:17:40Z (parallel), finished 08:21:06/08:21:13.
Both logs: deploy-count = 1 (expect 1), level=4. Host after: pgrep -f 'run_recipe_c[i]' → no
match (earlier "2" was pgrep self-match of the ssh cmdline); immi/plau services/volumes/
secrets/server-envs all 0. Zero leakage. (d) satisfied by 287 (full green immich e2e on the
final harness code).
- (c) round 2 triggered 08:22:13Z: comments 14303+14304 on immich#2 (same domain immi-ad3e33).
## 2026-06-10 — M2(c) PASS round 2 (builds 290+291) + (a) re-run triggered
- (c) round 2: builds 290 (08:22:30→08:46:05) + 291 (08:22:33→08:49:23) BOTH success.
291 log: "== app lock: another run of immi-ad3e33... in flight — waiting ==" at +1s,
"acquired" at +1411s = exactly 290's exit. Both: deploy-count = 1 (expect 1), level=4.
Slowness was an immich-ML healthcheck flake (Adversary cross-confirmed live via lslocks:
one holder pid 739163, one waiter pid 739341 on the same lock inode — serialization observed
in the kernel lock table); ML converged inside the 1500s window, both runs green anyway —
no clean re-run needed.
- After both: no harness procs (pgrep run_recipe_c[i] empty), 0 immi/plau services/volumes/
secrets/server-envs. Unheld lockfile remains by design (tidy-swept at next janitor probe).
- (a) re-run on fixed harness: !testme immich#2 comment 14307 @08:50:02Z; will cancel mid-run
via drone API once the deploy is in flight, then check pid/lock/leakage + janitor reap.
## 2026-06-10 — M2(a) re-run PASS (build 295) + M2 claim
- (a) on fixed harness: build 295 (comment 14307 @08:50:02Z) canceled @08:51:05Z (HTTP 200)
while mid-deploy (lock held by pid 763099, 4 immich services converging). Harness pid GONE
@08:51:15Z — the SIGTERM funnel ran the run's own teardown inside 10s; build status=killed;
lock released (lslocks empty); services/volumes/secrets/envs all 0. Zero leakage, no janitor
required.
- Adversary lifted the CONC-A1 VETO @09:05Z with its own M2(c) PASS (290/291 cold-verified,
kernel-lock-table serialization observation). Remaining for DONE: formal M2 claim (this
commit) + Adversary cold re-check of (a)/push-builds.
- M2 claimed in STATUS-conc.md with consolidated (a)-(d) evidence + cold re-check recipe.
## 2026-06-10 — M2 PASS → ## DONE
- Adversary M2 PASS @08:55Z (review 9987fba): all 7 claim items cold-confirmed, both M2-found
fixes verified, guardrails honored, no open veto. Parent-sha typo in my claim noted by the
Adversary (139e319^1 = 2173894, not 4ad55ed) — corrected in STATUS.
- ## DONE written to STATUS-conc.md. Phase conc complete: one mechanism (per-app-domain flock),
per-run ABRA_DIR isolation, flock-probe janitor, lifetime guards + 60-min deadline, single
concurrency knob, spec rewritten, 23-test real-kernel suite. Two live-found fixes along the
way: wrapper exit-code under set -e, CONC-A1 run-keyed state files.

307
JOURNAL-rcust.md Normal file
View File

@ -0,0 +1,307 @@
# JOURNAL — sub-phase rcust (Builder)
## 2026-06-10 bootstrap
Read phase plan (recipe-custom-restructure-full-plan.md), plan.md §6.1/§7/§9, and the reference
spec docs/recipe-customization.md @ 76a4b6b in full. Created phase state files. Work branch will
be `restructure/recipe-custom` off main @ 76a4b6b. Starting P1: reading the six current loaders
(run_recipe_ci.py::_load_meta, conftest.py::_recipe_meta, lifecycle.py::_recipe_extra_env,
lifecycle.py::_recipe_meta_flag, deps.py::declared_deps, canonical.py::is_canonical_enrolled)
before writing harness/meta.py.
## 2026-06-10 P1 — single loader + registry (branch 472a68b)
Wrote runner/harness/meta.py: KEYS registry (14 keys + CHAOS_BASE_DEPLOY/OIDC_AT_INSTALL/
SKIP_GENERIC kept registered as deprecated=True so P1 lands green before P2 deletes them),
RecipeMeta generated from KEYS via dataclasses.make_dataclass (frozen; field set cannot drift from
the registry), load() = the only exec() of recipe_meta.py, MetaError on unknown ALL-CAPS/type
mismatch/callable-on-data-key, difflib suggestion in the unknown-key message. BACKUP_CAPABLE keeps
its tri-state via default None (None = auto-detect — preserves the old `"BACKUP_CAPABLE" in meta`
semantics in generic.backup_capable).
Migrations: orchestrator loads once + passes meta down (deploy_app/perform_upgrade/_perform_op/
run_lifecycle_tier all take the object); conftest meta fixture returns full RecipeMeta (R3 closed);
lifecycle._recipe_extra_env/_recipe_meta_flag and deps.declared_deps deleted; canonical.is_enrolled
+ enrolled_recipes go through meta.load (tests monkeypatch meta.TESTS_DIR now instead of
canonical.__file__); screenshot._load_screenshot_hook reads the attribute (R2 fixed — unit test
proves SCREENSHOT survives the real orchestrator load path). deploy_app keeps an optional
meta=None fallback (loads via the single loader) for fixture/manual callers — exec still happens
in exactly one function.
Effective-value safety check before committing: dumped non_default() for all 21 recipe dirs through
the new loader — every recipe's customized key set matches its recipe_meta.py source (e.g. mumble:
DEPLOY_TIMEOUT/EXTRA_ENV/HEALTH_OK/READY_PROBE/UPGRADE_EXTRA_ENV). One intentional delta class:
deps.deploy_deps' fallback timeouts for a MISSING dep meta change from literal 900/600 to loading
the dep's real meta (orchestrator path always supplied metas, so CI behavior is identical).
Verified on cc-ci (rsynced working tree before committing):
cc-ci-run -m pytest tests/unit -q -> 175 passed
nix develop .#lint --command scripts/lint.sh -> lint: PASS
Three pre-existing f212 unit tests passed dicts to wait_ready_probes — updated mechanically to
construct RecipeMeta via dataclasses.replace (assertions untouched).
Next: P2a compose.ccci.yml first-class + auto-chaos.
## 2026-06-10 P2 — legacy keys & paths deleted (branch 8cd72fd)
P2a: lifecycle.provide_ccci_overlay copies tests/<recipe>/compose.ccci.yml into the per-run
checkout (after install_steps hook, before prepull/deploy); pinned base deploys auto-chaos on
overlay presence (has_ccci_overlay replaces the meta.CHAOS_BASE_DEPLOY elif). ghost/discourse
install_steps.sh were copy-only -> deleted whole; their metas keep COMPOSE_FILE in EXTRA_ENV
(unchanged wiring, the harness now owns the copy).
P2b: oidc_at_install condition removed — `if declared:` provisions before the single deploy,
legacy post-deploy block + _run_setup_custom_tests_hook deleted. lasuite-docs install_steps.sh is
the meet/drive hook with docs' exact env names (diffed against the deleted setup_custom_tests.sh:
same keys incl. OIDC_OP_DISCOVERY_ENDPOINT + scopes 'openid email profile'; secret-insert bump
identical; only the abra-redeploy step is gone — the single deploy reads the env instead).
lasuite-drive's MinIO bucket one-shot -> ops.py pre_install (runs at install-tier start, post-
deploy; bucket lives in the minio volume so it survives upgrade/restore; same scale --detach +
30x3s poll as the shell version). run_quick: deps still provision (realm/creds), hook call gone —
no quick-enrolled recipe declares DEPS today; noted inline.
P2c: SKIP_GENERIC out of the registry; _skip_generic(op) env-only; skip_generic_env_overrides()
prints a `!!` warning when active under DRONE (P5 will embed in the manifest).
P2d: conftest deps fixture = dict of _DepEntry (dict subclass w/ attribute sugar) — the 6 lasuite
files only ever used deps_creds, renamed param to deps, zero assertion changes. NOTE for Adversary:
some assert MESSAGE strings ('setup_custom_tests should have populated this.' -> 'dep
provisioning...') and docstrings updated — message text only, no assert logic/expected values.
Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest tests/unit -q -> 175 passed;
nix develop .#lint --command scripts/lint.sh -> PASS. Doc table regenerated to the 14-key registry
(doc-sync unit test pins it).
Next: P3 — HookCtx + ctx-hook signatures everywhere.
## 2026-06-10 P3 — uniform ctx hook convention (branch fd02d9f)
HookCtx frozen dataclass + hook_ctx() constructor in harness/meta.py; ctx.deps read straight from
$CCCI_DEPS_FILE (json, both shapes) — meta.py stays import-cycle-free (deps.py imports lifecycle
which imports meta). Registry keys carry hook_params; meta.load() enforces the expected positional
names per hook key (READY_PROBE/BACKUP_VERIFY/EXTRA_ENV/UPGRADE_EXTRA_ENV=(ctx,),
SCREENSHOT=(page, ctx)); _run_pre_hook applies meta.check_hook_signature(fn, ("ctx",)) to ops.py
hooks before calling. Conversion of 17 ops.py + 8 recipe_meta hooks was scripted (def-line regex +
bare `domain` -> `ctx.domain` inside the pre_*/hook function bodies only) and diff-reviewed; the
only manual fixes: keycloak pre_restore passed `meta` -> `ctx.meta`, and two comment lines in
lasuite-drive/-meet metas that the regex over-replaced were restored. wait_ready_probes gained
op= (install/upgrade call sites pass it) so probes can know the phase.
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 180 passed; lint PASS.
Next: P4 — discovery placement rule + op_state/deps fixtures + migrate hand-parsers.
## 2026-06-10 P4 — custom-test ergonomics (branch 29a28e2)
Pre-change sweeps confirmed the plan's zero-users claims: no top-level non-lifecycle test_*.py in
any recipe dir; no recipe test file reads os.environ / CCCI_OP_STATE_FILE directly (the only
op-state consumers are the generic assertions via harness.generic.op_state — harness-side, fine).
So P4 = discovery glob removal + new op_state fixture + pinning tests; no test migrations needed.
test_discovery.py's HC2 gate test moved its repo-local custom fixture under functional/ (the rule);
test_discovery_phase2.py now asserts top-level custom is NOT discovered. op_state fixture skips
(clear reason) when env unset / file missing / unparseable; tested via request.getfixturevalue.
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 184 passed; lint PASS.
Next: P5 — customization manifest (print block + results.json key).
## 2026-06-10 P5 — customization manifest (branch 68954be)
(Resumed after a usage-limit pause mid-P5; working tree carried the in-flight manifest.py.)
New runner/harness/manifest.py: build() collects {meta_non_default, hooks, overlays, custom_tests,
env_overrides} via the SAME discovery/meta functions the run uses (so the manifest can never
disagree with what actually executes — incl. the HC2 _gated() repo-local gate), render() prints
the block. Orchestrator builds+prints right after meta load / repo-local snapshot, BEFORE the
quick-lane branch (both lanes get the block); the dict rides into build_results(customization=...)
verbatim. run_quick writes no results.json, so the single build_results call site covers all.
Hooks render as "<hook>", tuples as lists (JSON-clean); ops.py pre-ops listed by cheap source
scan (same approach as discovery._module_defines — no import at manifest time).
Lint flagged: C408 dict() literal, import-block order (manifest after deps), ruff-format on the
new test file — all fixed. Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest
tests/unit -q -> 191 passed; nix develop .#lint --command scripts/lint.sh -> lint: PASS.
Next: P6 docs, then M1 prep (tests/concurrency proof run + 21-recipe baseline matrix).
## 2026-06-10 P6 — docs (branch da558ca) + inbox response (858e0f5)
Rewrote the three docs to the restructured end state; kept the generated §4 table byte-identical
(doc-sync test pins it). recipe-customization.md flipped from review spec to reference; §8 is now
the R1R9 resolution ledger. Facts double-checked against code before writing: R2 proof lives in
test_screenshot.py::test_screenshot_reachable_through_real_load_path (not test_meta.py — fixed a
first-draft error); mumble's post-F2-14c shape has NO install_steps.sh/CHAOS_BASE_DEPLOY (base =
mumbleweb-only COMPOSE_FILE, host-ports added at head via UPGRADE_EXTRA_ENV); lasuite-docs now
ships install_steps.sh (P2b migration); deps file shape is dict recipe->entry; custom_tests
discovery is NON-recursive over functional/+playwright/ (old doc said recursive — corrected).
Adversary inbox (19:06Z, non-blocking): manifest dumps meta values verbatim -> dashboard shows a
field named SECRET_KEY_BASE (plausible's committed CI dummy — public, no real leak). Took the
redaction option: _jsonable masks values whose key NAME matches
SECRET|PASSWORD|TOKEN|CREDENTIAL|word-segment-KEY, recursing into dict values (the plausible case
is a NESTED key under EXTRA_ENV); names stay visible. KEYCLOAK_URL deliberately not matched
(word-segment KEY). Unit test pins redacted+passthrough both.
Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest tests/unit -q -> 192 passed;
nix develop .#lint --command scripts/lint.sh -> lint: PASS.
Next: M1 prep — tests/concurrency proof run on the branch + the 21-dir baseline matrix.
## 2026-06-10 M1 prep + claim
Concurrency proof run on branch head 858e0f5 (rsynced tree on cc-ci): cc-ci-run -m pytest
tests/concurrency -q -> 23 passed in 11.46s (suite untouched by the restructure, as planned).
Baseline matrix: pulled every /var/lib/cc-ci-runs/*/results.json (141 files) and took the most
recent per recipe. 19/21 dirs covered by results.json; mumble's last full run predates the
results system (log ~/ccci-mumble-f214c.log, 5 tiers pass 05-31); bluesky-pds likewise
(Adversary Phase-2 cold verify e45e0ee). plausible's weekly-report RED was its PR branch
(pg13->14, build 200); its default-branch baseline is run 308 (06-10) L4 — runs 307/308 are
today's, from the conc-phase M2 sweep. Bad canaries recorded at their designed-fail tier.
Claimed M1. While waiting: nothing else unblocked in this phase (M2 is gated on M1) — will hold
with short fallback polls per §7 case 2.
## 2026-06-11 M2 reconciliation — discourse upgrade-HC1 root-cause hunt + bluesky re-characterization
Resumed after a loop stall (~21:18Z23:50Z): the m2b/ab sweeps had finished but nothing processed
them. Adversary's 23:53Z inbox asked for (1) a same-ref A/B for the m2b-discourse upgrade-HC1 L1
and (2) a fresh post-fix lasuite-drive L5 at baseline ref — both now queued/running.
Discourse dig (why I don't yet have a mechanism): first hypothesis was my own invocation error —
m2b ran PR=0 where baseline 184 ran PR=2, and I guessed the PR-head sha was unreachable without
the PR fetch. WRONG: fetch_recipe clones all mirror branches and `git checkout <sha>` is check=True
— and the preserved per-run clone sits at HEAD=7ae7b0f, so the re-checkout ran AND persisted.
Second hypothesis (prepull resets the checkout): also wrong — prepull_images is pure
`docker compose config --images` in cwd, never touches git. The scary
`service "sidekiq" depends on undefined service "discourse"` line turned out benign: it appears in
the PASSING m2r/m2rr upgrade sections verbatim (the published compose ships a dangling depends_on;
swarm ignores it — documented in the overlay NOTE). What's left: abra stamped the PREV-TAG commit
(eb96de94 = 0.7.0+3.3.1) on the chaos redeploy while the tree was at 7ae7b0f. One live hypothesis:
the cc-ci overlay clamps app+sidekiq images to bitnamilegacy/discourse:3.3.1; at this PR head
(0.9.0+3.5.0 bump) the redeploy spec may end up close enough to the base spec that the label
update path degenerates — but that requires abra-internals knowledge I can't verify analytically,
and m2r at 7d53d4ec (which also post-dates the 3.5.0 bump?) stamped correctly with the same
overlay, so content-difference-between-refs is doing SOMETHING. Decision: stop theorizing, let the
2x2 complete — m2p-discourse (new main, PR=2, @7ae7b0f) distinguishes PR=0-artifact/race from
deterministic; ab-discourse-7ae7b0f-oldmain (old main, PR=2, @7ae7b0f) distinguishes regression
from pre-existing. Run 184 left no orchestrator log (drone-side), so its chaos stamp is unknowable
— the old-main re-run stands in for it.
lifecycle.py diff c2508c7..main re-read for the upgrade path: overlay copy moved from per-recipe
install_steps.sh to first-class auto-chaos (P2a) but the copied FILE and its untracked-persistence
semantics are byte-identical; run_upgrade order (checkout → upgrade_env → prepull → chaos
redeploy -c → own wait_healthy) unchanged from old main. Nothing jumps out as the delta.
bluesky-pds: pulled the swarm service logs from all three failed runs — identical
`Cannot find module '/app/index.js'` crash-loop (Node v24.15.0) on new main @ mirror head, new
main serial re-run, AND old main @ old default head. The earlier "deploy timed out during
concurrent image pulls" guess in STATUS was wrong (the 600s timeout was the SYMPTOM; the ~2min
A/B failure exposed the crash-loop). Upstream re-published the pinned tag with a different image
layout — no harness can deploy it. Filed in STATUS as restructure-neutral with grep-able evidence.
## 2026-06-11 lasuite-drive root cause #2 — completed one-shot poisons convergence (caught live)
Watching the m2p proof run instead of just waiting paid off: the fix-forward's best-effort line
printed (so #1 is fixed), but the install assert then sat in pytest for 25+ minutes. Live state:
app serving 200, every service 1/1 EXCEPT minio-createbuckets 0/1 with its task **Complete 28
minutes ago**. services_converged demands cur==want for every service; a completed
restart_policy-none one-shot never returns to 1/1, so the bounded converge poll (DEPLOY_TIMEOUT
1800s for this recipe) was always going to burn to the deadline and fail install.
Why nobody ever saw this before P2b: the old setup_custom_tests.sh ran AFTER the install asserts
(post-deploy hook path), so converge never observed desired=1 on the one-shot, and the upgrade
tier's chaos redeploy reapplied the compose spec (replicas: 0) before its own converge checks.
P2b folded the trigger into ops.py pre_install — which the orchestrator runs BEFORE the generic
install assert. Also explains m2rr's odd "install fail but upgrade/backup/restore/custom all pass"
shape exactly (redeploy resets the spec).
Fix options weighed: (a) hook scales the one-shot back to 0 after the poll — rejected: on the
timeout path the task is typically still Preparing (image pull) and scale-to-0 CANCELS it, so the
observed "bucket lands just after the window" runs would become custom-tier RED, i.e. strictly
worse than baseline; (b) move the trigger to a post-assert hook point — no such hook exists in the
new convention and inventing one mid-M2 is scope creep; (c) teach services_converged that a
replica deficit consisting entirely of Complete tasks IS converged — chosen: semantically correct
(the one-shot did its job), restores baseline behavior for any triggered one-shot, and the
converge window doubles as the late-landing grace. Disclosed delta: a genuinely FAILING one-shot
now reds at install (converge timeout) instead of at the custom bucket test — both red, no false
green. Guard: Failed/mixed/spinning-up/no-tasks-yet still block (unit-pinned, 7 cases).
Branch fix/converged-oneshot @ be2026a, proposal in ADVERSARY-INBOX, awaiting approval per the M2
fix-forward protocol. Unit suite 199 passed + lint PASS from the cc-ci working-tree rsync.
## 2026-06-11 ~01:00Z — merge landed, queue shortened
be2026a approved (REVIEW a531746, cold-verified independently) and merged as 6cabbe7; drone build
350 green on the push head 914c166. Merged diff verified == branch diff (empty git diff be2026a..
main for the two files). Post-fix proof m2p2-lasuite-drive queued from a FRESH clone
/root/m2-postfix @6cabbe7 rather than git-updating /root/m2-sweep, because the serial queue's
discourse runs exec from m2-sweep and swapping code under an active/imminent run is how you get
unexplainable results. The discourse A/B therefore runs at 5c0676b (pre-converge-fix) — irrelevant
to discourse (no one-shots), and the Adversary's approval explicitly noted that.
Shortened the doomed m2p run: the generic install assert had already burned its 1800s converge
deadline and failed; the overlay install test then started an IDENTICAL second 1800s burn (same
assert_serving). SIGINT'd the overlay pytest child only — KeyboardInterrupt surfaced at
generic.py:97, the exact diagnosed converge-poll line (a nice live confirmation), and the
orchestrator advanced to the upgrade tier on its normal path. Teardown semantics untouched.
Disclosed in STATUS so the log's KeyboardInterrupt is pre-explained.
Drone API note for future me: no token on disk; fastest read-only check is docker cp the drone
sqlite out and query builds (documented in STATUS). The Gitea statuses API returned empty for
these shas (drone evidently doesn't post commit statuses here).
## 2026-06-11 ~00:55Z — discourse A/B closed (harness-neutral), mechanism still unattributed
m2p-discourse (new main, PR=2, @7ae7b0f) and ab-discourse-7ae7b0f-oldmain (old main, PR=2, same
ref) failed the upgrade IDENTICALLY: HC1, chaos-version=eb96de94+U, all other tiers pass, L2.
Same invocation as baseline 184 which was L4 five days ago. So: deterministic, harness-neutral,
and something outside both harnesses drifted since 06-05. Eliminated: branch-tip existence (7ae7b0f
still tips upgrade-0.8.0+3.5.0 + pr/2), upstream tag set (0.7.0+3.3.1 still latest), abra pin
(flake.lock untouched by the restructure). Not eliminated: abra-internal interaction with repo/app
state (the chaos stamp lands on the prev-base TAG commit despite the tree being at the PR head —
my best guess remains something in how abra resolves the version/commit for the chaos label when
COMPOSE_FILE includes the overlay and the project normalizes invalid, but m2r at 7d53d4ec stamping
correctly with the same dangling depends_on kills the simple version of that theory). The
`service "sidekiq" depends on...` line appears in passing AND failing upgrades, position-identical,
so it discriminates nothing. M2-wise the question is settled — the restructure is exonerated by
byte-identical old==new failure; chasing abra's stamp resolution further is post-phase work, filed
as a DEFERRED note rather than burning more M2 wall-clock on a non-rcust mechanism.
m2p2-lasuite-drive (the binding post-fix proof) auto-started at 00:48:58Z from /root/m2-postfix
@6cabbe7. Watching for: no 1800s converge burn after the one-shot completes, then L5.
## 2026-06-11 ~01:10Z — m2p2 green; "L5" turned out to be a moved goalpost (mainline, not ours)
m2p2-lasuite-drive: rc=0, 3m19s, all stages pass, OIDC + MinIO custom tests green, and the
fix-forward pair demonstrably exercised (one-shot overshot 90s again → best-effort line → late
Complete → converge fix admitted it). But results.json said level=4 where the binding condition
said L5 — heart-stopper until the git archaeology: run 189's level-5 + "L6 recipe-local N/A" cap
didn't match ANY derive_rungs I could find in either world, because the 6-rung ladder was removed
on MAIN by 46e2cdb+c51cd84 (PR #6) on 06-09, between the baseline runs and the merge — by the
mirror/report phase, not rcust. The merge didn't touch level.py (checked 01e6d49^1..01e6d49), and
run 204 on 06-09 (hours pre-deploy of the refactor) still shows 6 rungs — clean timeline. So the
baseline matrix's "L5" rows need a schema-equivalence reading, declared in STATUS BEFORE the claim
rather than negotiated after the Adversary trips on it. Lesson re-learned: a baseline matrix
should pin the SCHEMA VERSION of its evidence, not just the level number.
## 2026-06-11 ~01:30Z — M2 claim assembled
Drone-path runs landed green (356 immich#2 L4, 357 plausible#3 L4, both with embedded
customization manifests + clean flags, triggered by real !testme comments). Zero-leak verified
after everything. Plausible's missing screenshot.png checked against its other runs — it never
produces one (no screenshot surface), so not a capture regression. Claimed M2 with the full
21-recipe reconciliation table against the corrected baseline; the three lasuite rows ride the
Adversary-accepted L5≡L4+OIDC equivalence, bluesky-pds is the one justified exclusion, discourse
is reconciled as env-drift with byte-identical old==new evidence. Nothing else unblocked in this
phase while the verdict is out — holding per §7 case 2.
## 2026-06-11 ~01:20Z — M2 PASS → ## DONE
Adversary cold-verified the whole claim independently (re-ran the canaries themselves, jq'd all 21
run dirs, re-checked the drone DB and the zero-leak state) and passed M2 with no findings and no
VETO. M1 + M2 both stand; ## DONE written. Phase summary: 6 plan phases landed on one branch,
merged after M1; the real-CI sweep then caught exactly TWO genuine regressions (both in the same
lasuite-drive P2b hook port: raise-on-timeout, and one-shot-vs-converge ordering), both root-caused
live, fixed forward under approval, and proven end-to-end — plus it surfaced two pre-existing
environment drifts (discourse upgrade-HC1, bluesky-pds upstream image) that the A/B discipline
kept from being misattributed to the restructure. The sweep-as-safety-net worked as designed.

40
JOURNAL-shot.md Normal file
View File

@ -0,0 +1,40 @@
# JOURNAL-shot.md — Builder journal, phase `shot`
## 2026-06-11 ~01:1701:35Z — phase open, P1+P2 in one sweep
Read the phase plan + plan.md §6.1/§7/§9. Enumerated enrolled recipes (19). Pulled per-recipe
latest-run data off cc-ci (`results.json` screenshot field + PNG size for all ~190 run dirs),
scp'd 18 PNGs to /tmp/shot-audit/ and Read every one of them.
Findings vs the orchestrator pre-audit: all four 4801-2B suspects are indeed blank frames
(immich pure white, lasuite-meet white, n8n off-white, cryptpad grey). keycloak 8.7KB is a
"Loading the Administration Console" spinner — NOT a sparse login page as §2 guessed.
lasuite-docs/drive ~5.9KB are lone spinners. Two surprises: (1) mattermost-lts 242KB, classed
healthy by size, is actually the brand splash/loading screen, not the login form — size
heuristics lie in both directions; (2) mumble serves a real web page (mumble-web client per
compose.mumbleweb.yml, deployed since Phase 2 for HTTP health) showing its connecting spinner —
so mumble is fixable, not an N/A.
plausible root cause: traced via Drone sqlite (no python3 on host; ran alpine+sqlite3 against
the drone data volume). Build 357 log t=73s: capture failed, last status=500 after 45s. Cross-ref
tests/plausible/functional/test_health_check.py: `/` 500s via auth_controller under
DISABLE_AUTH=true — permanent, not an init race. So the default landing capture can never work;
plausible needs a SCREENSHOT hook to a path that renders (will probe /login, /sites on a live
deploy during P3).
bluesky-pds: null because install fails at level 0 (upstream image breakage, already in
DEFERRED.md from rcust) — capture gated on deploy_ok, correctly skipped. N/A while upstream broken.
custom-html nginx-welcome: verified no install-time seeding exists for this recipe (custom-html-tiny
has install_steps.sh; custom-html only seeds in pre_backup/pre_upgrade ops, after capture). The
nginx default page IS the honest fresh-install view. Leaving OK; flagged in matrix for Adversary.
Adversary opened REVIEW-shot.md with its own cold pre-audit (4f3a747) before my first push —
good: my visual reads agree with theirs on every overlapping row.
Design thinking for P3 (next iteration): default-path improvement = after goto(domcontentloaded),
try a bounded `wait_for_load_state("networkidle")` (~10-15s cap) and/or wait for a non-trivial
painted body, then screenshot; then a blank-detect (PNG < ~6KB or near-uniform) → one retry with
a longer settle. Keep total ≤ ~60s worst case, all inside the existing capture() try/except so R7
(cosmetics never block) is preserved. Unit tests: blank-detector pure function + retry logic with
a fake page. Per-recipe hooks only for plausible (500 root) + whatever the re-audit still shows.

View File

@ -30,3 +30,413 @@ Current-system facts I will hold the restructure against:
same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.
No verdict yet — waiting for Builder bootstrap/claim.
## 2026-06-10T04:05Z — cold-verify environment established (prep, no gate)
Builder seeded STATUS/BACKLOG/JOURNAL-conc; STATUS says P1 in flight, no gate claimed. Mapped the
test-execution environment I'll use for the M1 cold run so a time-sensitive gate isn't spent
debugging tooling:
- Local VM devshell (`nix develop`) has only lintTools (no pytest). So pytest does NOT run here.
- pytest 8.3.3 + playwright live in the host `pyEnv` (nix/modules/harness.nix) exposed as
`cc-ci-run` on cc-ci. `cc-ci-run -m pytest <path> -q` works as the real harness interpreter
(verified: `cc-ci-run -c "import pytest" -> 8.3.3`).
- `.drone.yml` lint stage runs `nix develop .#lint --command bash scripts/lint.sh`.
- COLD M1 PLAN: fresh `git clone`/checkout of `restructure/concurrency` into a throwaway dir ON
cc-ci → `cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` +
`nix develop .#lint --command bash scripts/lint.sh`, all from that clean checkout (not the
Builder's working tree). Then adversarial diff review per my baseline hit-list.
- Baseline `.drone.yml` on main is still the pre-restructure version (concurrency.limit=2,
acquire_recipe_lock / /run/cc-ci-active registry referenced) — confirms P1/P4 edits are
branch-only so far. Good.
## 2026-06-10T04:23Z — early pre-review of P1+P2 (branch @b302f3a, NO gate claimed — NOT a verdict)
Builder has pushed P1 (b492f99) + P2 (b302f3a) to restructure/concurrency; P3/P4/P5/tests still
pending, so M1 is not claimable and this is NOT a PASS — it's pre-review to front-load the M1 diff
audit and avoid re-doing it under gate time pressure. Read code/diff + git only; did NOT read
JOURNAL (anti-anchoring intact). I actively tried to break the following and each concern was
REFUTED:
1. **Green-on-red via the .drone.yml EXIT trap** (my lead hypothesis). The wrapper is
`setsid cc-ci-run … & PID=$!; trap 'kill -TERM -- -$PID' TERM EXIT; wait $PID`. I worried the
EXIT trap's final `kill` status would override the harness exit code and mask a failing run.
EMPIRICALLY TESTED (4 bash repros incl. failing harness with a lingering group member that
makes kill succeed=0): bash PRESERVES the pre-trap exit status when the EXIT trap doesn't call
`exit`. Exit code propagates correctly in all cases (RED stays RED, GREEN stays GREEN). Refuted.
2. **P2 unlink/reacquire inode race** (janitor unlinks a reaped orphan's lockfile while a new run
blocks on the old inode). Handled: both acquire_app_lock and _probe_and_reap recheck
`fstat(fd).st_ino == stat(path).st_ino` after acquiring and retry/bail on mismatch — a lock on
an unlinked (anonymous) inode is never treated as authoritative, and the path's lockfile is
never unlinked out from under a newer run. Refuted.
3. **Half-reaped/new-app coexistence.** Reap runs WHILE HOLDING the probe lock; a new same-domain
run blocks in acquire_app_lock until reap completes. The pre-deploy window (lock held, app not
yet created) is covered: the stale-lockfile sweep sees the held lock (BlockingIOError) and
leaves it. Refuted.
4. **Signal mid-normal-teardown aborting cleanup.** begin_teardown() is the FIRST line of BOTH
finally blocks (run_recipe_ci.py:663 run_quick, :1134 main); the _funnel_handler swallows
(logs+returns) any SIGTERM/SIGALRM once tearing_down is set, so a second signal can't abort the
cleanup the first asked for. install_lifetime_guards() is the FIRST statement of main() (:829),
before any abra/lock call, with prctl→ppid==1 recheck in the correct order. Refuted.
Open items to confirm AT M1 (cold, full suite) — NOT defects, just unverified-until-then:
- `datetime` import removed from lifecycle.py along with _stack_age_seconds — grep for any
remaining datetime use (ruff would catch an undefined name; confirm import truly orphaned).
- `_stack_name` / age-fallback deadcode after the janitor rewrite — confirm no dangling refs.
- Registry-symbol deletion is only PARTIAL on this commit: acquire_recipe_lock still present
(P3 deletes it); register/unregister/_run_owner_state/ACTIVE_RUN_DIR/CCCI_JANITOR_MAX_AGE are
gone — full dangling-ref grep belongs at M1 once P3 lands.
- setsid-fork edge: if `setsid` ever forks (only when it's a pgrp leader; not the case for a
backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the
harness would reparent to ppid==1 and self-abort. Live-verify the trap+cancel path at M2(a).
- begin_teardown is process-global module state (lifetime._state) — fine for one harness process;
the tests/concurrency suite must not import-share it across in-process cases (verify at M1).
## 2026-06-10T04:32Z — pre-review P3+P4 (branch @91d3cc7, NO gate claimed — NOT a verdict)
Builder pushed P3 (17ebdf3 per-run ABRA_DIR) + P4 (91d3cc7 config cleanup). tests/concurrency +
P5 docs still pending, so M1 still not claimable. Continued the front-loaded diff audit (code/git
only; JOURNAL still unread). Findings — all CLEAN:
- **Dangling-ref grep across runner/bridge/dashboard/nix = ZERO hits** for all 9 deleted symbols:
acquire_recipe_lock, register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
CCCI_JANITOR_MAX_AGE, RECIPE_LOCK_DIR, _stack_age_seconds, _registry_path. The orphaned
`datetime` import is also gone from lifecycle.py. Clean deletion.
- **Path centralization**: all `~/.abra/recipes/<recipe>` literals replaced by `abra.recipe_dir()`
(resolves `$ABRA_DIR else ~/.abra`) across abra.py (recipe_checkout, has_lightweight_version_tags,
recipe_head_commit, recipe_versions), generic._recipe_dir, lifecycle.prepull_images,
snapshot_recipe_tests, fetch_recipe. prepull's env_path stays canonical `~/.abra/servers/...`
which is correct (servers/ is the shared symlink target).
- **Ordering verified** (main(), the only structural risk): install_lifetime_guards() is the FIRST
stmt (873); between it and setup_run_abra_dir() (891) there are ONLY env reads + a print — no
abra call; ABRA_DIR is exported at 891 BEFORE fetch_recipe (892) and before the first path-helper
recipe_head_commit (895). The `--quick` dispatch (run_quick, ~908) is AFTER 891, so the quick lane
inherits the per-run ABRA_DIR too. No tree is touched before ABRA_DIR is set.
- **Manual-run isolation**: rid=="manual" → "manual-<pid>" so two hand-runs don't share a tree.
Open items to confirm AT M1 (cold) — not defects:
- setup_run_abra_dir symlink idempotency: `if not os.path.islink(link): os.symlink(...)` — if a
NON-symlink file pre-exists at servers/catalogue (reused run dir from a crashed partial), symlink
raises FileExistsError. Low risk (fresh run-id per Drone build) but worth a glance.
- CCCI_SKIP_FETCH=1 now `rm -rf dest` + copytree(canonical, dest, symlinks=True) — confirm the
--quick rollback-proof staging tests still pass (they set CCCI_SKIP_FETCH).
- tests/{ghost,discourse}/install_steps.sh RECIPE_DIR=${ABRA_DIR:-$HOME/.abra} mechanical path fix
— confirm it changed NO assertion/gate (guardrail: never weaken recipe-test gates). Diff-check.
Net: the entire P1P4 diff has been pre-audited and is clean against my break-it hit-list. M1 cold
run, once claimed (after tests/concurrency + P5 land), reduces to: fresh checkout on cc-ci →
`cc-ci-run -m pytest tests/unit -q` + `cc-ci-run -m pytest tests/concurrency -q` + lint, plus a
focused review of only the tests/concurrency suite (vs the 19 planned cases) and the P5 doc delta.
## M1: PASS @2026-06-10T04:38Z — implementation verified (branch restructure/concurrency @d3fe9e2)
Verdict formed from the plan (SSOT), the code/git, the STATUS claim's verify recipe, and my own
COLD acceptance run — WITHOUT reading JOURNAL first (anti-anchoring honored; noting here that I had
NOT consulted JOURNAL-conc at verdict time).
COLD ENVIRONMENT: fresh `git clone --branch restructure/concurrency` into /tmp/adv-m1 on cc-ci
(NOT the Builder's tree); `git rev-parse HEAD == d3fe9e26bb0fbaedb37383539ba3973bc1c80aff` (matches
claim), `git status` clean. Ran via the host `cc-ci-run` pyEnv (pytest 8.3.3 + playwright) and the
pinned `.#lint` devshell.
ACCEPTANCE RESULTS (expected → observed):
- `cc-ci-run -m pytest tests/unit -q` → 138 passed in 4.72s ✓ (claim: 138 passed)
- `cc-ci-run -m pytest tests/concurrency -q` → 20 passed in 9.91s ✓ (claim: 20 passed)
- `nix develop .#lint --command bash scripts/lint.sh``lint: PASS`
- `pytest tests/unit --collect-only` concurrency items → 0 ✓ (suite NOT in default gate)
- dangling-ref grep (register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR,
CCCI_JANITOR_MAX_AGE, acquire_recipe_lock, RECIPE_LOCK_DIR, _stack_age_seconds) over
*.py/*.nix/*.yml/*.sh → ZERO hits outside docs/ ✓
GATE-INTEGRITY (guardrails honored):
- `RUN_APP_RE` regex unchanged (lifecycle.py:26, identical pattern); warm/canonical apps still
never become probe candidates (test_11 asserts no lockfiles even created for warm names).
- `services_converged()` / paused-is-settled / `backup_app()` waits: NOT in the code diff — all
RUN_APP_RE/services_converged/paused diff hits are docs/concurrency.md prose (P5 rewrite).
- `teardown_app` ordering untouched; only its trailing unregister call removed (registry gone).
- Only `tests/<recipe>/` change is the mechanical `RECIPE_DIR=${ABRA_DIR:-$HOME/.abra}/...` line
in ghost+discourse install_steps.sh — NO assertion/gate touched (diff-confirmed). Guardrail
"never weaken recipe-test gates / touch tests/<recipe>/ content" honored.
- P4: `concurrency.limit` block removed from .drone.yml; drone-runner.nix comment makes
DRONE_RUNNER_CAPACITY the single knob.
ADVERSARIAL DIFF REVIEW (P1P4 pre-audited in the two notes above; refuted: green-on-red exit-code
masking [empirically tested], unlink/reacquire inode race [fstat==stat identity recheck],
half-reaped coexistence [reap-under-probe-lock], signal-mid-teardown reentrancy [begin_teardown
first line of both finally blocks], guard/ABRA_DIR/fetch ordering [no abra call pre-export]).
TEST-SUITE AUDIT vs the 19 plan cases: real kernel flocks, NEVER mocked (only teardown_app +
abra-discovery stubbed, both disclosed). Coverage complete: cases 14 test_locks, 512
test_janitor, 1316 test_lifetime, 1719 test_abra_dir, +test_18b (manual-pid isolation) = 20.
Assertions are substantive, not tautological: exact funnel exit codes 142/143 (test_15/16),
reap-vs-new-run timestamp ordering + fresh-inode `lock_state=="held"` (test_7), two-janitor
arbitration via separate open()s (test_8 — valid: flock binds the open file description, so
threads-with-distinct-fds model processes), long-held mtime-backdate flag-not-steal (test_10),
PEP 446 fd non-inheritance with a surviving child (test_3), divergent per-run trees + canonical
untouched (test_18).
INDEPENDENT PROBE (my own driver, NOT the Builder's helpers.py): drove the real
`lifecycle.acquire_app_lock` from a standalone script with a sandbox CCCI_APP_LOCK_DIR on cc-ci →
state `held` after acquire; a second acquirer BLOCKED while the first held (no ack2 after 1.5s);
after `SIGKILL` of the holder the second acquired within 10s (kernel auto-release). Core invariant
confirmed against the real code, not just the Builder's tests.
NON-BLOCKING NOTES (carry to M2 live-verify; none gate M1):
- setsid-fork edge in the .drone.yml trap wrapper: if `setsid` ever forks (only when it's a pgrp
leader — not the case for a backgrounded job in a non-job-control drone shell), $PID would be the
intermediate and the harness could reparent (ppid==1) and self-abort. MUST be live-verified by
the actual drone-cancel path at M2(a) — the plan already flags this ("verify drone exec runner
signal delivery; the trap must fire on drone cancel"). Not unit-testable here.
- End-of-janitor stale-lockfile tidy sweep (appless leftover lockfile unlink) is not directly
covered by a named test (not one of the 19); low risk (tidiness only). Noted, not a defect.
- test_14 (ppid race) depends on the helper reparenting to pid 1; under a subreaper it marks
NEVER_REPARENTED and FAILS VISIBLY (never false-passes). Passed in this env.
CONCLUSION: M1 — implementation verified — PASS. M2 (merge to main + live verification ad) is
unblocked. Reminder for both loops: recipe-mirror PRs are !testme targets only — never merge/push
them. (After this verdict I may consult JOURNAL-conc to contextualize, per §6.1.)
## 2026-06-10T04:49Z — M2 merge integrity pre-check (M2 NOT yet claimed — not a verdict)
Builder merged the branch to main (merge commit `bb5eb3d`, 2 parents 83a6c6e∘d3fe9e2, no force)
after my M1 PASS, and is mid-M2 live verification (journal: M2(a) cancel-mid-run evidence, (b)
parallel runs triggered). No `claim(conc): M2` commit yet; STATUS-conc still shows the stale M1
line (Builder's file — will update at the M2 claim). Independent merge check:
- `git diff bb5eb3d d3fe9e2 -- runner/ .drone.yml docs/concurrency.md tests/ nix/` = EMPTY → the
merge preserved EXACTLY the code I cold-verified at M1. No conflict-resolution drift introduced.
- `git merge-base --is-ancestor d3fe9e2 bb5eb3d` = true.
So deployed main == M1-verified tree. At the M2 claim I therefore re-verify only LIVE behavior +
the push build, not the code again:
push build green; (a) cancel mid-run → no leaked python/lock, next janitor reaps the app, zero
leakage; (b) two parallel !testme (immich#2 + plausible#3) → both green, zero leakage; (c)
double-!testme same PR → 2nd blocks on the app lock (visible in its drone log) then runs; (d) one
full green end-to-end run. Evidence to come from Drone build logs + cc-ci state (abra app ls /
lslocks / docker), cold from my own access path.
## 2026-06-10T05:00Z — wrapper exit-code fix verified + CORRECTION to my P1 pre-review (inbox consumed)
Consumed ADVERSARY-INBOX.md (deleted) — Builder reported an M2 live-verify finding + fix. Folded in:
**The defect (real, Builder-found, build 269 plausible#3):** the drone exec step shell is `set -e`.
On a NORMAL (green) harness exit the P1 EXIT trap still fired and its `kill -TERM -- -$PID` of the
already-exited process group returned ESRCH (exit 1), which under `set -e` poisoned the step's exit
status to 1 — a fully GREEN run (all tiers pass, level=4) reported RED.
**CORRECTION — my P1 pre-review was wrong on this point.** In my 04:23Z pre-review I claimed to have
"empirically tested" green-on-red exit-code masking and REFUTED it. That test was run with plain
`bash -c` WITHOUT `set -e` — the wrong shell mode. The real drone step runs `set -e`, where the bug
manifests. I re-ran the matrix correctly now (bash -e), reproducing the bug (old wrapper + green +
set -e → exit 1) and confirming I had the shell mode wrong. Lesson: model the EXACT runtime
(set -e) for shell-trap behavior. The Builder caught this live; I did not. Owning it.
NB the failure direction was false-RED (green reported red) — fail-safe-ish, not a green-on-red
(no failing run was ever reported green); still a real defect.
**The fix (e1c4198 on branch, merged to main b7a009c) — independently verified by me, cold under
`set -e` (the correct mode this time):**
```
setsid cc-ci-run runner/run_recipe_ci.py & PID=$!
trap 'kill -TERM -- "-$PID" 2>/dev/null || true' TERM EXIT
rc=0; wait "$PID" || rc=$?
trap - TERM EXIT
exit "$rc"
```
My 4-path matrix (all under `bash -e`, exact-shape repros):
- A green harness → step exit 0 ✓ (poisoning gone: `|| true` on the trap kill + `trap - EXIT` before exit)
- B **red harness (exit 7) → step exit 7 ✓ — NOT masked to green.** Critical false-GREEN check
PASSES: `wait || rc=$?` captures the real rc and `exit "$rc"` propagates it. The
"failing PR must report RED" gate is preserved by the fix.
- C old wrapper + green + set -e → exit 1 ✓ (bug reproduced — root-cause confirmed)
- D cancel (TERM to wrapper mid-wait) → wrapper exits 143 AND the child received TERM
(CHILD_GOT_TERM logged) ✓ — cancel-forwarding semantics unchanged; the `trap - TERM EXIT` runs
only AFTER `wait` returns (post-forward), so it can't disarm the forward during a real cancel.
Verdict on the fix: CORRECT and SAFE — resolves the false-RED poisoning without introducing
false-GREEN, and preserves cancel forwarding. Folds cleanly into the pending M2 review.
**M1 status unaffected:** M1 PASS was for the code/suites/lint/diff of d3fe9e2; this wrapper
exit-code-under-set-e is a LIVE behavior M1's checks could not exercise (the trap only runs in the
real drone exec shell). main now = d3fe9e2 + this .drone.yml wrapper fix; the fix is verified above.
Open for the formal M2 verdict: re-confirm lint green on the new .drone.yml (yamllint), the push
build green, and live (a) cancel-no-leak / (b) parallel both-green / (c) double-!testme blocks /
(d) one full green run — cold, once the Builder posts the M2 claim with evidence.
## M2(c): FAIL @2026-06-10T08:10Z — double-!testme same domain corrupts shared deploy-count → both runs RED + VETO
Proactive cold break-it probe of the live M2 evidence (M2 not yet formally `claim(conc)`'d — the
Builder's JOURNAL shows (c) "triggered" but NOT evidenced as PASS; I went straight to the Drone API
to verify the in-flight (c) runs independently, not to the JOURNAL narrative). I found a REAL defect
that breaks M2(c). Filed as BACKLOG-conc CONC-A1.
EVIDENCE (Drone API, recipe-maintainers/cc-ci, cold via /run/secrets/bridge_drone_token — my own
access path, not the Builder's word):
- (c) = builds **279 + 281**, both `event=custom PR=2 RECIPE=immich REF=a92b28d…` → SAME domain
`immi-ad3e33.ci.commoninternet.net`. Both `status=failure` (step `ci` exit_code=1).
- 281 (the blocked run): log `== app lock: ... in flight — waiting ==` @2s`== acquired ==` @194s,
which is exactly when 279's process exited (279 finished 05:07:35Z). **Lock serialisation + the
visible block line WORK** — that half of (c) is fine.
- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`.
- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33….ci.commoninternet.net` at
run_recipe_ci.py:1213.
- Control build 275 (isolated immich, same fixed wrapper) → `deploy-count = 1`, GREEN. Confirms the
failure is concurrency-specific, NOT a pre-existing immich/wrapper regression.
ROOT CAUSE (code, confirmed):
- DG4.1 counter file is DOMAIN-keyed in shared /tmp, not per-run: `run_recipe_ci.py:930
/tmp/ccci-deploys-<domain>`. P3 isolated ABRA_DIR per run but this per-run state file was missed
(predates the restructure, ef44d46; the old recipe-flock serialised same-recipe runs end-to-end,
masking it).
- `deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE `acquire_app_lock()` (:254,
introduced by P2 b302f3a) → the increment races OUTSIDE the lock. 281's single pre-lock
`_record_deploy` (@2s) bumps the shared counter 279 is using (→2, false violation), and 279's
end-of-run `os.remove(countfile)` (:1215) deletes the file under 281 → FileNotFoundError.
- Interleaving is fully reconstructed and self-consistent with the build timestamps (see CONC-A1).
This is squarely in M2(c) scope: the plan's DoD (c) requires the second run to "block … then RUN"
(implicitly green), and the phase's whole premise is "two concurrent !testme don't collide on
domain/volume/secrets." This is a domain-keyed-state collision — the restructure's narrower domain
lock no longer covers the deploy-count file. M1 (code/suites/lint/diff of d3fe9e2) is unaffected —
this is a live concurrency behavior M1's checks could not exercise; the tests/concurrency suite has
the matching blind spot (case 4 serialises acquire but never asserts deploy-count isolation across
two same-domain runs).
## VETO — M2 may NOT be marked DONE until CONC-A1 is fixed and I log a fresh (c) PASS
Forbidding `## DONE` in STATUS-conc until: (1) deploy-counter keyed per-run; (2) a tests/concurrency
case asserts same-domain deploy-count isolation; (3) live (c) re-run shows BOTH builds GREEN with
the visible block line and zero leakage; (4) (a),(b),(d) re-confirmed unaffected. Only I clear this.
(After this verdict I may consult JOURNAL-conc to contextualise — noting I had NOT read the (c)
journal reasoning before forming this FAIL; I verified from the Drone API + code directly.)
## 2026-06-10T08:20Z — CONC-A1 fix CODE-verified (veto conditions 1+2 met; 3+4 still pending — NOT cleared)
Builder fixed CONC-A1 (b6e12ef, merged main 139e319) and is re-running M2 live (a)(d). I
cold-verified the FIX CODE from my own clone + a fresh checkout on cc-ci (not the Builder's word):
- **Condition (1) per-run keying — MET.** `run_recipe_ci._run_state_path(name)` keys all four
run-scoped state files (`deploys`, `opstate`, `deps`, `depskip`) by `run_id()` + `os.getpid()`,
never domain. Grep: ZERO residual `ccci-<state>-{domain}` literals in prod code (only the
app-LOCK path stays domain-keyed, which is correct). All consumers env-read `CCCI_*_FILE`
(lifecycle:148, deps:72/155, generic:134) — no path re-derivation. Uniqueness holds even in the
manual fallback (`run_id()`→domain) because the `+pid` suffix separates two processes.
- **Condition (2) same-domain isolation test — MET, and proven non-tautological.**
tests/concurrency/test_run_state.py adds test_20/20b/20c. test_20c drives REAL processes + the
REAL lock + real `_run_state_path`/`_record_deploy`, reproducing the 279/281 interleaving: run A
reads `COUNT 1` (NOT polluted to 2 by B's pre-lock increment) and B's file survives A's remove
(no FileNotFoundError). **Mutation check (my own):** reverting `_run_state_path` to domain-keying
in a throwaway cc-ci clone → all 3 test_run_state cases FAIL (incl. test_20c). So the test
genuinely guards the fix.
- **Suites cold (fresh clone @4f6c955 on cc-ci):** unit 138 passed, concurrency 23 passed (was 20),
concurrency still NOT collected by the default `pytest tests/unit` run (0). lint not re-run here
(no .drone.yml/nix change in the fix; will confirm at the M2 claim).
**VETO NOT cleared.** Conditions (3) live (c) re-run BOTH builds GREEN + visible block line + zero
leakage, and (4) (a)/(b)/(d) re-confirmed on the fixed harness, still require the Builder's live
evidence (in flight). The code fix strongly predicts a (c) pass but M2 is a LIVE gate — I will
re-verify the (c) double-!testme cold from the Drone API once the Builder posts the M2 claim, and
only then clear the veto.
## 2026-06-10T08:43Z — live (c) round-2 (builds 290+291): serialization CONFIRMED via lslocks; delay is an immich-ML flake, NOT the restructure (not a verdict)
(b)+(d) re-passed on the fixed harness (builds 287 immich#2 + 288 plausible#3, parallel, both
success — I'll re-confirm at the M2 claim). (c) round 2 = builds 290+291 (both custom PR=2 immich,
same domain immi-ad3e33), started 08:22:30Z. I inspected the LIVE host state cold (my own ssh):
- **CORE INVARIANT DIRECTLY OBSERVED in the kernel lock table** — strongest possible proof of the
double-!testme serialization:
`lslocks`: pid 739163 (build 290) holds `WRITE` on cc-ci-app-immi-ad3e33….lock; pid 739341
(build 291) is blocked `WRITE*` on the SAME lock. Exactly one holder, one waiter, one inode.
- 290 (holder) is sleeping in `services_converged()` poll (hrtimer_nanosleep, no abra child) because
`immich-machine-learning` is stuck 0/1: its container repeatedly fails the healthcheck
(`non-zero exit (143): dockerexec: unhealthy container`, swarm restarting every 16 min). Current
attempt (08:43) has gunicorn up, health `starting` — slow/flaky ML readiness, not a deploy break.
- NOT caused by the restructure / teardown: 290's immich volumes (model-cache/postgres/uploads) +
.env are all from 290's OWN fresh deploy (08:23), not inherited from the earlier same-domain run
287. ML image present (1.36GB, no pull), host healthy (5.2Gi mem free, 65G disk). So this is an
immich-ML healthcheck flake, orthogonal to concurrency.
Bearing on M2(c): the SERIALIZATION mechanism under test is verified working live. The "both GREEN"
half of condition (3) is not yet demonstrated only because 290 is flake-blocked on immich-ML; if 290
REDs on deploy-timeout, (c) needs a clean re-run (flake, not a code fault). VETO unchanged — I still
require one clean (c) where both same-domain builds go GREEN with the block line + zero leakage.
Continuing to watch 290/291 to terminal.
## M2(c): PASS @2026-06-10T09:05Z — double-!testme same domain, CONC-A1 fixed; VETO LIFTED
(c) round-2 builds 290+291 (both `custom PR=2 immich`, same domain immi-ad3e33, on CONC-A1-fixed
main) both reached terminal **status=success**. Cold-verified from the Drone API + live host (my own
access path), not the Builder's word:
- **Both GREEN:** 290 success, 291 success (Drone API).
- **Visible block line (the (c) requirement):** 291 log —
`== app lock: another run of immi-ad3e33….ci.commoninternet.net is in flight — waiting ==`
then `== app lock: acquired … ==`. I ALSO observed the serialization directly in the kernel lock
table mid-run (lslocks: 290 held WRITE, 291 blocked WRITE* on the same inode; after 290 exited,
291 held it). Strongest possible proof of the double-!testme serialization invariant.
- **CONC-A1 regression GONE — the two exact round-1 failure points are now clean:**
- 290 (round-1 build 279 got false `deploy-count 2 != 1`) → now `deploy-count = 1 (expect 1)`,
all 5 tiers pass, level=4. Its run-keyed counter was NOT polluted by 291's concurrent pre-lock
`_record_deploy`.
- 291 (round-1 build 281 crashed `FileNotFoundError` at run_recipe_ci.py:1213) → now
`deploy-count = 1 (expect 1)`, all tiers pass, level=4, no traceback. Its own run-keyed countfile
survived 290's end-of-run remove.
- **Zero leakage after both:** 0 harness procs, 0 immich apps / services / volumes / secrets, no held
cc-ci locks. One unheld 0-byte leftover lockfile (mtime 08:46, 291's acquisition touch) — reaped
on sight by the next janitor probe, harmless by design.
- The ~20-min runtime each was an immich-machine-learning healthcheck slowness/flake (ML eventually
converged), NOT the restructure — already diagnosed in the 08:43Z note; serialization + isolation
both verified correct regardless.
**VETO LIFTED.** The CONC-A1 veto ("no DONE until CONC-A1 fixed + a fresh (c) PASS") is cleared:
conditions (1) per-run keying [code + mutation-proven], (2) same-domain isolation test
[non-tautological], and (3) live (c) both-GREEN + block line + zero leakage are ALL met. CONC-A1
closed in BACKLOG-conc.
**Still required before DONE (full M2 gate, not the CONC-A1 veto):** the Builder must post the formal
M2 claim in STATUS-conc with consolidated evidence, and I re-confirm condition (4) — specifically
**M2(a) cancel-mid-run re-run on the CONC-A1-fixed harness** (b+d already re-confirmed: builds
287+288 parallel both success on fixed main; a's only prior evidence (build 267) was on the
pre-CONC-A1, pre-wrapper-fix harness) — plus the push build green on current main. (a) re-run had
not yet appeared in Drone as of this verdict (Builder sequenced it after (c)). I will verify it cold
when it lands.
## M2: PASS @2026-06-10T08:55Z — merged + live-verified (a)(d) on final main 139e319/74ed240
Formal M2 gate verdict against the Builder's M2 claim (STATUS-conc, commit 74ed240). Formed from
the plan (SSOT), the code/git, the claim's verify recipe, and my OWN cold re-runs from my own clone
+ fresh checkouts/Drone-API on cc-ci — not the Builder's narrative. All seven claim items confirmed:
1. **Merge integrity** — `git diff 139e319 b6e12ef -- runner/ tests/ docs/ .drone.yml nix/` = 0 lines;
`b6e12ef ⊆ 139e319`; merge parents `2173894 ∘ b6e12ef`. So deployed main code == the CONC-A1 tree
I code-verified + mutation-proofed. No force-push (history linear). NB the claim mis-states the
first parent as `4ad55ed` (actual `2173894`, my M2(c)-FAIL commit) — immaterial: that's a state-
file commit, and the code-diff-empty check is authoritative.
2. **Push build green** — Drone push builds 283298 on main all `status=success`; no red push since
the merge.
3. **Suites + lint (cold, fresh clone on cc-ci)** — unit 138 passed, concurrency 23 passed
(concurrency NOT in the default unit gate), `lint: PASS` on final main 74ed240. test_run_state
mutation-proofed (reverting to domain-keying fails all 3 cases).
4. **(a) cancel-mid-run on fixed harness** — build 295 (custom immich#2): lockfile mtime 08:50:17
proves it acquired the app lock 7s in → canceled @08:51:05 MID-DEPLOY. After cancel (verified cold
~1 min later): 0 harness procs (no leaked python — old §8.1 gap stays closed), no held locks (lock
released), no immich app/.env/containers(even stopped)/services/volumes/secrets → ZERO leakage,
full teardown. Killed-step logs not API-retrievable (Drone truncates), but the end-state is the
actual test and it is clean.
5. **(b) parallel runs** — builds 287 (immich#2) + 288 (plausible#3), parallel, both
`status=success`, both `deploy-count = 1 (expect 1)`, level=4; host after = zero leakage.
6. **(c) double-!testme same PR** — builds 290 + 291 (same immich domain): both success, 291 logged
the block line then `acquired`, both `deploy-count = 1`, zero leakage. Serialization also observed
directly in the kernel lock table mid-run (lslocks). Covered in detail by my M2(c) PASS @09:05Z.
7. **(d) full green e2e** — build 287 (and 290): complete immich run, all 5 tiers pass, level=4.
Both M2-found fixes are folded in and independently verified: wrapper exit-code-under-set-e
(e1c4198/b7a009c, my 05:00Z note — red still propagates) and CONC-A1 run-keyed state files
(b6e12ef/139e319, my 09:05Z M2(c) PASS + mutation proof). The ~20-min (c) runtimes were an
immich-ML healthcheck flake (converged within DEPLOY_TIMEOUT=1500s), orthogonal to the restructure
(diagnosed 08:43Z). Unheld 0-byte leftover lockfiles are by-design (next-janitor tidy-sweep).
GUARDRAILS honored end-to-end: recipe-mirror PRs (immich#2, plausible#3) used as !testme targets
only, never merged/pushed; cc-ci main touched only by the gated merges (no force-push); no secrets in
any commit. RUN_APP_RE / services_converged / warm-canonical flows untouched (M1 diff review).
CONCLUSION: **M2 — merged + live-verified — PASS.** M1 PASS (04:38Z) + M2 PASS (here) are both fresh
in REVIEW-conc; no open VETO (CONC-A1 lifted). Per the phase DoD the Builder may now write `## DONE`
to STATUS-conc. (Post-verdict I may consult JOURNAL-conc to contextualize; I had NOT read its M2
reasoning before forming this verdict — verified from plan + code/git + Drone API + my own cold runs.)

541
REVIEW-rcust.md Normal file
View File

@ -0,0 +1,541 @@
# REVIEW-rcust.md — Adversary ledger for the recipe-customization restructure phase
SSOT for this phase: `/srv/cc-ci/cc-ci-plan/recipe-custom-restructure-full-plan.md`.
Gates: **M1** (implementation verified — branch `restructure/recipe-custom`, unit+concurrency+lint
green on cold clone, resolved-customization diff clean for all 21 recipes, adversarial diff review)
and **M2** (merged + real-CI regression sweep matching baseline matrix). DONE requires fresh PASS
for both with no open VETO.
I own this file and the `## Adversary findings` section of BACKLOG-rcust.md only.
---
## Standing watch items (what I will hunt at M1/M2)
- **Coverage loss** (cardinal risk): for every migrated recipe, old loaders' effective customization
values must equal new `meta.load()` values. Throwaway diff script over all 21 recipe dirs; any
delta = finding.
- **Assertion weakening** in `tests/<recipe>/` diffs — migrations must be mechanical only (signatures,
fixture/key renames, underscore prefixes). Any changed assert/expected value = VETO.
- **Deleted-code fallout** — dangling refs to `_recipe_meta`, `_load_meta`, `_recipe_extra_env`,
`_recipe_meta_flag`, `declared_deps`, `is_canonical_enrolled`, `OIDC_AT_INSTALL`,
`CHAOS_BASE_DEPLOY`, `SKIP_GENERIC`, `setup_custom_tests`, `deps_apps`, `deps_creds`, `deployed_app`.
- **Validation gaps** — typo'd key / wrong type / callable-on-data-key must raise MetaError, not pass.
- **R2 fixed end-to-end** — orchestrator load path delivers SCREENSHOT to screenshot.py.
- **HC2 / F2-11 integrity** — repo-local default-deny, requires_deps skip-report, generic floor
semantics all unchanged.
---
## Verdicts
_(no GATE verdict yet — M1 is not claimed. M1 only claims after P1P6 are all on the branch;
Builder has landed P1 (472a68b) + P2 (8cd72fd) and is mid-P3. The interim pre-review below is
front-loaded break-it work on the FROZEN P1/P2 commits — NOT an M1 PASS.)_
### Interim pre-review of frozen P1+P2 (branch @ 8cd72fd) — @2026-06-10, cold from upstream clone
Done as idle-time break-it work while no gate is pending. P1/P2 phase commits won't be rewritten
(Builder adds P3+ on top), so reviewing them now is non-wasted and front-loads M1. Cold clone of
`origin/restructure/recipe-custom` into `/tmp/rcust-verify` from the true upstream remote.
**No defects found so far.** Results:
1. **Deleted-code fallout — CLEAN.** Grepped `runner/ tests/ scripts/` for live refs to every deleted
symbol (`_recipe_meta`, `_load_meta`, `_recipe_extra_env`, `_recipe_meta_flag`, `declared_deps`,
`is_canonical_enrolled`, `OIDC_AT_INSTALL`, `CHAOS_BASE_DEPLOY`, `SKIP_GENERIC`,
`setup_custom_tests`, `deps_apps`, `deps_creds`, `deployed_app`). All hits are comments/docstrings
explaining the deletion, test names, or the intentionally-RETAINED `CCCI_SKIP_GENERIC*` env form
(kept per P2c). Zero live call-sites. `setup_custom_tests.sh` files gone.
2. **All-recipes-load-clean (typo gate) — PASS, independently.** Ran `meta.load()` (pure stdlib) over
all 21 recipe dirs cold via plain python3 (did NOT trust the Builder's test_meta.py). All 21 load;
non-default key sets sane. Every ALL-CAPS key used in any recipe_meta.py is in the 14-key registry.
3. **Coverage-loss diff (CARDINAL check) — ZERO deltas on data keys + hook presence.** Throwaway
harness (`/tmp/diff_meta.py`) reproduces main's six-loader effective resolution (`_load_meta`,
`declared_deps`, `is_enrolled`, `_recipe_extra_env`) from MAIN's recipe_meta files and diffs vs the
BRANCH's `meta.load()` for all 21 recipes. After correcting one harness artifact (EXTRA_ENV default
is `{}` not None), **0/21 recipes show any delta** for HEALTH_PATH/HEALTH_OK/DEPLOY_TIMEOUT/
HTTP_TIMEOUT/BACKUP_CAPABLE/EXPECTED_NA/UPGRADE_BASE_VERSION/DEPS/WARM_CANONICAL + presence of
READY_PROBE/BACKUP_VERIFY/UPGRADE_EXTRA_ENV/EXTRA_ENV/SCREENSHOT.
4. **Validation gaps — CLOSED.** Crafted tmp recipe_metas: typo'd key → MetaError (with "did you mean
DEPLOY_TIMEOUT?"); wrong type (`DEPLOY_TIMEOUT="str"`) → MetaError; callable on data key
(`DEPLOY_TIMEOUT=lambda ctx:...`) → MetaError; `_PRIVATE`/lowercase-helper → loads clean (exemption
works). All four behave per the locked decision.
5. **meta.py read** — single `exec()`, frozen `RecipeMeta` generated from `KEYS`, `_coerce` rejects
bool-as-int and callable-on-data-key; `non_default` compares vs registry default. No issues.
**Still UNVERIFIED for M1 (do NOT treat above as M1 PASS):** full `pytest tests/unit -q` +
`pytest tests/concurrency -q` + `scripts/lint.sh` cold on the cc-ci host; R2 end-to-end through the
real orchestrator screenshot path; P3 ctx-hook signature migration (assert byte-identical, legacy
`lambda domain:` raises clear MetaError); P4/P5/P6; re-run the coverage diff on the FINAL branch
(P3 changes hook signatures); recipe-test diffs are mechanical-only (no assertion weakening);
HC2/F2-11/generic-floor integrity. These wait for the `claim(rcust): M1`.
### Interim pre-review of frozen P3 (branch @ fd02d9f) — @2026-06-10, cold from upstream clone
Builder landed P3 (uniform ctx hook convention) and moved to P4, so P3 is frozen. Pre-reviewed it.
**No defects found.**
1. **Mechanical-migration discipline — HELD (no VETO trigger).** `git diff 8cd72fd..fd02d9f` over
`tests/*/` shows ZERO changed assert/expected literals. Every hook change is purely
`def HOOK(domain[, meta])``def HOOK(ctx)` + `domain``ctx.domain` in the body. Spot-checked
cryptpad/mumble/ghost/lasuite-drive recipe_meta.py + lasuite-drive ops.py: seeded values, return
dicts, paths, status codes, and the `pre_restore` `assert _psql(...) in (...)` are byte-identical
apart from the `ctx.` deref.
2. **HookCtx — present + complete.** `meta.HookCtx` frozen dataclass has all 5 documented fields
(`.domain`, `.base_url`, `.meta`, `.deps`, `.op`); `meta.hook_ctx(domain, meta, op=…)` factory
builds it and pulls `deps` from `$CCCI_DEPS_FILE`. All call sites migrated: run_recipe_ci
`pre_<op>`, BACKUP_VERIFY; lifecycle `extra_env` + READY_PROBE; screenshot `SCREENSHOT(page, ctx)`.
(NB my first pass falsely flagged "no HookCtx" — that was a STALE WORKTREE at P2; corrected by
checking out fd02d9f. Logged here for honesty.)
3. **Legacy-signature guard (P3.4) — PRESENT + works, live-probed.** `meta.check_hook_signature`
exact-matches positional params and raises a CLEAR MetaError naming the P3 migration + HookCtx
fields. Wired into both `load()` (recipe_meta hooks; SCREENSHOT expects `(page, ctx)`, rest
`(ctx)`) and the orchestrator (ops.py `pre_<op>`). Crafted tmp metas: legacy `READY_PROBE(domain)`,
`SCREENSHOT(page, domain, meta)`, `EXTRA_ENV(domain)` all → MetaError at load; `READY_PROBE(ctx)`
loads clean. No silent mid-run TypeError path.
4. **Coverage diff re-run at P3 head — still 0/21 deltas** (hook presence + all data keys unchanged).
Net: P1+P2+P3 all clean under cold adversarial probing. M1 still gated on full unit+concurrency+lint
on the cc-ci host, P4P6, R2 end-to-end via the real screenshot orchestrator path, and a final
coverage re-diff. No findings filed; no VETO.
### Interim pre-review of frozen P4 (branch @ 29a28e2) — @2026-06-10T18:55Z, cold from fresh host clone
Builder landed P4 (custom-test ergonomics) and moved to P5, so P4 is frozen. Pre-reviewed it cold.
**No defects found.** NOT an M1 verdict — M1 stays gated (see "Still UNVERIFIED" below).
Cold acceptance (fresh `git clone` on cc-ci host at 29a28e2, my own checkout — not the Builder's):
- `cc-ci-run -m pytest tests/unit -q`**184 passed** (exact match to claim; full suite, no
cross-fixture pollution from the session-scoped `deps` fixture).
- `cc-ci-run -m pytest tests/unit/test_discovery.py test_discovery_phase2.py
test_conftest_fixtures.py -q` → 14 passed.
- `nix develop .#lint --command scripts/lint.sh` → **lint: PASS** (ruff format/check, deadnix,
shfmt, shellcheck, yamllint all clean).
Correctness probes:
1. **Placement-rule claim ("zero in-repo users of top-level custom tests") — HOLDS.** Filesystem
sweep of every `tests/<recipe>/test_*.py`: ALL are lifecycle names (test_{install,upgrade,
backup,restore}.py). No top-level non-lifecycle custom exists in-repo, so dropping the top-level
glob in `discovery.custom_tests` loses ZERO coverage. The lifecycle-name exclusion is retained
inside functional/playwright as the double-run safety net.
2. **Discovery diff — clean.** Top-level `glob(test_*.py)` branch removed; functional/ + playwright/
subdir globs retained with `basename not in lifecycle_names` guard. Docstring + module header
updated to state the placement RULE.
3. **Test changes are adaptation + strengthening, NOT weakening (no VETO trigger).**
- `test_discovery_phase2`: renamed to `..._placement_rule_...`; now ASSERTS the top-level
`test_sso_smoke.py` is `not in names` (new negative assertion proving the behavior change),
while functional/playwright customs are still `in names` and lifecycle name excluded.
- `test_discovery::test_custom_tests_repo_local_gated`: repo-local custom moved from top-level
into `functional/`; HC2 default-deny (`== []` when unapproved) and approved-case
(`functional/test_sso.py in names`, `test_install.py` excluded) both INTACT. HC2 integrity
preserved.
4. **op_state fixture — correct.** Skips with clear reason on unset env / missing file / non-JSON
(`except ValueError` catches JSONDecodeError); reads & returns parsed dict otherwise. Tests
cover 3 of 4 paths (the non-JSON skip path is untested — minor coverage gap, not a defect; the
branch is trivially correct by inspection).
Net: P1+P2+P3+P4 all clean under cold adversarial probing; both halves of every phase claim
(unit count + lint) reproduced cold on a fresh clone. No findings filed; no VETO.
**Still UNVERIFIED for M1 (do NOT treat above as M1 PASS):** P5 (manifest) + P6 (docs);
`pytest tests/concurrency -q` cold; R2 end-to-end through the real orchestrator screenshot path;
final coverage re-diff on the COMPLETE branch (P1P6, all 21 recipes, effective customization set
unchanged); recipe-test diffs mechanical-only across the whole branch; HC2/F2-11/generic-floor
integrity at the final head. These wait for `claim(rcust): M1`.
### Interim pre-review of frozen P5 (branch @ 68954be) — @2026-06-10T19:06Z, cold from fresh host clone
Builder landed P5 (customization manifest) and moved to P6, so P5 is frozen. Pre-reviewed it cold.
**No blocking defect; one secret-SURFACE observation raised (heads-up to Builder, NOT a VETO, NOT
an M1 secret-leak failure).** NOT an M1 verdict.
Cold acceptance (fresh `git clone` on cc-ci host at 68954be, my own checkout):
- `cc-ci-run -m pytest tests/unit -q` → **191 passed** (exact match to claim).
- `nix develop .#lint --command scripts/lint.sh` → **lint: PASS**.
Primary adversarial target — SECRET LEAKAGE via the new manifest surface (D-gate: published logs +
dashboard contain NO secrets, incl. generated app passwords):
1. **Generated/runtime secrets — NOT exposed (gate holds).** `manifest.build` collects only:
`meta_non_default` (static recipe_meta), hook NAMES (pre-ops/install_steps.sh/compose.ccci.yml),
overlay FILENAMES, custom-test COUNTS, and env-override KEY names (printed `KEY=1`, value never
rendered). It never touches `deps` (client_secret), `op_state`, abra-generated app passwords, or
any env VALUE. The cardinal concern — generated app passwords on the dashboard — is structurally
absent from this surface.
2. **Cold all-recipes sweep.** Built+rendered the manifest for all 21 recipes on the host; grepped
the rendered blocks AND the results.json `customization` payload for secret/password/token/key/
credential and for any 32+ char high-entropy string. The ONLY hit, across every recipe, is
plausible's `EXTRA_ENV.SECRET_KEY_BASE` =
`"ccciplausibletestkeybase64charsexactlyforCIephemeral4567890123"`.
3. **OBSERVATION (not a leak):** that value is a HARDCODED, committed, PUBLIC dummy CI constant
(tests/plausible/recipe_meta.py, in the open-source repo) — not a generated or real secret.
`meta_non_default` dumps EXTRA_ENV literal dicts verbatim into the log AND results.json (→
dashboard), so a field literally named `SECRET_KEY_BASE` with a value now appears on the
dashboard. No real secret is exposed (it's public), so this is NOT a D-gate failure and does NOT
block P5. BUT it's a standing surface: (a) a dashboard secret-scan gets a true-positive-shaped
hit on a public dummy (noise that could mask a real leak), and (b) if any recipe ever set a real
secret-ish literal in a meta dict, the manifest would surface it unredacted. Flagged to Builder
via BUILDER-INBOX as a heads-up to consider redacting values of sensitive-named meta keys before
M1. Will re-examine on the real dashboard at the M1 cold-verify.
4. **HC2-honoring — confirmed.** Manifest routes ALL repo-local reads through `discovery._gated`
(ops.py loop direct; `install_steps`/`resolve_overlay_op`/`custom_tests` each call `_gated`
internally). An unapproved repo-local recipe contributes nothing to the manifest.
5. **Pure presentation — holds.** `build()` only reads files/env and returns a dict; `render()`
formats a string. Called at run_recipe_ci.py:889-890 (print) + embedded at :1261 into results;
no state mutation, no verdict influence. `_jsonable` renders callables as `'<hook>'` (so a
callable EXTRA_ENV/READY_PROBE never leaks closure internals) and tuples→lists for JSON.
Net: P1P5 all clean under cold adversarial probing; every phase claim (unit count + lint)
reproduced cold. No findings filed; no VETO. One non-blocking secret-surface heads-up sent.
**Still UNVERIFIED for M1:** P6 (docs); `pytest tests/concurrency -q` cold; R2 end-to-end via the
real orchestrator screenshot path; final coverage re-diff on the COMPLETE branch (all 21 recipes,
effective customization unchanged); recipe-test diffs mechanical-only across the whole branch;
HC2/F2-11/generic-floor integrity at final head; AND — at the M1 dashboard check — confirm the
SECRET_KEY_BASE-named field on the real dashboard is the accepted public dummy (or redacted).
These wait for `claim(rcust): M1`.
## M1 — implementation verified: **PASS** @2026-06-10T19:27Z (branch `restructure/recipe-custom` @ 858e0f5)
Cold-verified from TWO fresh clones on the cc-ci host (NEW=858e0f5, OLD=main pre-restructure;
merge-base 49fb818 confirmed → `main..858e0f5` is exactly P1P6). Verdict formed from the phase plan
(SSOT), the code/git history, the STATUS verification facts, and my own cold re-runs — NOT from
JOURNAL rationale (isolation discipline; I did not need to consult JOURNAL).
**All M1 Definition-of-Done items PASS:**
1. **Cold test suites — match claim exactly.** Fresh clone @858e0f5:
`cc-ci-run -m pytest tests/unit -q` → **192 passed**; `tests/concurrency -q` → **23 passed**
(untouched by this plan, proven); `nix develop .#lint --command scripts/lint.sh` → **lint: PASS**.
2. **Coverage diff (cardinal risk) — 0 REAL deltas / 21 recipes.** Wrote throwaway extractors that
resolve EVERY recipe's effective customization in BOTH worlds — OLD via the legacy loaders
(`_load_meta` + `lifecycle._recipe_extra_env` + `deps.declared_deps` + `_recipe_meta_flag`),
NEW via `meta.load()` + `meta.extra_env/upgrade_extra_env` — for the common keys (HEALTH_*,
timeouts, DEPS, EXTRA_ENV resolved at a fixed domain, UPGRADE_EXTRA_ENV, BACKUP_CAPABLE,
EXPECTED_NA, UPGRADE_BASE_VERSION, READY_PROBE/BACKUP_VERIFY presence). Diff = **0 behavioral
deltas**; the only raw diffs were 20× `UPGRADE_EXTRA_ENV: None→{}` (unset default representation,
behaviorally identical) and mumble (most-customized: callable EXTRA_ENV→dict, UPGRADE_EXTRA_ENV,
READY_PROBE) is **byte-identical** old↔new.
Deleted keys accounted for (no silent loss): `SKIP_GENERIC` (0 recipe users); `CHAOS_BASE_DEPLOY`
→ overlay-presence (discourse+ghost, exactly the two shipping compose.ccci.yml — perfect 1:1, no
change either direction); `OIDC_AT_INSTALL` → install-time made universal (drive+meet were
already install-time). **lasuite-docs** declared DEPS but NOT OIDC_AT_INSTALL → OLD post-install,
NEW install-time: an INTENTIONAL P2b consolidation, not a drop — flagged below for M2 validation.
3. **Assertion weakening (VETO-class) — NONE.** Full branch diff over all recipe test files
(excl. harness unit/concurrency/regression): 18 removed asserts, 18 added. After mechanical
normalization (`domain`→`ctx.domain`, `deps_creds`→`deps`, `MAX_USERS`→`_MAX_USERS`, whitespace)
the removed and added assert sets are **IDENTICAL** — zero unmatched in either direction. Every
change is a pure signature/fixture/constant rename; no expected value altered, no assert deleted.
Spot-confirmed discourse/ghost `_psql(domain,…ci_marker…) in (…)` → `ctx.domain` only (expected
tuple + SQL byte-identical). **No VETO.**
4. **Deleted-code fallout — clean.** No dangling LIVE refs to any of the 13 deleted symbols
(`_recipe_meta`/`_load_meta`/`_recipe_extra_env`/`_recipe_meta_flag`/`declared_deps`/
`is_canonical_enrolled`/`OIDC_AT_INSTALL`/`CHAOS_BASE_DEPLOY`/`SKIP_GENERIC`/`setup_custom_tests`/
`deps_apps`/`deps_creds`/`deployed_app`). Only residue: stale DOC/comment mentions of
`OIDC_AT_INSTALL` + `setup_custom_tests.sh` in PARITY.md files (non-blocking P6 cosmetic nit).
5. **Validation gaps — closed.** Cold-probed `meta.load()` with synthetic bad metas: typo'd key,
str-on-int, bool-as-int, callable-on-data-key, legacy hook sig `READY_PROBE(domain)`, and unknown
key ALL → `MetaError` (clear, names the offending file/key). Clean + underscore-private-helper
metas load fine (no false positives). No silent pass.
6. **R2 fixed end-to-end.** Cold proof through the REAL load path: a recipe declaring
`def SCREENSHOT(page, ctx)` is surfaced by `meta.load()` and resolved callable by
`screenshot._load_screenshot_hook` (old L1 allowlist dropped it — now arrives); orchestrator wires
it `run_recipe_ci.py:1029 capture(…, recipe_meta=meta)` → `hook(page, hook_ctx(domain, meta))`.
Absent recipe → None (default landing-page path). Legacy `SCREENSHOT(page, domain, meta)` sig
rejected at load.
7. **HC2 / F2-11 / generic-floor integrity — preserved.** Cold-probed `discovery.custom_tests` +
`install_steps`: UNAPPROVED repo-local → `[]` / `None` (default-deny holds); APPROVED → surfaced.
`sso_dep_unverified` (F2-11) logic UNCHANGED (only a comment edited) — a deps-not-ready run that
skips ≥1 `requires_deps` test still suppresses the green signal. Generic floor `_skip_generic`
default = run (additive); opt-out now env-only (same env vars as before; the 0-user meta key
removed) and surfaced LOUDLY in CI + flagged `!!` in the manifest — strictly stronger, never
silent.
8. **(Bonus) P5 secret-surface heads-up RESOLVED + verified.** The Builder landed `858e0f5`
redacting secret-named meta values in the manifest (my P5 BUILDER-INBOX ask). Cold-verified:
`plausible.EXTRA_ENV.SECRET_KEY_BASE` → `<redacted>` in BOTH the log block and results.json;
recursive into nested dict keys; word-segment `(^|_)KEY(_|$)` regex avoids over-match
(KEYCLOAK_* passes). All-21-recipe sweep: exactly 1 redaction, ZERO over-redaction, ZERO
under-redaction (no secret-shaped value remains). Regression test
`test_manifest_redacts_sensitive_named_values` present.
**Verdict: M1 PASS.** No findings filed, no VETO.
**This does NOT clear `## DONE`.** Per the phase DoD, DONE requires a fresh Adversary PASS for BOTH
M1 *and* M2. M2 (merged-main real-CI regression sweep vs the committed baseline matrix) is still
unverified. M2 watch-items I will specifically re-check from run logs:
- **lasuite-docs OIDC is now install-time** (post→install change above) — must pass a real run with
OIDC wired at install (skip-count 0 on its `requires_deps` tests).
- the customization spot-checks the plan §M2.4 enumerates (mumble READY_PROBE tcp lines, cryptpad
SANDBOX_DOMAIN, ghost/discourse BACKUP_VERIFY + overlay copy + auto-chaos base deploy, lasuite-*
deps provisioning + OIDC tests ran, immich ops.py seeds, manifest block present in every log,
screenshot.png where capture succeeded).
- canary suite (RED canaries still caught at intended tier) + per-recipe level == baseline matrix.
- zero leaked apps after teardown.
### M2-prep — independent hook-port audit (shell→python / best-effort↔fatal drift) @2026-06-10T20:55Z
Triggered by the lasuite-drive regression (below), which my M1 PASS MISSED: my M1 coverage diff
compared recipe_meta KEYS (resolved values), not ops.py hook BODIES, and my assertion scan matched
`assert ` not `raise AssertionError`. So a hook that flipped best-effort→fatal was invisible to my
M1 method. M2 (real-CI sweep) caught it — the safety net working as designed. I then audited ALL
hook ports cold (`git diff c2508c7..origin/main` per recipe ops.py + the 2 setup_custom_tests.sh
ports), filtering for non-mechanical error-handling (raise/assert/except/exit/timeout/poll changes):
- **lasuite-drive `pre_install`** — GENUINE rcust regression (Builder-disclosed, I confirmed):
OLD setup_custom_tests.sh bucket poll fell through on 90s timeout (best-effort, no failure; the
custom-tier `test_minio_storage.py` upload→list→download is the real gate); NEW port added a
terminal `raise AssertionError` → deterministic install RED when the bucket appears just after
90s. Fix-forward APPROVED (restore best-effort print+return, scoped to line-54 only; conditioned
on an L5 re-run + my diff re-verify). See approval entry in BUILDER-INBOX history (commit 57c66ad).
- **lasuite-docs `install_steps.sh`** — INTENTIONAL P2b change, NOT a defect: OLD setup_custom_tests
did `exit 1` on missing deps/null KC creds; NEW does `exit 0` (no-op) for missing-deps (gated now
by F2-11: the `@requires_deps` OIDC test skips → `sso_dep_unverified` suppresses green) BUT
preserves `exit 1` on secret-insert failure. Consistent with the install-time-deps redesign.
WATCH-ITEM (residual): the missing-deps path now relies entirely on F2-11; the sweep didn't
exercise it (deps were ready, skip-count 0). Mechanism verified present at M1; not blocking.
- **All other ops.py** (cryptpad, discourse, ghost, immich, keycloak, lasuite-meet, matrix-synapse,
mattermost-lts, mumble, n8n, plausible, custom-html) — pure mechanical ctx migration
(`domain`→`ctx.domain`, `meta`→`ctx.meta`); expected tuples/strings byte-identical (spot-checked
keycloak 201/409 + 204/200, discourse/ghost _psql ci_marker). No error-handling drift.
Net: exactly ONE accidental hook-port regression (lasuite-drive), now under approved fix. No other
best-effort↔fatal flips. This audit closes the M1-method gap for the hook bodies.
---
### M2 proof-run independent analysis (cold, Adversary) @2026-06-10T23:53Z
M2 is NOT yet claimed by the Builder; this is my independent read of the proof runs sitting on
cc-ci (`/var/lib/cc-ci-runs/{m2b-*,ab-*-oldmain}`), parsed myself via jq (NOT trusting Builder
narrative). The 6 first-sweep mismatches break down as follows.
**Confirmed root fact — REF MISMATCH is real (I verified, not taken on faith).** Every baseline
matrix run used a *PR-head* ref; the first M2.3 sweep used each mirror's *default-branch head* — a
different commit. Independently confirmed via `results.json.ref`:
| recipe | baseline run/ref/level | sweep ref/level |
|---|---|---|
| discourse | 184 / 7ae7b0f76efb / L4 | 7d53d4ec390f / L2 |
| plausible | 308 / 13458fac56a1 / L4 | da159375d89a / L2 |
| mattermost-lts | 196 / a333e31a6002 / L4 | 41c9eb8e5f34 / L2 |
| immich | 307 / 107d7220adce / L4 | 7eb3937a82d0 / L2 |
| lasuite-drive | 189 / ffa7d585afa2 / L5 | f4135d78201e / L0 |
So the sweep was NOT apples-to-apples vs the baseline matrix. Reconciliation requires either
(a) re-run at the baseline ref on new main == baseline level, or (b) A/B same-ref old-vs-new main
== same level. Status per recipe:
- **immich** — m2b-immich (new main, baseline ref 107d7220adce) = **L4 == baseline L4. CLEAN.**
- **mattermost-lts** — m2b (new main, a333e31a6002) = **L4 == baseline L4. CLEAN.**
- **plausible** — m2b (new main, 13458fac56a1) = **L4 == baseline L4. CLEAN.**
→ these three: restructure proven INNOCENT (baseline ref reproduces baseline level on merged main).
- **bluesky-pds** — ab-bluesky-pds-oldmain (OLD main, b2d86efba3f1) = L0 == new-main sweep L0 at
same ref → restructure-NEUTRAL at the sweep ref. (Baseline is "L4-equiv, pre-results-era", no run
id — softer baseline; A/B neutrality is the available evidence.)
- **discourse — NOT yet clean. OPEN.** Two *distinct* flake modes seen, and the A/B was run at the
wrong ref to close the gap:
- baseline 184 (OLD main, 7ae7b0f): all pass → L4.
- m2b-discourse (NEW main, SAME ref 7ae7b0f): **upgrade FAILED**, HC1 guard fired —
"upgrade deployed chaos commit 'eb96de94+U', not intended PR-head '7ae7b0f76efb' — re-checkout
to code-under-test failed (HC1)" → L1. ← same-ref old=L4 vs new=L1 discrepancy, UNexplained.
- ab-discourse-oldmain (OLD main, 7d53d4ec): **restore FAILED** (ci_marker truncated-dump race)
→ L2 == new-main sweep L2 at that ref → neutrality proven, but for the RESTORE mode at the
DEFAULT-head ref, NOT for the L1/upgrade-HC1 mode at the baseline ref.
- Net: the clean A/B (ref 7ae7b0f on OLD main vs NEW main) that would explain L4→L1 was NOT run.
The upgrade re-checkout/HC1 path lives in run_recipe_ci.py/lifecycle which the meta-param
threading DID touch — so "pre-existing flake" is plausible but UNPROVEN here. To clear: run
discourse @7ae7b0f on OLD main (does it deterministically reproduce L4, or also flake to L1?),
and/or repeat @7ae7b0f on new main to characterise the HC1 re-checkout as a race. The HC1 guard
FIRING (not silently passing the wrong commit) is the safety net working — good — but it means
the upgrade did not exercise the PR code, so the run is inconclusive, not a clean baseline match.
- **lasuite-drive** — fix-forward 1357544 (restore best-effort bucket poll) landed; needs a fresh
L5 run at the baseline ref ffa7d585afa2 on merged main to confirm baseline. m2rr/earlier runs
predate or used the default head — NOT yet a clean baseline match. OPEN.
**M2 disposition: still OPEN — no PASS.** 3/6 cleanly reconciled (immich/mattermost/plausible);
bluesky neutral-at-sweep-ref; discourse + lasuite-drive NOT yet closed. I will require, at the M2
claim: (1) discourse same-ref A/B (or repeat) explaining L4→L1; (2) a clean lasuite-drive L5 at
baseline ref; (3) my own cold re-parse of every per-recipe level vs baseline; (4) the M2.4
customization-executed spot-greps; (5) zero leaked apps. Recorded a BUILDER-INBOX heads-up on the
discourse-HC1 gap so it is addressed in the claim, not glossed as "the restore flake".
### M2 proof-run progress + self-correction @2026-06-11T00:05Z
Builder is running (independently, matching my inbox ask) the decisive A/B serially on the box:
`m2-proof.sh` → lasuite-drive @ffa7d585afa2 PR=1 (post-fix-forward 1357544) on merged main 5c0676b,
then discourse @7ae7b0f76efb **PR=2** on merged main (m2p-discourse); `m2-proof2.sh` (queued) →
discourse @7ae7b0f76efb **PR=2** on OLD main (/root/m2-oldmain, ab-discourse-7ae7b0f-oldmain).
**Self-correction to my 23:53Z discourse analysis:** my m2b-discourse run used **PR=0**, but the
upgrade HC1 guard resolves the *PR head* for the re-checkout. The L1 failure message ("deployed
chaos commit 'eb96de94+U', not PR-head 7ae7b0f — re-checkout failed") is plausibly a **PR=0
artifact** (no real PR to resolve the head from), NOT a restructure regression. The Builder's proof
runs correctly use PR=2 (matching baseline run 184's pr=2). So the apples-to-apples comparison I
need is m2p-discourse (PR=2, new main) vs ab-discourse-7ae7b0f-oldmain (PR=2, old main) vs baseline
184 (PR=2, old main, L4). I will cold-verify those three when they land; my L4→L1 concern is on
hold pending the PR=2 result, not yet a confirmed regression. Live lasu-f68b63 stack = active
lasuite-drive proof run (expected, not a leak).
### M2 fix-forward APPROVE: be2026a (services_converged completed-one-shot rule) @2026-06-11T00:31Z
Builder proposed a 2nd lasuite-drive P2b fix on branch `fix/converged-oneshot @ be2026a` and asked
approval before merging to main (M2 "trivial fix-forward w/ Adversary approval" path). Cold-verified
independently (fresh clone of be2026a at /root/adv-be2026a on cc-ci, NOT the Builder's working tree):
- **Diff** (`git diff origin/main..be2026a runner/harness/lifecycle.py`, read myself): in
`services_converged`, a `cur != want` deficit now passes ONLY if `docker service ps <svc>` shows
ALL task states == `Complete`. Conservative: any Running/Preparing/Pending (spinning up) or
Failed/Rejected (broken) in the deficit still returns False; no-tasks-yet still False; plain N/N
and 0/0 unchanged. Targeted addition, not a rewrite.
- **False-green analysis (my own):** only `restart_policy:none` one-shots ever show `Complete`; a
normal crashed service shows Failed/Running(restarting), never Complete. Even if converge passed
on a completed-but-ineffective one-shot, two INDEPENDENT gates still catch it — the generic
`test_serving` HTTP floor and the custom-tier functional test (lasuite-drive
`test_minio_storage.py` upload→list→download is the real bucket gate). Defense-in-depth holds; I
could not construct a false-green path.
- **Tests** `tests/unit/test_converged_oneshot.py` (read + cold-ran): 7 cases pin exactly the
non-vacuity criteria — completed→converged, Failed→NOT, mixed Complete+Failed→NOT (covers the
`docker service ps` history concern), Preparing→NOT, no-tasks→NOT, N/N→converged, 0/0→converged.
- **Cold suite+lint from fresh be2026a checkout:** `cc-ci-run -m pytest tests/unit -q` → **199
passed**; the 7 new tests pass alone; `nix develop .#lint --command scripts/lint.sh` → **lint:
PASS**. Matches Builder's claim.
- **Root cause judged genuine P2b regression** (hook moved into ops.py pre_install runs BEFORE the
install assert; the completed one-shot's 0/1 then burns DEPLOY_TIMEOUT in the converge poll). The
fix accepts a genuinely-healthy deploy (HTTP 200, all other services 1/1) the old `cur!=want`
wrongly rejected — correction, not masking.
- **Not on main** — confirmed `all(s == "Complete")` absent from origin/main; Builder held the gate.
- **Disclosed semantic delta** (a failing one-shot now blocks install convergence earlier vs later
at custom-tier): ACCEPTED — both paths RED, no false-green, no enrolled recipe has a
baseline-failing one-shot.
**VERDICT: fix-forward be2026a APPROVED, conditional on:**
1. Post-merge lasuite-drive proof re-run @ffa7d585afa2 PR=1 lands **L5** (binding end-to-end proof
the fix resolves the converge hang — if it doesn't, the diagnosis was wrong and approval voids).
2. I re-verify the MERGED diff == be2026a diff (no extra change sneaks in at merge).
3. discourse PR=2 A/B pair (m2p-discourse / ab-discourse-7ae7b0f-oldmain — no one-shots, unaffected
by this fix) completes and I cold-verify those levels too.
This APPROVE does NOT clear M2; M2 still needs all per-recipe levels reconciled + my independent
sample re-check + zero-leak teardown.
### be2026a merge cold-verify — condition #2 SATISFIED @2026-06-11T00:42Z
Builder merged be2026a as 6cabbe7 (build 350 green, origin/main now b4505ac). Independently checked:
`diff origin/main:runner/harness/lifecycle.py be2026a:...` → **IDENTICAL**; the merged
`tests/unit/test_converged_oneshot.py` → **IDENTICAL** to be2026a. Clean merge, no extra change
slipped in — approval condition #2 met. m2p-lasuite-drive (pre-fix) landed L0 (install/converge
timeout) = the diagnosed symptom (Builder disclosed b4505ac it SIGINT-shortcut the doomed burn;
binding proof is the post-fix m2p2 re-run). REMAINING be2026a conditions: #1 post-fix lasuite-drive
L5, #3 discourse PR=2 A/B cold-check — both pending (m2p-discourse running, then ab-oldmain, then
m2p2-lasuite-drive).
### be2026a conditions CLEARED + SSO-baseline staleness finding (independent) @2026-06-11T01:12Z
Reached the conclusions below COLD (own git archaeology + run-dir jq) BEFORE reading the Builder's
01:10Z inbox — which then concurred. Anti-anchoring preserved (no JOURNAL read; inbox read after my
own derivation).
**be2026a fix-forward — ALL 3 CONDITIONS SATISFIED → fix-forward FULLY CLEARED:**
1. **Post-fix lasuite-drive (m2p2, merged main 6cabbe7, ffa7d585afa2, PR=1): L4, rc=0, 3m19s.**
Independently verified: flags clean_teardown=true + no_secret_leak=true; all 4 essential rungs
pass; `test_minio_storage::...object_roundtrip` PASSED; `test_oidc_..._keycloak` PASSED. The
install converge no longer hangs — both fix-forwards (1357544 best-effort poll + 6cabbe7
completed-one-shot converge) exercised in one run. The literal "L5" in my condition is
**unmeetable on current code and NOT an rcust effect** — see staleness finding below; I accept
the L4-equivalence. Fix works end-to-end.
2. **Merged diff == branch diff** — verified earlier (4428e76): lifecycle.py + test file
byte-identical to be2026a.
3. **discourse A/B — restructure-NEUTRAL.** m2p-discourse (NEW main, 7ae7b0f, PR=2) = L1 and
ab-discourse-7ae7b0f-oldmain (OLD main, SAME ref, SAME PR=2) = L1, SAME stage (upgrade), SAME
message (`eb96de94+U` HC1 re-checkout). old==new byte-identical → rcust did NOT regress discourse.
The L4(184)→L1 vs baseline is pre-existing env drift since 06-05 (filed below), not rcust.
**FINDING [adversary] — M2 baseline matrix has 3 STALE L5 entries (lasuite-docs/drive/meet).**
Independently established: the level ladder dropped 6-rung(L5)→4-rung(max L4, integration &
recipe-local now OPTIONAL/non-laddered) in mainline PR#6 (c51cd84 "4-rung ladder", + 46e2cdb),
which `git merge-base --is-ancestor c51cd84 01e6d49^` confirms is an ANCESTOR OF PRE-RCUST MAIN.
The rcust merge touches level.py NOT AT ALL and results.py by +4 cosmetic P5 lines; compute_level
+ derive_rungs are byte-identical old-main↔merged-main. So NO current-code run (rcust or pre-rcust)
can produce L5; baselines 188/189/204 (L5, integration:pass) were recorded under the OLD schema
(run 204 ran 06-09 hours before the refactor deployed). **rcust is INNOCENT of L4≠L5.** Integration
coverage is NOT lost: the requires_deps OIDC tests EXECUTE and PASS (skip-count 0) on current code —
verified in m2p2 AND the sweep's m2r-lasuite-docs (`test_oidc_login_via_keycloak` +
`test_oidc_password_grant_...` PASSED) and m2r-lasuite-meet (`...password_grant...` PASSED).
ACCEPTED equivalence for the M2 matrix: **old L5 ≡ new L4 (all 4 essential rungs pass) + requires_deps
OIDC test PASSED (skip-count 0)**. Under this, lasuite-docs (m2r L4) / lasuite-meet (m2r L4) /
lasuite-drive (m2p2 L4) all MATCH. (Note: this validates — but corrects the basis of — the Builder's
first-sweep "lasuite-docs/meet matched baseline"; they are L4+OIDC, not numeric L5.) This is a
matrix-staleness correction, NOT a rcust regression; no VETO.
**Still OPEN for the M2 verdict (my side):** (a) per-recipe levels reconciled vs the CORRECTED
baseline for all 21; (b) bluesky-pds is L0 on BOTH old & new main (upstream image
`Cannot find module index.js`) — restructure-neutral but also cannot match its L4-equiv baseline on
ANY current run → needs a DECISIONS/DEFERRED note as non-rcust upstream breakage, not a silent
mismatch; (c) the 2 drone-path !testme runs (immich#2/plausible#3); (d) zero-leak teardown sweep;
(e) my own independent re-check of ≥5 recipes' logs + ALL mismatches before any M2 PASS.
---
## M2 — merged-main real-CI regression sweep: **PASS** @2026-06-11T01:15Z
Cold-verified the M2 claim (STATUS gate "M2 CLAIMED ~01:30Z") from my own clone + direct on cc-ci,
re-running/ re-parsing rather than trusting Builder logs. Every M2.0M2.4 item holds.
**M2.2 canaries — cold RE-RAN myself** from a fresh `origin/main` checkout (/root/adv-be2026a @
origin/main): `cc-ci-run -m pytest tests/regression/ -m canary -v` → **7/7 passed (301s)**, incl.
`bad-false-green` (the false-green detector) + all four RED canaries (bad-install/upgrade/backup/
restore) caught at their designed tier. The level system is NOT inflating. (log /root/adv-canary.log)
**M2.3 per-recipe — all 21 reconciled (cold jq on each run dir):**
- 13 clean: cryptpad/custom-html/ghost/hedgedoc/keycloak/matrix-synapse/n8n/uptime-kuma = L4;
mailu/custom-html-tiny = L2 (backup_restore N/A); mumble = L4 (deploy-count=1) — all == baseline,
clean_teardown=true.
- 2 designed-bad canaries genuinely exercised: bkp-bad rungs backup_restore=**fail** (backup=fail);
rst-bad backup_restore=**fail** (backup=pass→restore=fail). The L1 cap is upgrade-N/A ladder
semantics; the designed failure is recorded in the rung (verified — NOT a coincidental
level-match).
- immich/mattermost-lts/plausible: **L4 @ exact baseline refs** (m2b-*) — baseline REPRODUCED on the
restructured harness (cold-verified earlier this session).
- discourse: m2p-discourse (NEW main) == ab-discourse-7ae7b0f-oldmain (OLD main) — SAME ref/PR=2,
SAME stage, SAME upgrade-HC1 message (`eb96de94+U`), SAME L1. **old==new ⇒ rcust-neutral**; the
L4(184)→L1 is pre-existing env drift since 06-05 (DEFERRED.md), NOT caused by the restructure.
- lasuite-docs/-meet/-drive: L4 all-rungs-pass + requires_deps OIDC test PASSED (skip-count 0)
[lasuite-drive m2p2 also MinIO PASSED, post-both-fixes, rc=0]. Their "L5" baselines are STALE:
the 6→4-rung ladder landed in mainline c51cd84 (PR#6), which `git merge-base --is-ancestor
c51cd84 01e6d49^` confirms PREDATES the rcust merge; level.py untouched by the merge, derive_rungs
byte-identical old↔new. **rcust-innocent; integration coverage preserved** (OIDC tests execute &
pass). Accepted equivalence old L5 ≡ new L4-all-pass + OIDC-pass.
- bluesky-pds: EXCLUDED — `Cannot find module /app/index.js` crash-loop on BOTH old & new main at
every ref → upstream image breakage, rcust-neutral. DEFERRED.md note present.
**M2.3 drone→harness path:** drone builds **356 (immich) + 357 (plausible)** = `build_event=custom`
(bridge-triggered; distinct from push builds 358-361), trigger=autonomic-bot, both **success**
(verified in drone sqlite DB); run dirs 356/357 = immich L4 pr=2 / plausible L4 pr=3, customization
manifest present, clean_teardown=true.
**M2.4 customizations actually executed (cold-grep):** manifest block **21/21** logs; mumble
`ready-probe OK (tcp 3x) 127.0.0.1:64738`; ghost `ccci-overlay: provided compose.ccci.yml ...
base deploy auto-chaos` (P2a first-class path live); cryptpad `EXTRA_ENV='<hook>'`; immich
`ops.py[pre_backup,pre_restore,pre_upgrade]` + `pre-op seed` lines (migrated ctx hooks run).
**Teardown:** `docker stack ls` = infra (backups/bridge/dashboard/reports/drone/traefik) +
warm-keycloak ONLY, **zero leaked app stacks** (checked after ALL runs incl. drone-path).
**Fix-forwards (both Adversary-approved, additive):** 1357544 (lasuite-drive best-effort poll, appr
57c66ad) + be2026a/6cabbe7 (services_converged completed-one-shot, appr a531746) — merged diff ==
branch diff, all 3 be2026a conditions cleared (24a203a). Cold unit suite on post-fix main = 199
passed, lint PASS.
**VERDICT: M2 PASS.** No regression CAUSED BY the restructure: every deviation from the baseline
matrix is proven rcust-neutral by same-ref old-vs-new A/B (discourse, bluesky) or is a pre-rcust
stale-schema artifact with coverage preserved (3 lasuite), all documented in DEFERRED.md — not a
silent mismatch. The false-green detector is green on my own cold canary run. No findings filed,
no VETO.
**M1 PASS (01f9f70) + M2 PASS (this entry) both stand** → the phase DoD handshake is satisfied; the
Builder may write `## DONE` to STATUS-rcust.md. (M1's unit+lint acceptance still holds on post-fix
main: 199 passed / lint PASS, the fix-forwards being additive + separately approved.)

113
REVIEW-shot.md Normal file
View File

@ -0,0 +1,113 @@
# REVIEW-shot.md — Adversary verdicts, phase `shot` (recipe screenshot audit & repair)
Owner: Adversary loop. Append-only verdict log. Gates: M1 (audit+diagnosis), M2 (all working).
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-shot-screenshots.md`.
No gate CLAIMED yet (phase just opened; Builder has not bootstrapped STATUS-shot.md). Doing
independent cold ground-truth prep below so M1/M2 cold-verify is fast and un-anchored.
---
## Independent cold pre-audit (Adversary, @2026-06-11T01:20Z)
Method: ssh cc-ci, scanned `/var/lib/cc-ci-runs/*/results.json` for recipe + `screenshot` field +
on-disk `screenshot.png` size; scp'd suspect PNGs locally and **looked at them** (Read tool).
This is MY ground truth, formed before any Builder claim — to compare against the Builder's matrix.
PNG sizes from latest representative runs (m2r-* sweep + numbered drone runs):
| recipe | PNG bytes | my visual read | class |
|---|---|---|---|
| immich | 4801 | pure blank white frame | **BLANK** |
| n8n | 4801 | blank near-white frame | **BLANK** |
| lasuite-meet | 4801 | (size-identical to immich/n8n 4801B — blank tell) | BLANK (to confirm visually) |
| cryptpad | 4802 | blank light-grey frame | **BLANK** |
| keycloak | 8764 | spinner + "Loading the Administration Console" — paint-race loading state, NOT a real login form | **BLANK/LOADING** (not the "genuine sparse login" §2 guessed) |
| lasuite-docs | 6022 | bare spinner on white | **BLANK/LOADING** |
| lasuite-drive | ~5.9K | (size sibling of lasuite-docs — likely same spinner) | BLANK (to confirm) |
| plausible | null / NO PNG | every run null (122→357 incl. 357); run dir has no screenshot.png; capture stdout not in run dir (goes to Drone build log) — root cause still to trace | **NULL** |
| ghost | 444183 | (reference healthy, §2) | OK (visual-confirm at M2) |
| mattermost-lts | 242139 | reference healthy | OK |
| hedgedoc | 131967 | reference healthy | OK |
| discourse | 66-67K | reference healthy | OK |
| custom-html | 35707 | reference healthy | OK |
| mailu | 33800 | reference healthy | OK |
| matrix-synapse | 33296 | reference healthy | OK |
| uptime-kuma | 30858 | reference healthy | OK |
| custom-html-tiny | 12950 | reference healthy | OK |
| mumble | 7913 | voice server — web-UI N/A candidate (confirm) | N/A? |
Confirmed defect classes match the orchestrator pre-audit (§2): SPA paint-race (domcontentloaded
fires before JS paints) → immich/n8n/cryptpad fully blank, keycloak/lasuite-docs/-drive caught at
loading spinner; plausible never captures (null on every run). **The 4801B byte-identical size is a
reliable blank-frame fingerprint.**
Open items I must still resolve when verifying:
- plausible NULL root cause — need the Drone build log for a plausible run (capture stdout: "capture
failed" vs "produced no file" vs step never reached). Run dir alone doesn't have it.
- lasuite-meet / lasuite-drive / mumble — visual confirm.
- Authoritative enrolled-recipe set: every `tests/<recipe>/recipe_meta.py` minus fixtures
(`_generic`, `regression`, `concurrency`, `custom-html-bkp-bad`, `custom-html-rst-bad`).
No verdict yet. Awaiting `claim(shot): M1`.
---
## M1: PASS @2026-06-11T01:38Z (audit + diagnosis complete)
Claim: `claim(shot): M1` commit e005897; matrix+diagnoses at 8978fa6. STATUS-shot.md "M1 claim".
Verified COLD from my own clone + ssh cc-ci, **without reading JOURNAL-shot.md** (anti-anchoring).
My independent pre-audit (commit 4f3a747, formed BEFORE reading the Builder's matrix) already
agreed on every BLANK/LOADING/NULL read I had pre-formed — no anchoring.
**Enrolled set — complete, no omissions.** `ls tests/*/recipe_meta.py` = 21. Minus the two harness
canaries `custom-html-bkp-bad`, `custom-html-rst-bad` (plan §2 explicitly excludes both) = **19**.
The 19 matrix rows are *exactly* that set (diffed by hand) and exactly the plan §2 expected set.
`_generic`/`regression`/`concurrency`/`unit` have no recipe_meta.py → correctly absent. ✓
**Every non-OK row has evidence-backed root cause (independently re-derived):**
- plausible NULL — ran the Builder's drone-log command myself: build 357 step log shows
`capture failed … page.goto(https://plau-…/) never returned a status in (200,301,302,303,401,403)
after 15 attempts (45s); last status=500`. `/` 500s by design (DISABLE_AUTH) → default landing
capture can never succeed; needs a SCREENSHOT hook to a rendering path. Confirmed. ✓
- bluesky-pds NULL — capture is `if deploy_ok:`-gated, OUTSIDE the deploy try/except
(runner/run_recipe_ci.py:1024, read it). install=fail level=0 → capture correctly skipped. Not a
screenshot defect; upstream image breakage already in DEFERRED.md (rcust). ✓
- BLANK/LOADING — screenshot.py:84-93 navigates `wait_until="domcontentloaded"` then screenshots
immediately, no paint wait; accept_statuses excludes 500 (plausible mechanism). Read the code. ✓
- mumble NOT N/A — tests/mumble/recipe_meta.py header: deploys `compose.mumbleweb.yml`, a mumble-web
HTTP client routed through Traefik, HEALTH_PATH "/". A real web surface IS served → correctly the
HARDER (non-N/A) call. ✓
**Independent visual spot-checks (Read tool) — 11 artifacts, matrix matched reality on every one:**
immich 4801B = pure white; n8n 4801B = blank; cryptpad 4802B = blank grey; lasuite-meet 4801B =
pure white; keycloak 8764B = "Loading the Administration Console" spinner (NOT a real login — the
§2 "might be a genuine login" guess was wrong, Builder classed it LOADING correctly); lasuite-docs
6022B = bare spinner; mumble 7913B = spinner ring on grey; mattermost-lts 242139B = blue brand
splash + logo, NO login form (correctly LOADING despite large size — size alone is NOT a sufficient
signal, good catch); n8n run 197 30256B = real "Set up owner account" form, empty fields,
credential-free (flaky-pass + secret-safe, confirmed); custom-html 35707B = genuine "Welcome to
nginx!" (honest fresh-install view for a bare static host — OK); plausible = NULL via drone log.
Includes plausible ✓ and multiple 4801B cases ✓ (M1 minimum was ≥5 incl. those — exceeded).
**N/A arguments — agreed:**
- bluesky-pds → justified N/A (deploy-gated: can't screenshot what can't deploy; upstream breakage
is pre-existing/DEFERRED, not a screenshot defect). Agreed, contingent on the upstream image still
being broken at M2 — if it becomes deployable, it re-enters as a real recipe.
- mumble → NOT N/A. Agreed (real mumble-web surface, evidence above).
No omissions, no fabricated visual reads, diagnoses are causal not symptomatic. **M1 PASS.**
Watch-list for M2 (so the Builder has it early — NOT blocking M1):
1. Harness default-wait fix must stay within NAV_DEADLINE_S=45 / step worst-case ≤~60s and must
NEVER affect a verdict on screenshot failure (R7) — I will test the failure path has teeth but
no verdict impact, and compare pre/post run durations.
2. plausible SCREENSHOT hook must land on a credential-free *rendering* path (not /login showing a
generated secret; not a 500 page).
3. mattermost-lts proof: a bigger PNG is NOT acceptance — I will visually confirm the real login,
not a brand splash.
4. Secret-safety: every final PNG must show no generated credentials (install wizards, secrets
pages). n8n's "Set up owner account" with EMPTY fields is the safe shape; a pre-filled one is not.
5. M2 requires ≥2 proof runs via the drone `!testme` path + me Reading *every* final PNG.
Did not read JOURNAL-shot.md before this verdict. No finding filed (audit is accurate). No VETO.

View File

@ -2,17 +2,60 @@
Plan: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md (SSOT for this phase)
## DONE
Both gates Adversary-verified fresh in REVIEW-conc.md, no open VETO:
- M1 — implementation verified: PASS @2026-06-10T04:38Z (branch @d3fe9e2)
- M2 — merged + live-verified (a)(d): PASS @2026-06-10T08:55Z (final main 139e319/74ed240)
- CONC-A1 (M2(c) live finding): fixed b6e12ef, veto LIFTED + closed @09:05Z
## Phase state
- Phase: conc — concurrency restructure (P1P5 + tests/concurrency)
- Builder branch: `restructure/concurrency` (code lands there; main untouched until M2 merge)
- In flight: P1 (lock-lifetime hardening)
- Gate: none claimed yet
- Phase: conc — concurrency restructure (P1P5 + tests/concurrency) — COMPLETE
- Merged to main: bb5eb3d (restructure) + b7a009c (wrapper exit-code fix) + 139e319 (CONC-A1 fix)
- Correction per M2 verdict: 139e319's first parent is 2173894 (not 4ad55ed as the claim said);
immaterial — the code-diff-empty check (139e319 vs b6e12ef) is authoritative.
## Gates
## Gate claim: M2 — merged + live-verified
- M1 (implementation verified): NOT CLAIMED
- M2 (merged + live-verified): NOT CLAIMED — blocked on M1 PASS
**WHAT**: branch merged to main after M1 PASS; live verification (a)(d) all green on the final
main code (which includes two M2-found fixes, both already Adversary-verified: wrapper exit-code
e1c4198/b7a009c, CONC-A1 run-keyed state files b6e12ef/139e319).
**WHERE**: main tip code = merge 139e319 (parents 4ad55ed ∘ b6e12ef); branch tip b6e12ef.
All evidence builds ran post-139e319. Drone repo recipe-maintainers/cc-ci; host cc-ci.
**HOW + EXPECTED (cold re-check from your own access path):**
1. Merge integrity: `git diff 139e319 b6e12ef -- runner/ tests/ docs/ .drone.yml nix/` → EMPTY;
no force-push anywhere (reflog linear).
2. Push build green on main: Drone builds 283 (branch fix), 284 (merge 139e319), 285 (inbox
commit) → all `status=success` (push events). No main push since has a red build.
3. Suites at b6e12ef (cold clone): `cc-ci-run -m pytest tests/unit -q` → 138 passed;
`cc-ci-run -m pytest tests/concurrency -q` → 23 passed; `nix develop .#lint --command bash
scripts/lint.sh` → lint: PASS. (You already cold-verified these + mutation-proofed
test_run_state per REVIEW-conc 08:4xZ entry.)
4. **(a) cancel-mid-run, on fixed harness**: build **295** (custom immich PR=2, comment 14307
@08:50:02Z). Canceled via `DELETE /api/repos/recipe-maintainers/cc-ci/builds/295` @08:51:05Z
(HTTP 200) while mid-deploy (lock held by harness pid 763099, 4 immich services converging).
EXPECTED/observed: build `status=killed`; pid 763099 gone by 08:51:15Z (SIGTERM funnel ran
the run's own teardown); `pgrep -f run_recipe_c[i]` → none; `lslocks | grep cc-ci-app`
none (lock released); immi services/volumes/secrets/server-envs all 0. Zero leakage, no
janitor needed (better than plan minimum).
5. **(b) parallel runs**: builds **287** (immich#2) + **288** (plausible#3), both started
08:17:40Z (parallel), both `status=success`, both logs `deploy-count = 1 (expect 1)` +
level=4. Host after: zero harness procs / services / volumes / secrets / envs.
6. **(c) double-!testme same PR**: builds **290** + **291** (both immich#2, domain immi-ad3e33).
291 log line 1: `== app lock: another run of immi-ad3e33... is in flight — waiting ==`,
`acquired` @+1411s = exactly 290's exit (08:46:05Z). BOTH `status=success`, both
`deploy-count = 1`, level=4. Zero leakage after. (Your M2(c) PASS @09:05Z already covers
this; kernel-lock-table observation yours.)
7. **(d) full green run**: build **287** = complete immich e2e on final harness, all 5 tiers
pass, level=4 (288 plausible likewise).
**Notes for verification**: builds 290/291 ran ~20 min each due to an immich-ML healthcheck
flake (your 08:43Z note) — converged within DEPLOY_TIMEOUT=1500s; unrelated to the restructure.
Unheld 0-byte lockfiles left behind by design (tidy-swept at next janitor probe).
## Blockers

293
STATUS-rcust.md Normal file
View File

@ -0,0 +1,293 @@
# STATUS — sub-phase rcust (recipe-customization restructure)
## DONE
Phase complete 2026-06-11: M1 PASS (REVIEW-rcust.md 01f9f70, 2026-06-10) + M2 PASS (REVIEW-rcust.md
3245150, 2026-06-11) — both fresh, Adversary-verified, no standing VETO. Restructure merged to main
(01e6d49 + approved fix-forwards 1357544, 6cabbe7); all 21 recipes reconciled vs corrected
baseline; canaries 7/7 (Adversary's own cold run); drone path covered; zero leaked apps.
Non-rcust follow-ups filed in machine-docs/DEFERRED.md (discourse abra-stamp env drift,
bluesky-pds upstream image breakage re-pin).
Plan: /srv/cc-ci/cc-ci-plan/recipe-custom-restructure-full-plan.md (SSOT for this phase).
Reference spec: docs/recipe-customization.md @ 76a4b6b.
Work branch: `restructure/recipe-custom` (one commit per phase P1P6; merged to main only after M1 PASS).
## Phase progress
- [x] P1 — single loader + key registry + migrate L1L6 + unit tests + doc gen
(branch commit 472a68b)
- [x] P2 — delete legacy keys/paths: compose.ccci.yml first-class+auto-chaos; install-time deps only
(lasuite-docs migrated, setup_custom_tests.sh gone); SKIP_GENERIC meta deleted (env dev-only +
loud CI warning); conftest cleanup (deployed/deployed_app/app_domain gone, one `deps` fixture)
(branch commit 8cd72fd)
- [x] P3 — uniform ctx hook convention: HookCtx(.domain/.base_url/.meta/.deps/.op); all hooks
take ctx; legacy signatures raise MetaError at load naming the migration (branch fd02d9f)
- [x] P4 — custom-test ergonomics: placement rule (custom under functional/+playwright/ only),
op_state fixture, deps fixture tests (branch 29a28e2)
- [x] P5 — customization manifest: one block at run start (non-default meta keys, hooks, overlays,
custom-test counts, active CCCI_SKIP_GENERIC* env overrides with !! CI flag) printed +
embedded verbatim in results.json under "customization"; pure presentation, HC2-honoring
(branch commit 68954be — new runner/harness/manifest.py + tests/unit/test_manifest.py)
- [x] P6 — docs rewritten to the end state: recipe-customization.md is now the REFERENCE (was
review spec) — §8 records R1R9 resolutions, §4 keeps the generated table + HookCtx, §5 the
end-state shapes; testing.md invariant updated to install-time-deps isolation, generic
opt-out documented dev-only; enroll-recipe.md worked examples (lasuite-docs install-time
OIDC, mumble post-F2-14c), deps fixture, ctx signatures (branch commit da558ca)
- [x] Adversary inbox 19:06Z (P5 manifest dashboard hygiene) — addressed: secret-NAMED meta
values (top-level + nested dict keys) render as '<redacted>' in manifest + results.json;
key names stay visible; unit-test pinned (branch commit 858e0f5)
## P1P6 verification facts (for the eventual M1 cold-verify)
- WHERE: branch `restructure/recipe-custom`, P1=472a68b, P2=8cd72fd, P3=fd02d9f, P4=29a28e2,
P5=68954be, P6=da558ca, manifest-redaction fix=858e0f5 (branch head).
- HOW: `cc-ci-run -m pytest tests/unit -q` and `nix develop .#lint --command scripts/lint.sh`
from a clean checkout of the branch.
- EXPECTED: 192 passed; `lint: PASS`.
- New single loader: `runner/harness/meta.py::load()`; all-recipes typo gate + R2 proof in
`tests/unit/test_meta.py`; docs §4 table generated by `scripts/gen-meta-docs.py` (sync pinned
by unit test).
## M2 baseline matrix (built BEFORE merge, per plan M2.1)
Expected outcome per recipe dir for the post-merge regression sweep = most recent known-good
evidence. Levels are results.json `level`; evidence = run id under /var/lib/cc-ci-runs/<id>/
(on cc-ci) unless noted. Bad canaries are EXPECTED to fail at their designed tier.
| Recipe | Expected | Evidence |
|---|---|---|
| bluesky-pds | full lifecycle green: 5 tiers + 4 custom pass, deploy-count=1 (L4-equiv; pre-results-era) | Adversary cold run, REVIEW e45e0ee (Phase 2 Q4.3); weekly 06-05: up-to-date |
| cryptpad | L4 (all four essential rungs pass) | run 181 (06-05) |
| custom-html | L4 | run 182 (06-05) |
| custom-html-bkp-bad | DESIGNED-BAD: backup tier fail → backup_restore=fail, L1 | run regression-bad-restore-2 (06-02) |
| custom-html-rst-bad | DESIGNED-BAD: restore tier fail → backup_restore=fail, L1 | run regression-bad-restore-3 (06-02) |
| custom-html-tiny | L2 (backup_restore N/A — declared EXPECTED_NA; functional N/A) | run 205 (06-09) |
| discourse | L4 | run 184 (06-05) |
| ghost | L4 | run 185 (06-05) |
| hedgedoc | L4 | run 113 (06-02) |
| immich | L4 | run 307 (06-10) |
| keycloak | L4 | run 187 (06-05) |
| lasuite-docs | L5 (integration pass) | run 188 (06-05) |
| lasuite-drive | L5 (integration pass) | run 189 (06-05) |
| lasuite-meet | L5 (integration pass) | run 204 (06-09) |
| mailu | L2 (backup_restore N/A — no backupbot labels; functional pass) | run 191 (06-05) |
| matrix-synapse | L4 | run 203 (06-08) |
| mattermost-lts | L4 | run 196 (06-05) |
| mumble | all 5 tiers pass, deploy-count=1 (L4-equiv; pre-results-era) | log ~/ccci-mumble-f214c.log on cc-ci (05-31) |
| n8n | L4 | run 197 (06-05) |
| plausible | L4 | run 308 (06-10) |
| uptime-kuma | L4 | run 165 (06-02) |
Customization-executed spot-greps for M2.4 (mumble READY_PROBE tcp lines, cryptpad
SANDBOX_DOMAIN, ghost/discourse BACKUP_VERIFY + overlay copy + chaos base, lasuite-* deps
provisioning + OIDC skip-count 0, immich ops.py seeds, manifest block in every log) apply on the
sweep runs, not retroactively here.
## Gate
**Gate: M2 CLAIMED 2026-06-11 ~01:30Z, awaiting Adversary.**
### M2 claim — WHAT / HOW / EXPECTED / WHERE
WHAT: plan M2.0M2.4 complete on merged main. Merge 01e6d49 (build 326 green) + two
Adversary-approved fix-forwards: 1357544 (lasuite-drive best-effort bucket poll, approval 57c66ad)
and 6cabbe7 = merge of be2026a (services_converged completed-one-shot rule, approval a531746,
build 350 green on 914c166, merged-diff==branch-diff verified 4428e76). Canaries 7/7. All 21
recipe dirs reconciled vs the CORRECTED baseline (the Adversary-accepted L5≡L4+OIDC equivalence
for the three stale lasuite-* rows; one justified exclusion: bluesky-pds, non-rcust upstream image
breakage, DEFERRED.md). Drone→harness path covered (2 PR !testme runs green). Zero leaked apps.
RECONCILIATION (final evidence per recipe; run dirs under /var/lib/cc-ci-runs/):
| Recipe | Baseline | Final evidence | Match |
|---|---|---|---|
| bluesky-pds | full green (pre-results-era) | m2r L0 == m2rr L0 == ab-oldmain L0, all `Cannot find module /app/index.js` crash-loop | EXCLUDED: upstream image breakage, harness-neutral (DEFERRED.md) |
| cryptpad | L4 | m2r-cryptpad L4 | ✓ |
| custom-html | L4 | m2r-custom-html L4 | ✓ |
| custom-html-bkp-bad | designed backup fail, L1 | m2r: backup fail exactly | ✓ |
| custom-html-rst-bad | designed restore fail, L1 | m2r: backup pass → restore fail exactly | ✓ |
| custom-html-tiny | L2 (declared EXPECTED_NA) | m2r-custom-html-tiny L2 | ✓ |
| discourse | L4 (184, 06-05) | m2r/m2b/m2p + ab-oldmain×2: ALL deviations byte-identical old==new harness (restore race @default head: L2==L2; upgrade-HC1 @baseline ref PR=2: L1==L1, stamp eb96de94+U both) | env drift since 06-05, rcust-neutral (Adversary-verified, condition 3 of a531746) |
| ghost | L4 | m2r-ghost L4 | ✓ |
| hedgedoc | L4 | m2r-hedgedoc L4 | ✓ |
| immich | L4 | m2b-immich L4 @baseline ref + drone-path run 356 L4 | ✓ |
| keycloak | L4 | m2r-keycloak L4 | ✓ |
| lasuite-docs | L5 (stale schema) | m2r-lasuite-docs L4 all-pass + OIDC PASSED skip-0 | ✓ (accepted equivalence) |
| lasuite-drive | L5 (stale schema) | m2p2-lasuite-drive L4 all-pass + OIDC + MinIO PASSED, rc=0, post-both-fixes | ✓ (accepted equivalence) |
| lasuite-meet | L5 (stale schema) | m2r-lasuite-meet L4 all-pass + OIDC PASSED | ✓ (accepted equivalence) |
| mailu | L2 | m2r-mailu L2 | ✓ |
| matrix-synapse | L4 | m2r-matrix-synapse L4 | ✓ |
| mattermost-lts | L4 | m2b-mattermost-lts L4 @baseline ref | ✓ |
| mumble | all 5 tiers (pre-results-era) | m2r-mumble all tiers pass, deploy-count=1 | ✓ |
| n8n | L4 | m2r-n8n L4 | ✓ |
| plausible | L4 | m2b-plausible L4 @baseline ref + drone-path run 357 L4 | ✓ |
| uptime-kuma | L4 | m2r-uptime-kuma L4 | ✓ |
HOW (cold, from the Adversary's own clone / direct on cc-ci):
- per-recipe: `jq '{recipe,level,rungs,flags}' /var/lib/cc-ci-runs/<id>/results.json` for every id
above; logs in /root/m2-logs/, /root/m2-baseline-logs/, /root/m2-proof-logs/, /root/m2-ab-logs/.
- canaries: /root/m2-canary.log (7/7, fresh clone of merged main).
- drone path: builds 356 (immich#2) + 357 (plausible#3) `custom` events SUCCESS in drone DB
(`docker cp <drone_cid>:/data/database.sqlite` + sqlite query, as documented above); run dirs
356/357 carry `customization` manifest keys + clean flags; triggered by real `!testme` comments
(gitea comment ids 14317/14318).
- M2.4 spot-greps: section above (manifest 21/21, mumble tcp probe, ghost/discourse overlay+
BACKUP_VERIFY, lasuite deps+OIDC, immich seeds, cryptpad EXTRA_ENV hook+playwright).
- zero-leak: `docker stack ls` on cc-ci → infra (backups/bridge/dashboard/reports/drone/traefik)
+ warm-keycloak ONLY (checked 01:27Z, after ALL runs incl. drone-path).
- tree: origin/main, working tree clean, every claim-referenced commit pushed.
EXPECTED: every check above reproduces as stated; no recipe regresses vs the corrected baseline.
WHERE: origin/main @ (this commit); REVIEW-rcust.md holds M1 PASS (01f9f70), be2026a approval +
all-conditions-cleared (a531746, 24a203a); DEFERRED.md holds the two non-rcust follow-ups
(discourse abra-stamp mechanism, bluesky-pds upstream re-pin).
**Gate history: M2 IN PROGRESS** — M1 PASS in REVIEW-rcust.md (01f9f70, 2026-06-10).
- M2.0 merge: `restructure/recipe-custom` merged to main as 01e6d49 (merge commit, no force);
push build green: drone build **326 success** on 01e6d49 (API-verified).
- M2.2 canary suite: **7/7 PASSED** in 286s (fresh clone of merged main at /root/m2-sweep on
cc-ci, log /root/m2-canary.log) — green canaries pass, all four RED canaries still caught at
their designed tiers (bad-install/bad-upgrade/bad-backup/bad-restore).
- M2.3 per-recipe sweep (driver /root/m2-driver.sh, 2 concurrent, REF = mirror heads; logs
/root/m2-logs/<r>.log; results /var/lib/cc-ci-runs/m2r-<r>/): first pass **15/21 matched
baseline** —
hedgedoc/custom-html/custom-html-tiny/uptime-kuma/n8n/cryptpad/ghost/keycloak/mumble/mailu/
matrix-synapse/lasuite-docs/lasuite-meet at baseline level; both DESIGNED-BAD canaries failed
at exactly their designed tier (bkp-bad: backup fail; rst-bad: backup pass→restore fail).
6 below baseline, ALL flake-shaped (known modes, not new assertion semantics):
discourse+plausible+mattermost-lts+immich restore data-integrity (the documented pre-existing
truncated-dump capture race — discourse BACKUP_VERIFY honestly failed 3/3 attempts, its
docstring + the 06-05 weekly report record this exact mode pre-restructure; seeds verified
committed by ops.py read-back asserts, i.e. the migrated ctx hooks executed correctly);
bluesky-pds abra `FATA deploy timed out` at default 600s during concurrent image pulls;
lasuite-drive pre_install MinIO one-shot 90s timeout (bucket appeared later — every
subsequent tier passed). Serial re-runs (MAX=1, /root/m2-rerun.sh, logs /root/m2-rerun-logs/,
results m2rr-<r>/) completed 20:44Z — but ran default heads, not baseline refs (superseded by
the targeted runs below).
- M2.3 reconciliation runs (serial, MAX=1):
- **Baseline-ref re-runs on merged main** (/root/m2-baseline-runs.sh, logs /root/m2-baseline-logs/,
results m2b-<r>/): **plausible L4, mattermost-lts L4, immich L4** at their exact baseline refs —
baseline REPRODUCED on the restructured harness; restore-race cluster closed for those three.
m2b-discourse @7ae7b0f (ran PR=0; baseline run 184 was PR=2): **L1, NEW mode** — upgrade HC1
`deployed chaos commit 'eb96de94+U', not PR-head '7ae7b0f76efb'`. Investigated facts (cold-checkable
in /var/lib/cc-ci-runs/m2b-discourse/): `eb96de94` IS the prev-base tag commit `0.7.0+3.3.1`
(`git -C .../abra/recipes/discourse rev-list -n1 0.7.0+3.3.1`); the preserved per-run clone HEAD =
7ae7b0f (the upgrade re-checkout DID run and persist); the
`service "sidekiq" depends on undefined service "discourse"` log line is benign noise (appears
verbatim in the PASSING m2r/m2rr upgrade sections too; published compose ships a dangling
depends_on — see tests/discourse/compose.ccci.yml NOTE). So the chaos redeploy itself left the
base stamp in place at this ref. NOT folded into the restore-flake cluster; discriminating runs
queued (below).
- **Old-main A/B at the m2r ref** (/root/m2-ab.sh, /root/m2-ab-logs/, results ab-<r>-oldmain/):
discourse @7d53d4ec on OLD main = **L2 restore fail** == new-main m2r L2 at the same ref →
restore race harness-neutral at that ref. bluesky-pds @b2d86ef on OLD main = **L0 install fail**.
- **bluesky-pds re-characterized (not a pull timeout)**: the app container crash-loops
`Error: Cannot find module '/app/index.js'` (MODULE_NOT_FOUND, Node v24.15.0) in ALL THREE
failures — m2r (new main @ mirror head), m2rr (new main, serial), ab-oldmain (OLD main @ old
default head b2d86ef). Same pinned tag, both harnesses, both refs → upstream image content moved
under the tag; recipe cannot deploy on ANY harness. Evidence:
`grep -r MODULE_NOT_FOUND /var/lib/cc-ci-runs/{m2r,m2rr,ab}-bluesky-pds*/abra/logs/default/`.
Restructure-neutral (old==new L0).
- M2.3 in-flight proof runs (serial queue /root/m2-proof.sh + /root/m2-proof2.sh, logs
/root/m2-proof-logs/, driver /root/m2-proof-logs/driver.log):
1. **lasuite-drive @baseline ref ffa7d585afa2 PR=1 on merged main @5c0676b** (post-fix-forward
1357544) → run id m2p-lasuite-drive: **WILL LAND L0 — second P2b regression found via this
run, root-caused LIVE.** The 1357544 best-effort path WORKED (`!!` warn + continue in the
log); the one-shot task went **Complete** ~3min in (bucket created); but a completed
restart_policy-none one-shot reports replicas 0/1 FOREVER, and services_converged requires
cur==want → the install assert burned DEPLOY_TIMEOUT (1800s) and failed. Old world never saw
this: setup_custom_tests.sh ran POST-install-assert (its own header: orchestrator runs it
after the deploy is healthy); P2b moved the trigger to ops.py pre_install = PRE-assert.
Verified live during the run: app HTTP 200, all other services 1/1,
`docker service ps ..._minio-createbuckets` = Complete, pytest in converge loop 27+ min.
**Fix-forward proposed, awaiting Adversary approval: branch `fix/converged-oneshot` @
be2026a** — services_converged treats a replica deficit explained ENTIRELY by Complete tasks
as converged (Failed/mixed/spinning-up/no-tasks still block; 0/0 + N/N unchanged); pinned by
tests/unit/test_converged_oneshot.py (7 cases). Proof: working tree on cc-ci
`cc-ci-run -m pytest tests/unit -q` → 199 passed; lint PASS.
**APPROVED (REVIEW a531746) and MERGED to main as 6cabbe7** (merge commit, no force);
merged diff == be2026a diff (`git diff be2026a..main -- runner/harness/lifecycle.py
tests/unit/test_converged_oneshot.py` = empty). Push build green: drone build **350
success** on 914c166 (branch head incl. the merge; verify on cc-ci:
`docker cp <drone_cid>:/data/database.sqlite /tmp/d.sqlite && sqlite3 /tmp/d.sqlite
"select build_number,build_status,build_after from builds order by build_id desc limit 5"`).
Post-fix re-run QUEUED: /root/m2-proof3.sh waits for the discourse A/B pair to drain, then
runs lasuite-drive @ffa7d585afa2 PR=1 from fresh clone /root/m2-postfix @6cabbe7
CCCI_RUN_ID=m2p2-lasuite-drive, log /root/m2-proof-logs/lasuite-drive-postfix.log.
EXPECTED **L5** (binding condition 1 of the approval).
DISCLOSED INTERVENTION: in the doomed pre-fix m2p run, after the GENERIC install assert had
already failed at the 1800s converge deadline, the OVERLAY install test entered a second
identical 1800s converge burn — Builder sent it (pytest pid only) SIGINT at ~01:00Z to skip
the redundant 20+ min wait. The log therefore shows `KeyboardInterrupt` at generic.py:97
(the converge poll — the exact diagnosed line). The orchestrator's own exit paths/teardown
untouched; run continued to upgrade/backup/restore/custom normally. The m2p result is
diagnostic evidence of the bug, not a baseline data point — the binding proof is m2p2.
2. **discourse @7ae7b0f PR=2 on merged main** (exact baseline-184 invocation) → m2p-discourse:
**COMPLETE — L2, upgrade HC1 fail, chaos-version=eb96de94+U** (identical to m2b: stamp = the
prev-base tag commit). Deterministic at this ref on new main; NOT a PR=0 artifact, NOT a race.
install/backup/restore/custom all pass.
3. **discourse @7ae7b0f PR=2 on OLD main** → ab-discourse-7ae7b0f-oldmain: **COMPLETE — L2,
upgrade HC1 fail, chaos-version=eb96de94+U — BYTE-IDENTICAL failure to the new-main run.**
**DISCOURSE A/B CLOSED: old harness == new harness at the baseline ref + baseline invocation
(PR=2). The upgrade-HC1 mode is HARNESS-NEUTRAL — not an rcust regression.** Baseline 184's
L4 (06-05) vs today's identical-both-worlds failure = environment/content drift since 06-05,
outside both harnesses. Drift candidates checked and ELIMINATED: 7ae7b0f is still a live
branch tip in the mirror (`refs/heads/upgrade-0.8.0+3.5.0` + `refs/pull/2/head` — git
ls-remote), and upstream's latest release tag is unchanged (0.7.0+3.3.1 = eb96de94, no new
tag since 06-05). flake.lock (abra pin) identical in both worlds. HC1 firing rather than
false-greening is the guard working as designed.
Cold-verify: results.json + full logs at /var/lib/cc-ci-runs/{m2p-discourse,
ab-discourse-7ae7b0f-oldmain}/ + /root/m2-proof-logs/discourse{,-oldmain}.log.
4. **lasuite-drive @ffa7d585afa2 PR=1 on merged main @6cabbe7 (post-converge-fix)**
m2p2-lasuite-drive: **COMPLETE in 3m19s, rc=0 — all 5 stages pass, deploy-count=1,
`test_oidc_password_grant_against_dep_keycloak` PASSED (requires_deps skip-count 0),
`test_minio_bucket_present_and_object_roundtrip` PASSED, clean_teardown+no_secret_leak
flags true. NO converge burn: the one-shot again exceeded its 90s window (`!!` best-effort
line), completed late, and the install assert passed straight through — both fix-forwards
proven end-to-end.** results.json `level=4`, NOT 5 — see schema note below.
- **BASELINE SCHEMA NOTE (affects lasuite-docs/-drive/-meet expected "L5")**: the 6-rung ladder
(L5 integration / L6 recipe-local) was REMOVED from main by the deliberate mainline refactor
46e2cdb + c51cd84 ("four essential rungs only — integration & recipe-local are optional",
PR #6, 2026-06-09 ~03:00Z) — BEFORE the rcust merge and NOT part of it (merge diff
01e6d49^1..01e6d49 touches level.py not at all and results.py by +4 lines; current
derive_rungs/compute_level are byte-equal to the pre-merge main versions). Every post-06-09 run
caps at L4 BY DESIGN; the integration (OIDC) test now counts inside the functional/custom rung.
Timeline evidence: run 204 (lasuite-meet, 06-09 pre-deploy) = 6-rung level 5; all later runs =
4-rung. EQUIVALENCE for the baseline matrix: old "L5 (integration pass)" ≡ new "L4 all-rungs
pass + the requires_deps OIDC test PASSED (skip-count 0)". m2p2-lasuite-drive meets it; the
m2r sweep's lasuite-docs + lasuite-meet L4-all-pass results (with their OIDC PASSED lines,
already in M2.4 spot-greps) meet it identically.
- M2.4 spot-greps (customizations actually executed — log evidence in /root/m2-logs/):
manifest block present 21/21; mumble `ready-probe OK (tcp 3x): 127.0.0.1:64738`; ghost+discourse
`ccci-overlay: provided compose.ccci.yml ... auto-chaos` (P2a first-class path live);
discourse BACKUP_VERIFY hook live (3 verify lines); lasuite-docs `install-time OIDC:
provisioning deps ['keycloak'] BEFORE deploy` + `test_oidc_login_via_keycloak PASSED`
(requires_deps skip-count 0); immich ops.py pre_upgrade/pre_backup/pre_restore seed lines;
cryptpad EXTRA_ENV='<hook>' in manifest + its 4 overlays + playwright green (hook applied);
19 screenshot.png across m2r-* dirs.
- Teardown: `docker stack ls` after the full 21-recipe sweep = infra stacks + warm-keycloak only,
**zero leaked apps**.
- Drone→harness path: !testme on two open recipe PRs pending after the re-runs.
**Gate history: M1 CLAIMED 2026-06-10 → PASS** (branch head 858e0f5)
- WHAT: P1P6 complete on branch `restructure/recipe-custom` (P1=472a68b, P2=8cd72fd, P3=fd02d9f,
P4=29a28e2, P5=68954be, P6=da558ca, +858e0f5 manifest redaction). Working tree clean, all pushed.
- HOW (cold, from a fresh clone of the branch):
- `cc-ci-run -m pytest tests/unit -q` → EXPECTED: **192 passed**
- `cc-ci-run -m pytest tests/concurrency -q` → EXPECTED: **23 passed** (untouched by this plan;
Builder proof run 2026-06-10 on branch head: 23 passed in 11.46s)
- `nix develop .#lint --command scripts/lint.sh` → EXPECTED: **lint: PASS**
- resolved-customization diff old-vs-new for all 21 recipe dirs (Adversary's own script) →
EXPECTED: 0 deltas
- adversarial review of the full diff `main..restructure/recipe-custom`
- WHERE: origin branch `restructure/recipe-custom` @ 858e0f5; baseline matrix above (M2 prep,
committed pre-merge per plan).
## Current
M2 CLAIMED (see Gate above) — awaiting Adversary cold-verify. No other unblocked work in this
phase; DONE follows the M2 PASS handshake.

38
STATUS-shot.md Normal file
View File

@ -0,0 +1,38 @@
# STATUS-shot.md — Builder status, phase `shot`
SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-shot-screenshots.md
## Current section
Gate: M1 CLAIMED, awaiting Adversary.
P1 audit matrix COMPLETE (all 19 enrolled recipes, every PNG visually inspected).
P2 diagnoses COMPLETE (see BACKLOG-shot.md P2 — each with evidence).
Meanwhile working (unblocked, pre-M2): P3 harness default-wait improvement + unit tests.
## M1 claim — verification map (WHAT/HOW/EXPECTED/WHERE)
WHAT: M1 = full audit matrix (19/19 enrolled recipes, BACKLOG-shot.md "P1 — Audit matrix") +
root-cause diagnosis with evidence for every non-OK row (BACKLOG-shot.md "P2") + N/A candidates
argued (bluesky-pds: blocked-upstream N/A; mumble: explicitly NOT an N/A — real web UI).
Claimed at commit 8978fa6 (matrix+diagnoses) — claim commit follows.
- Enrolled set (19): `ls tests/*/recipe_meta.py` minus fixtures `_generic, regression, concurrency,
custom-html-bkp-bad, custom-html-rst-bad` (those first three have no recipe_meta.py; the two
`-bad` ones do but are harness canaries).
- Matrix: BACKLOG-shot.md "P1 — Audit matrix". Reproduce any row:
`ssh cc-ci 'grep -o "\"screenshot\": *[^,}]*" /var/lib/cc-ci-runs/<run>/results.json; stat -c%s /var/lib/cc-ci-runs/<run>/screenshot.png'`
then scp the PNG and Read it. Run ids are in the matrix "latest run" column.
- plausible NULL evidence: Drone sqlite, build 357 ci step (step_id 947):
`ssh cc-ci 'docker run --rm -v drone_ci_commoninternet_net_data:/data alpine sh -c "apk add -q sqlite; sqlite3 /data/database.sqlite \"select log_data from logs where log_id=947\"" | grep -o "screenshot[^\"]*"'`
EXPECTED: `capture failed … last status=500` after 15 attempts/45s.
- bluesky-pds NULL evidence: `grep '"install"' /var/lib/cc-ci-runs/m2rr-bluesky-pds/results.json`
→ fail, level=0; capture is gated on deploy_ok (runner/run_recipe_ci.py:1024).
- Default capture path under audit: runner/harness/screenshot.py:84-93 (domcontentloaded, no paint
wait) — the BLANK/LOADING mechanism; accept_statuses excludes 500 — the plausible mechanism.
- mumble web UI exists: tests/mumble/recipe_meta.py header (compose.mumbleweb.yml, HEALTH_PATH "/").
- custom-html fresh install serves nginx default: no install_steps.sh in tests/custom-html/ (only
pre_backup/pre_upgrade seeds in ops.py, which run AFTER the capture moment).
## Blocked
(nothing)

View File

@ -14,8 +14,9 @@ those are discovered and run against the live app (D4 — see below).
```
tests/<recipe>/
├── recipe_meta.py # optional per-recipe harness config (see below)
├── install_steps.sh # optional custom install-steps hook (pre-deploy setup)
├── ops.py # optional pre-op seed hooks (pre_install/pre_upgrade/pre_backup/pre_restore)
├── install_steps.sh # optional custom install-steps hook (pre-deploy setup + deps env wiring)
├── compose.ccci.yml # optional CI-only compose overlay (harness-copied, auto-chaos base deploy)
├── ops.py # optional pre_<op>(ctx) seed hooks (install/upgrade/backup/restore)
├── test_install.py # optional install overlay (runs ADDITIVELY alongside generic)
├── test_upgrade.py # optional upgrade overlay (runs ADDITIVELY alongside generic)
├── test_backup.py # optional backup overlay (runs ADDITIVELY alongside generic)
@ -39,11 +40,14 @@ To add recipe-specific coverage, drop a `tests/<recipe>/test_<op>.py` **overlay*
**ALONGSIDE** the generic for that op (HC3 additive, Phase 1e); the generic floor is never silently
dropped. Overlays are **assertion-only** against the shared live deployment (the `live_app` fixture;
they never perform the op or deploy/teardown — the orchestrator owns those). If the overlay needs to
SEED pre-op state (data-continuity markers, the backup→restore divergence), put `pre_<op>(domain,
meta)` callables in `tests/<recipe>/ops.py` — the orchestrator runs them BEFORE the op. Copy an
SEED pre-op state (data-continuity markers, the backup→restore divergence), put `pre_<op>(ctx)`
callables in `tests/<recipe>/ops.py` — the orchestrator runs them BEFORE the op (`ctx` is the
uniform `HookCtx` every hook receives — `docs/recipe-customization.md` §4.1). Copy an
existing recipe (`tests/custom-html/` simple/volume marker; `tests/keycloak/` admin-API; `tests/
matrix-synapse/` `db`-service psql marker). **Do not edit the shared `tests/conftest.py` /
`runner/harness/` to add a recipe** — set per-recipe knobs in `recipe_meta.py`:
`runner/harness/` to add a recipe** — set per-recipe knobs in `recipe_meta.py` (the COMPLETE key
reference is the generated table in `docs/recipe-customization.md` §4; unknown ALL-CAPS keys are
hard errors, recipe-private constants are underscore-prefixed `_FOO`):
```python
HEALTH_PATH = "/realms/master" # path that returns a healthy status (default "/")
@ -51,9 +55,7 @@ HEALTH_OK = (200,) # acceptable status codes (default 200/301/302)
DEPLOY_TIMEOUT = 600 # seconds for services to converge (default 600)
HTTP_TIMEOUT = 600 # seconds for the app to answer (default 300)
BACKUP_CAPABLE = True # override backup-capability auto-detect (default: scan compose)
EXTRA_ENV = {"KEY": "value"} # or EXTRA_ENV(domain) -> dict; extra .env keys set at deploy
SKIP_GENERIC = ["upgrade"] # per-recipe opt-out from the generic floor for the listed ops
# ("all"/"*" = every op); rarely needed — generic is the floor
EXTRA_ENV = {"KEY": "value"} # or EXTRA_ENV(ctx) -> dict; extra .env keys set at deploy
```
Useful `harness.lifecycle` helpers for overlays: `http_get`, `http_fetch`, `http_body`,
@ -76,9 +78,10 @@ Beyond the lifecycle overlays, each recipe carries (plan §4.1):
- **`playwright/`** — browser flows where the recipe's core UX is a UI (P6).
The orchestrator's **custom** tier discovers `test_*.py` in `tests/<recipe>/{functional,playwright}/`
(recursive, via `runner/harness/discovery.custom_tests`) and runs each as its own pytest against
the same `live_app` shared deployment. Lifecycle-named files (`test_install.py`/etc.) are
**excluded** from the custom tier — they live at the top level and run as lifecycle overlays.
ONLY (the placement rule, via `runner/harness/discovery.custom_tests` — a top-level `test_*.py`
is a lifecycle overlay and nothing else) and runs each as its own pytest against the same
`live_app` shared deployment. Lifecycle-named files (`test_install.py`/etc.) are **excluded**
from the custom tier even inside those subdirs (safety net against double-running).
### 2.2 Recipe-test dependencies — DEPS = [...] (Phase 2 Q2.3)
@ -89,23 +92,28 @@ them in `recipe_meta.py`:
DEPS = ["keycloak"] # one entry per dep recipe name (cc-ci tests/<dep>/ must exist + work)
```
The orchestrator (plan §4.2):
1. Reads `DEPS` BEFORE deploying the recipe under test.
2. Deploys each dep at a per-run domain `<dep[:4]>-<6hex>.ci.commoninternet.net` (the 6hex is
hashed from `parent_recipe + pr + ref + dep_recipe` so two recipes' deps of the same kind do
not collide on a single node).
3. Waits each dep healthy using its own `recipe_meta.py` (HEALTH_PATH/HEALTH_OK/timeouts).
4. Persists `[{"recipe": "<dep>", "domain": "<dep-domain>"}, ...]` to `$CCCI_DEPS_FILE`.
5. Deploys + tests the recipe under test as usual.
6. Tears down the dep LAST in `finally` (reverse declaration order, with `verify=True` — leaked
The orchestrator (plan §4.2; install-time provisioning is the ONLY mode):
1. Reads `DEPS` and provisions every dep **BEFORE the single deploy** of the recipe under test
each dep at a per-run domain `<dep[:4]>-<6hex>.ci.commoninternet.net` (the 6hex is hashed from
`parent_recipe + pr + ref + dep_recipe` so two recipes' deps of the same kind do not collide on
a single node), waited healthy using the dep's own `recipe_meta.py`.
2. Persists the full per-dep identity + SSO creds dict to `$CCCI_DEPS_FILE` (jq-readable JSON,
`{"<dep>": {"domain": ..., "realm": ..., "client_secret": ..., ...}}`).
3. Deploys the recipe under test — its `install_steps.sh` reads `$CCCI_DEPS_FILE` and wires
OIDC env into that ONE deploy (no post-deploy redeploy). A dep-provisioning failure does NOT
block the run: the recipe deploys alone, generic tiers run, and `requires_deps` tests skip
with a counted reason (F2-11).
4. Tears down the dep LAST in `finally` (reverse declaration order, with `verify=True` — leaked
deps fail the run loudly per §9 teardown sacred / F2-5 fix).
Tests access dep domains via the **`deps_apps` pytest fixture** (`tests/conftest.py`):
Tests access deps via the **`deps` pytest fixture** (`tests/conftest.py`) — entries expose
`.domain` plus the full creds dict (attribute or dict-style):
```python
def test_my_recipe_uses_keycloak(live_app, deps_apps):
assert "keycloak" in deps_apps, f"keycloak dep not deployed; {deps_apps}"
kc_domain = deps_apps["keycloak"]
@pytest.mark.requires_deps
def test_my_recipe_uses_keycloak(live_app, deps):
assert "keycloak" in deps, f"keycloak dep not deployed; {deps}"
kc_domain = deps["keycloak"].domain
```
@ -120,7 +128,7 @@ For OIDC-dependent recipes, the shared `runner/harness/sso.py` provides:
from harness import sso
creds = sso.setup_keycloak_realm(
kc_domain, # = deps_apps["keycloak"]
kc_domain, # = deps["keycloak"].domain
realm="my-realm",
client_id="my-client",
redirect_uris=[f"https://{live_app}/*"],
@ -144,10 +152,10 @@ ARE provider-pluggable.
Not every recipe is a single HTTP app. `recipe_meta.py` + a few harness mechanisms cover the harder
shapes (proven on mumble, mailu, and the SSO-dependent suite):
- **`EXTRA_ENV`** — a dict **or** a `callable(domain) -> dict`. The callable form derives values from
the per-run domain (e.g. `MAIL_DOMAIN`/`HOSTNAMES` for mailu, `SANDBOX_DOMAIN` for cryptpad). Applied
at every deploy (`abra.env_set`), so a recipe enrolls with NO shared-harness change.
- **`READY_PROBE(domain) -> [...]`** — readiness signals beyond replica-convergence + the app's
- **`EXTRA_ENV`** — a dict **or** a `callable(ctx) -> dict`. The callable form derives values from
the per-run domain (`ctx.domain` — e.g. `MAIL_DOMAIN`/`HOSTNAMES` for mailu, `SANDBOX_DOMAIN` for
cryptpad). Applied at every deploy (`abra.env_set`), so a recipe enrolls with NO shared-harness change.
- **`READY_PROBE(ctx) -> [...]`** — readiness signals beyond replica-convergence + the app's
`HEALTH_PATH`. Two probe shapes:
- HTTP: `{"host": "...", "path": "/...", "ok": (200,)}` (e.g. lasuite-drive collabora WOPI discovery).
- **TCP**: `{"tcp_host": "127.0.0.1", "tcp_port": 64738, "stable": 3}` — polls a socket connect N
@ -155,16 +163,16 @@ shapes (proven on mumble, mailu, and the SSO-dependent suite):
service (mumble: the mumble-web sidecar serves HTTP 200 while the voice server on 64738 is still
rebinding after an upgrade redeploy — the TCP probe gates the backup tier until the voice server is
actually up). Runs after install AND after the upgrade chaos redeploy.
- **`CHAOS_BASE_DEPLOY = True`** — make the pinned base deploy use `--chaos` (skips abra's clean-tree +
lint gates, still deploys the explicitly-checked-out pinned version, NOT latest). Needed when an
`install_steps.sh` adds an UNTRACKED file to the recipe checkout (e.g. mumble copies a
`compose.host-ports.yml` into versions that predate it) — abra's pinned-deploy clean-tree check would
otherwise FATA. `abra.recipe_checkout` force-checks-out (`-f`) so the upgrade tier's re-checkout to
PR-head overwrites such overlays cleanly.
- **`compose.ccci.yml`** (first-class at `tests/<recipe>/compose.ccci.yml`) — a CI-only compose
overlay the harness itself copies into the recipe checkout before the base deploy, automatically
using `--chaos` for that deploy (the untracked file would otherwise trip abra's pinned-deploy
clean-tree check). Reference it from `EXTRA_ENV`'s `COMPOSE_FILE`. Minimal, justified fallback
only (e.g. ghost's 15m `start_period` grace). `abra.recipe_checkout` force-checks-out (`-f`) so
the upgrade tier's re-checkout to PR-head overwrites such overlays cleanly.
- **`install_steps.sh`** (auto-discovered at `tests/<recipe>/install_steps.sh`) — runs after
`abra app new` + EXTRA_ENV + secret-generate, BEFORE the single deploy, with `CCCI_APP_DOMAIN` /
`CCCI_APP_ENV` / `CCCI_RECIPE` (and `CCCI_DEPS_FILE` when DEPS are provisioned at install). Use it to
drop a cc-ci-owned compose overlay into the checkout, wire dep-derived env/secrets, etc.
`CCCI_APP_ENV` / `CCCI_RECIPE` (and `CCCI_DEPS_FILE` when the recipe declares DEPS — deps are
always provisioned before the deploy). Use it to wire dep-derived env/secrets, seed config, etc.
**Non-HTTP protocol tests (mumble).** Reach a TCP service published `mode: host` (via a host-ports
overlay) at `127.0.0.1:<port>` — cc-ci runs tests on-host (cc-ci-run). mumble ships a stdlib protocol
@ -227,9 +235,10 @@ RECIPE=<recipe> PR=<n> REF=<sha-or-branch> SRC=recipe-maintainers/<recipe> \
```
tests/lasuite-docs/
├── recipe_meta.py # HEALTH_PATH="/", DEPLOY_TIMEOUT=900, EXTRA_ENV(domain) for cold-pull,
├── recipe_meta.py # HEALTH_PATH="/", DEPLOY_TIMEOUT=900, EXTRA_ENV(ctx) for cold-pull,
│ # DEPS=["keycloak"] ← Phase 2 dep declaration
├── ops.py # pre_<op> seed hooks (volume marker for backup/restore data-integrity)
├── install_steps.sh # wires OIDC env from $CCCI_DEPS_FILE into the single deploy
├── ops.py # pre_<op>(ctx) seed hooks (volume marker for backup/restore data-integrity)
├── test_install.py # lifecycle install overlay (Playwright frontend SPA load)
├── test_upgrade.py # lifecycle upgrade overlay (marker survives chaos redeploy)
├── test_backup.py # lifecycle backup overlay (marker captured)
@ -239,12 +248,14 @@ tests/lasuite-docs/
├── test_health_check.py # parity port (SOURCE comment cites recipe-info file)
├── test_auth_required.py # specific: /api/v1.0/users/me/ → 401 without auth
└── test_oidc_with_keycloak.py # specific: full OIDC flow against the dep keycloak (uses
# harness.sso primitives + deps_apps["keycloak"])
# harness.sso primitives + the `deps` fixture)
```
`!testme` on a lasuite-docs PR drives the orchestrator to:
1. Deploy the per-run keycloak dep (`keyc-<6hex>.ci.commoninternet.net`) and wait healthy.
2. Deploy lasuite-docs (`lasu-<6hex>.ci.commoninternet.net`).
1. Provision the per-run keycloak dep (`keyc-<6hex>.ci.commoninternet.net`), wait healthy, write
creds to `$CCCI_DEPS_FILE` — BEFORE the recipe deploy.
2. Deploy lasuite-docs (`lasu-<6hex>.ci.commoninternet.net`); `install_steps.sh` wires the OIDC
env into that one deploy.
3. Run install / upgrade / backup / restore + the 3 functional tests against the shared
deployment (custom tier).
4. Teardown lasuite-docs, then the keycloak dep (LAST), both with verify=True.
@ -254,12 +265,13 @@ tests/lasuite-docs/
### Other shapes (concrete references)
- **TCP / voice recipe — `tests/mumble/`**: `recipe_meta.py` (EXTRA_ENV sets
`COMPOSE_FILE=compose.yml:compose.mumbleweb.yml:compose.host-ports.yml`, `WELCOME_TEXT`/`USERS`
markers, `CHAOS_BASE_DEPLOY=True`, `READY_PROBE` TCP 64738), `install_steps.sh` (provides the
host-ports overlay to older versions), `functional/_mumble_proto.py` + the protocol/config-round-trip
`COMPOSE_FILE=compose.yml:compose.mumbleweb.yml` for the base; `UPGRADE_EXTRA_ENV` adds the
native `compose.host-ports.yml` at PR-head so 64738 is host-published on latest; private
`_WELCOME_TEXT_MARKER`/`_MAX_USERS` constants; `READY_PROBE(ctx)` TCP 64738 — phase-aware via
the live COMPOSE_FILE), `functional/_mumble_proto.py` + the protocol/config-round-trip
tests, `ops.py`/`test_backup.py`/`test_restore.py` (sqlite P4). See §2.4.
- **Multi-service, dep-less, in-container functional — `tests/mailu/`**: `recipe_meta.py`
(`EXTRA_ENV(domain)` with `TLS_FLAVOR=notls` + `MAIL_DOMAIN`/`HOSTNAMES`/`TRAEFIK_STACK_NAME`),
(`EXTRA_ENV(ctx)` with `TLS_FLAVOR=notls` + `MAIL_DOMAIN`/`HOSTNAMES`/`TRAEFIK_STACK_NAME`),
`functional/_mailu.py` (flask-CLI helpers), `test_mailbox.py` (create→config-export read-back),
`test_mail_flow.py` (in-container sendmail→doveadm delivery). No backupbot → P4 N/A (PARITY.md +
DEFERRED.md). See §2.4.

View File

@ -0,0 +1,360 @@
# Recipe customization — reference
Status: REFERENCE — describes the customization system as restructured on branch
`restructure/recipe-custom` (the "rcust" restructure). The pre-restructure system and its defects
are documented in this file's history (commit `76a4b6b`, the review spec whose §8 R1R9 drove the
restructure); §8 below records how each was resolved.
Companion docs: `docs/testing.md` (test architecture / tier semantics), `docs/enroll-recipe.md`
(step-by-step enrollment). This doc is the **complete reference** for the two questions those docs
answer only partially:
1. How are custom tests written for a particular recipe?
2. What are ALL the per-recipe CI settings, where do they live, and who reads them?
---
## 1. The three customization surfaces
A recipe customizes its CI through **three distinct mechanisms**:
| Surface | Form | Examples |
|---|---|---|
| **Declarative settings** | Python assignments in `tests/<recipe>/recipe_meta.py` | `DEPLOY_TIMEOUT = 1500`, `UPGRADE_BASE_VERSION = "2.3.1+..."` |
| **Code hooks** | Callables in `recipe_meta.py`, `ops.py` functions, one shell hook | `def READY_PROBE(ctx): ...`, `pre_upgrade(ctx)`, `install_steps.sh` |
| **File presence** | A file existing at a discovered path changes behavior | `test_upgrade.py` overlay, `functional/test_*.py`, `compose.ccci.yml` |
There is additionally a fourth, **operator-facing, local-dev-only** surface: environment variables
(`CCCI_SKIP_GENERIC*`) that suppress the generic floor at run time (§7). Whatever a run resolves
from all four surfaces is printed at run start as the **customization manifest** and embedded in
`results.json` under `"customization"` (§7) — one block answers "what does this recipe customize?".
## 2. Zero-config baseline
A recipe with **no `tests/<recipe>/` directory at all** still gets the full generic floor:
- deploy base version → INSTALL (generic `assert_serving`: HTTP on `/`, expect 200/301/302)
- chaos-upgrade to PR head → UPGRADE (generic `assert_upgraded`: version label matches head, converged, serving)
- BACKUP (generic `assert_backup_artifact`) — iff the recipe's compose files carry
`backupbot.backup` labels (auto-detected), else N/A
- RESTORE (generic `assert_restore_healthy`)
- CUSTOM tier: empty (no custom tests discovered)
- teardown
Defaults: `HEALTH_PATH="/"`, `HEALTH_OK=(200,301,302)`, `DEPLOY_TIMEOUT=600`, `HTTP_TIMEOUT=300`.
Everything in this doc is opt-in deviation from that floor. The cardinal invariant
(docs/testing.md §1): the generic floor is **always on** and never depends on custom code;
custom is **additive** by default.
## 3. The per-recipe tree — every file that can exist
Two locations, with precedence and a security gate between them:
- **cc-ci-owned**: `tests/<recipe>/` in this repo (trusted, maintainer-reviewed)
- **repo-local**: the recipe repo's own `tests/` dir (PR-author-controlled → **default-deny**,
consulted only when the recipe is listed in `tests/repo-local-approved.txt` — gate HC2,
centralized in `runner/harness/discovery.py`)
```
tests/<recipe>/ # cc-ci side (repo-local mirrors the same shape)
├── recipe_meta.py # THE config file: registry-validated keys + ctx-hooks (§4)
├── test_<op>.py # lifecycle overlay assertions, op ∈ install|upgrade|backup|restore (§5.1)
├── ops.py # pre_<op>(ctx) seed hooks (§5.2)
├── functional/test_*.py # custom tier: parity ports + recipe-specific (§5.3)
├── playwright/test_*.py # custom tier: UI flows (§5.3)
├── install_steps.sh # pre-deploy shell hook (the ONLY shell hook) (§5.4)
├── compose.ccci.yml # CI-only compose overlay (first-class) (§5.5)
└── PARITY.md # enrollment contract doc (human-read only)
```
**Placement rule (custom tests):** ALL custom-tier tests live under `functional/` or
`playwright/`. A top-level `test_*.py` is a lifecycle overlay (`test_<op>.py`) and nothing else —
top-level non-lifecycle files are NOT discovered (`discovery.custom_tests`; the lifecycle-name
exclusion stays as a safety net so a misfiled `test_<op>.py` can never double-run).
Precedence (machine-docs/DECISIONS.md, implemented in `discovery.py`):
- lifecycle overlay `test_<op>.py`: repo-local **wins** over cc-ci (same-name collision); the
generic floor still runs additively alongside.
- custom tier (`functional/` + `playwright/`): **ALL** run, from both locations (no collision
concept).
- `install_steps.sh`: repo-local > cc-ci, or none.
- `ops.py` pre-op hook: cc-ci wins; repo-local consulted only if approved.
- `recipe_meta.py` and `compose.ccci.yml`: cc-ci only — repo-local recipes cannot set CI settings
or compose overlays (by design; those surfaces stay maintainer-controlled).
## 4. `recipe_meta.py` — complete settings reference
The single settings file. Plain Python, `exec()`d by the harness in exactly ONE place: the
registry-backed loader `runner/harness/meta.py::load(recipe) -> RecipeMeta`. Every consumer — the
orchestrator (which loads once and passes the object down), the pytest `meta` fixture, lifecycle,
deps, canonical, screenshot — reads from that one loaded object.
**Validation (hard errors at load, before any deploy):**
- A key is "set" by a top-level ALL-CAPS assignment or `def`. Unknown ALL-CAPS top-level names
raise `MetaError` listing the unknown name and the nearest registered key (typo gate —
misspelling `READY_PROBE` can no longer silently disable the probe).
- Type mismatches raise `MetaError`; callables are accepted only for hook-typed keys.
- **Underscore-prefixed names (`_FOO`) are recipe-private and exempt** — that's where private
constants live (e.g. mumble's `_WELCOME_TEXT_MARKER`). Lowercase names (helpers/imports) are
ignored.
- Hook callables must have the registered signature (below); a legacy-signature hook raises a
`MetaError` naming the migration, never a silent `TypeError` mid-run.
A unit test (`tests/unit/test_meta.py`) loads every `tests/*/recipe_meta.py` through the registry,
so a typo'd key fails at PR time, not at run time.
<!-- META-TABLE-START -->
_This table is GENERATED from the `runner/harness/meta.py` KEYS registry by `scripts/gen-meta-docs.py` — do not edit by hand (a unit test pins the sync)._
| Key | Type | Default | Meaning |
|---|---|---|---|
| `HEALTH_PATH` | `str` | `'/'` | Path probed for serving/health checks (deploy wait + generic `assert_serving`). |
| `HEALTH_OK` | `tuple[int]` | `(200, 301, 302)` | Acceptable HTTP status codes for health. |
| `DEPLOY_TIMEOUT` | `int` | `600` | Max seconds to wait for swarm convergence per deploy. |
| `HTTP_TIMEOUT` | `int` | `300` | Max seconds to wait for HTTP health after convergence. |
| `BACKUP_CAPABLE` | `bool` | `None` | Override the backup-tier capability auto-detect (compose `backupbot.backup` labels). `False` forces N/A; `True` forces the tier on; unset = auto-detect. |
| `EXPECTED_NA` | `dict` | `None` | Declare an N/A rung intentional: `{rung: reason}`. The cap stands either way; only the report wording changes. |
| `READY_PROBE` | `hook` | `None` | Callable `(ctx) -> [probe, ...]` returning extra readiness probes, run after install AND after upgrade: HTTP `{host, path, ok}` or TCP `{tcp_host, tcp_port, stable}`. |
| `UPGRADE_BASE_VERSION` | `str` | `None` | Exact published tag overriding the upgrade tier's base (default: `recipe_versions[-2]`). |
| `BACKUP_VERIFY` | `hook` | `None` | Callable `(ctx) -> bool` post-backup data-capture check; `False` re-runs the backup (truncated-dump race guard), retried up to 3 attempts. |
| `UPGRADE_EXTRA_ENV` | `dict_or_hook` | `None` | Extra `.env` keys applied after the PR-head checkout, before the chaos redeploy (env that exists only at head). Dict, or callable `(ctx) -> dict`. |
| `EXTRA_ENV` | `dict_or_hook` | `{}` | Extra `.env` keys applied at EVERY deploy (base install AND upgrade old-app). Dict, or callable `(ctx) -> dict` deriving values from the per-run domain (`ctx.domain`). |
| `DEPS` | `list[str]` | `[]` | Dep recipes deployed/provisioned alongside (e.g. `["keycloak"]`); creds land in `$CCCI_DEPS_FILE`. |
| `WARM_CANONICAL` | `bool` | `False` | Enroll the recipe in the warm/canonical app system (docs/warm.md): green cold runs on LATEST advance the canonical snapshot. |
| `SCREENSHOT` | `hook` | `None` | Callable `(page, ctx)` driving Playwright to a safe, credential-free post-login view for the results-card screenshot (default: landing page). |
<!-- META-TABLE-END -->
### 4.1 The uniform hook convention — `HookCtx`
Every recipe callable takes a single `ctx` argument (`harness/meta.py::HookCtx`, frozen):
| Field | Meaning |
|---|---|
| `ctx.domain` | the app's per-run domain |
| `ctx.base_url` | `https://<domain>` |
| `ctx.meta` | the recipe's full `RecipeMeta` |
| `ctx.deps` | provisioned dep creds (`{dep_recipe: entry}`) or `None` |
| `ctx.op` | current lifecycle op (`install`/`upgrade`/`backup`/`restore`) or `None` |
Signatures: `EXTRA_ENV(ctx)`, `UPGRADE_EXTRA_ENV(ctx)`, `READY_PROBE(ctx)`, `BACKUP_VERIFY(ctx)`,
`SCREENSHOT(page, ctx)`, ops.py `pre_<op>(ctx)`. Dict-valued `EXTRA_ENV`/`UPGRADE_EXTRA_ENV`
(non-callable) are still fine — only the callable form takes ctx. The loader enforces the
parameter names at load time (a pre-restructure `(domain)`/`(domain, meta)` hook gets a pointed
`MetaError`, not a mid-run crash).
Worked hook examples: cryptpad (`EXTRA_ENV(ctx)` derives `SANDBOX_DOMAIN` from `ctx.domain`),
mumble (`READY_PROBE(ctx)` TCP voice-port probe, `UPGRADE_EXTRA_ENV(ctx)` adds a head-only compose
overlay), ghost/discourse (`BACKUP_VERIFY(ctx)` dump-capture check).
## 5. Writing custom tests & hooks
### 5.1 Lifecycle overlay assertions — `test_<op>.py`
One pytest file per lifecycle op (`install` / `upgrade` / `backup` / `restore`). The
**orchestrator performs the op exactly once**; the overlay only *asserts* on the resulting state
(HC3 op/assertion split — overlays never deploy, never restore, never mutate). The generic floor
test runs additively against the same state.
Conventions (see `tests/immich/test_backup.py` etc.):
- use the `live_app` fixture (asserts `CCCI_APP_DOMAIN` is set, yields the domain)
- use the `meta` fixture — the recipe's FULL validated `RecipeMeta` (attribute access)
- use the `op_state` fixture for op context (versions, `snapshot_id`, artifact paths — the
orchestrator's run-scoped op record; skips with a clear reason outside an orchestrator run)
- execute in-container checks via `harness.lifecycle.exec_in_app(domain, service, cmd)`
### 5.2 Pre-op seed hooks — `ops.py`
`def pre_<op>(ctx)` callables, imported and called by the orchestrator **before** performing the
op. This is where data gets seeded so the post-op overlay can assert on it:
```python
# tests/immich/ops.py (pattern)
def pre_upgrade(ctx): _psql(ctx.domain, "INSERT ... 'upgrade-survives'")
def pre_backup(ctx): _psql(ctx.domain, "INSERT ... 'original'")
def pre_restore(ctx): _psql(ctx.domain, "DROP TABLE ci_marker") # damage, restore must undo
```
Seed → op → assert is the whole pattern: `pre_backup` writes a marker, the orchestrator backs up,
`pre_restore` destroys it, the orchestrator restores, `test_restore.py` asserts the marker is back.
### 5.3 Custom tier — `functional/` and `playwright/` ONLY
All custom-tier tests live under `tests/<recipe>/functional/` or `tests/<recipe>/playwright/`
(discovery: `discovery.custom_tests`; the placement rule, §3). Run in the CUSTOM tier, after
restore, against the post-upgrade (PR-head) app. ALL discovered files run — cc-ci's and (if
HC2-approved) repo-local's, additively.
Enrollment contract (`docs/enroll-recipe.md`): ≥2 NEW functional tests beyond ports of existing
upstream checks; ported tests carry `SOURCE:` comments. Playwright tests get the shared
browser/harness helpers (`harness.browser`); SSO recipes get `harness.sso`
(`setup_keycloak_realm` — idempotent, `oidc_password_grant` — provider-pluggable). The documented
import toolbox for custom tests is `from harness import lifecycle, sso, browser`.
Tests needing deps use the `deps` fixture (entries expose `.domain` plus the full creds dict) and
carry `@pytest.mark.requires_deps` — when dep provisioning failed they skip with reason
`deps-not-ready` and the skip count is reported and FAILS a declared-deps run (F2-11; a green exit
must not mask an unrun SSO test). Fixtures replace direct `os.environ` reads — after the
restructure no recipe test parses env by hand.
### 5.4 Pre-deploy shell hook — `install_steps.sh`
The ONLY shell hook. Runs after `abra app new` + `EXTRA_ENV` application + secret generation,
**before** the single base deploy. For setup that must precede the first deploy: writing extra
config files into the recipe checkout, editing `.env` beyond simple key=val, and — for recipes
with `DEPS` — wiring dep-derived OIDC env into the deploy (deps are always provisioned BEFORE the
deploy; install-time wiring is the only mode, so there is exactly one deploy and no post-deploy
redeploy hook).
Env contract: `CCCI_APP_DOMAIN`, `CCCI_RECIPE`, `CCCI_APP_ENV` (path to the app's `.env`), and —
when `DEPS` is declared — `CCCI_DEPS_FILE` (jq-readable JSON of dep creds/URLs; see
lasuite-drive/-meet/-docs for the pattern). Must locate the recipe checkout ABRA_DIR-aware:
`RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"` (per-run `ABRA_DIR` since the
concurrency restructure — a hardcoded `~/.abra` writes to the wrong tree).
Graceful-generic rule: a recipe needing a hook but not shipping one simply fails the generic
install — a correct reported outcome, not a harness error.
### 5.5 CI-only compose overlay — `compose.ccci.yml`
**First-class:** if `tests/<recipe>/compose.ccci.yml` exists, the harness itself copies it into
the recipe checkout (ABRA_DIR-aware) before the base deploy and automatically uses `--chaos` for
that deploy (the untracked file would otherwise trip abra's clean-tree gate). No
`install_steps.sh` copy boilerplate, no flag to remember (the old `CHAOS_BASE_DEPLOY` ⇄ overlay
coupling is gone). The overlay is cc-ci-owned only.
Policy unchanged: overlays are a minimal, justified fallback (ghost's is a 15m `start_period`
grace — a literal, because abra validates `start_period` before env substitution). Reference the
overlay from `EXTRA_ENV`'s `COMPOSE_FILE` as usual. Users: ghost, discourse.
### 5.6 Environment & fixture contract (what custom code can read)
Pytest fixtures (`tests/conftest.py` — the single fixture file):
| Fixture | Yields |
|---|---|
| `recipe` | the recipe name (`$RECIPE`) |
| `meta` | the FULL validated `RecipeMeta` (single loader) |
| `live_app` | the shared deployment's domain (asserts it exists) |
| `op_state` | the orchestrator's op-context dict (skips cleanly outside a run) |
| `deps` | `{dep_recipe: entry}` — entries expose `.domain` + full SSO creds |
Environment (hooks/shell, and approved repo-local code):
| Var | Set for | Meaning |
|---|---|---|
| `CCCI_APP_DOMAIN` | all tests + hooks | the app's per-run domain |
| `CCCI_BASE_URL` | approved repo-local code | `https://<domain>` |
| `CCCI_RECIPE`, `CCCI_APP_ENV` | `install_steps.sh` | recipe name, app `.env` path |
| `CCCI_OP_STATE_FILE` | overlay tests (via `op_state`) | JSON op context (versions, artifacts) |
| `CCCI_DEPS_FILE` | `install_steps.sh` + harness | JSON dep creds dict |
| `CCCI_DEPS_READY` / `CCCI_DEPS_NOT_READY_REASON` | custom tier (via `requires_deps`) | gate SSO tests, skip-with-reason |
## 6. Run-model context (what the settings plug into)
One deploy chain per run (full detail: `docs/testing.md` §2):
```
[DEPS? provision deps FIRST → $CCCI_DEPS_FILE]
deploy BASE (UPGRADE_BASE_VERSION or recipe_versions[-2]; EXTRA_ENV; install_steps.sh;
compose.ccci.yml auto-copied + auto-chaos)
→ INSTALL tier (READY_PROBE; generic + overlay asserts)
→ pre_upgrade(ctx) → chaos-deploy PR HEAD (UPGRADE_EXTRA_ENV)
→ UPGRADE tier (READY_PROBE; version-label == head_ref)
→ pre_backup(ctx) → backup (BACKUP_CAPABLE; BACKUP_VERIFY)
→ BACKUP tier
→ pre_restore(ctx) → restore
→ RESTORE tier
→ CUSTOM tier (functional/ + playwright/; deps via the `deps` fixture)
→ SCREENSHOT (best-effort, never affects the verdict)
→ teardown (deps LAST)
```
Deploy-count guard (DG4.1): exactly `1 + len(DEPS)` deploys per run (chaos redeploys don't
count); the per-run counter file is keyed by run since the concurrency restructure.
## 7. Local iteration, the manifest, and the dev-only escape hatch
```
RECIPE=<recipe> PR=<n> REF=<sha> SRC=recipe-maintainers/<recipe> \
STAGES=install,upgrade,backup,restore,custom \
cc-ci-run runner/run_recipe_ci.py
```
(`docs/enroll-recipe.md` §5 for the full loop, including dep teardown caveats.)
**Customization manifest.** Every run prints, right after meta load + discovery, one block:
```
===== customization manifest: <recipe> =====
meta (non-default): DEPLOY_TIMEOUT=1500 DEPS=['keycloak'] EXTRA_ENV='<hook>'
hooks: ops.py[pre_backup,pre_upgrade](cc-ci) install_steps.sh(cc-ci) compose.ccci.yml(cc-ci)
overlays: test_backup.py(cc-ci) test_restore.py(repo-local)
custom tests: functional/=5 playwright/=2 (cc-ci)
env overrides: (none)
```
The same dict is embedded in `results.json` under `"customization"`. It is pure presentation —
built from the SAME discovery/meta calls the run uses (so it cannot disagree with what executes,
and it honors the HC2 gate) — and never influences a verdict.
**Dev-only generic skip.** `CCCI_SKIP_GENERIC=1` (all ops) / `CCCI_SKIP_GENERIC_<OP>=1` (one op)
suppress the generic floor — a LOCAL-DEV-ONLY escape hatch for iterating on one tier. There is no
declarative equivalent (the old `SKIP_GENERIC` meta key is deleted). If the env form is active in
a CI (drone) run, the run prints a loud `!!` warning and the manifest records it.
## 8. Restructure outcomes (the review spec's R1R9)
How each defect identified in the review spec (commit `76a4b6b` §8) was resolved:
- **R1 — six divergent meta loaders → RESOLVED.** One registry-backed loader
(`harness/meta.py::load`), the only `exec()` of `recipe_meta.py`. The orchestrator loads once
and passes the `RecipeMeta` down; conftest/lifecycle/deps/canonical all read the one object.
- **R2 — dead `SCREENSHOT` knob → RESOLVED (kept + fixed).** The registry replaced the allowlist
that orphaned it; the orchestrator path now delivers the hook to `screenshot.py`
(proven end-to-end by `tests/unit/test_screenshot.py::test_screenshot_reachable_through_real_load_path`).
- **R3 — 4-key pytest `meta` fixture → RESOLVED.** The fixture returns the full validated
`RecipeMeta`.
- **R4 — three config languages → MITIGATED by the manifest** (§7): the surfaces stay (they serve
different actors), but every run resolves them into one visible block + results key.
- **R5 — reference-doc drift → RESOLVED.** §4's key table is generated from the registry
(`scripts/gen-meta-docs.py`); a unit test fails CI on drift; `testing.md`/`enroll-recipe.md`
point here instead of keeping partial lists.
- **R6 — silent typos → RESOLVED.** Unknown ALL-CAPS keys and type mismatches are hard
`MetaError`s; private constants are underscore-prefixed (exempt).
- **R7 — `compose.ccci.yml``CHAOS_BASE_DEPLOY` coupling → RESOLVED.** The overlay is
first-class: harness-copied, auto-chaos. The flag is deleted.
- **R8 — zero-user `SKIP_GENERIC` meta key → RESOLVED (deleted).** Env form remains, documented
dev-only, loudly flagged in CI runs (§7).
- **R9 — `recipe_meta.py` is code, not config → REJECTED by decision.** No data/hooks file split:
registry validation gets the value (typed, validated keys) at lower cost; one file per recipe
remains the single config place. The expressiveness need is real (cryptpad derives env from the
per-run domain).
Also settled in the restructure: install-time deps provisioning is the ONLY mode (the legacy
post-deploy `setup_custom_tests.sh` machinery and its extra redeploy are deleted); the custom-test
placement rule (§3); the uniform ctx hook convention (§4.1); the consolidated fixture surface
(§5.6 — `deps` replaces `deps_apps`+`deps_creds`; dead `deployed`/`deployed_app`/`app_domain`
fixtures deleted).
## 9. File / symbol index
| Concern | Where |
|---|---|
| THE meta loader + key registry + `HookCtx` + `MetaError` | `runner/harness/meta.py` (`load`, `KEYS`, `check_hook_signature`) |
| Generated key table | `scripts/gen-meta-docs.py` → §4 above (sync pinned by `tests/unit/test_meta.py`) |
| Customization manifest | `runner/harness/manifest.py` (`build`, `render`), printed by `runner/run_recipe_ci.py` |
| Overlay/custom/hook discovery + HC2 gate + placement rule | `runner/harness/discovery.py` |
| HC2 allowlist | `tests/repo-local-approved.txt` |
| Generic assertions + `BACKUP_CAPABLE` detect | `runner/harness/generic.py` |
| `compose.ccci.yml` auto-copy + auto-chaos | `runner/harness/lifecycle.py` (`provide_ccci_overlay`, `deploy_app`) |
| `READY_PROBE` consumption | `runner/harness/lifecycle.py` (`wait_ready_probes`) |
| `EXPECTED_NA` reporting | `runner/harness/results.py` |
| `SCREENSHOT` consumer | `runner/harness/screenshot.py` |
| Fixtures (`recipe`/`meta`/`live_app`/`op_state`/`deps`) + F2-11 skip-report | `tests/conftest.py` |
| Skip-generic env logic (dev-only) | `runner/run_recipe_ci.py` (`_skip_generic`) |
| Unit tests pinning all of the above | `tests/unit/test_meta.py`, `test_manifest.py`, `test_discovery*.py` |
| Worked examples | `tests/ghost/` (overlay+compose.ccci.yml), `tests/mumble/` (TCP probe, UPGRADE_EXTRA_ENV, private `_` constants), `tests/lasuite-drive/` (DEPS + install-time OIDC wiring), `tests/immich/` (ops.py seed pattern) |

View File

@ -16,12 +16,13 @@ year from now, this is the one rule that should still hold.
ship as the floor for every recipe. No SSO provider, no external deps, no per-recipe state
scaffolding — just "does this recipe deploy and lifecycle work?"
- **Generic must not depend on custom.** A custom test or a custom-tests setup (e.g. SSO/OIDC dep
provisioning) **can never be a precondition for the generic tier to pass.** Concretely: the
orchestrator runs all generic tiers (install → upgrade → backup → restore) against the recipe
**alone, with no deps deployed**, then runs the `setup_custom_tests` step (deps + post-deps
wiring) only after — and a failure there is **isolated** to the custom tier (tests tagged
`@pytest.mark.requires_deps` skip with reason `"deps-not-ready"`; generic tier reports
normally). See `cc-ci-plan/plan-sso-dep-testing.md` for the SSO-dep specifics.
provisioning) **can never be a precondition for the generic tier to pass.** Concretely: deps are
provisioned BEFORE the single deploy (so `install_steps.sh` can wire OIDC env into that one
deploy), but a dep-provisioning failure is **isolated** to the custom tier — the recipe still
deploys alone, every generic tier (install → upgrade → backup → restore) runs normally, and
tests tagged `@pytest.mark.requires_deps` skip with reason `"deps-not-ready"` (a counted,
reported skip — F2-11). A deps failure can never fail or block a generic tier. See
`cc-ci-plan/plan-sso-dep-testing.md` for the SSO-dep specifics.
- **Custom tests are the thoroughness layer — and they cost more to maintain.** They're more
thorough (authenticated APIs, multi-app flows, version-specific browser selectors, helper
scripts, state-management) and *therefore* take more maintenance: an SSO provider's admin API
@ -113,9 +114,11 @@ repo-local <recipe-repo>/tests/test_<op>.py (upstream-authoritative; gated
Only ONE overlay source wins for a given op (repo-local > cc-ci); the generic floor runs **in
addition** unless explicitly opted out.
**Custom (non-lifecycle) `test_*.py`** — any other `test_*.py` (e.g. `test_sso.py`) is **opt-in and
additive**: it has no generic equivalent and runs only when present, discovered from both locations
(repo-local gated by the HC2 allowlist).
**Custom (non-lifecycle) tests** — e.g. `functional/test_sso.py` — are **opt-in and additive**:
they have no generic equivalent and run only when present, discovered from both locations
(repo-local gated by the HC2 allowlist). Placement rule: custom tests live ONLY under
`functional/` or `playwright/`; a top-level `test_*.py` is a lifecycle overlay and nothing else
(top-level non-lifecycle files are not discovered).
### Pre-op seed hooks (per-recipe `ops.py`)
@ -127,35 +130,38 @@ etc.). Since the orchestrator owns the op, overlays place their seed in an optio
# tests/<recipe>/ops.py
from harness import lifecycle
def pre_upgrade(domain, meta):
def pre_upgrade(ctx):
# seed a marker before the harness performs the upgrade
lifecycle.exec_in_app(domain, ["sh", "-c", "echo upgrade-survives > /path/marker"])
lifecycle.exec_in_app(ctx.domain, ["sh", "-c", "echo upgrade-survives > /path/marker"])
def pre_backup(domain, meta):
def pre_backup(ctx):
# establish a known "original" state before the backup op captures it
lifecycle.exec_in_app(domain, ["sh", "-c", "echo original > /path/marker"])
lifecycle.exec_in_app(ctx.domain, ["sh", "-c", "echo original > /path/marker"])
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backed-up state so a successful restore is observable
lifecycle.exec_in_app(domain, ["sh", "-c", "echo mutated > /path/marker"])
lifecycle.exec_in_app(ctx.domain, ["sh", "-c", "echo mutated > /path/marker"])
```
The orchestrator imports `ops.py` in-process (with the recipe dir on `sys.path`, so it can import
sibling helpers like `kc_admin.py`) and calls `pre_<op>(domain, meta)` immediately before performing
the op. Then `test_<op>.py` asserts the post-op state. See `tests/custom-html/` (volume marker),
sibling helpers like `kc_admin.py`) and calls `pre_<op>(ctx)` immediately before performing the
op — `ctx` is the uniform `HookCtx` every recipe hook receives (`.domain`, `.base_url`, `.meta`,
`.deps`, `.op``docs/recipe-customization.md` §4.1). Then `test_<op>.py` asserts the post-op
state. See `tests/custom-html/` (volume marker),
`tests/keycloak/` (admin-API/realm), `tests/matrix-synapse/`, `tests/lasuite-docs/` (psql in the `db`
service) for worked examples.
### Opting out of the generic floor
### Opting out of the generic floor (LOCAL-DEV-ONLY)
The generic runs additively by default. To skip it (e.g. when an overlay's recipe-specific check
fully replaces the generic's mechanism check) set, in increasing specificity:
The generic runs additively by default and there is **no declarative opt-out** — no recipe can
ship without the floor. For local iteration only (e.g. re-running one tier while developing an
overlay), two env escape hatches exist:
- **env `CCCI_SKIP_GENERIC=1`** — skip generic for ALL ops (run-wide).
- **env `CCCI_SKIP_GENERIC_<OP>=1`** — e.g. `CCCI_SKIP_GENERIC_UPGRADE=1` — skip generic for that one op.
- **declarative in `recipe_meta.py`** — `SKIP_GENERIC = ["upgrade"]` (per-op) or `SKIP_GENERIC = ["all"]`.
Opting out is per-recipe and visible in git — not a hidden global. Truthy = `1`/`true`/`yes`/`on`.
Truthy = `1`/`true`/`yes`/`on`. If either is active in a CI (drone) run, the run prints a loud
`!!` warning and the customization manifest records it (`docs/recipe-customization.md` §7).
## Repo-local trust gate (HC2) — default-deny
@ -215,12 +221,14 @@ installs and stays 1.
`tests/custom-html/test_upgrade.py`). Assert the POST-op state — reading app state through
`lifecycle.exec_in_app` (volume/DB) for data checks, not HTTP. Generic + your overlay both run.
3. If the overlay needs to seed PRE-op state (data-continuity markers, the backup→restore
divergence), drop `tests/<recipe>/ops.py` with `pre_upgrade/pre_backup/pre_restore(domain, meta)`.
divergence), drop `tests/<recipe>/ops.py` with `pre_upgrade/pre_backup/pre_restore(ctx)`.
4. If the recipe needs install-time setup, add `tests/<recipe>/install_steps.sh`.
5. Set per-recipe knobs (health path, timeouts, opt-out) in `recipe_meta.py`.
5. Set per-recipe knobs (health path, timeouts) in `recipe_meta.py`.
6. **Never weaken or skip an assertion to make a run pass** — a red tier is information.
Per-recipe config (`tests/<recipe>/recipe_meta.py`, all optional):
Per-recipe config (`tests/<recipe>/recipe_meta.py`, all optional — the COMPLETE key reference is
the generated table in `docs/recipe-customization.md` §4; unknown keys are hard errors, private
constants are underscore-prefixed):
```python
HEALTH_PATH = "/realms/master" # path that returns a healthy status (default "/")
@ -228,8 +236,7 @@ HEALTH_OK = (200,) # acceptable status codes (default 200/301/302)
DEPLOY_TIMEOUT = 600 # seconds for services to converge (default 600)
HTTP_TIMEOUT = 600 # seconds for the app to answer (default 300)
BACKUP_CAPABLE = True # override backup-capability auto-detection (default: scan compose)
EXTRA_ENV = {"KEY": "value"} # or EXTRA_ENV(domain) -> dict; extra .env keys set at deploy
SKIP_GENERIC = ["upgrade"] # per-recipe declarative opt-out from generic ops ("all" = every op)
EXTRA_ENV = {"KEY": "value"} # or EXTRA_ENV(ctx) -> dict; extra .env keys set at deploy
```
The harness self-tests for discovery / precedence / the HC2 allowlist live in `tests/unit/` (run:

View File

@ -1283,3 +1283,15 @@ the commit), which is the correct SCM integration.
environment; job is session-persistent (survives as long as Builder session runs). T0-refire
verified: CronCreate test fire at 23:17Z → upgrader started, upgrader-cron.log created, status
RUNNING. (2026-06-01)
## conc P3 (2026-06-10, Builder): install_steps.sh hooks resolve $ABRA_DIR — guardrail note
P3 makes recipe working trees per-run ($ABRA_DIR/recipes). tests/{ghost,discourse}/install_steps.sh
hard-coded `${HOME}/.abra/recipes/...` to copy their compose.ccci.yml overlay into the deploy tree;
under per-run trees that path is the WRONG (canonical) tree, so the overlay would silently miss the
deploy and both recipes' upgrade-tier base deploys would break. Fixed with ONE mechanical line per
hook: `RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"` (identical resolution rule to
the abra CLI and abra.recipe_dir()). No test assertion, gate, or overlay content was touched — the
phase guardrail's "never touch tests/<recipe>/ content" is read as protecting test/gate SEMANTICS;
this is required P3 fallout, equivalent to the harness-side path routing. Flagged here for the
Adversary's gate-integrity review.

View File

@ -335,3 +335,15 @@ before the build is called done) — but does **not** force closure.
- **Re-entry trigger:** Builder authors recipe-PR Q4.7b (cache tarball on a volume / wget
retry+backoff / drop `2>/dev/null` / `set +e` w/ fallback), then runs plausible-full green + claims.
- **Linked:** REVIEW-2 `e850281` (root-cause + DENY), `71af595` (§4.3 floor); DECISIONS 2026-05-30.
- discourse upgrade-HC1 @7ae7b0f stamps prev-base tag commit (eb96de94+U) on BOTH old+new harness since ~06-10 (baseline 184 was L4 on 06-05); harness-neutral (rcust exonerated, M2-closed) but abra stamp-resolution mechanism UNATTRIBUTED — worth a standalone dig outside rcust. Evidence: /var/lib/cc-ci-runs/{m2p-discourse,ab-discourse-7ae7b0f-oldmain}, JOURNAL-rcust 2026-06-11.
- bluesky-pds: UPSTREAM IMAGE BREAKAGE (non-rcust, M2-justified exclusion from baseline match).
The app container crash-loops `Error: Cannot find module '/app/index.js'` (MODULE_NOT_FOUND,
Node v24.15.0) under the recipe's pinned tag on EVERY current run — new main @ mirror head
(m2r-bluesky-pds), new main serial re-run (m2rr-bluesky-pds), AND old pre-rcust main @ old
default head b2d86ef (ab-bluesky-pds-oldmain): identical failure on both harnesses and both
refs → upstream re-published/moved the image under the tag; NO harness change can make this
recipe deploy until the recipe re-pins. Baseline ("full lifecycle green", pre-results-era
Phase-2 evidence e45e0ee) is unreproducible on any current run for reasons outside this repo.
Evidence: `grep -r MODULE_NOT_FOUND /var/lib/cc-ci-runs/{m2r,m2rr,ab}-bluesky-pds*/abra/logs/
default/`; REVIEW-rcust.md 2026-06-11 entries. Follow-up (post-phase): file/propose a re-pin PR
against the bluesky-pds recipe mirror.

View File

@ -30,17 +30,13 @@ import subprocess
import time
from . import abra, warm, warmsnap
from . import meta as meta_mod
def is_enrolled(recipe: str) -> bool:
"""True if `tests/<recipe>/recipe_meta.py` sets `WARM_CANONICAL = True`. Missing meta → False."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
if not os.path.exists(path):
return False
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
return bool(ns.get("WARM_CANONICAL"))
"""True if `tests/<recipe>/recipe_meta.py` sets `WARM_CANONICAL = True`. Missing meta → False.
Reads through the single meta loader (rcust P1 — no per-module exec)."""
return bool(meta_mod.load(recipe).WARM_CANONICAL)
def canonical_domain(recipe: str) -> str:
@ -51,7 +47,7 @@ def canonical_domain(recipe: str) -> str:
def enrolled_recipes() -> list[str]:
"""All recipes enrolled as data-warm canonicals (recipe_meta.WARM_CANONICAL=True), sorted. Used
by the WC6 nightly sweep to know which canonicals to refresh via a green cold run on latest."""
tests_dir = os.path.join(os.path.dirname(__file__), "..", "..", "tests")
tests_dir = meta_mod.TESTS_DIR
out = []
try:
for name in sorted(os.listdir(tests_dir)):

View File

@ -20,7 +20,7 @@ Per Phase-2 DECISIONS:
Run state:
- `$CCCI_DEPS_FILE` — JSON file written by the orchestrator after each dep deploys; each entry is
`{"recipe": "<dep-recipe>", "domain": "<dep-domain>", "version": null}`. Tests access via the
`deps_apps` pytest fixture defined in `tests/conftest.py`.
`deps` pytest fixture defined in `tests/conftest.py`.
"""
from __future__ import annotations
@ -31,19 +31,7 @@ import os
from collections.abc import Iterable
from . import lifecycle, naming
def declared_deps(recipe: str) -> list[str]:
"""Read `DEPS` from `tests/<recipe>/recipe_meta.py` — a list of recipe names this recipe needs
deployed alongside it. Returns [] if none."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
if not os.path.exists(path):
return []
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
deps = ns.get("DEPS") or []
return [str(d) for d in deps if d]
from . import meta as meta_mod
def dep_domain(parent_recipe: str, pr: str, ref: str | None, dep_recipe: str) -> str:
@ -62,11 +50,11 @@ def write_run_state(deps_state) -> None:
"""Write the deps state file ($CCCI_DEPS_FILE). Two shapes supported (canonical=keyed dict):
1. **Legacy list-of-entries:** `[{"recipe": "<dep>", "domain": "<d>"}, ...]` (Q2.3 original).
Still accepted by `load_run_state` for backwards compat — `deps_apps` fixture flattens.
Still accepted by `load_run_state` for backwards compat — the `deps` fixture flattens.
2. **NEW per-spec dict (operator-2026-05-28 SSO-dep plan §3.2):**
`{"<dep_recipe>": {"recipe": "<dep>", "domain": "<d>", "realm": "...",
"client_id": "...", "client_secret": "...", "admin_user": "...", "admin_password": "..."}}`.
The `setup_custom_tests.sh` per-recipe hook reads this via `jq` to wire OIDC env.
The per-recipe `install_steps.sh` hook reads this via `jq` to wire OIDC env.
No-op if `$CCCI_DEPS_FILE` isn't set."""
path = os.environ.get("CCCI_DEPS_FILE")
@ -81,11 +69,12 @@ def deploy_deps(
pr: str,
ref: str | None,
deps: Iterable[str],
meta_for: dict[str, dict] | None = None,
meta_for: dict | None = None,
) -> list[dict]:
"""Deploy each declared dep, sequentially, at its per-run domain. Returns the list of state
dicts (one per dep). `meta_for` maps dep_recipe -> meta (HEALTH_PATH/HEALTH_OK/timeouts) so the
readiness wait uses per-dep config; missing dep meta falls back to (/, 200/301/302, 600s)."""
dicts (one per dep). `meta_for` maps dep_recipe -> RecipeMeta (HEALTH_PATH/HEALTH_OK/timeouts)
so the readiness wait uses per-dep config; a missing dep meta is loaded via meta.load()
(defaults: /, 200/301/302, 600s)."""
meta_for = meta_for or {}
state: list[dict] = []
for dep in deps:
@ -94,20 +83,21 @@ def deploy_deps(
# NB: each dep_app gets a fresh deploy_count entry only on `_record_deploy` which fires
# inside `lifecycle.deploy_app`. For Phase 2 the deploy-count guard (DG4.1) counts the
# parent + its deps as distinct install events — by design, since each is a separate app.
dm = meta_for.get(dep, {})
dm = meta_for.get(dep) or meta_mod.load(dep)
lifecycle.deploy_app(
dep,
domain,
secrets=True,
deploy_timeout=int(dm.get("DEPLOY_TIMEOUT", 900)),
deploy_timeout=int(dm.DEPLOY_TIMEOUT),
meta=dm,
)
try:
lifecycle.wait_healthy(
domain,
ok_codes=tuple(dm.get("HEALTH_OK", (200, 301, 302))),
path=dm.get("HEALTH_PATH", "/"),
deploy_timeout=int(dm.get("DEPLOY_TIMEOUT", 600)),
http_timeout=int(dm.get("HTTP_TIMEOUT", 600)),
ok_codes=tuple(dm.HEALTH_OK),
path=dm.HEALTH_PATH,
deploy_timeout=int(dm.DEPLOY_TIMEOUT),
http_timeout=int(dm.HTTP_TIMEOUT),
)
except Exception:
# If a dep fails to converge, abort the whole resolve — let the caller teardown
@ -163,7 +153,7 @@ def load_run_state():
def deps_as_dict(state) -> dict[str, dict]:
"""Coerce either shape (legacy list or new dict) into a recipe→entry dict for the deps_apps
"""Coerce either shape (legacy list or new dict) into a recipe→entry dict for the `deps`
fixture + dependent-tests consumption."""
if isinstance(state, dict):
return state

View File

@ -11,7 +11,8 @@ hook; the orchestrator decides additive-vs-skip. Sources, in precedence order
> cc-ci tests/<recipe>/test_<op>.py
(the generic tests/_generic/test_<op>.py is the always-present floor, run separately by default)
custom (non-lifecycle) test_*.py — ALL run, additively, from BOTH locations (opt-in).
custom test_*.py (functional/ + playwright/ ONLY, rcust P4 placement rule) — ALL run,
additively, from BOTH locations (opt-in).
install-steps hook — install_steps.sh: repo-local > cc-ci, or none.
@ -100,29 +101,22 @@ def resolve_op(recipe: str, op: str, repo_local_dir: str | None) -> tuple[str, s
def custom_tests(recipe: str, repo_local_dir: str | None) -> list[tuple[str, str]]:
"""All non-lifecycle test_*.py from cc-ci's tests/<recipe>/ and (if approved) the recipe's
repo-local tests/. Discovered locations (Phase 2 §4.1):
- the top-level dir tests/<recipe>/test_*.py (legacy + cross-cutting)
- functional/ tests/<recipe>/functional/test_*.py (parity ports + recipe-specific)
- playwright/ tests/<recipe>/playwright/test_*.py (UI flows P6)
Files named `test_<op>.py` (lifecycle ops) are excluded from this list — the orchestrator runs
those in their lifecycle tier, not the custom one. Repo-local is consulted only for
allowlist-approved recipes (HC2)."""
"""All custom-tier test_*.py from cc-ci's tests/<recipe>/ and (if approved) the recipe's
repo-local tests/. PLACEMENT RULE (rcust P4): custom tests live ONLY under
- functional/ tests/<recipe>/functional/test_*.py (parity ports + recipe-specific)
- playwright/ tests/<recipe>/playwright/test_*.py (UI flows)
A top-level test_*.py is a LIFECYCLE OVERLAY (test_<op>.py) and nothing else — top-level
non-lifecycle files are NOT discovered (zero users at the time of the change; the lifecycle-
name exclusion below stays as a safety net so a misfiled test_<op>.py can never double-run).
Repo-local is consulted only for allowlist-approved recipes (HC2)."""
lifecycle_names = {f"test_{op}.py" for op in LIFECYCLE_OPS}
subdirs = ("functional", "playwright")
found: list[tuple[str, str]] = []
for source, d in (("cc-ci", cc_ci_dir(recipe)), ("repo-local", _gated(recipe, repo_local_dir))):
if not d or not os.path.isdir(d):
continue
# top-level (legacy / cross-cutting tests not under functional/playwright)
for p in sorted(glob.glob(os.path.join(d, "test_*.py"))):
if os.path.basename(p) not in lifecycle_names:
found.append((source, p))
# functional/ and playwright/ subdirs (Phase 2 §4.1)
for sub in subdirs:
for p in sorted(glob.glob(os.path.join(d, sub, "test_*.py"))):
# Phase-2 layout: lifecycle ops never live under functional/playwright, but be
# explicit so a misfiled file doesn't silently get double-run.
if os.path.basename(p) not in lifecycle_names:
found.append((source, p))
return found
@ -144,7 +138,7 @@ def install_steps(recipe: str, repo_local_dir: str | None) -> tuple[str, str] |
def pre_op_hook(recipe: str, op: str, repo_local_dir: str | None) -> tuple[str, str] | None:
"""The pre-op seed hook for `op`: the path to a recipe `ops.py` module that defines a
`pre_<op>(domain, meta)` callable, or None. cc-ci's tests/<recipe>/ops.py wins; the repo-local
`pre_<op>(ctx)` callable, or None. cc-ci's tests/<recipe>/ops.py wins; the repo-local
ops.py is consulted only for allowlist-approved recipes (HC2). The orchestrator imports the
module and calls pre_<op> BEFORE performing the op (HC3 op/assertion split — overlays seed
pre-op state here, then assert post-op in test_<op>.py)."""

View File

@ -19,6 +19,7 @@ import ssl
import time
from . import abra, lifecycle
from . import meta as meta_mod
# A recipe is backup-capable iff a compose file carries a truthy backupbot.backup label.
_BACKUPBOT_RE = re.compile(r"backupbot\.backup\b[^\n]*\btrue\b", re.IGNORECASE)
@ -28,13 +29,14 @@ def _recipe_dir(recipe: str) -> str:
return abra.recipe_dir(recipe) # the per-run tree inside a CI run ($ABRA_DIR)
def backup_capable(recipe: str, meta: dict | None = None) -> bool:
def backup_capable(recipe: str, meta=None) -> bool:
"""Whether the harness should run the backup/restore tiers (else they are a clean N/A skip, DG3).
`recipe_meta.BACKUP_CAPABLE` (bool) overrides; otherwise auto-detect by scanning the recipe's
compose*.yml for a truthy `backupbot.backup` label (the Co-op Cloud backup convention)."""
if meta and "BACKUP_CAPABLE" in meta:
return bool(meta["BACKUP_CAPABLE"])
`recipe_meta.BACKUP_CAPABLE` (bool) overrides when explicitly set (RecipeMeta default is None =
unset); otherwise auto-detect by scanning the recipe's compose*.yml for a truthy
`backupbot.backup` label (the Co-op Cloud backup convention)."""
if meta is not None and meta.BACKUP_CAPABLE is not None:
return bool(meta.BACKUP_CAPABLE)
for path in glob.glob(os.path.join(_recipe_dir(recipe), "compose*.yml")):
try:
with open(path) as fh:
@ -75,7 +77,7 @@ def served_cert(domain: str, port: int = 443) -> tuple[bool, str]:
return (True, f"CN={cn} SAN={sans}")
def assert_serving(domain: str, meta: dict) -> None:
def assert_serving(domain: str, meta) -> None:
"""The single generic "is the app really serving?" assertion (DG1).
The app-vs-Traefik-fallback proof is steps 1+2 (both load-bearing, verified by the Adversary):
@ -90,14 +92,14 @@ def assert_serving(domain: str, meta: dict) -> None:
Steps 12 are BOUNDED POLLS (no bare sleep), so a state-mutating op (upgrade/restore) that leaves
the app briefly reconverging settles, while a persistent failure still fails within the timeout."""
deadline = time.time() + meta["DEPLOY_TIMEOUT"]
deadline = time.time() + meta.DEPLOY_TIMEOUT
while time.time() < deadline and not lifecycle.services_converged(domain):
time.sleep(5)
assert lifecycle.services_converged(domain), f"{domain}: services did not converge"
path = meta["HEALTH_PATH"]
ok = tuple(meta["HEALTH_OK"])
deadline = time.time() + meta["HTTP_TIMEOUT"]
path = meta.HEALTH_PATH
ok = tuple(meta.HEALTH_OK)
deadline = time.time() + meta.HTTP_TIMEOUT
served = False
status, body = 0, ""
while time.time() < deadline:
@ -141,7 +143,7 @@ def op_state() -> dict:
return {}
def assert_upgraded(domain: str, meta: dict) -> None:
def assert_upgraded(domain: str, meta) -> None:
"""Generic UPGRADE assertion (post-op): the orchestrator already performed the upgrade once via
`abra app deploy --chaos` of the PR-head checkout. Assert it reconverged + still serves AND that
the deployment is genuinely the PR-head code under test (HC1) — non-vacuously (guarding F1d-2).
@ -212,7 +214,7 @@ def assert_backup_artifact(domain: str) -> str:
return snap_id
def assert_restore_healthy(domain: str, meta: dict) -> None:
def assert_restore_healthy(domain: str, meta) -> None:
"""Generic RESTORE assertion (post-op): the orchestrator already restored. Assert the app is
healthy + serving again (assert_serving polls, so the post-restore reconverge settles)."""
assert_serving(domain, meta)
@ -226,7 +228,7 @@ def perform_upgrade(
recipe: str,
head_ref: str | None,
deploy_timeout: int = 900,
meta: dict | None = None,
meta=None,
) -> dict[str, str | None]:
"""Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
@ -244,7 +246,8 @@ def perform_upgrade(
STRICTER convergence+health wait here: services N/N (wait_healthy) + app HEALTH_PATH healthy +
any recipe READY_PROBE (collabora WOPI discovery 200). This bounds readiness by OUR generous
deadline, not abra's impatient one — and is stronger evidence than abra's monitor."""
meta = meta or {}
if meta is None:
meta = meta_mod.load(recipe)
before = lifecycle.deployed_identity(domain)
if head_ref:
lifecycle.recipe_checkout_ref(recipe, head_ref)
@ -253,9 +256,7 @@ def perform_upgrade(
# (target) version, so the base deploys minimally WITHOUT it and the upgrade adds it to COMPOSE_FILE
# here, after the PR-head checkout (which ships the overlay) and before the chaos redeploy that
# picks up the new .env. Dict or callable(domain)->dict. No-op for recipes without it.
upgrade_env = meta.get("UPGRADE_EXTRA_ENV") or {}
if callable(upgrade_env):
upgrade_env = upgrade_env(domain) or {}
upgrade_env = meta_mod.upgrade_extra_env(meta, meta_mod.hook_ctx(domain, meta, op="upgrade"))
for k, v in upgrade_env.items():
print(f" upgrade-env: {k}={v}", flush=True)
abra.env_set(domain, k, v)
@ -266,14 +267,12 @@ def perform_upgrade(
# Own the convergence verification (abra's monitor was skipped via -c).
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta.get("HEALTH_OK", (200, 301, 302))),
path=meta.get("HEALTH_PATH", "/"),
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout)),
http_timeout=int(meta.get("HTTP_TIMEOUT", 300)),
)
lifecycle.wait_ready_probes(
meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout))
ok_codes=tuple(meta.HEALTH_OK),
path=meta.HEALTH_PATH,
deploy_timeout=int(meta.DEPLOY_TIMEOUT),
http_timeout=int(meta.HTTP_TIMEOUT),
)
lifecycle.wait_ready_probes(meta, domain, timeout=int(meta.DEPLOY_TIMEOUT), op="upgrade")
after = lifecycle.deployed_identity(domain)
# Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
# PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.

View File

@ -12,6 +12,7 @@ import glob
import json
import os
import re
import shutil
import socket
import ssl
import subprocess
@ -19,6 +20,7 @@ import time
import urllib.request
from . import abra, lifetime
from . import meta as meta_mod
GATEWAY_IP = "143.244.213.108" # *.ci.commoninternet.net -> gateway (TLS passthrough to cc-ci)
# A run app domain is "<recipe[:4]>-<6hex>.ci.commoninternet.net" (see DECISIONS.md). Used by the
@ -111,37 +113,6 @@ def _residual(domain: str) -> dict:
}
def _recipe_extra_env(recipe: str, domain: str) -> dict[str, str]:
"""Per-recipe extra .env keys, applied at every deploy (install + upgrade's old_app) so a recipe
with multi-domain / config needs is enrolled with NO shared-harness change (D5/M6.5). A recipe
declares `EXTRA_ENV` in tests/<recipe>/recipe_meta.py as either a dict or a callable
`EXTRA_ENV(domain) -> dict` (callable form lets it derive values from the per-run domain, e.g.
cryptpad's SANDBOX_DOMAIN). Returns {} if none."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
if not os.path.exists(path):
return {}
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
ee = ns.get("EXTRA_ENV")
if callable(ee):
ee = ee(domain)
return {str(k): str(v) for k, v in (ee or {}).items()}
def _recipe_meta_flag(recipe: str, key: str) -> bool:
"""Read a boolean flag from tests/<recipe>/recipe_meta.py (e.g. CHAOS_BASE_DEPLOY). Returns
False if the recipe ships no meta or the flag is absent/falsey. Trusted in-repo exec, same as
_recipe_extra_env."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
if not os.path.exists(path):
return False
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
return bool(ns.get(key))
def _record_deploy() -> None:
"""Increment the per-run deploy counter (DG4.1: one deploy per run). No-op unless the
orchestrator set CCCI_DEPLOY_COUNT_FILE — so it never affects standalone/manual use."""
@ -155,6 +126,34 @@ def _record_deploy() -> None:
f.write(str(n + 1))
def ccci_overlay_path(recipe: str) -> str:
"""The cc-ci-owned compose overlay for a recipe (rcust P2a: first-class, auto-discovered)."""
return os.path.join(meta_mod.TESTS_DIR, recipe, "compose.ccci.yml")
def has_ccci_overlay(recipe: str) -> bool:
return os.path.isfile(ccci_overlay_path(recipe))
def provide_ccci_overlay(recipe: str) -> None:
"""Copy tests/<recipe>/compose.ccci.yml into THIS run's recipe checkout (ABRA_DIR-aware), so
the recipe's COMPOSE_FILE reference resolves (rcust P2a — the harness owns the copy; recipes
no longer ship install_steps.sh boilerplate for it). No-op for recipes without an overlay."""
src = ccci_overlay_path(recipe)
if not os.path.isfile(src):
return
dest_dir = abra.recipe_dir(recipe)
if not os.path.isdir(dest_dir):
print(f" ccci-overlay: recipe dir {dest_dir} missing — cannot provide overlay", flush=True)
raise RuntimeError(f"recipe checkout missing for {recipe}: {dest_dir}")
shutil.copy(src, os.path.join(dest_dir, "compose.ccci.yml"))
print(
f" ccci-overlay: provided compose.ccci.yml to the {recipe} checkout "
"(first-class overlay; base deploy auto-chaos)",
flush=True,
)
def _run_install_steps(hook: tuple[str, str], recipe: str, domain: str) -> None:
"""Run a recipe's custom install-steps hook (install_steps.sh) during the install tier — after
`abra app new` + env defaults + secret generate, before deploy (Phase 1d DG5). The hook gets the
@ -238,15 +237,23 @@ def deploy_app(
secrets: bool = True,
install_steps_hook: tuple[str, str] | None = None,
deploy_timeout: int = 900,
meta=None,
) -> None:
"""Create + configure + deploy an app. Forces LETS_ENCRYPT_ENV='' so traefik serves the
wildcard cert via the file provider and NEVER attempts ACME (adversary finding A1). Applies any
per-recipe EXTRA_ENV (recipe_meta.py) and the custom install-steps hook (Phase 1d) before deploy.
per-recipe EXTRA_ENV (recipe_meta.py), the custom install-steps hook (Phase 1d), and the
first-class `tests/<recipe>/compose.ccci.yml` overlay (rcust P2a) before deploy.
`meta` is the recipe's loaded RecipeMeta (EXTRA_ENV); the orchestrator loads once and passes
it down. Callers without one in hand (fixtures, warm reconcile) may omit it — it is then
loaded here via the single meta.load() path.
`deploy_timeout` is the subprocess timeout for `abra app deploy`. Caller (orchestrator) passes
`recipe_meta.DEPLOY_TIMEOUT` so heavy recipes (ghost, matrix-synapse, lasuite-meet) can extend
past the 900s default. abra's INTERNAL TIMEOUT (recipe's TIMEOUT env, default 300s) is set via
EXTRA_ENV; this is the Python subprocess wrapper's timeout so abra doesn't get SIGKILLed mid-deploy."""
if meta is None:
meta = meta_mod.load(recipe)
_record_deploy()
# Lock BEFORE the app exists: a concurrent run's janitor must never see this app without a
# held app lock (it would probe it as an orphan and reap an in-flight deploy). Also the
@ -274,16 +281,18 @@ def deploy_app(
flush=True,
)
chaos = True
# A recipe may force a chaos base deploy via recipe_meta CHAOS_BASE_DEPLOY=True when an
# install_steps hook adds an untracked compose overlay to the recipe checkout (e.g. discourse's
# compose.ccci.yml, provided by install_steps for the pinned base). The untracked file makes
# abra's pinned-deploy clean-tree check FATA ('has locally unstaged changes'); chaos skips lint +
# the clean-tree gate and deploys the EXPLICITLY-checked-out pinned version (we already ran
# recipe_checkout(version) above) — NOT latest. Same mechanism as the lightweight-tag branch.
elif _recipe_meta_flag(recipe, "CHAOS_BASE_DEPLOY"):
# A first-class cc-ci compose overlay (tests/<recipe>/compose.ccci.yml, copied into the
# checkout below — rcust P2a) is an UNTRACKED file in the recipe checkout, which makes
# abra's pinned-deploy clean-tree check FATA ('has locally unstaged changes'). Auto-chaos:
# chaos skips lint + the clean-tree gate and deploys the EXPLICITLY-checked-out pinned
# version (we already ran recipe_checkout(version) above) — NOT latest. Same mechanism as
# the lightweight-tag branch. (Replaces the deleted CHAOS_BASE_DEPLOY meta flag — the
# overlay's presence IS the signal, killing the R7 implicit coupling.)
elif has_ccci_overlay(recipe):
print(
f" deploy_app({recipe}@{version}): CHAOS_BASE_DEPLOY set → chaos base deploy of the "
"checked-out pinned version (skips clean-tree/lint; deploys version, not LATEST)",
f" deploy_app({recipe}@{version}): compose.ccci.yml overlay present → chaos base "
"deploy of the checked-out pinned version (skips clean-tree/lint; deploys version, "
"not LATEST)",
flush=True,
)
chaos = True
@ -293,12 +302,18 @@ def deploy_app(
# it ourselves is recipe-agnostic and canonical (the run domain IS the app's domain).
abra.env_set(domain, "DOMAIN", domain)
abra.env_set(domain, "LETS_ENCRYPT_ENV", "")
for k, v in _recipe_extra_env(recipe, domain).items():
for k, v in meta_mod.extra_env(meta, meta_mod.hook_ctx(domain, meta)).items():
abra.env_set(domain, k, v)
if secrets:
abra.secret_generate(domain)
if install_steps_hook:
_run_install_steps(install_steps_hook, recipe, domain)
# First-class cc-ci compose overlay (rcust P2a): if the recipe ships
# tests/<recipe>/compose.ccci.yml, copy it into THIS run's recipe checkout (ABRA_DIR-aware)
# so the COMPOSE_FILE reference in the recipe's EXTRA_ENV resolves. Untracked, so it persists
# across the later PR-head checkout (idempotent when the head ships the same fix). Replaces
# the per-recipe install_steps.sh copy boilerplate + CHAOS_BASE_DEPLOY flag (auto-chaos above).
provide_ccci_overlay(recipe)
# HQ1: warm the local image store before the (real, unchanged) abra deploy.
prepull_images(recipe, domain)
abra.deploy(domain, chaos=chaos, timeout=deploy_timeout)
@ -333,8 +348,27 @@ def services_converged(domain: str) -> bool:
# `want == "0"` rejection wrongly treated those as never-converged, hanging the deploy
# forever. `cur == want` (with `want` present) is the correct convergence test; a service
# still spinning up shows e.g. "0/1" (cur != want) and is correctly not-yet-converged.
if not want or cur != want:
if not want:
return False
if cur != want:
# A TRIGGERED one-shot (restart_policy none, scaled 0→1, runs once, exits 0) reports
# "0/1" FOREVER after its task completes — swarm never restarts it, so a bare
# `cur != want` rejection would block convergence for the rest of the run (lasuite-drive
# minio-createbuckets, rcust M2: install assert burned the full DEPLOY_TIMEOUT after the
# P2b port moved the bucket trigger BEFORE the install assert; pre-restructure the
# trigger ran after it, so converge never saw the 0/1). A replica deficit explained
# entirely by COMPLETE tasks IS converged: the one-shot did its job and will never run
# again. Anything else in the deficit (Running/Starting/Pending = still spinning up;
# Failed/Rejected = genuinely broken) stays not-converged, and a desired>0 service with
# no tasks yet is still scheduling.
tasks = subprocess.run(
["docker", "service", "ps", name, "--format", "{{.CurrentState}}"],
capture_output=True,
text=True,
)
states = [ln.split()[0] for ln in tasks.stdout.split("\n") if ln.strip()]
if not (states and all(s == "Complete" for s in states)):
return False
# N/N alone is NOT convergence during a stop-first rolling update: a chaos redeploy that changes
# a non-app service image (e.g. immich's db pin) registers the update immediately, but swarm may
# not have cycled that service's task yet — the OLD task still shows 1/1, then dies seconds later
@ -510,7 +544,7 @@ def chaos_redeploy(
abra.deploy(domain, chaos=True, timeout=deploy_timeout, no_converge_checks=no_converge_checks)
def wait_ready_probes(meta: dict, domain: str, timeout: int = 600) -> None:
def wait_ready_probes(meta, domain: str, timeout: int = 600, op: str | None = None) -> None:
"""Poll a recipe's optional READY_PROBE endpoints until each returns an accepted status, or raise.
A recipe_meta may define `READY_PROBE(domain) -> [{"host":..., "path":..., "ok":(200,)}, ...]`
@ -527,10 +561,10 @@ def wait_ready_probes(meta: dict, domain: str, timeout: int = 600) -> None:
must be released by the old task + rebound by the new) the voice server can be down while
HTTP-200 still passes — and backup-bot then execs into a not-running app container (409). Requiring
the voice port to be stably listening before proceeding closes that window."""
probe_fn = meta.get("READY_PROBE")
probe_fn = meta.READY_PROBE
if not callable(probe_fn):
return
probes = probe_fn(domain) or []
probes = probe_fn(meta_mod.hook_ctx(domain, meta, op=op)) or []
for probe in probes:
if "tcp_port" in probe:
host = probe.get("tcp_host", "127.0.0.1")

153
runner/harness/manifest.py Normal file
View File

@ -0,0 +1,153 @@
"""Customization manifest (rcust P5; spec §8 R4 mitigation).
One block at run start answering "what does this recipe customize?" across ALL the surfaces
(recipe_meta keys, hook files, file-presence, run-time env overrides) — printed to the run log and
embedded verbatim in results.json under "customization". PURE PRESENTATION: building or printing
the manifest must never influence any verdict (R7-class invariant).
"""
from __future__ import annotations
import os
import re
from . import discovery, lifecycle
from . import meta as meta_mod
_PRE_OP_RE = re.compile(r"^def (pre_[a-z]+)\(", re.MULTILINE)
# Meta values are repo-public by construction (recipe_meta.py is committed; real secrets are
# class-B generated, never meta), but the manifest lands on the dashboard — mask values whose
# key NAME is secret-shaped so a field literally called SECRET_KEY_BASE never shows a value
# (defense in depth + keeps dashboard secret-scans quiet). `KEY` matches only as a word segment
# (API_KEY yes, KEYCLOAK_URL no).
_SENSITIVE_NAME_RE = re.compile(r"SECRET|PASSWORD|TOKEN|CREDENTIAL|(^|_)KEY(_|$)", re.IGNORECASE)
def _jsonable(v, name=""):
"""Manifest values must be JSON-serializable + deterministic: hooks render as '<hook>',
tuples become lists, secret-named entries (by key name, incl. nested dict keys) as
'<redacted>'."""
if callable(v):
return "<hook>"
if name and _SENSITIVE_NAME_RE.search(name):
return "<redacted>"
if isinstance(v, tuple):
return list(v)
if isinstance(v, dict):
return {k: _jsonable(x, name=str(k)) for k, x in v.items()}
return v
def _pre_ops(path: str) -> list[str]:
"""The pre_<op> hook names an ops.py defines (cheap source scan, same approach as
discovery._module_defines — no import)."""
try:
with open(path) as fh:
return sorted(set(_PRE_OP_RE.findall(fh.read())))
except OSError:
return []
def _custom_counts(recipe: str, repo_local: str | None) -> dict[str, dict[str, int]]:
out: dict[str, dict[str, int]] = {}
for source, path in discovery.custom_tests(recipe, repo_local):
sub = os.path.basename(os.path.dirname(path)) # functional | playwright
out.setdefault(source, {}).setdefault(sub, 0)
out[source][sub] += 1
return out
def build(recipe: str, meta, repo_local: str | None) -> dict:
"""Collect the run's resolved customization into one deterministic, JSON-serializable dict.
Keys: meta_non_default (explicitly-customized recipe_meta keys), hooks (ops.py pre-ops +
install_steps.sh + compose.ccci.yml with their source), overlays (lifecycle overlay files by
op + source), custom_tests (counts per source/subdir), env_overrides (active
CCCI_SKIP_GENERIC* — the dev-only escape hatch, flagged when riding a CI run)."""
hooks: dict = {}
pre_ops: dict[str, list[str]] = {}
for source, d in (
("cc-ci", discovery.cc_ci_dir(recipe)),
("repo-local", discovery._gated(recipe, repo_local)), # noqa: SLF001 — same HC2 gate
):
if not d:
continue
p = os.path.join(d, "ops.py")
if os.path.isfile(p):
ops = _pre_ops(p)
if ops:
pre_ops[source] = ops
if pre_ops:
hooks["ops.py"] = pre_ops
ist = discovery.install_steps(recipe, repo_local)
if ist:
hooks["install_steps.sh"] = ist[0]
if lifecycle.has_ccci_overlay(recipe):
hooks["compose.ccci.yml"] = "cc-ci"
overlays = {}
for op in discovery.LIFECYCLE_OPS:
ov = discovery.resolve_overlay_op(recipe, op, repo_local)
if ov:
overlays[op] = ov[0]
env_overrides = sorted(
k
for k in os.environ
if k.startswith("CCCI_SKIP_GENERIC")
and str(os.environ.get(k) or "").strip().lower() in ("1", "true", "yes", "on")
)
return {
"meta_non_default": {
k: _jsonable(v, name=k) for k, v in sorted(meta_mod.non_default(meta).items())
},
"hooks": hooks,
"overlays": overlays,
"custom_tests": _custom_counts(recipe, repo_local),
"env_overrides": env_overrides,
}
def render(recipe: str, manifest: dict) -> str:
"""The human block printed at run start (same content as the results.json key)."""
lines = [f"===== customization manifest: {recipe} ====="]
nd = manifest["meta_non_default"]
lines.append(
"meta (non-default): "
+ (" ".join(f"{k}={v!r}" for k, v in nd.items()) if nd else "(none — zero-config floor)")
)
hk = manifest["hooks"]
parts = []
for source, ops in hk.get("ops.py", {}).items():
parts.append(f"ops.py[{','.join(ops)}]({source})")
if "install_steps.sh" in hk:
parts.append(f"install_steps.sh({hk['install_steps.sh']})")
if "compose.ccci.yml" in hk:
parts.append(f"compose.ccci.yml({hk['compose.ccci.yml']})")
lines.append("hooks: " + (" ".join(parts) if parts else "(none)"))
ov = manifest["overlays"]
lines.append(
"overlays: "
+ (" ".join(f"test_{op}.py({src})" for op, src in ov.items()) if ov else "(none)")
)
ct = manifest["custom_tests"]
lines.append(
"custom tests: "
+ (
" ".join(
" ".join(f"{sub}/={n}" for sub, n in sorted(counts.items())) + f" ({source})"
for source, counts in sorted(ct.items())
)
if ct
else "(none)"
)
)
eo = manifest["env_overrides"]
if eo:
suffix = " !! dev-only override active in CI" if os.environ.get("DRONE") else ""
lines.append("env overrides: " + " ".join(f"{k}=1" for k in eo) + suffix)
else:
lines.append("env overrides: (none)")
return "\n".join(lines)

320
runner/harness/meta.py Normal file
View File

@ -0,0 +1,320 @@
"""Single recipe-meta loader + declarative key registry (recipe-custom restructure P1; spec
docs/recipe-customization.md §8 R1).
THE one place `tests/<recipe>/recipe_meta.py` is `exec()`d. Every consumer (orchestrator, pytest
`meta` fixture, deploy env shaping, deps, warm-canonical enrollment, screenshot) reads the ONE
loaded `RecipeMeta` object instead of re-exec'ing the file and cherry-picking keys — that drift
(six divergent loaders, spec §4 L1L6) is what made `SCREENSHOT` an unreachable knob (R2) and let
key typos silently disable coverage (R6).
Validation (locked decision, recipe-custom-restructure-full-plan.md):
- unknown ALL-CAPS top-level name → MetaError (hard error, fails fast at load; the all-recipes
unit test catches it at PR time). Underscore-prefixed names (`_FOO`) are recipe-private and
exempt; lowercase names (helper functions/imports) are ignored.
- type mismatch → MetaError. Callables are accepted ONLY for hook-typed keys.
The KEYS registry is the single source of truth for the key set: it drives validation, the
RecipeMeta dataclass fields, and the generated reference table in docs/recipe-customization.md §4
(scripts/gen-meta-docs.py; a unit test asserts the committed table matches).
"""
from __future__ import annotations
import copy
import dataclasses
import difflib
import inspect
import json
import os
from collections.abc import Callable
ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
TESTS_DIR = os.path.join(ROOT, "tests")
class MetaError(Exception):
"""A recipe_meta.py failed registry validation (unknown key / type mismatch / callable on a
data key). Hard error by design: a typo'd key must fail the run at load, not silently reduce
coverage (spec §8 R6 — the worst failure mode for a CI harness)."""
@dataclasses.dataclass(frozen=True)
class Key:
"""One registered recipe_meta key: name, type tag, default, one-line doc (rendered into the
generated reference table), optional extra validator, and a deprecation marker (deprecated
keys still load+validate but are scheduled for deletion)."""
name: str
type: str # "int"|"str"|"tuple[int]"|"bool"|"dict_or_hook"|"hook"|"list[str]"|"dict"
default: object
doc: str
validate: Callable[[object], None] | None = None
deprecated: bool = False
# Expected positional-parameter names for a callable value (rcust P3 uniform ctx convention).
# Enforced at load so a legacy-signature hook (e.g. `def READY_PROBE(domain)`) fails with a
# CLEAR MetaError naming the migration — never a silent TypeError mid-run.
hook_params: tuple[str, ...] | None = None
KEYS: tuple[Key, ...] = (
Key(
"HEALTH_PATH",
"str",
"/",
"Path probed for serving/health checks (deploy wait + generic `assert_serving`).",
),
Key("HEALTH_OK", "tuple[int]", (200, 301, 302), "Acceptable HTTP status codes for health."),
Key("DEPLOY_TIMEOUT", "int", 600, "Max seconds to wait for swarm convergence per deploy."),
Key("HTTP_TIMEOUT", "int", 300, "Max seconds to wait for HTTP health after convergence."),
Key(
"BACKUP_CAPABLE",
"bool",
None,
"Override the backup-tier capability auto-detect (compose `backupbot.backup` labels). `False` forces N/A; `True` forces the tier on; unset = auto-detect.",
),
Key(
"EXPECTED_NA",
"dict",
None,
"Declare an N/A rung intentional: `{rung: reason}`. The cap stands either way; only the report wording changes.",
),
Key(
"READY_PROBE",
"hook",
None,
"Callable `(ctx) -> [probe, ...]` returning extra readiness probes, run after install AND after upgrade: HTTP `{host, path, ok}` or TCP `{tcp_host, tcp_port, stable}`.",
hook_params=("ctx",),
),
Key(
"UPGRADE_BASE_VERSION",
"str",
None,
"Exact published tag overriding the upgrade tier's base (default: `recipe_versions[-2]`).",
),
Key(
"BACKUP_VERIFY",
"hook",
None,
"Callable `(ctx) -> bool` post-backup data-capture check; `False` re-runs the backup (truncated-dump race guard), retried up to 3 attempts.",
hook_params=("ctx",),
),
Key(
"UPGRADE_EXTRA_ENV",
"dict_or_hook",
None,
"Extra `.env` keys applied after the PR-head checkout, before the chaos redeploy (env that exists only at head). Dict, or callable `(ctx) -> dict`.",
hook_params=("ctx",),
),
Key(
"EXTRA_ENV",
"dict_or_hook",
{},
"Extra `.env` keys applied at EVERY deploy (base install AND upgrade old-app). Dict, or callable `(ctx) -> dict` deriving values from the per-run domain (`ctx.domain`).",
hook_params=("ctx",),
),
Key(
"DEPS",
"list[str]",
[],
'Dep recipes deployed/provisioned alongside (e.g. `["keycloak"]`); creds land in `$CCCI_DEPS_FILE`.',
),
Key(
"WARM_CANONICAL",
"bool",
False,
"Enroll the recipe in the warm/canonical app system (docs/warm.md): green cold runs on LATEST advance the canonical snapshot.",
),
Key(
"SCREENSHOT",
"hook",
None,
"Callable `(page, ctx)` driving Playwright to a safe, credential-free post-login view for the results-card screenshot (default: landing page).",
hook_params=("page", "ctx"),
),
# (CHAOS_BASE_DEPLOY, OIDC_AT_INSTALL and SKIP_GENERIC were deleted in restructure P2:
# compose.ccci.yml is first-class + auto-chaos; install-time deps wiring is the only mode;
# the generic floor is suppressible only via the dev-only CCCI_SKIP_GENERIC* env form.)
)
_REGISTRY: dict[str, Key] = {k.name: k for k in KEYS}
# The one validated, attribute-access view of a recipe's customization. Generated from KEYS so the
# field set can never drift from the registry (frozen: consumers share one immutable object).
RecipeMeta = dataclasses.make_dataclass(
"RecipeMeta",
[(k.name, object, dataclasses.field(default=None)) for k in KEYS],
frozen=True,
)
RecipeMeta.__doc__ = (
"Validated per-recipe customization (one field per registered key; attribute access). "
"Built ONLY by meta.load()."
)
def meta_path(recipe: str, tests_dir: str | None = None) -> str:
"""Canonical path of a recipe's meta file (pure)."""
return os.path.join(tests_dir or TESTS_DIR, recipe, "recipe_meta.py")
def check_hook_signature(fn, expected: tuple[str, ...], where: str) -> None:
"""Enforce the uniform ctx hook convention (rcust P3): a hook callable's positional parameters
must be exactly `expected` (e.g. ("ctx",) or ("page", "ctx")). A legacy-signature hook (the
pre-restructure `(domain)` / `(domain, meta)` / `(page, domain, meta)` forms) raises a CLEAR
MetaError naming the migration — never a silent TypeError mid-run."""
try:
params = [
p.name
for p in inspect.signature(fn).parameters.values()
if p.kind in (p.POSITIONAL_ONLY, p.POSITIONAL_OR_KEYWORD)
]
except (TypeError, ValueError): # builtins/odd callables — let the call site surface it
return
if tuple(params) != expected:
raise MetaError(
f"{where}: hook signature is ({', '.join(params)}) — the recipe-customization "
f"restructure (P3) changed ALL recipe hook signatures to ({', '.join(expected)}); "
f"read fields off the HookCtx (ctx.domain, ctx.base_url, ctx.meta, ctx.deps, ctx.op). "
f"See docs/recipe-customization.md §5."
)
def _coerce(key: Key, value: object, path: str) -> object:
"""Validate `value` against `key`'s declared type; normalize containers (tuple[int]/list[str]).
Raises MetaError on mismatch — including a callable supplied for a data-typed key."""
t = key.type
if callable(value) and t not in ("hook", "dict_or_hook"):
raise MetaError(
f"{path}: {key.name} is a data key (type {t}) — callables are accepted only for "
f"hook-typed keys"
)
if t == "int":
if isinstance(value, int) and not isinstance(value, bool):
return value
elif t == "str":
if isinstance(value, str):
return value
elif t == "bool":
if isinstance(value, bool):
return value
elif t == "tuple[int]":
if isinstance(value, tuple | list) and all(
isinstance(x, int) and not isinstance(x, bool) for x in value
):
return tuple(value)
elif t == "list[str]":
if isinstance(value, tuple | list) and all(isinstance(x, str) for x in value):
return list(value)
elif t == "dict":
if isinstance(value, dict):
return value
elif (
t == "hook"
and callable(value)
or t == "dict_or_hook"
and (isinstance(value, dict) or callable(value))
):
return value
raise MetaError(f"{path}: {key.name} must be {t}, got {type(value).__name__} ({value!r})")
def load(recipe: str, tests_dir: str | None = None):
"""Load + validate a recipe's customization -> RecipeMeta. THE only exec() of recipe_meta.py.
Missing file -> all registry defaults (the zero-config baseline, spec §2). Unknown
non-underscore ALL-CAPS top-level name or type mismatch -> MetaError (hard error).
`tests_dir` overrides the recipe-meta root (unit tests / fixtures)."""
path = meta_path(recipe, tests_dir)
values = {k.name: copy.copy(k.default) for k in KEYS}
if os.path.exists(path):
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
for name in sorted(ns):
if name.startswith("_") or not name.isupper():
continue # _FOO = recipe-private (exempt); lowercase = helpers/imports (ignored)
key = _REGISTRY.get(name)
if key is None:
near = difflib.get_close_matches(name, _REGISTRY, n=1)
hint = f" — did you mean {near[0]!r}?" if near else ""
raise MetaError(
f"{path}: unknown recipe_meta key {name!r}{hint}. Registered keys: "
f"{', '.join(sorted(_REGISTRY))}. Recipe-private constants must be "
f"underscore-prefixed (e.g. _{name})."
)
values[name] = _coerce(key, ns[name], path)
if key.hook_params and callable(values[name]):
check_hook_signature(values[name], key.hook_params, f"{path}: {name}")
if key.validate:
key.validate(values[name])
return RecipeMeta(**values)
def as_dict(meta) -> dict:
"""RecipeMeta -> {key: value} (every registered key, defaults included)."""
return dataclasses.asdict(meta)
def non_default(meta) -> dict:
"""The keys a recipe explicitly customized: {key: value} where value differs from the registry
default. Hooks compare by identity-vs-None (a set hook is always non-default). Feeds the run's
customization manifest (P5)."""
out = {}
for k in KEYS:
v = getattr(meta, k.name)
if v != k.default:
out[k.name] = v
return out
@dataclasses.dataclass(frozen=True)
class HookCtx:
"""The single argument every recipe hook receives (rcust P3 uniform ctx convention):
`EXTRA_ENV(ctx)`, `UPGRADE_EXTRA_ENV(ctx)`, `READY_PROBE(ctx)`, `BACKUP_VERIFY(ctx)`,
`SCREENSHOT(page, ctx)`, ops.py `pre_<op>(ctx)`."""
domain: str # the app's per-run domain
base_url: str # https://<domain>
meta: object # the recipe's full RecipeMeta
deps: dict | None # provisioned dep creds ({dep_recipe: entry}) or None if absent/empty
op: str | None # current lifecycle op (install|upgrade|backup|restore) or None
def _run_deps() -> dict | None:
"""The current run's provisioned dep creds from $CCCI_DEPS_FILE (either shape), or None.
Read directly (not via harness.deps) to keep meta.py import-cycle-free."""
path = os.environ.get("CCCI_DEPS_FILE")
if not path or not os.path.exists(path):
return None
try:
with open(path) as f:
data = json.load(f)
except (OSError, ValueError):
return None
if isinstance(data, dict):
return data or None
if isinstance(data, list):
out = {e["recipe"]: e for e in data if isinstance(e, dict) and e.get("recipe")}
return out or None
return None
def hook_ctx(domain: str, meta, *, op: str | None = None) -> HookCtx:
"""Build the HookCtx for a hook call site. Dep creds are picked up from the run's
$CCCI_DEPS_FILE when present (None otherwise)."""
return HookCtx(domain=domain, base_url=f"https://{domain}", meta=meta, deps=_run_deps(), op=op)
def _env_map(value, ctx: HookCtx) -> dict[str, str]:
if callable(value):
value = value(ctx)
return {str(k): str(v) for k, v in (value or {}).items()}
def extra_env(meta, ctx: HookCtx) -> dict[str, str]:
"""Resolve EXTRA_ENV (dict or callable(ctx)->dict) to the concrete per-run env map."""
return _env_map(meta.EXTRA_ENV, ctx)
def upgrade_extra_env(meta, ctx: HookCtx) -> dict[str, str]:
"""Resolve UPGRADE_EXTRA_ENV (dict or callable(ctx)->dict) to the concrete env map."""
return _env_map(meta.UPGRADE_EXTRA_ENV, ctx)

View File

@ -203,6 +203,7 @@ def build_results(
screenshot: str | None = None,
summary_card: str | None = None,
expected_na: dict | None = None,
customization: dict | None = None,
) -> dict:
"""Assemble the full results.json dict (no I/O). `finished_ts` is passed in (the orchestrator
stamps it) so this stays pure and deterministic for unit tests. `expected_na` is the recipe's
@ -236,6 +237,9 @@ def build_results(
},
"screenshot": screenshot,
"summary_card": summary_card,
# rcust P5: the run's resolved customization manifest (pure presentation — consumers must
# never derive a verdict from it).
"customization": customization,
}

View File

@ -8,7 +8,7 @@ Secret-safety (R7, the cardinal screenshot guardrail): the screenshot step must
that displays generated credentials (an install wizard showing the initial admin password, a secrets
page, etc.). The DEFAULT capture is the app's **landing page** (a login form shows fields, not the
password) — safe for every recipe. A recipe that needs a post-login view opts in via a recipe-meta
`SCREENSHOT` hook: a callable `screenshot(page, domain, meta) -> None` that drives Playwright to a
`SCREENSHOT` hook: a callable `SCREENSHOT(page, ctx) -> None` that drives Playwright to a
safe, credential-free view and is responsible for not landing on a secrets page. The harness never
auto-fills a wizard.
@ -18,27 +18,85 @@ missing, app slow, navigation error) is swallowed and returns None so the run/ve
from __future__ import annotations
import contextlib
import os
from . import browser as harness_browser
from . import meta as meta_mod
# Default viewport for the captured screenshot — a desktop-ish frame that crops well into the card.
VIEWPORT = {"width": 1280, "height": 800}
# Hard cap so a wedged app can never hang the run on the screenshot step (R7 / Phase-1 timeouts).
NAV_DEADLINE_S = 45
# ---- post-navigation settle (phase-shot fix, 2026-06-11) ----
# SPAs (immich, n8n, cryptpad, the keycloak admin console, lasuite-*, mumble-web, mattermost) fire
# `domcontentloaded` on their empty HTML shell and only paint after the JS bundle loads — snapping
# immediately produced solid blank frames (byte-stable 4801-2 B) or loading spinners. After nav,
# wait for network-idle up to SETTLE_TIMEOUT_MS (apps that never go idle — continuous polling —
# simply spend the cap; bounded, never raises), then RENDER_GRACE_MS for the final paint.
SETTLE_TIMEOUT_MS = 10_000
RENDER_GRACE_MS = 500
# A 1280x800 PNG below this is near-certainly a solid frame or a bare loading spinner (phase-shot
# audit: blank frames were 4801-2 B across three different apps, lone spinners 5.9-8.8 KB; the
# smallest real page was 12950 B). One bounded retry with an extra settle, then keep what we get —
# an honest late frame beats none, and the retry only ever replaces a tiny frame with a later one.
BLANK_SIZE_BYTES = 10_000
BLANK_RETRY_SETTLE_MS = 4_000
# Wait-budget arithmetic (plan-phase-shot §3 P3: step worst case ≤ ~60s): NAV_DEADLINE_S (45s,
# spent only while the app isn't serving yet) + SETTLE_TIMEOUT_MS + RENDER_GRACE_MS +
# BLANK_RETRY_SETTLE_MS + RENDER_GRACE_MS = 60s of bounded waiting; tested in unit tests.
def _settle(page, idle_timeout_ms: int) -> None:
"""Best-effort bounded settle: network-idle up to the cap, then a short render grace.
Never raises (R7) — a timeout just means the page kept polling; we snap what's painted."""
# cosmetic path (R7): a timeout on a never-idle app is expected — the cap IS the wait
with contextlib.suppress(Exception):
page.wait_for_load_state("networkidle", timeout=idle_timeout_ms)
with contextlib.suppress(Exception):
page.wait_for_timeout(RENDER_GRACE_MS)
def _snap_with_blank_retry(page, out_path: str) -> None:
"""Screenshot the page; if the PNG is blank/spinner-sized, retry ONCE after a longer settle.
The retry overwrites the tiny frame with a strictly-later one (same page, more paint time)."""
page.screenshot(path=out_path, full_page=False)
try:
first = os.path.getsize(out_path)
except OSError:
return
if first >= BLANK_SIZE_BYTES:
return
print(
f" screenshot: frame looks blank/loading ({first} B < {BLANK_SIZE_BYTES} B) — "
"one retry after a longer settle",
flush=True,
)
_settle(page, BLANK_RETRY_SETTLE_MS)
page.screenshot(path=out_path, full_page=False)
with contextlib.suppress(OSError):
print(f" screenshot: retry frame {os.path.getsize(out_path)} B", flush=True)
def screenshot_path(run_artifact_dir: str) -> str:
"""Canonical on-disk path for a run's app screenshot (pure)."""
return os.path.join(run_artifact_dir, "screenshot.png")
def _load_screenshot_hook(recipe_meta: dict | None):
def _load_screenshot_hook(recipe_meta):
"""Return the recipe's optional SCREENSHOT hook (a callable) if it declared one, else None.
The hook drives Playwright to a safe post-login view; default is the landing page."""
if not recipe_meta:
The hook drives Playwright to a safe post-login view; default is the landing page.
`recipe_meta` is the loaded RecipeMeta (rcust P1 — the single loader actually delivers
SCREENSHOT now; under the old L1 allowlist the key never arrived, spec §8 R2). A plain dict
is still accepted for direct/manual callers."""
if recipe_meta is None:
return None
hook = recipe_meta.get("SCREENSHOT")
if isinstance(recipe_meta, dict):
hook = recipe_meta.get("SCREENSHOT")
else:
hook = getattr(recipe_meta, "SCREENSHOT", None)
return hook if callable(hook) else None
@ -67,10 +125,11 @@ def capture(domain: str, out_path: str, *, recipe_meta: dict | None = None) -> s
if hook is not None:
# Recipe-specific safe view (post-login etc.). The hook owns navigation +
# the no-secret-page guarantee; it should call page.screenshot itself, but if
# it doesn't, we still snap the resulting page below.
hook(page, domain, recipe_meta)
# it doesn't, we still snap the resulting page below. SCREENSHOT(page, ctx) —
# the uniform ctx convention (rcust P3).
hook(page, meta_mod.hook_ctx(domain, recipe_meta))
if not os.path.exists(out_path):
page.screenshot(path=out_path, full_page=False)
_snap_with_blank_retry(page, out_path)
else:
# Default: landing page. Accept any rendered status (200 or an auth redirect to a
# login form) — both are credential-free and representative of "the app is up".
@ -81,7 +140,9 @@ def capture(domain: str, out_path: str, *, recipe_meta: dict | None = None) -> s
deadline_seconds=NAV_DEADLINE_S,
wait_until="domcontentloaded",
)
page.screenshot(path=out_path, full_page=False)
# SPA paint race fix (phase-shot): settle before snapping, retry a blank frame.
_settle(page, SETTLE_TIMEOUT_MS)
_snap_with_blank_retry(page, out_path)
finally:
browser.close()
if os.path.exists(out_path) and os.path.getsize(out_path) > 0:

View File

@ -58,6 +58,12 @@ from harness import ( # noqa: E402
from harness import ( # noqa: E402
deps as deps_mod,
)
from harness import ( # noqa: E402
manifest as manifest_mod,
)
from harness import ( # noqa: E402
meta as meta_mod,
)
from harness import ( # noqa: E402
results as results_mod,
)
@ -70,7 +76,7 @@ ALL_STAGES = ("install", "upgrade", "backup", "restore", "custom")
def sso_dep_unverified(declared, deps_ready: bool, requires_deps_skipped: int) -> bool:
"""F2-11 gate predicate (pure, unit-tested). True when a recipe declares DEPS but its
setup_custom_tests failed (deps not ready) AND that caused ≥1 `requires_deps` (SSO/OIDC) test
dep provisioning failed (deps not ready) AND that caused ≥1 `requires_deps` (SSO/OIDC) test
to SKIP. In that case the recipe's characteristic SSO claim was NOT verified, so the run must
NOT report GREEN — even though a skip-only pytest file exits 0 and leaves every tier 'pass'.
Generic-tier failure-isolation is preserved (those results stand); only the green SIGNAL is
@ -247,52 +253,29 @@ def snapshot_recipe_tests(recipe: str) -> str | None:
return dst
def _load_meta(recipe: str) -> dict:
"""Mirror tests/conftest._recipe_meta so the orchestrator's deploy/wait uses the same per-recipe
config the tiers see (timeouts, health path/codes)."""
meta = {
"HEALTH_PATH": "/",
"HEALTH_OK": (200, 301, 302),
"DEPLOY_TIMEOUT": 600,
"HTTP_TIMEOUT": 300,
}
path = os.path.join(ROOT, "tests", recipe, "recipe_meta.py")
if os.path.exists(path):
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
for k in list(meta) + [
"BACKUP_CAPABLE",
"SKIP_GENERIC",
"EXPECTED_NA",
"OIDC_AT_INSTALL",
"READY_PROBE",
"UPGRADE_BASE_VERSION",
"BACKUP_VERIFY",
"UPGRADE_EXTRA_ENV",
]:
if k in ns:
meta[k] = ns[k]
return meta
def _tier_env(domain: str) -> dict:
return dict(os.environ, CCCI_APP_DOMAIN=domain, CCCI_BASE_URL=f"https://{domain}")
def _skip_generic(op: str, meta: dict) -> bool:
def skip_generic_env_overrides() -> list[str]:
"""Active CCCI_SKIP_GENERIC* env overrides (rcust P2c: the meta key is deleted; the env form
is a documented LOCAL-DEV-ONLY escape hatch). Surfaced loudly when set in a CI (drone) run —
it reduces generic-floor coverage and must never silently ride a CI verdict."""
return sorted(
k for k in os.environ if k.startswith("CCCI_SKIP_GENERIC") and _truthy(os.environ.get(k))
)
def _skip_generic(op: str) -> bool:
"""Whether the generic assertion for `op` is opted out (Phase 1e HC3). Default: run (additive).
Opt-out, any of: env CCCI_SKIP_GENERIC (all ops), env CCCI_SKIP_GENERIC_<OP>, or the recipe's
declarative recipe_meta.SKIP_GENERIC list (op name, or "all"/"*")."""
Opt-out via env only (dev-only escape hatch, P2c): CCCI_SKIP_GENERIC (all ops) or
CCCI_SKIP_GENERIC_<OP>. The recipe_meta SKIP_GENERIC key is deleted (zero users)."""
if _truthy(os.environ.get("CCCI_SKIP_GENERIC")):
return True
if _truthy(os.environ.get(f"CCCI_SKIP_GENERIC_{op.upper()}")):
return True
sg = [str(s).lower() for s in (meta.get("SKIP_GENERIC") or [])]
return "all" in sg or "*" in sg or op in sg
return _truthy(os.environ.get(f"CCCI_SKIP_GENERIC_{op.upper()}"))
def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, meta: dict) -> None:
def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, meta) -> None:
"""Run the optional pre-op seed hook (recipe ops.py `pre_<op>`) BEFORE the harness performs the
op (HC3 op/assertion split): overlays seed data-continuity markers / the backup→restore mutation
here, then assert post-op in test_<op>.py. cc-ci's ops.py is trusted; a repo-local ops.py is
@ -309,7 +292,11 @@ def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, met
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
print(f" pre-op seed ({source}): {os.path.relpath(path, ROOT)}::pre_{op}", flush=True)
getattr(mod, f"pre_{op}")(domain, meta)
fn = getattr(mod, f"pre_{op}")
# Uniform ctx convention (rcust P3): pre_<op>(ctx). A legacy (domain, meta) hook fails
# HERE with a clear migration message, not a TypeError mid-call.
meta_mod.check_hook_signature(fn, ("ctx",), f"{os.path.relpath(path, ROOT)}::pre_{op}")
fn(meta_mod.hook_ctx(domain, meta, op=op))
finally:
if d in sys.path:
sys.path.remove(d)
@ -322,7 +309,7 @@ def _perform_op(
head_ref: str | None,
op_state: dict,
deploy_timeout: int = 900,
meta: dict | None = None,
meta=None,
) -> None:
"""Perform the single mutating op ONCE (the harness owns the op, HC3). install has no op. Records
what the assertions need (pre-upgrade identity, backup snapshot_id) into op_state. None of these
@ -345,9 +332,10 @@ def _perform_op(
# verify fails we re-run the WHOLE backup (fresh restic snapshot) with a re-stabilised DB, up to
# 3 attempts. Recipes without BACKUP_VERIFY are unaffected (single backup, as before).
snap = generic.perform_backup(domain)
verify = meta.get("BACKUP_VERIFY") if meta else None
verify = meta.BACKUP_VERIFY if meta else None
verify_ctx = meta_mod.hook_ctx(domain, meta, op="backup") if meta else None
attempt = 1
while callable(verify) and not verify(domain) and attempt < 3:
while callable(verify) and not verify(verify_ctx) and attempt < 3:
attempt += 1
print(
f" backup-verify FAILED (attempt {attempt - 1}/3) — backup did not capture the "
@ -355,7 +343,7 @@ def _perform_op(
flush=True,
)
snap = generic.perform_backup(domain)
if callable(verify) and not verify(domain):
if callable(verify) and not verify(verify_ctx):
print(
f" !! backup-verify still FAILED after {attempt} attempts — backup is incomplete",
flush=True,
@ -371,7 +359,7 @@ def run_lifecycle_tier(
op: str,
repo_local: str | None,
domain: str,
meta: dict,
meta,
head_ref: str | None,
op_state: dict,
records: list[dict] | None = None,
@ -386,7 +374,7 @@ def run_lifecycle_tier(
a {tier,source,file,rc,junit} record appended, so the run can assemble per-stage/per-test
results.json + the level afterwards. Purely additive — does not change the verdict."""
overlay = discovery.resolve_overlay_op(recipe, op, repo_local)
skip_gen = _skip_generic(op, meta)
skip_gen = _skip_generic(op)
files: list[tuple[str, str]] = []
if not skip_gen:
files.append(discovery.generic_op(op))
@ -411,7 +399,7 @@ def run_lifecycle_tier(
recipe,
head_ref,
op_state,
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)),
deploy_timeout=int(meta.DEPLOY_TIMEOUT),
meta=meta,
)
with open(os.environ["CCCI_OP_STATE_FILE"], "w") as f:
@ -449,7 +437,7 @@ def run_lifecycle_tier(
def _enrich_deps_with_sso(parent_recipe: str, parent_domain: str, deps_list) -> dict[str, dict]:
"""For each dep, set up a fresh realm/client + test user via the harness's provider-specific
setup function, then return a recipe→entry dict carrying domain + admin + realm/client/user
info — the shape the `setup_custom_tests.sh` hook (and dependent tests) read.
info — the shape the `install_steps.sh` hook (and dependent tests) read.
Provider routing: today only `keycloak` is supported. authentik will need a parallel
`setup_authentik_realm` when an authentik-dep recipe enrolls (DEFERRED.md #9).
@ -463,7 +451,7 @@ def _enrich_deps_with_sso(parent_recipe: str, parent_domain: str, deps_list) ->
if not dep_recipe or not dep_domain:
continue
if dep_recipe != "keycloak":
# Provider not yet supported — record bare entry; setup_custom_tests.sh / tests will
# Provider not yet supported — record bare entry; install_steps.sh / tests will
# raise if they need realm/client info they don't see.
out[dep_recipe] = entry
continue
@ -507,12 +495,10 @@ def _provision_deps(
Splits deps into live-warm (shared provider at a stable domain + a per-run realm) vs cold
(co-deployed per run), provisions each dep's SSO realm/client/user, and persists the enriched
dict the `setup_custom_tests.sh`/`install_steps.sh` hooks + dependent tests read. Raises on any
failure (the caller marks deps-not-ready). Used by BOTH wiring paths:
- post-deploy (legacy): provision AFTER generic tiers, then `setup_custom_tests.sh` does an
in-place OIDC redeploy.
- install-time (`OIDC_AT_INSTALL`, Q3.2a): provision BEFORE the single deploy so the
install-tier `install_steps.sh` hook wires OIDC env into that one deploy — no reconverge.
dict the `install_steps.sh` hooks + dependent tests read. Raises on any failure (the caller
marks deps-not-ready). Install-time wiring is the ONLY mode (rcust P2b): provision BEFORE the
single deploy so the install-tier `install_steps.sh` hook wires OIDC env into that one deploy —
no reconverge, no post-deploy `setup_custom_tests.sh` machinery.
"""
warm_deps, cold_deps = [], []
for d in declared:
@ -523,7 +509,7 @@ def _provision_deps(
if wd:
print(f" dep: {d} warm provider {wd} not up — cold fallback", flush=True)
cold_deps.append(d)
dep_metas = {d: _load_meta(d) for d in cold_deps}
dep_metas = {d: meta_mod.load(d) for d in cold_deps}
deps_list = (
deps_mod.deploy_deps(recipe, os.environ.get("PR", "0"), ref, cold_deps, meta_for=dep_metas)
if cold_deps
@ -541,32 +527,6 @@ def _provision_deps(
return deps_state
def _run_setup_custom_tests_hook(recipe: str, domain: str, deps_file: str) -> None:
"""Run `tests/<recipe>/setup_custom_tests.sh` if present (operator-2026-05-28 SSO-dep plan
§3.2). The hook reads `$CCCI_DEPS_FILE`, sets OIDC env via `abra app config set` + secret
insert, and triggers an in-place `abra app deploy --force --chaos`. Failure here propagates
to mark deps-not-ready (caught in main())."""
path = os.path.join(ROOT, "tests", recipe, "setup_custom_tests.sh")
if not os.path.isfile(path):
# No hook = recipe doesn't need post-deps wiring; deps are deployed + creds available
# via deps_apps fixture as-is.
print(
f" setup_custom_tests: no hook at {os.path.relpath(path, ROOT)} (deps creds ready in $CCCI_DEPS_FILE)",
flush=True,
)
return
print(f" setup_custom_tests hook: {os.path.relpath(path, ROOT)}", flush=True)
rc = subprocess.run(
["bash", path],
check=False,
env=dict(os.environ, CCCI_APP_DOMAIN=domain, CCCI_RECIPE=recipe, CCCI_DEPS_FILE=deps_file),
)
if rc.returncode != 0:
raise RuntimeError(
f"setup_custom_tests.sh exited {rc.returncode} (deps env not wired into parent)"
)
def run_custom(
recipe: str,
repo_local: str | None,
@ -609,7 +569,7 @@ def _wait_undeployed(domain: str, timeout: int = 120) -> None:
def run_quick(
recipe: str, ref: str | None, head_ref: str | None, repo_local: str | None, meta: dict
recipe: str, ref: str | None, head_ref: str | None, repo_local: str | None, meta
) -> int:
"""WC4 `--quick` opt-in fast lane (plan §2). Reattach the data-warm canonical (known-good volume)
→ upgrade IN PLACE to the PR head (chaos) → assert generic UPGRADE (reconverge+moved+serving) +
@ -645,7 +605,7 @@ def run_quick(
op_state: dict = {}
results: dict[str, str] = {}
declared = deps_mod.declared_deps(recipe)
declared = list(meta.DEPS)
deps_state: dict = {}
deps_ready = True
deps_not_ready_reason = ""
@ -657,28 +617,32 @@ def run_quick(
try:
# 1) reattach the canonical (warm boot at the known-good version + retained volume)
try:
canonical.deploy_canonical(recipe, timeout=int(meta.get("DEPLOY_TIMEOUT", 900)))
canonical.deploy_canonical(recipe, timeout=int(meta.DEPLOY_TIMEOUT))
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta["HEALTH_OK"]),
path=meta["HEALTH_PATH"],
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
ok_codes=tuple(meta.HEALTH_OK),
path=meta.HEALTH_PATH,
deploy_timeout=meta.DEPLOY_TIMEOUT,
http_timeout=meta.HTTP_TIMEOUT,
)
warm_ok = True
except Exception as e: # noqa: BLE001
print(f"!! canonical reattach/readiness failed: {_scrub(str(e))}", flush=True)
if warm_ok:
# 2) deps (warm keycloak + per-run realm) — mirrors main()'s warm/cold split
# 2) deps (warm keycloak + per-run realm) — mirrors main()'s warm/cold split. NB
# (rcust P2b): deps are provisioned (realm/creds in $CCCI_DEPS_FILE) but quick mode
# cannot do install-time OIDC env wiring — the canonical app pre-exists its per-run
# realm. No quick-enrolled recipe declares DEPS today; if one ever does, its
# requires_deps tests will exercise creds-only flows or skip (F2-11 keeps the signal).
if declared:
print(f"\n===== setup_custom_tests (quick): deps {declared} =====", flush=True)
print(f"\n===== deps (quick): {declared} =====", flush=True)
try:
warm_deps, cold_deps = [], []
for d in declared:
wd = warm.warm_domain(d)
(warm_deps if (wd and warm.is_warm_up(d, wd)) else cold_deps).append(d)
dep_metas = {d: _load_meta(d) for d in cold_deps}
dep_metas = {d: meta_mod.load(d) for d in cold_deps}
deps_list = (
deps_mod.deploy_deps(
recipe, os.environ.get("PR", "0"), ref, cold_deps, meta_for=dep_metas
@ -693,12 +657,11 @@ def run_quick(
print(f" dep: using live-warm {d} @ {wd} (per-run realm)", flush=True)
deps_state = _enrich_deps_with_sso(recipe, domain, deps_list)
deps_mod.write_run_state(deps_state)
_run_setup_custom_tests_hook(recipe, domain, depsfile)
except Exception as e: # noqa: BLE001
deps_ready = False
deps_not_ready_reason = _scrub(str(e))[:300]
print(
f"!! setup_custom_tests failed (deps-not-ready): {deps_not_ready_reason}",
f"!! dep provisioning failed (deps-not-ready): {deps_not_ready_reason}",
flush=True,
)
@ -813,7 +776,7 @@ def run_quick(
overall = 1
if sso_unverified:
print(
f"!! DEPS={declared} but setup_custom_tests failed and {requires_deps_skipped} "
f"!! DEPS={declared} but dep provisioning failed and {requires_deps_skipped} "
"requires_deps SKIPPED — SSO NOT verified (F2-11)",
file=sys.stderr,
)
@ -848,7 +811,7 @@ def promote_canonical(recipe: str, head_ref: str | None) -> None:
if not latest:
print(f"WC5 promote: no version tags for {recipe} — skip", flush=True)
return
meta = _load_meta(recipe)
meta = meta_mod.load(recipe)
# The cold run's deploy-count was already asserted + the countfile removed; don't perturb it.
os.environ.pop("CCCI_DEPLOY_COUNT_FILE", None)
print(
@ -860,14 +823,15 @@ def promote_canonical(recipe: str, head_ref: str | None) -> None:
domain,
version=latest,
secrets=True,
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)),
deploy_timeout=int(meta.DEPLOY_TIMEOUT),
meta=meta,
)
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta["HEALTH_OK"]),
path=meta["HEALTH_PATH"],
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
ok_codes=tuple(meta.HEALTH_OK),
path=meta.HEALTH_PATH,
deploy_timeout=meta.DEPLOY_TIMEOUT,
http_timeout=meta.HTTP_TIMEOUT,
)
abra.undeploy(domain)
_wait_undeployed(domain)
@ -896,6 +860,17 @@ def main() -> int:
print(
f"== cc-ci run: recipe={recipe} ref={ref} pr={os.environ.get('PR', '0')} stages={sorted(stages)}"
)
# P2c: the CCCI_SKIP_GENERIC* env escape hatch is LOCAL-DEV-ONLY. If it rides a CI (drone)
# run, shout — generic-floor coverage is reduced and the verdict must not look routine.
for ov in skip_generic_env_overrides():
if os.environ.get("DRONE"):
print(
f"!! {ov}=1 — dev-only generic-floor override ACTIVE IN A CI RUN; generic "
"assertions are suppressed for the affected op(s). This must never gate a merge.",
flush=True,
)
else:
print(f"== {ov}=1 (dev-only generic-floor override active)", flush=True)
# Concurrent-run safety is structural: this run's recipe trees live in its own ABRA_DIR
# (exported here, before ANY abra call), so no recipe-tree lock exists; same-DOMAIN runs
# serialise on the app-domain flock taken in deploy_app (see docs/concurrency.md).
@ -906,7 +881,13 @@ def main() -> int:
# HEAD (the catalogue current) for a non-PR `!testme`. Captured before any version-tag checkout.
head_ref = ref or lifecycle.recipe_head_commit(recipe)
repo_local = snapshot_recipe_tests(recipe)
meta = _load_meta(recipe)
meta = meta_mod.load(recipe)
# Customization manifest (rcust P5, R4): ONE block answering "what does this recipe
# customize?" across all surfaces — printed here and embedded verbatim in results.json under
# "customization". Pure presentation; never influences a verdict.
customization = manifest_mod.build(recipe, meta, repo_local)
print("\n" + manifest_mod.render(recipe, customization) + "\n", flush=True)
# WC4/WC7: opt-in `--quick` fast lane. Requires an existing data-warm canonical; if none, fall
# back cleanly to the full COLD run below so the PR is still tested (DECISIONS Phase-2w).
@ -929,9 +910,7 @@ def main() -> int:
# override must be an exact published version tag (deployed as a pinned base). (Adversary §7.1.)
want_upgrade = "upgrade" in stages
prev = (
(meta.get("UPGRADE_BASE_VERSION") or lifecycle.previous_version(recipe))
if want_upgrade
else None
(meta.UPGRADE_BASE_VERSION or lifecycle.previous_version(recipe)) if want_upgrade else None
)
base = prev or target
backup_cap = generic.backup_capable(recipe, meta)
@ -960,10 +939,8 @@ def main() -> int:
os.environ["CCCI_OP_STATE_FILE"] = statefile
op_state: dict = {}
# Run-scoped dep state (Phase 2 Q2.3, refined per operator-2026-05-28 SSO-dep plan §1):
# deps now deploy AFTER generic tiers (between RESTORE and CUSTOM) so a failed dep deploy
# cannot break the generic-tier signal. The `setup_custom_tests` step deploys each dep + runs
# `tests/<recipe>/setup_custom_tests.sh` to wire OIDC env via in-place redeploy.
# Run-scoped dep state (Phase 2 Q2.3; install-time-only since rcust P2b): deps are provisioned
# BEFORE the single deploy so install_steps.sh wires OIDC env into that one deploy.
# `$CCCI_DEPS_FILE` is written with the full creds dict the hook script needs (jq-readable).
depsfile = _run_state_path("deps") + ".json"
with open(depsfile, "w") as f:
@ -974,15 +951,9 @@ def main() -> int:
with contextlib.suppress(OSError):
os.remove(skipfile)
os.environ["CCCI_DEPS_SKIP_REPORT"] = skipfile
declared = deps_mod.declared_deps(recipe)
# Q3.2a: a recipe that tolerates OIDC env at first boot AND whose deps are live-warm wires OIDC
# at INSTALL time (provision the realm BEFORE the single deploy; install_steps.sh writes the env
# into it) instead of the post-deploy in-place `--chaos` redeploy — which is flaky on the heavy
# 12-service lasuite-drive stack (collabora WOPI race; see JOURNAL Step 0). Opt-in per recipe.
oidc_at_install = bool(meta.get("OIDC_AT_INSTALL")) and bool(declared)
declared = list(meta.DEPS)
if declared:
when = "BEFORE deploy (install-time OIDC)" if oidc_at_install else "AFTER generic tiers"
print(f"\n===== DEPS declared (provision {when}): {declared} =====", flush=True)
print(f"\n===== DEPS declared (provision BEFORE deploy): {declared} =====", flush=True)
deps_state: dict[str, dict] = {} # new shape: recipe→entry dict (sso-dep plan §1)
deps_ready = True
deps_not_ready_reason: str = ""
@ -996,7 +967,7 @@ def main() -> int:
# install_steps.sh can read $CCCI_DEPS_FILE and wire the OIDC env into that one deploy. On
# failure we mark deps-not-ready but STILL deploy the recipe alone (install_steps.sh no-ops
# on an empty deps file) so the generic tiers run; the OIDC custom test then skips → F2-11. ----
if oidc_at_install:
if declared:
print(
f"\n===== install-time OIDC: provisioning deps {declared} BEFORE deploy =====",
flush=True,
@ -1023,18 +994,21 @@ def main() -> int:
version=base,
secrets=True,
install_steps_hook=hook,
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)),
deploy_timeout=int(meta.DEPLOY_TIMEOUT),
meta=meta,
)
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta["HEALTH_OK"]),
path=meta["HEALTH_PATH"],
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
ok_codes=tuple(meta.HEALTH_OK),
path=meta.HEALTH_PATH,
deploy_timeout=meta.DEPLOY_TIMEOUT,
http_timeout=meta.HTTP_TIMEOUT,
)
# Recipe READY_PROBE (e.g. lasuite-drive collabora WOPI discovery) — readiness beyond
# replica convergence + app HEALTH_PATH; no-op for recipes without one.
lifecycle.wait_ready_probes(meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", 900)))
lifecycle.wait_ready_probes(
meta, domain, timeout=int(meta.DEPLOY_TIMEOUT), op="install"
)
deploy_ok = True
except Exception as e: # noqa: BLE001 — a failed deploy is a reported INSTALL failure
print(f"!! deploy/readiness failed: {e}", flush=True)
@ -1131,41 +1105,11 @@ def main() -> int:
if backup_cap
else "skip"
)
# ---- setup_custom_tests step (NEW, operator-2026-05-28 SSO-dep plan §3.2) ----
# Deploy each declared dep + wire OIDC env into the parent app via the per-recipe
# setup_custom_tests.sh hook + in-place redeploy. Failure here marks deps-not-ready
# but does NOT abort the run — @pytest.mark.requires_deps tests skip with reason;
# non-deps custom tests still run normally.
if declared and not oidc_at_install:
# LEGACY post-deploy path: provision deps AFTER generic tiers, then wire OIDC env
# into the parent via the setup_custom_tests.sh hook + an in-place `--chaos` redeploy.
print("\n===== setup_custom_tests: deps + OIDC wiring =====", flush=True)
try:
deps_state = _provision_deps(recipe, domain, ref, declared)
# Run the per-recipe post-deps hook (jq-driven OIDC wiring + in-place redeploy)
_run_setup_custom_tests_hook(recipe, domain, depsfile)
except Exception as e: # noqa: BLE001 — setup failure is ISOLATED to dep-marked tests
deps_ready = False
deps_not_ready_reason = _scrub(str(e))[:300]
print(
f"!! setup_custom_tests failed (deps-not-ready): {deps_not_ready_reason}",
flush=True,
)
elif declared and oidc_at_install and deps_ready:
# INSTALL-TIME path (Q3.2a): deps were provisioned BEFORE the single deploy and the
# install-tier install_steps.sh hook already wired OIDC env into that one deploy —
# so NO re-provision, NO reconverge here. Run only the post-deploy setup hook
# (e.g. lasuite-drive's minio-createbuckets one-shot), which needs the live stack.
print("\n===== post-deploy setup (OIDC already wired at install) =====", flush=True)
try:
_run_setup_custom_tests_hook(recipe, domain, depsfile)
except Exception as e: # noqa: BLE001 — isolated to dep-marked / state-dependent tests
deps_ready = False
deps_not_ready_reason = _scrub(str(e))[:300]
print(
f"!! post-deploy setup failed: {deps_not_ready_reason}",
flush=True,
)
# (rcust P2b: install-time deps wiring is the ONLY mode — deps were provisioned BEFORE
# the single deploy and install_steps.sh wired the OIDC env into it. The legacy
# post-deploy provisioning + setup_custom_tests.sh redeploy machinery is deleted; a
# recipe's post-deploy seeding belongs in ops.py pre_install, e.g. lasuite-drive's
# MinIO bucket one-shot.)
# ---- CUSTOM tier ----
if "custom" in stages:
@ -1240,8 +1184,7 @@ def main() -> int:
# ---- per-op summary (DG6 feed) ----
# SSO-dep plan §1: DG4.1 generalised — one `abra app new` per app in the run (recipe + each
# COLD dep). In-place reconfigure-and-redeploy (the setup_custom_tests step's
# `abra app deploy --force --chaos`) is NOT a fresh `app_new` and does NOT increment the count.
# COLD dep). Chaos redeploys are NOT a fresh `app_new` and do NOT increment the count.
# WC1: a live-warm dep (keycloak) is NOT deployed by the run — it only gets a per-run realm — so
# warm deps contribute 0. So expected = 1 + (number of COLD deps that actually got deployed).
_dep_entries = deps_state.values() if isinstance(deps_state, dict) else (deps_state or [])
@ -1282,12 +1225,12 @@ def main() -> int:
overall = 1
if any(v == "fail" for v in results.values()):
overall = 1
# F2-11: a deps-declaring recipe whose setup_custom_tests failed has NOT verified its SSO/OIDC
# F2-11: a deps-declaring recipe whose dep provisioning failed has NOT verified its SSO/OIDC
# claim — its requires_deps tests SKIPPED (a skip-only file exits 0, so without this the run
# would report GREEN). Fail the run for that recipe; generic-tier results above are untouched.
if sso_dep_unverified(declared, deps_ready, requires_deps_skipped):
print(
f"!! recipe declares DEPS={declared} but setup_custom_tests failed and "
f"!! recipe declares DEPS={declared} but dep provisioning failed and "
f"{requires_deps_skipped} requires_deps (SSO) test(s) were SKIPPED — SSO claim NOT "
f"verified; failing run (F2-11). deps-not-ready: {deps_not_ready_reason}",
file=sys.stderr,
@ -1314,7 +1257,8 @@ def main() -> int:
no_secret_leak=True, # narrowed below by an actual scan of the serialised artifact
screenshot=screenshot_rel, # Phase 3 U1 (R4): relative PNG name iff capture succeeded
finished_ts=time.time(),
expected_na=meta.get("EXPECTED_NA"), # declared intentional-skip map (recipe_meta)
expected_na=meta.EXPECTED_NA, # declared intentional-skip map (recipe_meta)
customization=customization, # rcust P5: the run-start manifest, verbatim
)
# Real (if narrow) leak check: no known infra-secret value may appear in the artifact (R7).
blob = json.dumps(data)

71
scripts/gen-meta-docs.py Normal file
View File

@ -0,0 +1,71 @@
#!/usr/bin/env python3
"""Render the harness.meta KEYS registry to the markdown key-reference table in
docs/recipe-customization.md §4 (rcust P1.5; kills the R5 doc-drift class).
Usage:
python3 scripts/gen-meta-docs.py # rewrite the table in-place between the markers
python3 scripts/gen-meta-docs.py --print # print the rendered table to stdout (used by the
# doc-sync unit test, tests/unit/test_meta.py)
The table lives between `<!-- META-TABLE-START -->` / `<!-- META-TABLE-END -->` markers; a unit
test asserts the committed table equals this rendering, so editing it by hand fails CI.
"""
from __future__ import annotations
import os
import sys
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, os.path.join(ROOT, "runner"))
from harness.meta import KEYS # noqa: E402
DOC = os.path.join(ROOT, "docs", "recipe-customization.md")
START = "<!-- META-TABLE-START -->"
END = "<!-- META-TABLE-END -->"
def _default_repr(v) -> str:
if v is None:
return "`None`"
return f"`{v!r}`"
def render() -> str:
lines = [
START,
"",
"_This table is GENERATED from the `runner/harness/meta.py` KEYS registry by"
" `scripts/gen-meta-docs.py` — do not edit by hand (a unit test pins the sync)._",
"",
"| Key | Type | Default | Meaning |",
"|---|---|---|---|",
]
for k in KEYS:
doc = k.doc.replace("|", "\\|")
name = f"`{k.name}`" + (" **(deprecated)**" if k.deprecated else "")
lines.append(f"| {name} | `{k.type}` | {_default_repr(k.default)} | {doc} |")
lines += ["", END]
return "\n".join(lines)
def main() -> int:
table = render()
if "--print" in sys.argv:
print(table)
return 0
with open(DOC) as f:
text = f.read()
if START not in text or END not in text:
print(f"{DOC}: missing {START}/{END} markers", file=sys.stderr)
return 1
head, _, rest = text.partition(START)
_, _, tail = rest.partition(END)
with open(DOC, "w") as f:
f.write(head + table + tail)
print(f"{DOC}: key table rewritten from the registry ({len(KEYS)} keys)")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -9,14 +9,14 @@ sys.path.insert(0, os.path.dirname(__file__))
import _p4 # noqa: E402
def pre_upgrade(domain, meta):
_p4.create_account(domain)
def pre_upgrade(ctx):
_p4.create_account(ctx.domain)
def pre_backup(domain, meta):
_p4.create_account(domain)
def pre_backup(ctx):
_p4.create_account(ctx.domain)
def pre_restore(domain, meta):
_p4.delete_account(domain)
assert not _p4.account_exists(domain), "marker account delete did not take (pre_restore)"
def pre_restore(ctx):
_p4.delete_account(ctx.domain)
assert not _p4.account_exists(ctx.domain), "marker account delete did not take (pre_restore)"

View File

@ -14,32 +14,7 @@ import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "runner"))
from harness import deps as deps_mod # noqa: E402
from harness import lifecycle, naming
def _short(s: str, n: int = 8) -> str:
return "".join(c for c in s if c.isalnum())[:n] or "local"
def _recipe_meta(recipe: str) -> dict:
"""Optional per-recipe config so enrolling a recipe needs NO shared-harness change (D5).
A recipe may ship tests/<recipe>/recipe_meta.py with any of: HEALTH_PATH (str),
HEALTH_OK (tuple of status codes), DEPLOY_TIMEOUT (int), HTTP_TIMEOUT (int)."""
path = os.path.join(os.path.dirname(__file__), recipe, "recipe_meta.py")
meta = {
"HEALTH_PATH": "/",
"HEALTH_OK": (200, 301, 302),
"DEPLOY_TIMEOUT": 600,
"HTTP_TIMEOUT": 300,
}
if os.path.exists(path):
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
for k in meta:
if k in ns:
meta[k] = ns[k]
return meta
from harness import meta as meta_mod # noqa: E402
@pytest.fixture(scope="session")
@ -48,18 +23,10 @@ def recipe() -> str:
@pytest.fixture(scope="session")
def app_domain(recipe) -> str:
# Docker swarm config/secret names = <stackname>_<res>_<ver> must be <= 64 chars, and
# stackname is the sanitized domain. ".ci.commoninternet.net" alone is 22 chars, so the
# subdomain label must stay short. Use <recipe[:4]>-<6hex(recipe|pr|ref)> — unique per run,
# collision-safe across recipes (full recipe in the hash), readable context lives in the
# Drone build params + PR comment. (Deviation from plan §4.0 long name; see DECISIONS.md.)
return naming.app_domain(recipe, os.environ.get("PR", "0"), os.environ.get("REF"))
@pytest.fixture(scope="session")
def meta(recipe) -> dict:
return _recipe_meta(recipe)
def meta(recipe):
"""The recipe's FULL validated customization (RecipeMeta, attribute access) via the single
loader (rcust P1 — previously this fixture saw only the 4 base keys, spec §8 R3)."""
return meta_mod.load(recipe)
@pytest.fixture(scope="session")
@ -73,32 +40,55 @@ def live_app() -> str:
return domain
@pytest.fixture(scope="session")
def deps_apps() -> dict[str, str]:
"""Phase 2 Q2.3 dependency-resolver contract (refined operator-2026-05-28 SSO-dep plan §1):
when a recipe declares `DEPS = [...]` in its `recipe_meta.py`, the orchestrator deploys each
dep AFTER the generic tiers (between RESTORE and CUSTOM) and persists their per-run identity
+ SSO creds to `$CCCI_DEPS_FILE`. Tests access the dep's per-run domain via this fixture.
For full SSO creds (realm/client/secret/admin) use the `deps_creds` fixture instead.
@pytest.fixture
def op_state() -> dict:
"""The orchestrator's run-scoped op context (rcust P4): versions, artifact paths — written to
`$CCCI_OP_STATE_FILE` after each lifecycle op (e.g. `{"upgrade": {"before": {...},
"head_ref": ...}, "backup": {"snapshot_id": ...}}`). Overlay tests read op facts from here
instead of hand-parsing env/JSON. Skips with a clear reason outside an orchestrator run."""
import json
Returns `{dep_recipe: domain}` (str→str). Empty when no deps declared OR deps-not-ready."""
path = os.environ.get("CCCI_OP_STATE_FILE")
if not path:
pytest.skip(
"CCCI_OP_STATE_FILE not set — op_state is only available under the orchestrator"
)
if not os.path.exists(path):
pytest.skip(f"op-state file missing ({path}) — orchestrator has not performed an op yet")
try:
with open(path) as f:
return json.load(f)
except ValueError:
pytest.skip(f"op-state file unreadable/not JSON ({path})")
class _DepEntry(dict):
"""One provisioned dep (full creds dict) with attribute sugar: `entry.domain`, `entry.realm`,
`entry.client_secret`, ... — dict-style access works too (rcust P2d)."""
def __getattr__(self, name):
try:
return self[name]
except KeyError as e:
raise AttributeError(name) from e
@pytest.fixture(scope="session")
def deps() -> dict[str, _DepEntry]:
"""The recipe's provisioned deps (rcust P2d — consolidates the old `deps_apps`+`deps_creds`
pair). When a recipe declares `DEPS = [...]` in its `recipe_meta.py`, the orchestrator
provisions each dep BEFORE the single deploy and persists per-run identity + SSO creds to
`$CCCI_DEPS_FILE`. `deps["keycloak"]` carries domain/realm/client_id/client_secret/user/
password/email/admin_user/admin_password/discovery_url/token_url/... (`.domain` etc. work as
attributes). Empty when no deps declared OR deps-not-ready — pair with
`@pytest.mark.requires_deps` so the F2-11 skip-report keeps the green signal honest."""
state = deps_mod.deps_as_dict(deps_mod.load_run_state())
return {r: e["domain"] for r, e in state.items() if e.get("domain")}
@pytest.fixture(scope="session")
def deps_creds() -> dict[str, dict]:
"""Full SSO-creds dict for each declared dep (operator-2026-05-28 SSO-dep plan §1).
`deps_creds["keycloak"]` returns the entry written by setup_custom_tests with keys
domain/realm/client_id/client_secret/user/password/email/admin_user/admin_password/
discovery_url/token_url/.... Use this in `@pytest.mark.requires_deps` tests that need to
authenticate via OIDC."""
return deps_mod.deps_as_dict(deps_mod.load_run_state())
return {r: _DepEntry(e) for r, e in state.items()}
def pytest_collection_modifyitems(config, items):
"""SSO-dep plan §4: tests marked `@pytest.mark.requires_deps` are skipped with reason
`deps-not-ready: <captured-err>` when the orchestrator's setup_custom_tests step failed
`deps-not-ready: <captured-err>` when the orchestrator's dep provisioning failed
(orchestrator sets CCCI_DEPS_READY=0 in env). Non-deps custom tests are unaffected.
This is failure-isolation per plan §1 — generic tiers cannot break the SSO-marked tests'
@ -131,40 +121,5 @@ def pytest_configure(config):
"""Register the `requires_deps` marker so pytest doesn't warn about it."""
config.addinivalue_line(
"markers",
"requires_deps: test requires DEPS-declared services + setup_custom_tests success.",
"requires_deps: test requires DEPS-declared services + dep provisioning success.",
)
def _wait_healthy(domain, meta):
lifecycle.wait_healthy(
domain,
ok_codes=tuple(meta["HEALTH_OK"]),
path=meta["HEALTH_PATH"],
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
)
@pytest.fixture
def deployed(recipe, app_domain, meta, request):
"""Function-scoped: deploy the current/$REF version healthy, guaranteed teardown after.
Used by stages that start from current (install/backup)."""
version = os.environ.get("VERSION") or None
lifecycle.janitor()
request.addfinalizer(lambda: lifecycle.teardown_app(app_domain))
lifecycle.deploy_app(recipe, app_domain, version=version)
_wait_healthy(app_domain, meta)
return app_domain
@pytest.fixture(scope="session")
def deployed_app(recipe, app_domain, meta):
"""Install stage: deploy the recipe and wait until healthy; tear down at session end."""
version = os.environ.get("VERSION") or None
lifecycle.janitor() # sweep orphans from crashed runs first
try:
lifecycle.deploy_app(recipe, app_domain, version=version, secrets=True)
_wait_healthy(app_domain, meta)
yield app_domain
finally:
lifecycle.teardown_app(app_domain)

View File

@ -15,13 +15,13 @@ def _write(domain, val):
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo {val} > {MARKER}"])
def pre_upgrade(domain, meta):
_write(domain, "upgrade-survives")
def pre_upgrade(ctx):
_write(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_write(domain, "original")
def pre_backup(ctx):
_write(ctx.domain, "original")
def pre_restore(domain, meta):
_write(domain, "mutated") # diverge so a successful restore is observable
def pre_restore(ctx):
_write(ctx.domain, "mutated") # diverge so a successful restore is observable

View File

@ -7,9 +7,9 @@ DEPLOY_TIMEOUT = 600
HTTP_TIMEOUT = 600
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
"""cryptpad needs a SANDBOX_DOMAIN distinct from the main DOMAIN (it serves user content from a
separate origin; the web router routes both). Derive a sibling subdomain under the same wildcard
(covered by the wildcard cert, so no cert work)."""
label, _, rest = domain.partition(".")
label, _, rest = ctx.domain.partition(".")
return {"SANDBOX_DOMAIN": f"{label}-sb.{rest}"}

View File

@ -12,8 +12,8 @@ from harness import lifecycle
MARKER_PATH = "/usr/share/nginx/html/ci-marker.txt"
def pre_restore(domain: str, meta: dict) -> None:
def pre_restore(ctx) -> None:
"""Write 'mutated' to the marker before restore runs. If restore brings back the
snapshot (which has no marker — never seeded by pre_backup), the marker ends up
MISSING or 'mutated' after restore → test_restore_returns_state FAILS → restore=RED."""
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])
lifecycle.exec_in_app(ctx.domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])

View File

@ -11,5 +11,5 @@ from harness import lifecycle
MARKER_PATH = "/usr/share/nginx/html/ci-marker.txt"
def pre_restore(domain: str, meta: dict) -> None:
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])
def pre_restore(ctx) -> None:
lifecycle.exec_in_app(ctx.domain, ["sh", "-c", f"echo mutated > {MARKER_PATH}"])

View File

@ -1,4 +1,4 @@
"""custom-html — pre-op seed hooks (Phase 1e HC3). The orchestrator runs `pre_<op>(domain, meta)`
"""custom-html — pre-op seed hooks (Phase 1e HC3). The orchestrator runs `pre_<op>(ctx)`
BEFORE it performs the op; the matching test_<op>.py asserts the post-op state (assertion-only).
nginx serves the volume at /usr/share/nginx/html, so the marker file survives an upgrade / a
@ -17,16 +17,16 @@ def _write(domain: str, val: str) -> None:
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo {val} > {MARKER_PATH}"])
def pre_upgrade(domain, meta):
def pre_upgrade(ctx):
# seed a marker before the upgrade so the overlay can prove the data survives it
_write(domain, "upgrade-survives")
_write(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
def pre_backup(ctx):
# establish a known original state before the backup op captures it
_write(domain, "original")
_write(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backed-up state so a successful restore (back to "original") is observable
_write(domain, "mutated")
_write(ctx.domain, "mutated")

View File

@ -1,28 +0,0 @@
#!/usr/bin/env bash
# discourse — INSTALL-TIME hook (Phase 2 Q4.6). Runs during the install tier AFTER `abra app new` +
# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN in env.
#
# Purpose: provide the cc-ci re-pin+grace overlay (compose.ccci.yml) to the recipe checkout so the
# UPGRADE-tier BASE deploy (published 0.7.0+3.3.1, whose compose pins the Docker-Hub-removed
# `bitnami/discourse:3.3.1` and ships a too-tight 5m start_period) is deployable and can survive the
# 15-25min Rails cold boot — so upgrade-to-latest can run. See compose.ccci.yml's header for the full
# rationale. The overlay is referenced by recipe_meta COMPOSE_FILE; it is a cc-ci file (not part of the
# recipe), so copying it here makes it resolvable. It persists across the later `git checkout <head>`
# (untracked) so the head deploy also merges it (idempotent — the PR head already re-pins + ships 20m).
# CHAOS_BASE_DEPLOY=True is set so abra's pinned-deploy clean-tree check doesn't FATA on the overlay.
set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Resolve the recipe tree the way abra does: $ABRA_DIR (the per-run tree inside a CI run) else
# the canonical ~/.abra — the overlay must land in the tree this run actually deploys from.
RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then
echo " discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2
exit 1
fi
cp "$SCRIPT_DIR/compose.ccci.yml" "$RECIPE_DIR/compose.ccci.yml"
echo " discourse install_steps: provided compose.ccci.yml (bitnamilegacy re-pin + 20m start_period grace) to recipe checkout (${CCCI_RECIPE})"

View File

@ -30,18 +30,18 @@ def _seed(domain, value):
assert got == value, f"seed did not commit (read back {got!r}, expected {value!r})"
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backup so a successful restore is observable
_psql(domain, "DROP TABLE IF EXISTS ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE IF EXISTS ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -29,11 +29,11 @@ HTTP_TIMEOUT = 1200
# (1) it pins the Docker-Hub-removed `bitnami/discourse:3.3.1` (404) → overlay re-pins app+sidekiq to
# `bitnamilegacy/discourse:3.3.1` (namespace-only, identical image), the same re-pin the PR makes;
# (2) its 5m start_period is too tight for the 15-25min Rails boot → overlay widens it to 20m (grace).
# install_steps.sh provides the overlay; CHAOS_BASE_DEPLOY skips the clean-tree gate on the untracked
# overlay; it persists across the head checkout (idempotent — the PR head already re-pins + ships 20m).
# The harness auto-provides the overlay to the checkout and auto-chaoses the base deploy
# (first-class compose.ccci.yml, rcust P2a); it persists across the head checkout (idempotent — the
# PR head already re-pins + ships 20m).
# Upgrade crossover: 0.7.0 (re-pinned base) → PR head; full assertions run on the HEAD. The 0.7.0
# *custom* tests are not separately run (custom tier runs once, on the head — policy §1 allows skip+record).
CHAOS_BASE_DEPLOY = True
UPGRADE_BASE_VERSION = "0.7.0+3.3.1"
EXTRA_ENV = {
"TIMEOUT": "3600", # abra's internal convergence wait; matches DEPLOY_TIMEOUT (slow Rails boot headroom)
@ -41,7 +41,7 @@ EXTRA_ENV = {
}
def BACKUP_VERIFY(domain):
def BACKUP_VERIFY(ctx):
"""Post-backup integrity check (Q4.6, same race ghost F2-14b hit). The recipe's backupbot db
pre-hook (`/pg_backup.sh backup`) dumps the discourse postgres DB to `/var/lib/postgresql/data/
backup.sql` (gzip), then restic captures that path. On the loaded single CI node the db container
@ -60,7 +60,7 @@ def BACKUP_VERIFY(domain):
try:
out = lifecycle.exec_in_app(
domain,
ctx.domain,
[
"sh",
"-c",

View File

@ -1,28 +0,0 @@
#!/usr/bin/env bash
# ghost — INSTALL-TIME hook (Phase 2 F2-14b). Runs during the install tier AFTER `abra app new` +
# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN in env.
#
# Purpose: provide the cc-ci start_period-grace overlay (compose.ccci.yml) to the recipe checkout so
# the UPGRADE-tier BASE deploy (a previous published version whose app healthcheck still ships the
# too-tight 1m start_period) can survive ghost's ~6-9min fresh-DB migration and converge. See
# compose.ccci.yml's header for the full rationale. The overlay is referenced by recipe_meta
# COMPOSE_FILE; copying it here (it is a cc-ci file, not part of the recipe) makes it resolvable.
# It persists across the later `git checkout <head>` (untracked) so the head deploy also merges it
# (idempotent — the PR head already ships 15m). CHAOS_BASE_DEPLOY=True is set so abra's pinned-deploy
# clean-tree check doesn't FATA on the untracked overlay.
set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Resolve the recipe tree the way abra does: $ABRA_DIR (the per-run tree inside a CI run) else
# the canonical ~/.abra — the overlay must land in the tree this run actually deploys from.
RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then
echo " ghost install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2
exit 1
fi
cp "$SCRIPT_DIR/compose.ccci.yml" "$RECIPE_DIR/compose.ccci.yml"
echo " ghost install_steps: provided compose.ccci.yml (app start_period grace) to recipe checkout (${CCCI_RECIPE})"

View File

@ -36,19 +36,19 @@ def _seed(domain, value):
assert got == value, f"seed did not commit (read back {got!r}, expected {value!r})"
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backup so a successful restore is observable: drop the marker table.
_mysql(domain, "DROP TABLE IF EXISTS ci_marker;")
_mysql(ctx.domain, "DROP TABLE IF EXISTS ci_marker;")
got = _mysql(
domain,
ctx.domain,
"SELECT COUNT(*) FROM information_schema.tables "
"WHERE table_schema='ghost' AND table_name='ci_marker';",
)

View File

@ -31,23 +31,22 @@ HTTP_TIMEOUT = 900
# (plan-ccci-compose-overlay-policy.md §1), so the harness base-deploys the previous PUBLISHED version
# (1.1.1+6-alpine) — which predates the PR and still ships the too-tight 1m start_period → it would
# deadlock on the same migration kill. compose.ccci.yml re-applies the 15m grace to the BASE so the
# from-version is deployable; install_steps.sh provides it to the checkout; CHAOS_BASE_DEPLOY skips the
# clean-tree gate on that untracked overlay. It persists across the head checkout (idempotent — the PR
# head already ships 15m). This is the policy-blessed "minimal overlay on the from-version so
# from-version is deployable; the harness auto-provides it to the checkout and auto-chaoses the base
# deploy (first-class compose.ccci.yml, rcust P2a). It persists across the head checkout (idempotent —
# the PR head already ships 15m). This is the policy-blessed "minimal overlay on the from-version so
# upgrade-to-latest can run" — grace-only, masks no defect, weakens no test.
# TIMEOUT/DEPLOY_TIMEOUT 2400s: the BASE cold boot's wall-time is mysql fresh-dir init (~6min, during
# which the app crash-loops harmlessly on `ECONNREFUSED 3306` until mysql accepts connections — no
# migration progress lost, it hasn't started) PLUS the ~9-15min schema migration (round-trip-bound,
# slower under host load). 1200s was too tight (full4 killed at the near-final `email_recipients`
# tables while still 0/1); 2400s gives headroom while still bounding a genuine hang (matches discourse).
CHAOS_BASE_DEPLOY = True
EXTRA_ENV = {
"TIMEOUT": "2400",
"COMPOSE_FILE": "compose.yml:compose.ccci.yml",
}
def BACKUP_VERIFY(domain):
def BACKUP_VERIFY(ctx):
"""Post-backup integrity check (F2-14b). The recipe's backupbot db pre-hook dumps the ghost MySQL
DB to `/var/lib/mysql/backup.sql.gz` (then restic captures that path). On the loaded single CI node
the db container intermittently CYCLES mid-dump (observed: full5/6/7 RED, full8 green — pure race;
@ -62,7 +61,7 @@ def BACKUP_VERIFY(domain):
try:
out = lifecycle.exec_in_app(
domain,
ctx.domain,
[
"sh",
"-c",

View File

@ -25,17 +25,17 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
def pre_restore(ctx):
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -14,20 +14,20 @@ def _token(domain):
return kc_admin.admin_token(domain, kc_admin.admin_password(domain))
def pre_upgrade(domain, meta):
def pre_upgrade(ctx):
# create the marker realm (DB data) before the upgrade so the overlay can prove it survives
assert kc_admin.create_marker_realm(domain, _token(domain)) in (201, 409)
assert kc_admin.create_marker_realm(ctx.domain, _token(ctx.domain)) in (201, 409)
def pre_backup(domain, meta):
def pre_backup(ctx):
# establish the marker realm before the backup op captures mariadb
assert kc_admin.create_marker_realm(domain, _token(domain)) in (201, 409)
assert kc_admin.create_marker_realm(ctx.domain, _token(ctx.domain)) in (201, 409)
def pre_restore(domain, meta):
def pre_restore(ctx):
# backup-bot-two cycles the keycloak container during backup → wait for serving, re-auth, then
# delete the realm (diverge from the backup) so a successful restore is observable
generic.assert_serving(domain, meta)
tok = _token(domain)
assert kc_admin.delete_marker_realm(domain, tok) in (204, 200)
assert not kc_admin.marker_realm_exists(domain, tok), "delete did not take"
generic.assert_serving(ctx.domain, ctx.meta)
tok = _token(ctx.domain)
assert kc_admin.delete_marker_realm(ctx.domain, tok) in (204, 200)
assert not kc_admin.marker_realm_exists(ctx.domain, tok), "delete did not take"

View File

@ -5,7 +5,7 @@ persistence". This is the canonical create-an-object + read-it-back for lasuite-
Flow (uses an OIDC token from the dep keycloak):
1. Obtain a JWT via OIDC password grant against the dep keycloak (the test user is provisioned
by the orchestrator's setup_custom_tests step).
by the orchestrator's dep-provisioning step).
2. POST `/api/v1.0/documents/` with `Authorization: Bearer <jwt>` to create a new doc with a
unique title; capture the returned `id`.
3. GET `/api/v1.0/documents/<id>/` with the same Bearer token; assert the returned title and
@ -15,7 +15,7 @@ Non-vacuous: a misconfigured OIDC, broken backend, or missing endpoint fails at
broken. The marker-in-the-title + id round-trip proves the doc actually persisted in lasuite-
docs's database after going through the recipe's nginx → backend → postgres path.
Marked @pytest.mark.requires_deps — skips with `deps-not-ready` if setup_custom_tests failed.
Marked @pytest.mark.requires_deps — skips with `deps-not-ready` if dep provisioning failed.
"""
from __future__ import annotations
@ -32,9 +32,9 @@ from harness import sso
@pytest.mark.requires_deps
def test_create_doc_and_read_back(live_app, deps_creds):
def test_create_doc_and_read_back(live_app, deps):
"""Create a doc via the authenticated API; fetch it back; assert round-trip."""
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Obtain a JWT via OIDC password grant
access_token = sso.oidc_password_grant(

View File

@ -5,13 +5,13 @@ SOURCE: references/recipe-maintainer/recipe-info/lasuite-docs/tests/oidc_login.p
End-to-end flow:
1. GET `/api/v1.0/users/me/` without auth → asserts the response REDIRECTS to the dep
keycloak's realm auth endpoint (the recipe is correctly configured to challenge
unauthenticated callers — wired via setup_custom_tests.sh).
unauthenticated callers — wired via install_steps.sh).
2. Obtain an OIDC token from the dep keycloak via password grant
(the test user provisioned by the orchestrator's realm setup).
3. Call `/api/v1.0/users/me/` with `Authorization: Bearer <jwt>` → asserts 200 and the
returned user's email matches the provisioned test user.
Marked @pytest.mark.requires_deps — skips with `deps-not-ready` if setup_custom_tests failed.
Marked @pytest.mark.requires_deps — skips with `deps-not-ready` if dep provisioning failed.
"""
from __future__ import annotations
@ -51,9 +51,9 @@ def _get_no_redirect(url: str) -> tuple[int, str]:
@pytest.mark.requires_deps
def test_oidc_login_via_keycloak(live_app, deps_creds):
def test_oidc_login_via_keycloak(live_app, deps):
"""Anonymous → redirect to keycloak; password-grant token → 200 from /api/v1.0/users/me/."""
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Step 1: unauthenticated GET → 302 to keycloak realm's auth endpoint
status, redirect = _get_no_redirect(f"https://{live_app}/api/v1.0/users/me/")

View File

@ -3,10 +3,10 @@
Refactored to the refined SSO-dep model:
- The orchestrator deploys a per-run keycloak dep AFTER generic tiers and provisions a fresh
realm/client/user via `harness.sso.setup_keycloak_realm`. The creds are written to
`$CCCI_DEPS_FILE` (read here via the `deps_creds` fixture).
`$CCCI_DEPS_FILE` (read here via the `deps` fixture).
- This test no longer calls `setup_keycloak_realm` itself — that's the orchestrator's job in
the setup_custom_tests step. We just consume the credentials and exercise the OIDC flow.
- Marked `@pytest.mark.requires_deps` so if setup_custom_tests failed, this test SKIPs with a
the dep-provisioning step. We just consume the credentials and exercise the OIDC flow.
- Marked `@pytest.mark.requires_deps` so if dep provisioning failed, this test SKIPs with a
clear `deps-not-ready` reason rather than red-flagging a non-recipe failure.
"""
@ -31,13 +31,13 @@ def _b64url_decode(seg: str) -> bytes:
@pytest.mark.requires_deps
def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
def test_oidc_password_grant_against_dep_keycloak(live_app, deps):
"""The dep keycloak issues a JWT for the pre-provisioned test user via OIDC password grant."""
assert "keycloak" in deps_creds, (
f"keycloak creds not in deps_creds; got {list(deps_creds.keys())}. "
"setup_custom_tests should have populated this."
assert "keycloak" in deps, (
f"keycloak creds not in deps; got {list(deps.keys())}. "
"dep provisioning should have populated this."
)
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Sanity-check the creds shape — orchestrator-written
assert kc["domain"]

View File

@ -0,0 +1,74 @@
#!/usr/bin/env bash
# lasuite-docs — INSTALL-TIME OIDC wiring hook (rcust P2b; migrated from the deleted
# setup_custom_tests.sh post-deploy path — sibling of lasuite-drive/-meet's hooks).
#
# Runs during the install tier AFTER `abra app new` + EXTRA_ENV + `abra app secret generate`, and
# BEFORE the single `abra app deploy` (lifecycle.py::_run_install_steps). Writing OIDC env + the
# real client secret HERE means the recipe deploys ONCE with OIDC already wired — no post-deploy
# reconverge. The orchestrator provisions the per-run realm/client on the (live-warm) keycloak
# BEFORE this hook and writes $CCCI_DEPS_FILE (the recipe→creds dict). docs' OIDC settings are
# config-only (validated by `manage.py check`, not fetched at boot), so the stack boots healthy
# with the env set. Env names per lasuite-docs's .env.sample (same values the old post-deploy
# hook wrote — byte-identical wiring, only the timing moved).
#
# Env supplied by the harness:
# CCCI_APP_DOMAIN — the per-run lasuite-docs app domain
# CCCI_APP_ENV — path to the app's .env (the one `abra app deploy` reads)
# CCCI_DEPS_FILE — JSON {keycloak: {domain, realm, client_id, client_secret, ...}} (may be empty)
set -euo pipefail
: "${CCCI_APP_DOMAIN:?missing}"
ENV_PATH="${CCCI_APP_ENV:?missing}"
# No deps file / no keycloak entry → install-time provisioning failed or was skipped. NO-OP so the
# recipe still boots; the @requires_deps OIDC custom test then SKIPs and F2-11 flips the run RED.
if [ -z "${CCCI_DEPS_FILE:-}" ] || [ ! -s "${CCCI_DEPS_FILE}" ]; then
echo " install_steps: no deps file — skipping OIDC wiring (recipe boots without OIDC)"
exit 0
fi
KC_DOMAIN=$(jq -r '.keycloak.domain // empty' "$CCCI_DEPS_FILE")
KC_REALM=$(jq -r '.keycloak.realm // empty' "$CCCI_DEPS_FILE")
KC_CLIENT=$(jq -r '.keycloak.client_id // empty' "$CCCI_DEPS_FILE")
KC_SECRET=$(jq -r '.keycloak.client_secret // empty' "$CCCI_DEPS_FILE")
if [ -z "$KC_DOMAIN" ] || [ -z "$KC_SECRET" ]; then
echo " install_steps: deps file has no keycloak domain/secret — skipping OIDC wiring"
exit 0
fi
echo " lasuite-docs install_steps: wiring OIDC at install against keycloak ${KC_DOMAIN}"
# 1) Insert the OIDC client secret at a bumped version (abra already generated oidc_rpcs:v1; swarm
# forbids overwriting a secret at the same version). The app is not deployed yet — a swarm secret
# can be created independently — so the single deploy below picks up v2.
CUR_VER=$(grep -E '^\s*SECRET_OIDC_RPCS_VERSION=' "$ENV_PATH" | tail -1 | cut -d= -f2 | tr -d '"\r' || echo "v1")
NEW_NUM=$((${CUR_VER#v} + 1))
NEW_VER="v${NEW_NUM}"
INSERT_LOG=$(abra app secret insert "$CCCI_APP_DOMAIN" oidc_rpcs "$NEW_VER" "$KC_SECRET" --no-input -C -o 2>&1) ||
INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) ||
{
echo " install_steps: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"
exit 1
}
sed -i "s|^\s*SECRET_OIDC_RPCS_VERSION=.*|SECRET_OIDC_RPCS_VERSION=$NEW_VER|" "$ENV_PATH"
echo " install_steps: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)"
# 2) Write OIDC env vars to the app's .env (names per lasuite-docs's .env.sample). Ensure a
# trailing newline first so appends never concatenate onto the last line.
write_env() {
local key="$1" val="$2"
sed -i "/^\s*#\?\s*${key}=/d" "$ENV_PATH"
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
printf '%s=%s\n' "$key" "$val" >>"$ENV_PATH"
}
write_env OIDC_REALM "$KC_REALM"
write_env OIDC_OP_DISCOVERY_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration"
write_env OIDC_OP_AUTHORIZATION_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
write_env OIDC_OP_TOKEN_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
write_env OIDC_OP_USER_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
write_env OIDC_OP_LOGOUT_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
write_env OIDC_OP_JWKS_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
write_env OIDC_RP_CLIENT_ID "$KC_CLIENT"
write_env OIDC_RP_SIGN_ALGO "RS256"
write_env OIDC_RP_SCOPES "openid email profile"
echo " lasuite-docs install_steps: OIDC env wired into .env (deploy will pick it up, no reconverge)"

View File

@ -24,18 +24,18 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -15,7 +15,7 @@ HTTP_TIMEOUT = 600
DEPS = ["keycloak"]
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
# abra's internal per-deploy convergence timeout (the recipe's TIMEOUT env, default 300s) is too
# short for this 9-service stack on a COLD image cache (~9 large images: impress frontend/backend,
# minio, postgres18, redis, docspec, y-provider). Cold pulls exceed 300s -> "deploy timed out 🟠".

View File

@ -1,91 +0,0 @@
#!/usr/bin/env bash
# lasuite-docs — post-deps setup hook (operator-2026-05-28 SSO-dep plan §3.2).
#
# Runs AFTER the generic tiers (install/upgrade/backup/restore) and AFTER each declared dep is
# deployed + provisioned with realm/client via the harness. The orchestrator has written
# $CCCI_DEPS_FILE with the keycloak dep's domain + realm + client_secret + admin creds.
#
# This hook:
# 1. Reads the dep's connection info from $CCCI_DEPS_FILE.
# 2. Inserts the OIDC client secret as an abra app secret (recipe-conventional name oidc_rpcs).
# 3. Writes the OIDC env vars to the running app's .env via `abra app config set`.
# 4. Triggers an in-place `abra app deploy --force --chaos` so the new env takes effect.
# THIS IS NOT a fresh `abra app new` — the deploy-count guard (DG4.1, generalised) still
# sees one app_new per app.
#
# Env supplied by the orchestrator:
# CCCI_APP_DOMAIN — the running per-run lasuite-docs app domain
# CCCI_RECIPE — "lasuite-docs"
# CCCI_DEPS_FILE — JSON file (dict shape: {dep_recipe: {domain, realm, client_id, ...}, ...})
set -euo pipefail
: "${CCCI_APP_DOMAIN:?missing}"
: "${CCCI_DEPS_FILE:?missing}"
test -s "$CCCI_DEPS_FILE" || {
echo " setup_custom_tests: deps file empty"
exit 1
}
# Read keycloak dep info via jq
KC_DOMAIN=$(jq -r '.keycloak.domain' "$CCCI_DEPS_FILE")
KC_REALM=$(jq -r '.keycloak.realm' "$CCCI_DEPS_FILE")
KC_CLIENT=$(jq -r '.keycloak.client_id' "$CCCI_DEPS_FILE")
KC_SECRET=$(jq -r '.keycloak.client_secret' "$CCCI_DEPS_FILE")
if [ -z "$KC_DOMAIN" ] || [ "$KC_DOMAIN" = "null" ]; then
echo " setup_custom_tests: no keycloak.domain in deps"
exit 1
fi
if [ -z "$KC_SECRET" ] || [ "$KC_SECRET" = "null" ]; then
echo " setup_custom_tests: no keycloak.client_secret"
exit 1
fi
echo " lasuite-docs setup_custom_tests: wiring OIDC against keycloak dep ${KC_DOMAIN}"
# 1) Insert the OIDC client secret AT A BUMPED VERSION (the recipe-maintainer pattern).
# `abra app new -S` already generated `oidc_rpcs:v1` (random) — Docker Swarm forbids overwriting
# a secret at the same version, so we bump the version (v2), insert our value there, then
# update SECRET_OIDC_RPCS_VERSION in the .env to point at the new one.
ENV_PATH="$HOME/.abra/servers/default/${CCCI_APP_DOMAIN}.env"
CUR_VER=$(grep -E '^\s*SECRET_OIDC_RPCS_VERSION=' "$ENV_PATH" | tail -1 | cut -d= -f2 | tr -d '"\r' || echo "v1")
NEW_NUM=$((${CUR_VER#v} + 1))
NEW_VER="v${NEW_NUM}"
INSERT_LOG=$(abra app secret insert "$CCCI_APP_DOMAIN" oidc_rpcs "$NEW_VER" "$KC_SECRET" --no-input -C -o 2>&1) ||
INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) ||
{
echo " setup_custom_tests: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"
exit 1
}
# Repoint the env var to the new version
sed -i "s|^\s*SECRET_OIDC_RPCS_VERSION=.*|SECRET_OIDC_RPCS_VERSION=$NEW_VER|" "$ENV_PATH"
echo " setup_custom_tests: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)"
# 2) Write OIDC env vars to the app's .env (names per lasuite-docs's .env.sample).
# Ensure the file ends with a newline FIRST so our appends don't concatenate onto the last line
# (we saw `TIMEOUT=900OIDC_REALM=...` malformed by a missing-trailing-newline file).
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
write_env() {
local key="$1" val="$2"
# remove any existing key (commented or live) then append the live key=val
sed -i "/^\s*#\?\s*${key}=/d" "$ENV_PATH"
# Re-ensure trailing newline after each delete (sed may leave the file without one)
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
printf '%s=%s\n' "$key" "$val" >>"$ENV_PATH"
}
write_env OIDC_REALM "$KC_REALM"
write_env OIDC_OP_DISCOVERY_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration"
write_env OIDC_OP_AUTHORIZATION_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
write_env OIDC_OP_TOKEN_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
write_env OIDC_OP_USER_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
write_env OIDC_OP_LOGOUT_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
write_env OIDC_OP_JWKS_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
write_env OIDC_RP_CLIENT_ID "$KC_CLIENT"
write_env OIDC_RP_SIGN_ALGO "RS256"
write_env OIDC_RP_SCOPES "openid email profile"
# 3) Trigger an in-place redeploy so the env update takes effect. --force re-deploys even when
# the recipe hasn't changed; --chaos avoids the chaos prompt; --no-input non-interactive.
abra app deploy "$CCCI_APP_DOMAIN" --force --chaos --no-input 2>&1 | tail -10
echo " lasuite-docs setup_custom_tests: OIDC wired + redeployed"

View File

@ -3,12 +3,12 @@
Drive (La Suite Drive) is OIDC-required: login is gated by an external OpenID Connect provider.
Mirrors the proven lasuite-docs SSO model:
- The orchestrator deploys a per-run keycloak dep AFTER the generic tiers and provisions a fresh
realm/client/user via `harness.sso.setup_keycloak_realm`; `setup_custom_tests.sh` then wires the
realm/client/user via `harness.sso.setup_keycloak_realm`; `install_steps.sh` then wires the
OIDC env + client secret into the running drive app and redeploys. Creds land in `$CCCI_DEPS_FILE`
(read here via the `deps_creds` fixture).
(read here via the `deps` fixture).
- This test consumes those creds and exercises the real OIDC flow against the dep keycloak: discovery
endpoint advertises the realm, and a password grant yields a valid JWT with the expected claims.
- Marked `@pytest.mark.requires_deps` so if setup_custom_tests failed the test SKIPs with a clear
- Marked `@pytest.mark.requires_deps` so if dep provisioning failed the test SKIPs with a clear
`deps-not-ready` reason — and (per F2-11) the orchestrator then fails the run rather than going
green on a skipped SSO test.
@ -36,13 +36,13 @@ def _b64url_decode(seg: str) -> bytes:
@pytest.mark.requires_deps
def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
def test_oidc_password_grant_against_dep_keycloak(live_app, deps):
"""The dep keycloak issues a JWT for the pre-provisioned test user via OIDC password grant."""
assert "keycloak" in deps_creds, (
f"keycloak creds not in deps_creds; got {list(deps_creds.keys())}. "
"setup_custom_tests should have populated this."
assert "keycloak" in deps, (
f"keycloak creds not in deps; got {list(deps.keys())}. "
"dep provisioning should have populated this."
)
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Creds shape. WC1: realm is per-run namespaced "<parent>-<6hex>"; client_id stays the parent.
assert kc["domain"]

View File

@ -6,7 +6,7 @@
# BEFORE the single `abra app deploy` (runner/harness/lifecycle.py::_run_install_steps). By writing
# the OIDC env + the real client secret into the app's `.env` HERE, the recipe deploys ONCE with
# OIDC already wired — eliminating the flaky post-deploy in-place `--force --chaos` 12-service
# reconverge that the old setup_custom_tests.sh did (collabora WOPI-discovery race; see JOURNAL
# post-deploy reconverge (collabora WOPI-discovery race; see JOURNAL
# Step 0). The orchestrator provisions the per-run realm/client on the live-warm keycloak BEFORE
# this hook and writes $CCCI_DEPS_FILE (the recipe→creds dict).
#

View File

@ -5,6 +5,7 @@ in the `db` service. The backup path exercises the recipe's pg_backup.sh DB-dump
backupbot-labelled)."""
import os
import subprocess
import sys
import time
@ -12,6 +13,57 @@ sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner")
from harness import lifecycle # noqa: E402
def pre_install(ctx):
"""Post-deploy seed for the custom tier (the former setup_custom_tests.sh, moved here in rcust
P2b — install_steps.sh runs PRE-deploy and cannot touch the live stack). The deploy alone does
NOT create the MinIO bucket: `minio-createbuckets` is a `replicas:0` one-shot (restart_policy:
none) that must be triggered. The MinIO storage test asserts the bucket exists, so trigger it
here and poll. `--detach` is REQUIRED: the job creates the bucket then EXITS 0, so it never
holds a steady 1/1 replica — a blocking scale would wait forever.
BEST-EFFORT, like the setup_custom_tests.sh it replaced: on poll timeout we WARN and continue
(the one-shot often lands just after the window). The custom-tier MinIO storage test is the
real gate for a genuinely missing bucket — failing the install op here was an rcust M2
regression (the original hook fell through on timeout by design)."""
stack = ctx.domain.replace(".", "_")
print(" pre_install: creating MinIO bucket via the minio-createbuckets one-shot", flush=True)
subprocess.run(
["docker", "service", "scale", "--detach", f"{stack}_minio-createbuckets=1"],
capture_output=True,
check=False,
)
check = (
'mc alias set _c http://localhost:9000 "$(cat /run/secrets/minio_ru)" '
'"$(cat /run/secrets/minio_rp)" >/dev/null 2>&1 && '
"mc ls _c/drive-media-storage >/dev/null 2>&1"
)
for i in range(30):
cid = subprocess.run(
["docker", "ps", "-q", "-f", f"name={stack}_minio.1"],
capture_output=True,
text=True,
check=False,
).stdout.split()
if cid and (
subprocess.run(
["docker", "exec", cid[0], "sh", "-c", check], capture_output=True, check=False
).returncode
== 0
):
print(
f" pre_install: bucket drive-media-storage present after {i + 1} poll(s)",
flush=True,
)
return
time.sleep(3)
print(
" !! pre_install: minio-createbuckets one-shot did not create drive-media-storage in 90s "
"— continuing (best-effort, as the pre-restructure hook did); the custom-tier MinIO test "
"gates a genuinely missing bucket",
flush=True,
)
def _wait_collabora_ready(domain, timeout=420):
"""Gate the upgrade op on collabora being FULLY ready (WOPI discovery endpoint → 200), not just
container 1/1 'running'. coolwsd takes ~2min to boot (pre-reads 1300+ l10n files + RSA keygen);
@ -49,21 +101,21 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
def pre_upgrade(ctx):
# Gate the chaos redeploy on a fully-ready collabora (else it kills a still-booting coolwsd and
# abra aborts the upgrade deploy — Q3.2a run 1). Then seed the data-integrity marker.
_wait_collabora_ready(domain)
_seed(domain, "upgrade-survives")
_wait_collabora_ready(ctx.domain)
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -18,34 +18,31 @@ DEPLOY_TIMEOUT = 1800
HTTP_TIMEOUT = 900
# Base deploy/lifecycle proven cold-green @2026-05-28 (install: pass; 12 services incl.
# onlyoffice+collabora) once the Docker Hub rate limit was fixed. The keycloak SSO dep is now
# enabled: declaring DEPS triggers the orchestrator's setup_custom_tests step (deploy keycloak +
# provision realm/client/user + run tests/lasuite-drive/setup_custom_tests.sh to wire OIDC env +
# in-place redeploy). functional/test_oidc_with_keycloak.py then exercises the SSO flow.
# onlyoffice+collabora) once the Docker Hub rate limit was fixed. Declaring DEPS makes the
# orchestrator provision keycloak (realm/client/user) BEFORE the single deploy;
# functional/test_oidc_with_keycloak.py then exercises the SSO flow.
DEPS = ["keycloak"]
# Q3.2a (plan-lasuite-drive-oidc-robustness.md Part A): wire OIDC at INSTALL time, not via a
# post-deploy in-place `--chaos` redeploy. The orchestrator provisions the per-run realm on the
# live-warm keycloak BEFORE the single `abra app deploy`, and tests/lasuite-drive/install_steps.sh
# writes the OIDC env + client secret into the .env that one deploy reads. This eliminates the flaky
# 12-service reconverge (collabora WOPI-discovery race; JOURNAL Step 0). Drive boots fine with OIDC
# env set because keycloak is live-warm (discovery reachable at boot). setup_custom_tests.sh now
# only triggers the post-deploy MinIO bucket one-shot.
OIDC_AT_INSTALL = True
# OIDC is wired at INSTALL time (the only deps mode since rcust P2b; Q3.2a pioneered it here):
# the orchestrator provisions the per-run realm on the live-warm keycloak BEFORE the single
# `abra app deploy`, and tests/lasuite-drive/install_steps.sh writes the OIDC env + client secret
# into the .env that one deploy reads. No post-deploy reconverge (the flaky 12-service collabora
# WOPI race is structurally gone). The post-deploy MinIO bucket one-shot lives in ops.py
# pre_install (the former setup_custom_tests.sh, deleted in P2b).
def READY_PROBE(domain):
def READY_PROBE(ctx):
"""Readiness signals beyond replica-convergence + the app HEALTH_PATH (Q3.2/F2-12). collabora's
coolwsd reports its container 1/1 'running' while still doing jail/config init, and its WOPI
discovery endpoint 404s until ready — so the harness waits for `/hosting/discovery` → 200 on the
collabora sibling host after the install deploy AND after the upgrade chaos redeploy. This is what
makes the heavy prev→PR-head crossover reliably green (the new collabora 25.04.9.x finishes init
within swarm's healthcheck retries; abra's own converge monitor was too impatient — F2-12)."""
label, _, rest = domain.partition(".")
return [{"host": f"collabora-{domain}", "path": "/hosting/discovery", "ok": (200,)}]
label, _, rest = ctx.domain.partition(".")
return [{"host": f"collabora-{ctx.domain}", "path": "/hosting/discovery", "ok": (200,)}]
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
# Two of lasuite-drive's services route on DOMAIN-DERIVED **nested** subdomains —
# `MINIO_DOMAIN="minio.${DOMAIN}"` and `COLLABORA_DOMAIN="collabora.${DOMAIN}"`. The cc-ci
# wildcard TLS cert is `*.ci.commoninternet.net` (single label only), so a 2-label name like
@ -55,8 +52,8 @@ def EXTRA_ENV(domain):
# no cert/gateway change. See DECISIONS.md "Phase 2 — nested DOMAIN-derived subdomains".
# `AWS_S3_DOMAIN_REPLACE` derives from MINIO_DOMAIN in-compose, so setting MINIO_DOMAIN is enough.
return {
"MINIO_DOMAIN": f"minio-{domain}",
"COLLABORA_DOMAIN": f"collabora-{domain}",
"MINIO_DOMAIN": f"minio-{ctx.domain}",
"COLLABORA_DOMAIN": f"collabora-{ctx.domain}",
# abra's internal per-deploy convergence timeout (recipe TIMEOUT env, default 300s) is too
# short for this 12-service stack on a cold image cache (impress frontend/backend, minio,
# postgres, redis, collabora ~1GB, onlyoffice ~2GB). Bump so abra waits long enough for

View File

@ -1,39 +0,0 @@
#!/usr/bin/env bash
# lasuite-drive — POST-DEPLOY setup hook (Phase 2 Q3.2a).
#
# As of Q3.2a (plan-lasuite-drive-oidc-robustness.md Part A) OIDC is wired at INSTALL time by
# tests/lasuite-drive/install_steps.sh (before the single `abra app deploy`), so this hook NO LONGER
# does any OIDC env wiring or in-place redeploy — that eliminated the flaky 12-service reconverge
# (collabora WOPI race; see JOURNAL Step 0). What remains here is the ONE post-deploy step that
# genuinely needs the live stack: triggering the MinIO bucket-creation one-shot. The orchestrator
# runs this only on the install-time path AFTER the deploy is healthy (deps already provisioned).
#
# Env supplied by the orchestrator:
# CCCI_APP_DOMAIN — the running per-run lasuite-drive app domain
# CCCI_DEPS_FILE — JSON deps creds dict (unused here now; OIDC handled at install)
set -euo pipefail
: "${CCCI_APP_DOMAIN:?missing}"
# The deploy alone does NOT create the MinIO bucket — `minio-createbuckets` is a `replicas:0`
# one-shot (restart_policy: none) that must be triggered. The MinIO storage test asserts the bucket
# exists, so create it here. `--detach` is REQUIRED: the job creates the bucket then EXITS 0, so it
# never holds a steady 1/1 replica; a blocking `docker service scale ...=1` would wait forever and
# hang the run. With `--detach` the scale just submits the one-run and returns; the poll loop below
# confirms the bucket was actually created.
STACK=$(printf '%s' "$CCCI_APP_DOMAIN" | tr '.' '_')
echo " setup: creating MinIO bucket via the minio-createbuckets one-shot (scale 0->1)"
docker service scale --detach "${STACK}_minio-createbuckets=1" >/dev/null 2>&1 || true
# Wait up to 90s for the one-shot to create the bucket (mc mb drive/drive-media-storage; exit 0).
# Poll by checking the bucket directly from the running minio replica container.
for i in $(seq 1 30); do
MC_CID=$(docker ps -q -f "name=${STACK}_minio.1" | head -1)
if [ -n "$MC_CID" ] && docker exec "$MC_CID" sh -c \
'mc alias set _c http://localhost:9000 "$(cat /run/secrets/minio_ru)" "$(cat /run/secrets/minio_rp)" >/dev/null 2>&1 && mc ls _c/drive-media-storage >/dev/null 2>&1'; then
echo " setup: bucket drive-media-storage present after ${i} poll(s)"
break
fi
sleep 3
done
echo " lasuite-drive setup_custom_tests: post-deploy MinIO bucket step complete (OIDC wired at install)"

View File

@ -36,8 +36,8 @@ def _b64url(seg: str) -> bytes:
return base64.urlsafe_b64decode(seg + "=" * ((4 - len(seg) % 4) % 4))
def _creds(deps_creds: dict) -> dict:
kc = deps_creds["keycloak"]
def _creds(deps: dict) -> dict:
kc = deps["keycloak"]
return {
"provider": "keycloak",
"provider_domain": kc["domain"],
@ -55,10 +55,10 @@ def _creds(deps_creds: dict) -> dict:
@pytest.mark.requires_deps
def test_create_room_get_livekit_token_and_read_back(live_app, deps_creds):
assert "keycloak" in deps_creds, f"keycloak creds missing; got {list(deps_creds.keys())}"
def test_create_room_get_livekit_token_and_read_back(live_app, deps):
assert "keycloak" in deps, f"keycloak creds missing; got {list(deps.keys())}"
base = f"https://{live_app}"
token = sso.oidc_password_grant(_creds(deps_creds))
token = sso.oidc_password_grant(_creds(deps))
assert isinstance(token, str) and token.count(".") == 2, "OIDC access token is not a JWT"
auth = {"Authorization": f"Bearer {token}"}

View File

@ -3,12 +3,12 @@
Meet (La Suite Meet) is OIDC-required: login is gated by an external OpenID Connect provider.
Mirrors the proven lasuite-docs SSO model:
- The orchestrator deploys a per-run keycloak dep AFTER the generic tiers and provisions a fresh
realm/client/user via `harness.sso.setup_keycloak_realm`; `setup_custom_tests.sh` then wires the
realm/client/user via `harness.sso.setup_keycloak_realm`; `install_steps.sh` then wires the
OIDC env + client secret into the running drive app and redeploys. Creds land in `$CCCI_DEPS_FILE`
(read here via the `deps_creds` fixture).
(read here via the `deps` fixture).
- This test consumes those creds and exercises the real OIDC flow against the dep keycloak: discovery
endpoint advertises the realm, and a password grant yields a valid JWT with the expected claims.
- Marked `@pytest.mark.requires_deps` so if setup_custom_tests failed the test SKIPs with a clear
- Marked `@pytest.mark.requires_deps` so if dep provisioning failed the test SKIPs with a clear
`deps-not-ready` reason — and (per F2-11) the orchestrator then fails the run rather than going
green on a skipped SSO test.
@ -36,13 +36,13 @@ def _b64url_decode(seg: str) -> bytes:
@pytest.mark.requires_deps
def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
def test_oidc_password_grant_against_dep_keycloak(live_app, deps):
"""The dep keycloak issues a JWT for the pre-provisioned test user via OIDC password grant."""
assert "keycloak" in deps_creds, (
f"keycloak creds not in deps_creds; got {list(deps_creds.keys())}. "
"setup_custom_tests should have populated this."
assert "keycloak" in deps, (
f"keycloak creds not in deps; got {list(deps.keys())}. "
"dep provisioning should have populated this."
)
kc = deps_creds["keycloak"]
kc = deps["keycloak"]
# Creds shape. WC1: realm is per-run namespaced "<parent>-<6hex>"; client_id stays the parent.
assert kc["domain"]

View File

@ -4,7 +4,8 @@
# Runs during the install tier AFTER `abra app new` + EXTRA_ENV + `abra app secret generate`, and
# BEFORE the single `abra app deploy` (lifecycle.py::_run_install_steps). Writing OIDC env + the real
# client secret HERE means the recipe deploys ONCE with OIDC already wired — no post-deploy reconverge
# (OIDC_AT_INSTALL). The orchestrator provisions the per-run realm/client on the live-warm keycloak
# (install-time deps wiring — the only mode since rcust P2b). The orchestrator provisions the
# per-run realm/client on the live-warm keycloak
# BEFORE this hook and writes $CCCI_DEPS_FILE (the recipe→creds dict).
#
# Meet's OIDC is REQUIRED (recipe README). Same La Suite/impress env contract as drive, with meet's

View File

@ -27,18 +27,18 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -13,16 +13,15 @@ HEALTH_OK = (200, 301, 302)
DEPLOY_TIMEOUT = 1200
HTTP_TIMEOUT = 600
# SSO-dependent (recipe.toml requires=["keycloak"], [sso] provider=keycloak). Wire OIDC at INSTALL
# time against the live-warm keycloak — same machinery as lasuite-drive (Q3.2a): the orchestrator
# provisions the per-run realm BEFORE the single `abra app deploy`, and tests/lasuite-meet/
# install_steps.sh writes the OIDC env + client secret into that one deploy (no post-deploy
# reconverge). Meet boots fine with OIDC env set because keycloak is live-warm.
# SSO-dependent (recipe.toml requires=["keycloak"], [sso] provider=keycloak). OIDC is wired at
# INSTALL time (the only deps mode since rcust P2b) against the live-warm keycloak: the
# orchestrator provisions the per-run realm BEFORE the single `abra app deploy`, and
# tests/lasuite-meet/install_steps.sh writes the OIDC env + client secret into that one deploy
# (no post-deploy reconverge). Meet boots fine with OIDC env set because keycloak is live-warm.
DEPS = ["keycloak"]
OIDC_AT_INSTALL = True
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
# lasuite-meet routes LiveKit's WebSocket signaling on a DOMAIN-derived **nested** subdomain
# `LIVEKIT_DOMAIN="livekit.${DOMAIN}"`. The cc-ci wildcard TLS cert is `*.ci.commoninternet.net`
# (single label only), so a 2-label name like `livekit.lasuite-meet-pr0-abc.ci.commoninternet.net`
@ -31,7 +30,7 @@ def EXTRA_ENV(domain):
# no cert/gateway change. Same fix as lasuite-drive's minio/collabora siblings (DECISIONS.md
# "Phase 2 — nested DOMAIN-derived subdomains").
return {
"LIVEKIT_DOMAIN": f"livekit-{domain}",
"LIVEKIT_DOMAIN": f"livekit-{ctx.domain}",
# abra's internal per-deploy convergence TIMEOUT (default 300s) is too short for this stack on
# a cold image cache; bump it (kept under DEPLOY_TIMEOUT so Python never kills abra mid-wait).
"TIMEOUT": "1000",

View File

@ -21,10 +21,10 @@ DEPLOY_TIMEOUT = 900
HTTP_TIMEOUT = 600
def EXTRA_ENV(domain):
def EXTRA_ENV(ctx):
return {
"MAIL_DOMAIN": domain,
"HOSTNAMES": domain,
"MAIL_DOMAIN": ctx.domain,
"HOSTNAMES": ctx.domain,
"TRAEFIK_STACK_NAME": "traefik_ci_commoninternet_net",
"TLS_FLAVOR": "notls",
"SITENAME": "ccci-mail",

View File

@ -24,18 +24,18 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -29,18 +29,18 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -26,9 +26,9 @@ def test_configured_max_users_surfaces_in_serverconfig(live_app):
assert r["server_sync"], f"ServerSync handshake did not complete — {r.get('error')}"
cfg = r["server_config"]
assert cfg, f"server did not send a ServerConfig message — {r!r}"
assert cfg.get("max_users") == recipe_meta.MAX_USERS, (
assert cfg.get("max_users") == recipe_meta._MAX_USERS, (
f"ServerConfig.max_users={cfg.get('max_users')!r} does not match the configured "
f"USERS={recipe_meta.MAX_USERS} — deploy-time server-limit config did not propagate"
f"USERS={recipe_meta._MAX_USERS} — deploy-time server-limit config did not propagate"
)
# allow_html defaults true in the recipe; assert it is present/boolean to prove the field set
# is the real ServerConfig (not an empty/garbled decode).

View File

@ -20,7 +20,7 @@ import recipe_meta # noqa: E402
def test_configured_welcome_text_surfaces_in_serversync(live_app):
marker = recipe_meta.WELCOME_TEXT_MARKER
marker = recipe_meta._WELCOME_TEXT_MARKER
r = _mumble_proto.retry_handshake(attempts=12, interval=5.0)
assert r["server_sync"], f"ServerSync handshake did not complete — {r.get('error')}"

View File

@ -38,16 +38,18 @@ def _seed(domain, value):
assert got == value, f"seed did not commit (read back {got!r}, expected {value!r})"
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
def pre_restore(ctx):
# diverge from the backup so a successful restore is observable: drop the marker table.
_sqlite(domain, "DROP TABLE IF EXISTS ci_marker;")
got = _sqlite(domain, "SELECT name FROM sqlite_master WHERE type='table' AND name='ci_marker';")
_sqlite(ctx.domain, "DROP TABLE IF EXISTS ci_marker;")
got = _sqlite(
ctx.domain, "SELECT name FROM sqlite_master WHERE type='table' AND name='ci_marker';"
)
assert got == "", f"drop did not take (sqlite_master still lists ci_marker: {got!r})"

View File

@ -31,18 +31,19 @@ HEALTH_OK = (200,)
DEPLOY_TIMEOUT = 900 # two images to pull (mumble-server + mumble-web) on a cold node
HTTP_TIMEOUT = 300
# A unique, stable welcome-text marker the round-trip test asserts surfaces over the protocol.
WELCOME_TEXT_MARKER = "cc-ci-mumble-welcome-7f3a9c"
# A unique, stable welcome-text marker the round-trip test asserts surfaces over the protocol
# (underscore prefix = recipe-private constant, exempt from registry validation — rcust P1).
_WELCOME_TEXT_MARKER = "cc-ci-mumble-welcome-7f3a9c"
# A distinctive max-users value (not the recipe default 100) the server_config test asserts.
MAX_USERS = 42
_MAX_USERS = 42
# BASE deploy (0.2.0): mumble-web only — NO host-ports (0.2.0 predates it). The voice-config env is
# set here and persists across the upgrade so it takes effect on the latest (where the custom config
# round-trip tests assert it).
EXTRA_ENV = {
"COMPOSE_FILE": "compose.yml:compose.mumbleweb.yml",
"WELCOME_TEXT": WELCOME_TEXT_MARKER,
"USERS": str(MAX_USERS),
"WELCOME_TEXT": _WELCOME_TEXT_MARKER,
"USERS": str(_MAX_USERS),
}
# UPGRADE-target deploy (latest 1.0.0+): add the NATIVE compose.host-ports.yml so 64738 is
@ -52,7 +53,7 @@ UPGRADE_EXTRA_ENV = {
}
def READY_PROBE(domain):
def READY_PROBE(ctx):
# The voice server on 64738 is testable on-host ONLY when compose.host-ports.yml is active — i.e.
# the post-upgrade LATEST, not the minimal 0.2.0 base. Read the live COMPOSE_FILE to decide, so the
# SAME probe fn is correct in both phases: the post-install probe (base, no host-ports) returns []
@ -63,7 +64,7 @@ def READY_PROBE(domain):
# backup-bot would then exec into a not-running app container -> 409).
from harness import abra # lazy: recipe_meta is exec'd with `harness` importable at call time
cf = abra.env_get(domain, "COMPOSE_FILE") or ""
cf = abra.env_get(ctx.domain, "COMPOSE_FILE") or ""
if "compose.host-ports.yml" in cf:
return [{"tcp_host": "127.0.0.1", "tcp_port": 64738, "stable": 3}]
return []

View File

@ -15,13 +15,13 @@ def _write(domain, val):
lifecycle.exec_in_app(domain, ["sh", "-c", f"echo {val} > {MARKER}"])
def pre_upgrade(domain, meta):
_write(domain, "upgrade-survives")
def pre_upgrade(ctx):
_write(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_write(domain, "original")
def pre_backup(ctx):
_write(ctx.domain, "original")
def pre_restore(domain, meta):
_write(domain, "mutated") # diverge so a successful restore is observable
def pre_restore(ctx):
_write(ctx.domain, "mutated") # diverge so a successful restore is observable

View File

@ -24,17 +24,17 @@ def _seed(domain, value):
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_upgrade(ctx):
_seed(ctx.domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_backup(ctx):
_seed(ctx.domain, "original")
def pre_restore(domain, meta):
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
def pre_restore(ctx):
_psql(ctx.domain, "DROP TABLE ci_marker;")
assert _psql(ctx.domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -13,6 +13,7 @@ import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import canonical, warm # noqa: E402
from harness import meta as harness_meta # noqa: E402
def test_canonical_domain():
@ -33,11 +34,9 @@ def test_is_enrolled_reads_flag(tmp_path, monkeypatch):
tests_dir = tmp_path / "tests" / recipe
tests_dir.mkdir(parents=True)
(tests_dir / "recipe_meta.py").write_text("WARM_CANONICAL = True\n")
# canonical.is_enrolled builds the path from canonical.__file__/../../tests/<recipe>; emulate by
# creating the layout under a fake harness dir and pointing __file__ there.
fake_harness = tmp_path / "runner" / "harness"
fake_harness.mkdir(parents=True)
monkeypatch.setattr(canonical, "__file__", str(fake_harness / "canonical.py"))
# is_enrolled reads through the single meta loader (rcust P1); point its tests/ root at the
# temp layout.
monkeypatch.setattr(harness_meta, "TESTS_DIR", str(tmp_path / "tests"))
assert canonical.is_enrolled(recipe) is True
(tests_dir / "recipe_meta.py").write_text("WARM_CANONICAL = False\n")
assert canonical.is_enrolled(recipe) is False
@ -65,9 +64,7 @@ def test_registry_roundtrip(tmp_path, monkeypatch):
def test_enrolled_recipes_scans_meta(tmp_path, monkeypatch):
# enrolled_recipes() lists recipes whose tests/<r>/recipe_meta.py sets WARM_CANONICAL=True.
fake_harness = tmp_path / "runner" / "harness"
fake_harness.mkdir(parents=True)
monkeypatch.setattr(canonical, "__file__", str(fake_harness / "canonical.py"))
monkeypatch.setattr(harness_meta, "TESTS_DIR", str(tmp_path / "tests"))
for name, body in (
("aaa", "WARM_CANONICAL = True\n"),
("bbb", "DEPS=['x']\n"),

View File

@ -0,0 +1,48 @@
"""Unit tests for the shared conftest fixtures added/reshaped by the rcust restructure (P2d/P4):
`op_state` (run-scoped op context from $CCCI_OP_STATE_FILE) and `deps` (consolidated dep creds
with attribute sugar). Pure — exercised via request.getfixturevalue with env monkeypatched."""
from __future__ import annotations
import json
import pytest
def test_op_state_fixture_reads_file(tmp_path, monkeypatch, request):
f = tmp_path / "op.json"
f.write_text(json.dumps({"backup": {"snapshot_id": "abc123"}, "upgrade": {"head_ref": "h"}}))
monkeypatch.setenv("CCCI_OP_STATE_FILE", str(f))
st = request.getfixturevalue("op_state")
assert st["backup"]["snapshot_id"] == "abc123"
assert st["upgrade"]["head_ref"] == "h"
def test_op_state_fixture_skips_without_env(monkeypatch, request):
monkeypatch.delenv("CCCI_OP_STATE_FILE", raising=False)
with pytest.raises(pytest.skip.Exception, match="orchestrator"):
request.getfixturevalue("op_state")
def test_op_state_fixture_skips_on_missing_file(tmp_path, monkeypatch, request):
monkeypatch.setenv("CCCI_OP_STATE_FILE", str(tmp_path / "nope.json"))
with pytest.raises(pytest.skip.Exception, match="missing"):
request.getfixturevalue("op_state")
def test_deps_fixture_entries_expose_attributes(tmp_path, monkeypatch, request):
"""`deps` (session-scoped) coerces the run deps file into entries with .domain/.realm/...
attribute sugar while keeping dict-style access (rcust P2d). Single test for the session-
cached fixture (one instantiation)."""
f = tmp_path / "deps.json"
f.write_text(
json.dumps(
{"keycloak": {"recipe": "keycloak", "domain": "kc.x", "client_secret": "s3cret"}}
)
)
monkeypatch.setenv("CCCI_DEPS_FILE", str(f))
deps = request.getfixturevalue("deps")
assert deps["keycloak"].domain == "kc.x"
assert deps["keycloak"]["client_secret"] == "s3cret"
with pytest.raises(AttributeError):
_ = deps["keycloak"].not_a_field

View File

@ -0,0 +1,96 @@
"""Unit tests for lifecycle.services_converged's completed-one-shot rule (rcust M2 fix-forward).
A TRIGGERED one-shot service (restart_policy none, scaled 0→1, runs once, exits 0) reports "0/1"
forever after its task completes — swarm never restarts it. A bare `cur != want` rejection then
blocks convergence for the REST OF THE RUN (lasuite-drive minio-createbuckets: the P2b port moved
the bucket trigger BEFORE the install assert, so the assert burned the full DEPLOY_TIMEOUT —
pre-restructure the trigger ran after the assert and converge never saw the 0/1).
Pins (the Adversary's non-vacuity criteria):
- deficit explained ENTIRELY by Complete tasks → converged (the one-shot did its job).
- deficit with a Failed task → NOT converged (a broken one-shot must not pass).
- deficit with a Running/Preparing task → NOT converged (still spinning up; no early green).
- deficit with NO tasks yet → NOT converged (still scheduling).
- plain N/N services still converge; plain 0/1-spinning-up still doesn't (regression guards).
"""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle as lc # noqa: E402
class _R:
def __init__(self, stdout="", stderr="", returncode=0):
self.stdout, self.stderr, self.returncode = stdout, stderr, returncode
def _patch_docker(monkeypatch, replicas_rows, task_states_by_service=None, update_state=""):
"""Fake subprocess.run for the three docker calls services_converged makes."""
task_states_by_service = task_states_by_service or {}
def fake_run(args, **kw):
if args[:3] == ["docker", "stack", "services"]:
return _R(stdout="\n".join(replicas_rows) + "\n")
if args[:3] == ["docker", "service", "ps"]:
name = args[3]
return _R(stdout="\n".join(task_states_by_service.get(name, [])) + "\n")
if args[:3] == ["docker", "service", "inspect"]:
return _R(stdout=update_state + "\n")
raise AssertionError(f"unexpected docker call: {args}")
monkeypatch.setattr(lc.subprocess, "run", fake_run)
def test_completed_oneshot_deficit_is_converged(monkeypatch):
_patch_docker(
monkeypatch,
["stack_app 1/1", "stack_minio-createbuckets 0/1"],
{"stack_minio-createbuckets": ["Complete 28 minutes ago"]},
)
assert lc.services_converged("app.example.com") is True
def test_failed_oneshot_deficit_is_not_converged(monkeypatch):
_patch_docker(
monkeypatch,
["stack_app 1/1", "stack_minio-createbuckets 0/1"],
{"stack_minio-createbuckets": ["Failed 2 minutes ago"]},
)
assert lc.services_converged("app.example.com") is False
def test_mixed_complete_and_failed_tasks_not_converged(monkeypatch):
_patch_docker(
monkeypatch,
["stack_oneshot 0/1"],
{"stack_oneshot": ["Complete 5 minutes ago", "Failed 6 minutes ago"]},
)
assert lc.services_converged("app.example.com") is False
def test_still_spinning_up_not_converged(monkeypatch):
_patch_docker(
monkeypatch,
["stack_app 0/1"],
{"stack_app": ["Preparing 10 seconds ago"]},
)
assert lc.services_converged("app.example.com") is False
def test_deficit_with_no_tasks_yet_not_converged(monkeypatch):
_patch_docker(monkeypatch, ["stack_app 0/1"], {"stack_app": []})
assert lc.services_converged("app.example.com") is False
def test_all_full_replicas_still_converged(monkeypatch):
_patch_docker(monkeypatch, ["stack_app 1/1", "stack_db 1/1"])
assert lc.services_converged("app.example.com") is True
def test_on_demand_zero_zero_oneshot_still_converged(monkeypatch):
_patch_docker(monkeypatch, ["stack_app 1/1", "stack_minio-createbuckets 0/0"])
assert lc.services_converged("app.example.com") is True

View File

@ -1,9 +1,9 @@
"""Unit tests for runner/harness/deps.py (Phase 2 §4.2 / Q2.3).
Pure-Python: no real deploys. Tests the declarative parts of the dep resolver — declared_deps
reading from `tests/<recipe>/recipe_meta.py`, the per-dep domain derivation, and write/load of the
run state file. The deploy_deps + teardown_deps integration is exercised by real e2e against cc-ci
(Q2.4 acceptance).
Pure-Python: no real deploys. Tests the declarative parts of the dep resolver — DEPS declaration
(read through the single meta loader since rcust P1), the per-dep domain derivation, and write/load
of the run state file. The deploy_deps + teardown_deps integration is exercised by real e2e against
cc-ci (Q2.4 acceptance).
"""
from __future__ import annotations
@ -13,42 +13,23 @@ import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import deps # noqa: E402
from harness import meta as meta_mod # noqa: E402
def test_declared_deps_returns_empty_for_no_meta(monkeypatch, tmp_path):
"""A recipe with no recipe_meta.py returns []."""
fake_recipe = "ccci-no-meta"
# No file at tests/<fake_recipe>/recipe_meta.py -> declared_deps reads nothing -> []
monkeypatch.chdir(tmp_path)
assert deps.declared_deps(fake_recipe) == []
def test_declared_deps_empty_for_no_meta(monkeypatch, tmp_path):
"""A recipe with no recipe_meta.py declares no deps (rcust P1: DEPS via meta.load)."""
monkeypatch.setattr(meta_mod, "TESTS_DIR", str(tmp_path / "tests"))
assert meta_mod.load("ccci-no-meta").DEPS == []
def test_declared_deps_reads_DEPS_list(tmp_path, monkeypatch):
"""A recipe_meta.py with `DEPS = [...]` returns the list."""
fake_recipe = "ccci-with-deps"
# Build a fake repo layout under tmp_path
recipe_dir = tmp_path / "tests" / fake_recipe
"""A recipe_meta.py with `DEPS = [...]` surfaces the list on the loaded meta (the orchestrator
reads meta.DEPS — the successor of the deleted deps.declared_deps loader)."""
recipe_dir = tmp_path / "tests" / "ccci-with-deps"
recipe_dir.mkdir(parents=True)
(recipe_dir / "recipe_meta.py").write_text('HEALTH_PATH = "/"\nDEPS = ["keycloak", "redis"]\n')
# Patch the deps module's idea of "where the repo is" by monkey-patching __file__ for the
# function indirectly: declared_deps uses `os.path.dirname(__file__), "..", "..", "tests"` —
# which resolves to the real repo's `tests/`. So instead, override that with a symlink/dir
# under tmp_path: deps.__file__ points at the runner module. We can't easily relocate that.
# Instead, mock the path by writing the fake recipe under the REAL tests/ dir.
real_tests = os.path.join(os.path.dirname(deps.__file__), "..", "..", "tests")
target_dir = os.path.join(real_tests, fake_recipe)
os.makedirs(target_dir, exist_ok=True)
target_meta = os.path.join(target_dir, "recipe_meta.py")
try:
with open(target_meta, "w") as f:
f.write('DEPS = ["keycloak", "redis"]\n')
result = deps.declared_deps(fake_recipe)
assert result == ["keycloak", "redis"]
finally:
if os.path.exists(target_meta):
os.remove(target_meta)
if os.path.isdir(target_dir):
os.rmdir(target_dir)
monkeypatch.setattr(meta_mod, "TESTS_DIR", str(tmp_path / "tests"))
assert meta_mod.load("ccci-with-deps").DEPS == ["keycloak", "redis"]
def test_dep_domain_distinct_per_dep():

View File

@ -71,17 +71,18 @@ def test_repo_local_wins_when_approved(tmp_path):
def test_custom_tests_repo_local_gated(tmp_path, monkeypatch):
# non-lifecycle test_*.py from repo-local only count for approved recipes; lifecycle names excluded
# custom test_*.py from repo-local only count for approved recipes (HC2); placement rule
# (rcust P4): custom tests live under functional/ (or playwright/) — top-level files are
# lifecycle overlays only, so the repo-local custom here sits in functional/.
# Use a synthetic recipe name + monkeypatched cc_ci_dir so this is independent of what
# tests/<real-recipe>/ ships (Phase-2 custom-html now also ships functional/ + playwright/,
# which would legitimately appear in custom_tests for "custom-html" — F2-1).
# tests/<real-recipe>/ ships (F2-1).
fake_recipe = "ccci-hc2-fixture"
monkeypatch.setattr(discovery, "cc_ci_dir", lambda r: str(tmp_path / "cc-ci" / r))
(tmp_path / "cc-ci" / fake_recipe).mkdir(parents=True)
rl = tmp_path / "repo"
rl.mkdir()
(rl / "test_sso.py").write_text("# repo-local custom\n")
(rl / "test_install.py").write_text("# lifecycle name -> excluded from custom\n")
(rl / "functional").mkdir(parents=True)
(rl / "functional" / "test_sso.py").write_text("# repo-local custom\n")
(rl / "functional" / "test_install.py").write_text("# lifecycle name -> excluded from custom\n")
_approve(tmp_path) # not approved -> repo-local custom ignored
assert discovery.custom_tests(fake_recipe, str(rl)) == []

View File

@ -1,6 +1,6 @@
"""Unit tests for Phase-2 discovery additions (plan §4.1).
Proves the `custom_tests` discovery recurses into the per-recipe `functional/` + `playwright/`
Proves the `custom_tests` discovery covers exactly the per-recipe `functional/` + `playwright/`
subdirs as well as the top-level dir, while still excluding lifecycle `test_<op>.py` names and
honouring the HC2 repo-local approval gate.
@ -27,16 +27,16 @@ def teardown_function():
os.environ.pop("CCCI_REPO_LOCAL_APPROVED_FILE", None)
def test_custom_tests_recurses_functional_and_playwright(tmp_path, monkeypatch):
"""A Phase-2 cc-ci recipe layout: functional/test_*.py + playwright/test_*.py + top-level
test_*.py — all are discovered as custom tests; the lifecycle names are excluded."""
def test_custom_tests_placement_rule_functional_playwright_only(tmp_path, monkeypatch):
"""Placement rule (rcust P4): custom tests are discovered ONLY under functional/ +
playwright/. A top-level non-lifecycle test_*.py is NOT discovered (top level is reserved
for lifecycle overlays); lifecycle names inside the subdirs stay excluded (defensive)."""
# Point cc-ci's per-recipe dir at a fake recipe in tmp_path
fake_recipe = "ccci-phase2-fixture"
fake_dir = tmp_path / "tests" / fake_recipe
(fake_dir / "functional").mkdir(parents=True)
(fake_dir / "playwright").mkdir()
# legitimate custom tests at multiple levels
(fake_dir / "test_sso_smoke.py").write_text("# top-level cross-cutting\n")
(fake_dir / "test_sso_smoke.py").write_text("# top-level — NOT discovered since P4\n")
(fake_dir / "functional" / "test_health_check.py").write_text("# parity port\n")
(fake_dir / "functional" / "test_content_roundtrip.py").write_text("# recipe-specific\n")
(fake_dir / "playwright" / "test_login_flow.py").write_text("# UI flow\n")
@ -49,11 +49,11 @@ def test_custom_tests_recurses_functional_and_playwright(tmp_path, monkeypatch):
customs = discovery.custom_tests(fake_recipe, None)
names = sorted((src, os.path.basename(p)) for src, p in customs)
# Top-level + functional/ + playwright/ all discovered; lifecycle name excluded
assert ("cc-ci", "test_sso_smoke.py") in names
# functional/ + playwright/ discovered; top-level custom + lifecycle name are NOT
assert ("cc-ci", "test_health_check.py") in names
assert ("cc-ci", "test_content_roundtrip.py") in names
assert ("cc-ci", "test_login_flow.py") in names
assert ("cc-ci", "test_sso_smoke.py") not in names
assert ("cc-ci", "test_install.py") not in names

View File

@ -30,7 +30,7 @@ def test_sso_dep_unverified_true_when_declared_notready_and_skipped():
def test_sso_dep_unverified_false_when_deps_ready():
"""deps ready (setup_custom_tests succeeded) → SSO tests actually ran → not a failure."""
"""deps ready (dep provisioning succeeded) → SSO tests actually ran → not a failure."""
assert not run_recipe_ci.sso_dep_unverified(
["keycloak"], deps_ready=True, requires_deps_skipped=0
)

View File

@ -14,6 +14,7 @@ So `-c` + owned-wait is non-vacuous: a genuinely-broken upgrade stays RED.
from __future__ import annotations
import dataclasses
import os
import sys
@ -21,6 +22,7 @@ import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle as lc # noqa: E402
from harness import meta as harness_meta # noqa: E402
def _fake_clock(monkeypatch):
@ -31,11 +33,15 @@ def _fake_clock(monkeypatch):
return state
_DRIVE_META = {
"READY_PROBE": lambda d: [
{"host": f"collabora-{d}", "path": "/hosting/discovery", "ok": (200,)}
]
}
# RecipeMeta (rcust P1: wait_ready_probes reads meta.READY_PROBE off the loaded object); defaults
# + the drive-style probe hook (P3 ctx signature: the probe receives a HookCtx).
_DRIVE_META = dataclasses.replace(
harness_meta.load("ccci-no-such-recipe"),
READY_PROBE=lambda ctx: [
{"host": f"collabora-{ctx.domain}", "path": "/hosting/discovery", "ok": (200,)}
],
)
_NO_PROBE_META = harness_meta.load("ccci-no-such-recipe")
def test_wait_ready_probes_raises_when_never_ready(monkeypatch):
@ -57,7 +63,7 @@ def test_wait_ready_probes_returns_when_ready(monkeypatch):
def test_wait_ready_probes_noop_without_probe(monkeypatch):
"""A recipe with no READY_PROBE is a clean no-op (default behavior preserved for all recipes)."""
monkeypatch.setattr(lc, "http_get", lambda *a, **k: 599) # would fail if it were consulted
lc.wait_ready_probes({}, "x.ci.commoninternet.net", timeout=1) # no raise, no call
lc.wait_ready_probes(_NO_PROBE_META, "x.ci.commoninternet.net", timeout=1) # no raise, no call
def test_wait_healthy_raises_when_services_never_converge(monkeypatch):

177
tests/unit/test_manifest.py Normal file
View File

@ -0,0 +1,177 @@
"""Unit tests for the customization manifest (rcust P5; spec §8 R4 mitigation).
The manifest is PURE PRESENTATION (must never influence a verdict); these tests pin that it is
COMPLETE (every customization surface a synthetic recipe exercises shows up), DETERMINISTIC
(same inputs -> byte-identical JSON), serializable, and HC2-honoring (unapproved repo-local
contributions are invisible). Pure / tmp-file only. Run cold:
cc-ci-run -m pytest tests/unit/test_manifest.py -q
"""
from __future__ import annotations
import json
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import discovery, manifest # noqa: E402
from harness import meta as meta_mod # noqa: E402
RECIPE = "ccci-manifest-fixture"
def _mk_synthetic(tmp_path, monkeypatch, approved=True):
"""A synthetic recipe dir exercising EVERY manifest surface, plus a repo-local tests dir.
cc-ci side: meta (2 data keys + 1 hook key non-default), ops.py (2 pre-ops), install_steps.sh,
compose.ccci.yml, test_backup.py overlay, 2 functional + 1 playwright custom tests.
repo-local side: test_restore.py overlay + 1 functional custom test (visible iff approved, HC2).
"""
ccci_root = tmp_path / "cc-ci-tests"
d = ccci_root / RECIPE
(d / "functional").mkdir(parents=True)
(d / "playwright").mkdir()
(d / "recipe_meta.py").write_text(
"HTTP_TIMEOUT = 600\n"
"DEPS = ['keycloak']\n"
"def EXTRA_ENV(ctx):\n return {}\n"
"_PRIVATE = 'exempt'\n"
)
(d / "ops.py").write_text("def pre_upgrade(ctx):\n pass\n\ndef pre_backup(ctx):\n pass\n")
(d / "install_steps.sh").write_text("#!/usr/bin/env bash\n")
(d / "compose.ccci.yml").write_text("version: '3.8'\n")
(d / "test_backup.py").write_text("# lifecycle overlay\n")
(d / "functional" / "test_a.py").write_text("# custom\n")
(d / "functional" / "test_b.py").write_text("# custom\n")
(d / "playwright" / "test_ui.py").write_text("# custom\n")
rl = tmp_path / "repo-local"
(rl / "functional").mkdir(parents=True)
(rl / "functional" / "test_c.py").write_text("# repo-local custom\n")
(rl / "test_restore.py").write_text("# repo-local lifecycle overlay\n")
monkeypatch.setattr(discovery, "cc_ci_dir", lambda r: str(ccci_root / r))
monkeypatch.setattr(meta_mod, "TESTS_DIR", str(ccci_root)) # compose.ccci.yml discovery
approved_file = tmp_path / "approved.txt"
approved_file.write_text(f"{RECIPE}\n" if approved else "")
monkeypatch.setenv("CCCI_REPO_LOCAL_APPROVED_FILE", str(approved_file))
meta = meta_mod.load(RECIPE, tests_dir=str(ccci_root))
return meta, str(rl)
def test_manifest_complete(tmp_path, monkeypatch):
# Every surface the synthetic recipe customizes appears — nothing silently dropped (R4).
meta, rl = _mk_synthetic(tmp_path, monkeypatch)
m = manifest.build(RECIPE, meta, rl)
assert m["meta_non_default"] == {
"DEPS": ["keycloak"],
"EXTRA_ENV": "<hook>",
"HTTP_TIMEOUT": 600,
}
assert m["hooks"] == {
"ops.py": {"cc-ci": ["pre_backup", "pre_upgrade"]},
"install_steps.sh": "cc-ci",
"compose.ccci.yml": "cc-ci",
}
assert m["overlays"] == {"backup": "cc-ci", "restore": "repo-local"}
assert m["custom_tests"] == {
"cc-ci": {"functional": 2, "playwright": 1},
"repo-local": {"functional": 1},
}
assert m["env_overrides"] == []
def test_manifest_deterministic_and_serializable(tmp_path, monkeypatch):
meta, rl = _mk_synthetic(tmp_path, monkeypatch)
a = manifest.build(RECIPE, meta, rl)
b = manifest.build(RECIPE, meta, rl)
assert json.dumps(a, sort_keys=True) == json.dumps(b, sort_keys=True)
assert json.loads(json.dumps(a)) == a # round-trips: no callables/tuples leak through
def test_manifest_zero_config_floor(tmp_path, monkeypatch):
# A recipe with NO customization at all -> every section empty, render says so explicitly.
ccci_root = tmp_path / "cc-ci-tests"
(ccci_root / RECIPE).mkdir(parents=True)
monkeypatch.setattr(discovery, "cc_ci_dir", lambda r: str(ccci_root / r))
monkeypatch.setattr(meta_mod, "TESTS_DIR", str(ccci_root))
monkeypatch.setenv("CCCI_REPO_LOCAL_APPROVED_FILE", str(tmp_path / "missing.txt"))
meta = meta_mod.load(RECIPE, tests_dir=str(ccci_root))
m = manifest.build(RECIPE, meta, None)
assert m == {
"meta_non_default": {},
"hooks": {},
"overlays": {},
"custom_tests": {},
"env_overrides": [],
}
out = manifest.render(RECIPE, m)
assert f"===== customization manifest: {RECIPE} =====" in out
assert "(none — zero-config floor)" in out
def test_manifest_repo_local_hc2_gate(tmp_path, monkeypatch):
# Unapproved recipe -> repo-local overlay + custom tests INVISIBLE (same default-deny as the
# discovery they ride on; the manifest must not advertise code the run will not execute).
meta, rl = _mk_synthetic(tmp_path, monkeypatch, approved=False)
m = manifest.build(RECIPE, meta, rl)
assert m["overlays"] == {"backup": "cc-ci"} # repo-local test_restore.py gone
assert "repo-local" not in m["custom_tests"]
def test_manifest_env_overrides_and_ci_flag(tmp_path, monkeypatch):
meta, rl = _mk_synthetic(tmp_path, monkeypatch)
monkeypatch.setenv("CCCI_SKIP_GENERIC_BACKUP", "1")
monkeypatch.setenv("CCCI_SKIP_GENERIC_UPGRADE", "0") # falsy -> not an active override
m = manifest.build(RECIPE, meta, rl)
assert m["env_overrides"] == ["CCCI_SKIP_GENERIC_BACKUP"]
monkeypatch.delenv("DRONE", raising=False)
assert "!!" not in manifest.render(RECIPE, m) # local dev: no CI warning
monkeypatch.setenv("DRONE", "true") # riding a CI run -> loud flag (P2c)
assert "!! dev-only override active in CI" in manifest.render(RECIPE, m)
def test_manifest_redacts_sensitive_named_values(tmp_path, monkeypatch):
# Meta values are repo-public by construction, but the manifest lands on the dashboard:
# secret-NAMED entries (top-level or nested dict keys, e.g. plausible's
# EXTRA_ENV["SECRET_KEY_BASE"] dummy) render as '<redacted>' — name shown, value masked.
# Non-sensitive names (incl. KEYCLOAK_* — 'KEY' matches only as a word segment) pass through.
ccci_root = tmp_path / "cc-ci-tests"
d = ccci_root / RECIPE
d.mkdir(parents=True)
(d / "recipe_meta.py").write_text(
"EXTRA_ENV = {\n"
" 'SECRET_KEY_BASE': 'dummy-ci-constant',\n"
" 'API_KEY': 'also-dummy',\n"
" 'KEYCLOAK_URL': 'https://kc.example',\n"
"}\n"
)
monkeypatch.setattr(discovery, "cc_ci_dir", lambda r: str(ccci_root / r))
monkeypatch.setattr(meta_mod, "TESTS_DIR", str(ccci_root))
monkeypatch.setenv("CCCI_REPO_LOCAL_APPROVED_FILE", str(tmp_path / "missing.txt"))
meta = meta_mod.load(RECIPE, tests_dir=str(ccci_root))
m = manifest.build(RECIPE, meta, None)
assert m["meta_non_default"]["EXTRA_ENV"] == {
"SECRET_KEY_BASE": "<redacted>",
"API_KEY": "<redacted>",
"KEYCLOAK_URL": "https://kc.example",
}
out = manifest.render(RECIPE, m)
assert "dummy-ci-constant" not in out and "also-dummy" not in out
assert "SECRET_KEY_BASE" in out # the key NAME stays visible
def test_render_lists_every_surface(tmp_path, monkeypatch):
meta, rl = _mk_synthetic(tmp_path, monkeypatch)
out = manifest.render(RECIPE, manifest.build(RECIPE, meta, rl))
lines = out.splitlines()
assert lines[0] == f"===== customization manifest: {RECIPE} ====="
assert "meta (non-default): DEPS=['keycloak'] EXTRA_ENV='<hook>' HTTP_TIMEOUT=600" in lines
assert (
"hooks: ops.py[pre_backup,pre_upgrade](cc-ci) install_steps.sh(cc-ci) compose.ccci.yml(cc-ci)"
in lines
)
assert "overlays: test_backup.py(cc-ci) test_restore.py(repo-local)" in lines
assert "custom tests: functional/=2 playwright/=1 (cc-ci) functional/=1 (repo-local)" in lines
assert "env overrides: (none)" in lines

276
tests/unit/test_meta.py Normal file
View File

@ -0,0 +1,276 @@
"""Unit tests for the single recipe-meta loader + key registry (rcust P1; spec §8 R1/R6).
Covers: every in-repo recipe_meta.py loads clean through the registry (THE typo gate), validation
hard-errors (unknown key, wrong type, callable on a data key), the zero-config baseline defaults
(spec §2), the underscore exemption for recipe-private constants, and the registry↔generated-doc
sync (P1.5; drift fails CI). Run: cc-ci-run -m pytest tests/unit/test_meta.py -q
"""
from __future__ import annotations
import os
import subprocess
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import meta as meta_mod # noqa: E402
from harness.meta import KEYS, MetaError, RecipeMeta # noqa: E402
ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
def _recipes_with_meta() -> list[str]:
tests_dir = os.path.join(ROOT, "tests")
return sorted(
n
for n in os.listdir(tests_dir)
if os.path.isfile(os.path.join(tests_dir, n, "recipe_meta.py"))
)
# ---- the typo gate: every in-repo recipe meta must validate against the registry --------------
@pytest.mark.parametrize("recipe", _recipes_with_meta())
def test_every_recipe_meta_loads_clean(recipe):
"""All tests/*/recipe_meta.py in the repo load + validate through the registry. A typo'd or
unregistered ALL-CAPS key in any recipe meta fails HERE, at PR time — not silently at run
time (the R6 failure mode this restructure kills)."""
meta = meta_mod.load(recipe)
assert isinstance(meta, RecipeMeta)
# sanity: the 4 base keys always materialize with usable types
assert isinstance(meta.HEALTH_PATH, str)
assert isinstance(meta.HEALTH_OK, tuple) and meta.HEALTH_OK
assert isinstance(meta.DEPLOY_TIMEOUT, int) and isinstance(meta.HTTP_TIMEOUT, int)
# ---- zero-config baseline (spec §2) ------------------------------------------------------------
def test_missing_meta_yields_spec_baseline(tmp_path):
meta = meta_mod.load("no-such-recipe", tests_dir=str(tmp_path))
assert meta.HEALTH_PATH == "/"
assert meta.HEALTH_OK == (200, 301, 302)
assert meta.DEPLOY_TIMEOUT == 600
assert meta.HTTP_TIMEOUT == 300
assert meta.BACKUP_CAPABLE is None # None = auto-detect (tri-state, not False)
assert meta.EXPECTED_NA is None
assert meta.READY_PROBE is None
assert meta.UPGRADE_BASE_VERSION is None
assert meta.BACKUP_VERIFY is None
assert meta.UPGRADE_EXTRA_ENV is None
assert meta.EXTRA_ENV == {}
assert meta.DEPS == []
assert meta.WARM_CANONICAL is False
assert meta.SCREENSHOT is None
assert meta_mod.non_default(meta) == {}
def test_registry_field_set_matches_dataclass():
"""The RecipeMeta field set is generated from KEYS — no drift possible, pinned anyway."""
import dataclasses
assert [f.name for f in dataclasses.fields(RecipeMeta)] == [k.name for k in KEYS]
# the 14 final keys, no more (the 3 P2-deleted legacy keys are gone from the registry,
# so any recipe_meta still setting them hard-fails the typo gate)
assert len(KEYS) == 14
assert not [k for k in KEYS if k.deprecated]
for gone in ("CHAOS_BASE_DEPLOY", "OIDC_AT_INSTALL", "SKIP_GENERIC"):
assert gone not in {k.name for k in KEYS}
# ---- validation hard errors (locked decision: fail fast at load) -------------------------------
def _write_meta(tmp_path, body: str, recipe: str = "r") -> str:
d = tmp_path / recipe
d.mkdir(exist_ok=True)
(d / "recipe_meta.py").write_text(body)
return recipe
def test_unknown_key_raises_with_suggestion(tmp_path):
r = _write_meta(tmp_path, "READINESS_PROBE = None\n") # the R6 typo example
with pytest.raises(MetaError) as ei:
meta_mod.load(r, tests_dir=str(tmp_path))
msg = str(ei.value)
assert "READINESS_PROBE" in msg and "READY_PROBE" in msg # names the typo + nearest key
def test_unknown_key_without_near_match_lists_registry(tmp_path):
r = _write_meta(tmp_path, "TOTALLY_BOGUS_KNOB = 1\n")
with pytest.raises(MetaError) as ei:
meta_mod.load(r, tests_dir=str(tmp_path))
assert "HEALTH_PATH" in str(ei.value) # registered keys listed for the reader
def test_wrong_type_raises(tmp_path):
r = _write_meta(tmp_path, 'DEPLOY_TIMEOUT = "900"\n')
with pytest.raises(MetaError, match="DEPLOY_TIMEOUT"):
meta_mod.load(r, tests_dir=str(tmp_path))
def test_bool_not_accepted_as_int(tmp_path):
r = _write_meta(tmp_path, "DEPLOY_TIMEOUT = True\n")
with pytest.raises(MetaError, match="DEPLOY_TIMEOUT"):
meta_mod.load(r, tests_dir=str(tmp_path))
def test_callable_on_data_key_rejected(tmp_path):
r = _write_meta(tmp_path, "def HEALTH_PATH():\n return '/'\n")
with pytest.raises(MetaError, match="hook-typed"):
meta_mod.load(r, tests_dir=str(tmp_path))
def test_non_callable_on_hook_key_rejected(tmp_path):
r = _write_meta(tmp_path, "READY_PROBE = ['not', 'a', 'callable']\n")
with pytest.raises(MetaError, match="READY_PROBE"):
meta_mod.load(r, tests_dir=str(tmp_path))
def test_underscore_names_are_private_and_exempt(tmp_path):
r = _write_meta(
tmp_path,
"_WELCOME_TEXT_MARKER = 'marker-xyz'\n_MAX_USERS = 42\n"
"EXTRA_ENV = {'WELCOME_TEXT': _WELCOME_TEXT_MARKER, 'USERS': str(_MAX_USERS)}\n",
)
meta = meta_mod.load(r, tests_dir=str(tmp_path))
assert meta.EXTRA_ENV == {"WELCOME_TEXT": "marker-xyz", "USERS": "42"}
def test_lowercase_helpers_ignored(tmp_path):
r = _write_meta(
tmp_path,
"def _helper(d):\n return {'K': d}\n\ndef EXTRA_ENV(ctx):\n return _helper(ctx.domain)\n",
)
meta = meta_mod.load(r, tests_dir=str(tmp_path))
ctx = meta_mod.hook_ctx("x.example", meta)
assert meta_mod.extra_env(meta, ctx) == {"K": "x.example"}
# ---- normalization + helpers --------------------------------------------------------------------
def test_health_ok_list_normalized_to_tuple(tmp_path):
r = _write_meta(tmp_path, "HEALTH_OK = [200, 302]\n")
assert meta_mod.load(r, tests_dir=str(tmp_path)).HEALTH_OK == (200, 302)
def test_extra_env_dict_and_callable_forms(tmp_path):
r = _write_meta(tmp_path, "EXTRA_ENV = {'A': 1}\n")
meta = meta_mod.load(r, tests_dir=str(tmp_path))
assert meta_mod.extra_env(meta, meta_mod.hook_ctx("d", meta)) == {"A": "1"} # stringified
r2 = _write_meta(
tmp_path, "UPGRADE_EXTRA_ENV = lambda ctx: {'COMPOSE_FILE': ctx.domain}\n", recipe="r2"
)
meta2 = meta_mod.load(r2, tests_dir=str(tmp_path))
ctx2 = meta_mod.hook_ctx("dom.x", meta2, op="upgrade")
assert meta_mod.upgrade_extra_env(meta2, ctx2) == {"COMPOSE_FILE": "dom.x"}
assert meta_mod.extra_env(meta2, ctx2) == {} # unset EXTRA_ENV resolves to {}
# ---- P3: uniform ctx hook convention -------------------------------------------------------------
def test_hook_ctx_fields(tmp_path):
meta = meta_mod.load("no-such", tests_dir=str(tmp_path))
ctx = meta_mod.hook_ctx("app.ci.example", meta, op="backup")
assert ctx.domain == "app.ci.example"
assert ctx.base_url == "https://app.ci.example"
assert ctx.meta is meta
assert ctx.op == "backup"
assert meta_mod.hook_ctx("d", meta).op is None
def test_hook_ctx_deps_from_run_file(tmp_path, monkeypatch):
import json
meta = meta_mod.load("no-such", tests_dir=str(tmp_path))
monkeypatch.delenv("CCCI_DEPS_FILE", raising=False)
assert meta_mod.hook_ctx("d", meta).deps is None
f = tmp_path / "deps.json"
f.write_text(json.dumps({"keycloak": {"recipe": "keycloak", "domain": "kc.x"}}))
monkeypatch.setenv("CCCI_DEPS_FILE", str(f))
deps = meta_mod.hook_ctx("d", meta).deps
assert deps["keycloak"]["domain"] == "kc.x"
f.write_text("{}") # empty dict -> None (deps declared but not provisioned)
assert meta_mod.hook_ctx("d", meta).deps is None
def test_legacy_hook_signature_raises_clear_meta_error(tmp_path):
"""A pre-restructure hook signature must fail AT LOAD with a migration message — never a
silent TypeError mid-run (P3.4)."""
r = _write_meta(tmp_path, "def READY_PROBE(domain):\n return []\n")
with pytest.raises(MetaError, match="ctx"):
meta_mod.load(r, tests_dir=str(tmp_path))
r2 = _write_meta(tmp_path, "EXTRA_ENV = lambda domain: {}\n", recipe="r2")
with pytest.raises(MetaError, match="restructure"):
meta_mod.load(r2, tests_dir=str(tmp_path))
r3 = _write_meta(
tmp_path, "def SCREENSHOT(page, domain, meta):\n return None\n", recipe="r3"
)
with pytest.raises(MetaError, match="page, ctx"):
meta_mod.load(r3, tests_dir=str(tmp_path))
def test_ctx_hook_signatures_accepted(tmp_path):
r = _write_meta(
tmp_path,
"def READY_PROBE(ctx):\n return []\n"
"def BACKUP_VERIFY(ctx):\n return True\n"
"def SCREENSHOT(page, ctx):\n return None\n"
"def EXTRA_ENV(ctx):\n return {}\n",
)
meta = meta_mod.load(r, tests_dir=str(tmp_path))
assert callable(meta.READY_PROBE) and callable(meta.SCREENSHOT)
def test_check_hook_signature_for_pre_op_hooks():
"""The orchestrator validates ops.py pre_<op> hooks with the same checker (legacy
(domain, meta) form names the migration)."""
def legacy(domain, meta):
pass
def new(ctx):
pass
with pytest.raises(MetaError, match="ctx"):
meta_mod.check_hook_signature(legacy, ("ctx",), "tests/x/ops.py::pre_upgrade")
meta_mod.check_hook_signature(new, ("ctx",), "tests/x/ops.py::pre_upgrade") # no raise
def test_non_default_reports_only_customized_keys(tmp_path):
r = _write_meta(tmp_path, "DEPLOY_TIMEOUT = 1500\nDEPS = ['keycloak']\n")
nd = meta_mod.non_default(meta_mod.load(r, tests_dir=str(tmp_path)))
assert nd == {"DEPLOY_TIMEOUT": 1500, "DEPS": ["keycloak"]}
def test_meta_is_frozen():
import dataclasses
meta = meta_mod.load("custom-html")
with pytest.raises(dataclasses.FrozenInstanceError):
meta.DEPLOY_TIMEOUT = 1
# ---- doc generation sync (P1.5: the committed §4 table == the registry rendering) ---------------
def test_generated_doc_table_in_sync():
"""docs/recipe-customization.md's key reference table is GENERATED from the registry
(scripts/gen-meta-docs.py). If this fails: re-run `python3 scripts/gen-meta-docs.py` and
commit the result — the table must never drift from the registry (R5)."""
gen = os.path.join(ROOT, "scripts", "gen-meta-docs.py")
doc = os.path.join(ROOT, "docs", "recipe-customization.md")
rendered = subprocess.run(
[sys.executable, gen, "--print"], capture_output=True, text=True, check=True
).stdout
with open(doc) as f:
committed = f.read()
assert rendered.strip() in committed, (
"docs/recipe-customization.md key table is out of sync with the harness.meta registry — "
"run `python3 scripts/gen-meta-docs.py` and commit"
)

View File

@ -280,6 +280,41 @@ def test_build_results_threads_expected_na(tmp_path):
) # backup_restore declared; functional passed → clean
def test_build_results_threads_customization(tmp_path):
# rcust P5: the run-start customization manifest lands verbatim under "customization";
# omitted -> explicit None (key always present in the schema).
recs = [
{
"tier": "install",
"source": "generic",
"file": "g/test_install.py",
"rc": 0,
"junit": _write(tmp_path, "i.xml", JUNIT_PASS),
},
]
cust = {
"meta_non_default": {"HTTP_TIMEOUT": 600},
"hooks": {"install_steps.sh": "cc-ci"},
"overlays": {},
"custom_tests": {"cc-ci": {"functional": 2}},
"env_overrides": [],
}
kwargs = {
"recipe": "hedgedoc",
"version": "1.2.3",
"pr": "7",
"ref": None,
"records": recs,
"results": _results(),
"backup_capable": True,
"clean_teardown": True,
"no_secret_leak": True,
"finished_ts": 0.0,
}
assert R.build_results(**kwargs, customization=cust)["customization"] == cust
assert R.build_results(**kwargs)["customization"] is None
def test_write_results_roundtrip(tmp_path):
data = {"run_id": "42", "level": 3, "stages": []}
path = R.write_results(data, runs_dir_override=str(tmp_path))

View File

@ -11,6 +11,7 @@ import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import meta as meta_mod # noqa: E402
from harness import screenshot as S # noqa: E402
@ -29,3 +30,103 @@ def test_hook_returned_when_callable():
pass
assert S._load_screenshot_hook({"SCREENSHOT": hook}) is hook
class _FakePage:
"""Minimal Playwright-page stand-in for the settle/blank-retry helpers (no browser needed)."""
def __init__(self, shot_sizes, idle_raises=False):
self._shot_sizes = list(shot_sizes) # bytes written per successive screenshot() call
self._idle_raises = idle_raises
self.idle_waits = [] # (state, timeout) per wait_for_load_state call
self.timeout_waits = [] # ms per wait_for_timeout call
self.shots = 0
def wait_for_load_state(self, state, timeout=None):
self.idle_waits.append((state, timeout))
if self._idle_raises:
raise TimeoutError(f"page kept polling past {timeout}ms")
def wait_for_timeout(self, ms):
self.timeout_waits.append(ms)
def screenshot(self, path, full_page=False):
self.shots += 1
with open(path, "wb") as f:
f.write(b"\x89PNG" + b"\0" * (self._shot_sizes.pop(0) - 4))
def test_settle_swallows_never_idle_pages():
"""R7: an app that never reaches network-idle (continuous polling) must not raise — the
timeout cap IS the wait."""
page = _FakePage([], idle_raises=True)
S._settle(page, 1234) # must not raise
assert page.idle_waits == [("networkidle", 1234)]
assert page.timeout_waits == [S.RENDER_GRACE_MS]
def test_snap_retries_blank_frame(tmp_path):
"""A blank-sized first frame (audit fingerprint: 4801 B) triggers exactly one retry with a
longer settle, overwriting the tiny frame with the later (painted) one."""
out = str(tmp_path / "shot.png")
page = _FakePage([4801, 30256])
S._snap_with_blank_retry(page, out)
assert page.shots == 2
assert page.idle_waits == [("networkidle", S.BLANK_RETRY_SETTLE_MS)]
assert os.path.getsize(out) == 30256
def test_snap_no_retry_for_real_frame(tmp_path):
"""A real-sized first frame is kept as-is — no second screenshot, no extra waiting."""
out = str(tmp_path / "shot.png")
page = _FakePage([35707])
S._snap_with_blank_retry(page, out)
assert page.shots == 1
assert page.idle_waits == []
assert os.path.getsize(out) == 35707
def test_snap_retry_keeps_late_frame_even_if_still_blank(tmp_path):
"""If the retry frame is still tiny we keep it (honest best-effort) — exactly one retry,
never a loop."""
out = str(tmp_path / "shot.png")
page = _FakePage([4801, 4801])
S._snap_with_blank_retry(page, out)
assert page.shots == 2
assert os.path.getsize(out) == 4801
def test_blank_threshold_brackets_observed_sizes():
"""Threshold sits between the audited defect sizes (blank 4801-2 B, lone spinners up to
8764 B) and the smallest real page (custom-html-tiny, 12950 B)."""
for defect in (4801, 4802, 5895, 6022, 7913, 8764):
assert defect < S.BLANK_SIZE_BYTES
assert S.BLANK_SIZE_BYTES < 12950
def test_wait_budget_within_step_cap():
"""plan-phase-shot §3 P3: the screenshot step's bounded waiting must stay ≤ ~60s worst case."""
total_ms = (
S.NAV_DEADLINE_S * 1000
+ S.SETTLE_TIMEOUT_MS
+ S.RENDER_GRACE_MS
+ S.BLANK_RETRY_SETTLE_MS
+ S.RENDER_GRACE_MS
)
assert total_ms <= 60_000, f"screenshot wait budget {total_ms}ms exceeds the ~60s step cap"
def test_screenshot_reachable_through_real_load_path(tmp_path):
"""R2 proof (rcust P1): a recipe SCREENSHOT hook declared in recipe_meta.py arrives at
screenshot._load_screenshot_hook through the REAL orchestrator load path (meta.load — the
object run_recipe_ci passes to capture()). Under the old six-loader world the orchestrator's
L1 allowlist dropped SCREENSHOT, so the hook was unreachable (spec §8 R2)."""
d = tmp_path / "shotrecipe"
d.mkdir()
(d / "recipe_meta.py").write_text(
"def SCREENSHOT(page, ctx):\n return None\n",
)
meta = meta_mod.load("shotrecipe", tests_dir=str(tmp_path))
hook = S._load_screenshot_hook(meta)
assert callable(hook), "SCREENSHOT hook did not survive the orchestrator load path (R2)"
assert S._load_screenshot_hook(meta_mod.load("no-such", tests_dir=str(tmp_path))) is None