22 KiB
JOURNAL — sub-phase rcust (Builder)
2026-06-10 bootstrap
Read phase plan (recipe-custom-restructure-full-plan.md), plan.md §6.1/§7/§9, and the reference
spec docs/recipe-customization.md @ 76a4b6b in full. Created phase state files. Work branch will
be restructure/recipe-custom off main @ 76a4b6b. Starting P1: reading the six current loaders
(run_recipe_ci.py::_load_meta, conftest.py::_recipe_meta, lifecycle.py::_recipe_extra_env,
lifecycle.py::_recipe_meta_flag, deps.py::declared_deps, canonical.py::is_canonical_enrolled)
before writing harness/meta.py.
2026-06-10 P1 — single loader + registry (branch 472a68b)
Wrote runner/harness/meta.py: KEYS registry (14 keys + CHAOS_BASE_DEPLOY/OIDC_AT_INSTALL/
SKIP_GENERIC kept registered as deprecated=True so P1 lands green before P2 deletes them),
RecipeMeta generated from KEYS via dataclasses.make_dataclass (frozen; field set cannot drift from
the registry), load() = the only exec() of recipe_meta.py, MetaError on unknown ALL-CAPS/type
mismatch/callable-on-data-key, difflib suggestion in the unknown-key message. BACKUP_CAPABLE keeps
its tri-state via default None (None = auto-detect — preserves the old "BACKUP_CAPABLE" in meta
semantics in generic.backup_capable).
Migrations: orchestrator loads once + passes meta down (deploy_app/perform_upgrade/_perform_op/ run_lifecycle_tier all take the object); conftest meta fixture returns full RecipeMeta (R3 closed); lifecycle._recipe_extra_env/_recipe_meta_flag and deps.declared_deps deleted; canonical.is_enrolled
- enrolled_recipes go through meta.load (tests monkeypatch meta.TESTS_DIR now instead of canonical.file); screenshot._load_screenshot_hook reads the attribute (R2 fixed — unit test proves SCREENSHOT survives the real orchestrator load path). deploy_app keeps an optional meta=None fallback (loads via the single loader) for fixture/manual callers — exec still happens in exactly one function.
Effective-value safety check before committing: dumped non_default() for all 21 recipe dirs through the new loader — every recipe's customized key set matches its recipe_meta.py source (e.g. mumble: DEPLOY_TIMEOUT/EXTRA_ENV/HEALTH_OK/READY_PROBE/UPGRADE_EXTRA_ENV). One intentional delta class: deps.deploy_deps' fallback timeouts for a MISSING dep meta change from literal 900/600 to loading the dep's real meta (orchestrator path always supplied metas, so CI behavior is identical).
Verified on cc-ci (rsynced working tree before committing): cc-ci-run -m pytest tests/unit -q -> 175 passed nix develop .#lint --command scripts/lint.sh -> lint: PASS Three pre-existing f212 unit tests passed dicts to wait_ready_probes — updated mechanically to construct RecipeMeta via dataclasses.replace (assertions untouched).
Next: P2a compose.ccci.yml first-class + auto-chaos.
2026-06-10 P2 — legacy keys & paths deleted (branch 8cd72fd)
P2a: lifecycle.provide_ccci_overlay copies tests//compose.ccci.yml into the per-run checkout (after install_steps hook, before prepull/deploy); pinned base deploys auto-chaos on overlay presence (has_ccci_overlay replaces the meta.CHAOS_BASE_DEPLOY elif). ghost/discourse install_steps.sh were copy-only -> deleted whole; their metas keep COMPOSE_FILE in EXTRA_ENV (unchanged wiring, the harness now owns the copy).
P2b: oidc_at_install condition removed — if declared: provisions before the single deploy,
legacy post-deploy block + _run_setup_custom_tests_hook deleted. lasuite-docs install_steps.sh is
the meet/drive hook with docs' exact env names (diffed against the deleted setup_custom_tests.sh:
same keys incl. OIDC_OP_DISCOVERY_ENDPOINT + scopes 'openid email profile'; secret-insert bump
identical; only the abra-redeploy step is gone — the single deploy reads the env instead).
lasuite-drive's MinIO bucket one-shot -> ops.py pre_install (runs at install-tier start, post-
deploy; bucket lives in the minio volume so it survives upgrade/restore; same scale --detach +
30x3s poll as the shell version). run_quick: deps still provision (realm/creds), hook call gone —
no quick-enrolled recipe declares DEPS today; noted inline.
P2c: SKIP_GENERIC out of the registry; _skip_generic(op) env-only; skip_generic_env_overrides()
prints a !! warning when active under DRONE (P5 will embed in the manifest).
P2d: conftest deps fixture = dict of _DepEntry (dict subclass w/ attribute sugar) — the 6 lasuite files only ever used deps_creds, renamed param to deps, zero assertion changes. NOTE for Adversary: some assert MESSAGE strings ('setup_custom_tests should have populated this.' -> 'dep provisioning...') and docstrings updated — message text only, no assert logic/expected values.
Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest tests/unit -q -> 175 passed; nix develop .#lint --command scripts/lint.sh -> PASS. Doc table regenerated to the 14-key registry (doc-sync unit test pins it).
Next: P3 — HookCtx + ctx-hook signatures everywhere.
2026-06-10 P3 — uniform ctx hook convention (branch fd02d9f)
HookCtx frozen dataclass + hook_ctx() constructor in harness/meta.py; ctx.deps read straight from
$CCCI_DEPS_FILE (json, both shapes) — meta.py stays import-cycle-free (deps.py imports lifecycle
which imports meta). Registry keys carry hook_params; meta.load() enforces the expected positional
names per hook key (READY_PROBE/BACKUP_VERIFY/EXTRA_ENV/UPGRADE_EXTRA_ENV=(ctx,),
SCREENSHOT=(page, ctx)); run_pre_hook applies meta.check_hook_signature(fn, ("ctx",)) to ops.py
hooks before calling. Conversion of 17 ops.py + 8 recipe_meta hooks was scripted (def-line regex +
bare domain -> ctx.domain inside the pre*/hook function bodies only) and diff-reviewed; the
only manual fixes: keycloak pre_restore passed meta -> ctx.meta, and two comment lines in
lasuite-drive/-meet metas that the regex over-replaced were restored. wait_ready_probes gained
op= (install/upgrade call sites pass it) so probes can know the phase.
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 180 passed; lint PASS.
Next: P4 — discovery placement rule + op_state/deps fixtures + migrate hand-parsers.
2026-06-10 P4 — custom-test ergonomics (branch 29a28e2)
Pre-change sweeps confirmed the plan's zero-users claims: no top-level non-lifecycle test_*.py in any recipe dir; no recipe test file reads os.environ / CCCI_OP_STATE_FILE directly (the only op-state consumers are the generic assertions via harness.generic.op_state — harness-side, fine). So P4 = discovery glob removal + new op_state fixture + pinning tests; no test migrations needed. test_discovery.py's HC2 gate test moved its repo-local custom fixture under functional/ (the rule); test_discovery_phase2.py now asserts top-level custom is NOT discovered. op_state fixture skips (clear reason) when env unset / file missing / unparseable; tested via request.getfixturevalue.
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 184 passed; lint PASS.
Next: P5 — customization manifest (print block + results.json key).
2026-06-10 P5 — customization manifest (branch 68954be)
(Resumed after a usage-limit pause mid-P5; working tree carried the in-flight manifest.py.) New runner/harness/manifest.py: build() collects {meta_non_default, hooks, overlays, custom_tests, env_overrides} via the SAME discovery/meta functions the run uses (so the manifest can never disagree with what actually executes — incl. the HC2 _gated() repo-local gate), render() prints the block. Orchestrator builds+prints right after meta load / repo-local snapshot, BEFORE the quick-lane branch (both lanes get the block); the dict rides into build_results(customization=...) verbatim. run_quick writes no results.json, so the single build_results call site covers all. Hooks render as "", tuples as lists (JSON-clean); ops.py pre-ops listed by cheap source scan (same approach as discovery._module_defines — no import at manifest time).
Lint flagged: C408 dict() literal, import-block order (manifest after deps), ruff-format on the new test file — all fixed. Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest tests/unit -q -> 191 passed; nix develop .#lint --command scripts/lint.sh -> lint: PASS.
Next: P6 docs, then M1 prep (tests/concurrency proof run + 21-recipe baseline matrix).
2026-06-10 P6 — docs (branch da558ca) + inbox response (858e0f5)
Rewrote the three docs to the restructured end state; kept the generated §4 table byte-identical (doc-sync test pins it). recipe-customization.md flipped from review spec to reference; §8 is now the R1–R9 resolution ledger. Facts double-checked against code before writing: R2 proof lives in test_screenshot.py::test_screenshot_reachable_through_real_load_path (not test_meta.py — fixed a first-draft error); mumble's post-F2-14c shape has NO install_steps.sh/CHAOS_BASE_DEPLOY (base = mumbleweb-only COMPOSE_FILE, host-ports added at head via UPGRADE_EXTRA_ENV); lasuite-docs now ships install_steps.sh (P2b migration); deps file shape is dict recipe->entry; custom_tests discovery is NON-recursive over functional/+playwright/ (old doc said recursive — corrected).
Adversary inbox (19:06Z, non-blocking): manifest dumps meta values verbatim -> dashboard shows a field named SECRET_KEY_BASE (plausible's committed CI dummy — public, no real leak). Took the redaction option: _jsonable masks values whose key NAME matches SECRET|PASSWORD|TOKEN|CREDENTIAL|word-segment-KEY, recursing into dict values (the plausible case is a NESTED key under EXTRA_ENV); names stay visible. KEYCLOAK_URL deliberately not matched (word-segment KEY). Unit test pins redacted+passthrough both.
Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest tests/unit -q -> 192 passed; nix develop .#lint --command scripts/lint.sh -> lint: PASS.
Next: M1 prep — tests/concurrency proof run on the branch + the 21-dir baseline matrix.
2026-06-10 M1 prep + claim
Concurrency proof run on branch head 858e0f5 (rsynced tree on cc-ci): cc-ci-run -m pytest
tests/concurrency -q -> 23 passed in 11.46s (suite untouched by the restructure, as planned).
Baseline matrix: pulled every /var/lib/cc-ci-runs/*/results.json (141 files) and took the most
recent per recipe. 19/21 dirs covered by results.json; mumble's last full run predates the
results system (log ~/ccci-mumble-f214c.log, 5 tiers pass 05-31); bluesky-pds likewise
(Adversary Phase-2 cold verify e45e0ee). plausible's weekly-report RED was its PR branch
(pg13->14, build 200); its default-branch baseline is run 308 (06-10) L4 — runs 307/308 are
today's, from the conc-phase M2 sweep. Bad canaries recorded at their designed-fail tier.
Claimed M1. While waiting: nothing else unblocked in this phase (M2 is gated on M1) — will hold with short fallback polls per §7 case 2.
2026-06-11 M2 reconciliation — discourse upgrade-HC1 root-cause hunt + bluesky re-characterization
Resumed after a loop stall (~21:18Z–23:50Z): the m2b/ab sweeps had finished but nothing processed them. Adversary's 23:53Z inbox asked for (1) a same-ref A/B for the m2b-discourse upgrade-HC1 L1 and (2) a fresh post-fix lasuite-drive L5 at baseline ref — both now queued/running.
Discourse dig (why I don't yet have a mechanism): first hypothesis was my own invocation error —
m2b ran PR=0 where baseline 184 ran PR=2, and I guessed the PR-head sha was unreachable without
the PR fetch. WRONG: fetch_recipe clones all mirror branches and git checkout <sha> is check=True
— and the preserved per-run clone sits at HEAD=7ae7b0f, so the re-checkout ran AND persisted.
Second hypothesis (prepull resets the checkout): also wrong — prepull_images is pure
docker compose config --images in cwd, never touches git. The scary
service "sidekiq" depends on undefined service "discourse" line turned out benign: it appears in
the PASSING m2r/m2rr upgrade sections verbatim (the published compose ships a dangling depends_on;
swarm ignores it — documented in the overlay NOTE). What's left: abra stamped the PREV-TAG commit
(eb96de94 = 0.7.0+3.3.1) on the chaos redeploy while the tree was at 7ae7b0f. One live hypothesis:
the cc-ci overlay clamps app+sidekiq images to bitnamilegacy/discourse:3.3.1; at this PR head
(0.9.0+3.5.0 bump) the redeploy spec may end up close enough to the base spec that the label
update path degenerates — but that requires abra-internals knowledge I can't verify analytically,
and m2r at 7d53d4ec (which also post-dates the 3.5.0 bump?) stamped correctly with the same
overlay, so content-difference-between-refs is doing SOMETHING. Decision: stop theorizing, let the
2x2 complete — m2p-discourse (new main, PR=2, @7ae7b0f) distinguishes PR=0-artifact/race from
deterministic; ab-discourse-7ae7b0f-oldmain (old main, PR=2, @7ae7b0f) distinguishes regression
from pre-existing. Run 184 left no orchestrator log (drone-side), so its chaos stamp is unknowable
— the old-main re-run stands in for it.
lifecycle.py diff c2508c7..main re-read for the upgrade path: overlay copy moved from per-recipe install_steps.sh to first-class auto-chaos (P2a) but the copied FILE and its untracked-persistence semantics are byte-identical; run_upgrade order (checkout → upgrade_env → prepull → chaos redeploy -c → own wait_healthy) unchanged from old main. Nothing jumps out as the delta.
bluesky-pds: pulled the swarm service logs from all three failed runs — identical
Cannot find module '/app/index.js' crash-loop (Node v24.15.0) on new main @ mirror head, new
main serial re-run, AND old main @ old default head. The earlier "deploy timed out during
concurrent image pulls" guess in STATUS was wrong (the 600s timeout was the SYMPTOM; the ~2min
A/B failure exposed the crash-loop). Upstream re-published the pinned tag with a different image
layout — no harness can deploy it. Filed in STATUS as restructure-neutral with grep-able evidence.
2026-06-11 lasuite-drive root cause #2 — completed one-shot poisons convergence (caught live)
Watching the m2p proof run instead of just waiting paid off: the fix-forward's best-effort line printed (so #1 is fixed), but the install assert then sat in pytest for 25+ minutes. Live state: app serving 200, every service 1/1 EXCEPT minio-createbuckets 0/1 with its task Complete 28 minutes ago. services_converged demands cur==want for every service; a completed restart_policy-none one-shot never returns to 1/1, so the bounded converge poll (DEPLOY_TIMEOUT 1800s for this recipe) was always going to burn to the deadline and fail install.
Why nobody ever saw this before P2b: the old setup_custom_tests.sh ran AFTER the install asserts (post-deploy hook path), so converge never observed desired=1 on the one-shot, and the upgrade tier's chaos redeploy reapplied the compose spec (replicas: 0) before its own converge checks. P2b folded the trigger into ops.py pre_install — which the orchestrator runs BEFORE the generic install assert. Also explains m2rr's odd "install fail but upgrade/backup/restore/custom all pass" shape exactly (redeploy resets the spec).
Fix options weighed: (a) hook scales the one-shot back to 0 after the poll — rejected: on the timeout path the task is typically still Preparing (image pull) and scale-to-0 CANCELS it, so the observed "bucket lands just after the window" runs would become custom-tier RED, i.e. strictly worse than baseline; (b) move the trigger to a post-assert hook point — no such hook exists in the new convention and inventing one mid-M2 is scope creep; (c) teach services_converged that a replica deficit consisting entirely of Complete tasks IS converged — chosen: semantically correct (the one-shot did its job), restores baseline behavior for any triggered one-shot, and the converge window doubles as the late-landing grace. Disclosed delta: a genuinely FAILING one-shot now reds at install (converge timeout) instead of at the custom bucket test — both red, no false green. Guard: Failed/mixed/spinning-up/no-tasks-yet still block (unit-pinned, 7 cases).
Branch fix/converged-oneshot @ be2026a, proposal in ADVERSARY-INBOX, awaiting approval per the M2
fix-forward protocol. Unit suite 199 passed + lint PASS from the cc-ci working-tree rsync.
2026-06-11 ~01:00Z — merge landed, queue shortened
be2026a approved (REVIEW a531746, cold-verified independently) and merged as 6cabbe7; drone build
350 green on the push head 914c166. Merged diff verified == branch diff (empty git diff be2026a..
main for the two files). Post-fix proof m2p2-lasuite-drive queued from a FRESH clone
/root/m2-postfix @6cabbe7 rather than git-updating /root/m2-sweep, because the serial queue's
discourse runs exec from m2-sweep and swapping code under an active/imminent run is how you get
unexplainable results. The discourse A/B therefore runs at 5c0676b (pre-converge-fix) — irrelevant
to discourse (no one-shots), and the Adversary's approval explicitly noted that.
Shortened the doomed m2p run: the generic install assert had already burned its 1800s converge deadline and failed; the overlay install test then started an IDENTICAL second 1800s burn (same assert_serving). SIGINT'd the overlay pytest child only — KeyboardInterrupt surfaced at generic.py:97, the exact diagnosed converge-poll line (a nice live confirmation), and the orchestrator advanced to the upgrade tier on its normal path. Teardown semantics untouched. Disclosed in STATUS so the log's KeyboardInterrupt is pre-explained.
Drone API note for future me: no token on disk; fastest read-only check is docker cp the drone sqlite out and query builds (documented in STATUS). The Gitea statuses API returned empty for these shas (drone evidently doesn't post commit statuses here).
2026-06-11 ~00:55Z — discourse A/B closed (harness-neutral), mechanism still unattributed
m2p-discourse (new main, PR=2, @7ae7b0f) and ab-discourse-7ae7b0f-oldmain (old main, PR=2, same
ref) failed the upgrade IDENTICALLY: HC1, chaos-version=eb96de94+U, all other tiers pass, L2.
Same invocation as baseline 184 which was L4 five days ago. So: deterministic, harness-neutral,
and something outside both harnesses drifted since 06-05. Eliminated: branch-tip existence (7ae7b0f
still tips upgrade-0.8.0+3.5.0 + pr/2), upstream tag set (0.7.0+3.3.1 still latest), abra pin
(flake.lock untouched by the restructure). Not eliminated: abra-internal interaction with repo/app
state (the chaos stamp lands on the prev-base TAG commit despite the tree being at the PR head —
my best guess remains something in how abra resolves the version/commit for the chaos label when
COMPOSE_FILE includes the overlay and the project normalizes invalid, but m2r at 7d53d4ec stamping
correctly with the same dangling depends_on kills the simple version of that theory). The
service "sidekiq" depends on... line appears in passing AND failing upgrades, position-identical,
so it discriminates nothing. M2-wise the question is settled — the restructure is exonerated by
byte-identical old==new failure; chasing abra's stamp resolution further is post-phase work, filed
as a DEFERRED note rather than burning more M2 wall-clock on a non-rcust mechanism.
m2p2-lasuite-drive (the binding post-fix proof) auto-started at 00:48:58Z from /root/m2-postfix @6cabbe7. Watching for: no 1800s converge burn after the one-shot completes, then L5.
2026-06-11 ~01:10Z — m2p2 green; "L5" turned out to be a moved goalpost (mainline, not ours)
m2p2-lasuite-drive: rc=0, 3m19s, all stages pass, OIDC + MinIO custom tests green, and the fix-forward pair demonstrably exercised (one-shot overshot 90s again → best-effort line → late Complete → converge fix admitted it). But results.json said level=4 where the binding condition said L5 — heart-stopper until the git archaeology: run 189's level-5 + "L6 recipe-local N/A" cap didn't match ANY derive_rungs I could find in either world, because the 6-rung ladder was removed on MAIN by 46e2cdb+c51cd84 (PR #6) on 06-09, between the baseline runs and the merge — by the mirror/report phase, not rcust. The merge didn't touch level.py (checked 01e6d49^1..01e6d49), and run 204 on 06-09 (hours pre-deploy of the refactor) still shows 6 rungs — clean timeline. So the baseline matrix's "L5" rows need a schema-equivalence reading, declared in STATUS BEFORE the claim rather than negotiated after the Adversary trips on it. Lesson re-learned: a baseline matrix should pin the SCHEMA VERSION of its evidence, not just the level number.
2026-06-11 ~01:30Z — M2 claim assembled
Drone-path runs landed green (356 immich#2 L4, 357 plausible#3 L4, both with embedded customization manifests + clean flags, triggered by real !testme comments). Zero-leak verified after everything. Plausible's missing screenshot.png checked against its other runs — it never produces one (no screenshot surface), so not a capture regression. Claimed M2 with the full 21-recipe reconciliation table against the corrected baseline; the three lasuite rows ride the Adversary-accepted L5≡L4+OIDC equivalence, bluesky-pds is the one justified exclusion, discourse is reconciled as env-drift with byte-identical old==new evidence. Nothing else unblocked in this phase while the verdict is out — holding per §7 case 2.
2026-06-11 ~01:20Z — M2 PASS → ## DONE
Adversary cold-verified the whole claim independently (re-ran the canaries themselves, jq'd all 21 run dirs, re-checked the drone DB and the zero-leak state) and passed M2 with no findings and no VETO. M1 + M2 both stand; ## DONE written. Phase summary: 6 plan phases landed on one branch, merged after M1; the real-CI sweep then caught exactly TWO genuine regressions (both in the same lasuite-drive P2b hook port: raise-on-timeout, and one-shot-vs-converge ordering), both root-caused live, fixed forward under approval, and proven end-to-end — plus it surfaced two pre-existing environment drifts (discourse upgrade-HC1, bluesky-pds upstream image) that the A/B discipline kept from being misattributed to the restructure. The sweep-as-safety-net worked as designed.