16 KiB
JOURNAL — sub-phase rcust (Builder)
2026-06-10 bootstrap
Read phase plan (recipe-custom-restructure-full-plan.md), plan.md §6.1/§7/§9, and the reference
spec docs/recipe-customization.md @ 76a4b6b in full. Created phase state files. Work branch will
be restructure/recipe-custom off main @ 76a4b6b. Starting P1: reading the six current loaders
(run_recipe_ci.py::_load_meta, conftest.py::_recipe_meta, lifecycle.py::_recipe_extra_env,
lifecycle.py::_recipe_meta_flag, deps.py::declared_deps, canonical.py::is_canonical_enrolled)
before writing harness/meta.py.
2026-06-10 P1 — single loader + registry (branch 472a68b)
Wrote runner/harness/meta.py: KEYS registry (14 keys + CHAOS_BASE_DEPLOY/OIDC_AT_INSTALL/
SKIP_GENERIC kept registered as deprecated=True so P1 lands green before P2 deletes them),
RecipeMeta generated from KEYS via dataclasses.make_dataclass (frozen; field set cannot drift from
the registry), load() = the only exec() of recipe_meta.py, MetaError on unknown ALL-CAPS/type
mismatch/callable-on-data-key, difflib suggestion in the unknown-key message. BACKUP_CAPABLE keeps
its tri-state via default None (None = auto-detect — preserves the old "BACKUP_CAPABLE" in meta
semantics in generic.backup_capable).
Migrations: orchestrator loads once + passes meta down (deploy_app/perform_upgrade/_perform_op/ run_lifecycle_tier all take the object); conftest meta fixture returns full RecipeMeta (R3 closed); lifecycle._recipe_extra_env/_recipe_meta_flag and deps.declared_deps deleted; canonical.is_enrolled
- enrolled_recipes go through meta.load (tests monkeypatch meta.TESTS_DIR now instead of canonical.file); screenshot._load_screenshot_hook reads the attribute (R2 fixed — unit test proves SCREENSHOT survives the real orchestrator load path). deploy_app keeps an optional meta=None fallback (loads via the single loader) for fixture/manual callers — exec still happens in exactly one function.
Effective-value safety check before committing: dumped non_default() for all 21 recipe dirs through the new loader — every recipe's customized key set matches its recipe_meta.py source (e.g. mumble: DEPLOY_TIMEOUT/EXTRA_ENV/HEALTH_OK/READY_PROBE/UPGRADE_EXTRA_ENV). One intentional delta class: deps.deploy_deps' fallback timeouts for a MISSING dep meta change from literal 900/600 to loading the dep's real meta (orchestrator path always supplied metas, so CI behavior is identical).
Verified on cc-ci (rsynced working tree before committing): cc-ci-run -m pytest tests/unit -q -> 175 passed nix develop .#lint --command scripts/lint.sh -> lint: PASS Three pre-existing f212 unit tests passed dicts to wait_ready_probes — updated mechanically to construct RecipeMeta via dataclasses.replace (assertions untouched).
Next: P2a compose.ccci.yml first-class + auto-chaos.
2026-06-10 P2 — legacy keys & paths deleted (branch 8cd72fd)
P2a: lifecycle.provide_ccci_overlay copies tests//compose.ccci.yml into the per-run checkout (after install_steps hook, before prepull/deploy); pinned base deploys auto-chaos on overlay presence (has_ccci_overlay replaces the meta.CHAOS_BASE_DEPLOY elif). ghost/discourse install_steps.sh were copy-only -> deleted whole; their metas keep COMPOSE_FILE in EXTRA_ENV (unchanged wiring, the harness now owns the copy).
P2b: oidc_at_install condition removed — if declared: provisions before the single deploy,
legacy post-deploy block + _run_setup_custom_tests_hook deleted. lasuite-docs install_steps.sh is
the meet/drive hook with docs' exact env names (diffed against the deleted setup_custom_tests.sh:
same keys incl. OIDC_OP_DISCOVERY_ENDPOINT + scopes 'openid email profile'; secret-insert bump
identical; only the abra-redeploy step is gone — the single deploy reads the env instead).
lasuite-drive's MinIO bucket one-shot -> ops.py pre_install (runs at install-tier start, post-
deploy; bucket lives in the minio volume so it survives upgrade/restore; same scale --detach +
30x3s poll as the shell version). run_quick: deps still provision (realm/creds), hook call gone —
no quick-enrolled recipe declares DEPS today; noted inline.
P2c: SKIP_GENERIC out of the registry; _skip_generic(op) env-only; skip_generic_env_overrides()
prints a !! warning when active under DRONE (P5 will embed in the manifest).
P2d: conftest deps fixture = dict of _DepEntry (dict subclass w/ attribute sugar) — the 6 lasuite files only ever used deps_creds, renamed param to deps, zero assertion changes. NOTE for Adversary: some assert MESSAGE strings ('setup_custom_tests should have populated this.' -> 'dep provisioning...') and docstrings updated — message text only, no assert logic/expected values.
Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest tests/unit -q -> 175 passed; nix develop .#lint --command scripts/lint.sh -> PASS. Doc table regenerated to the 14-key registry (doc-sync unit test pins it).
Next: P3 — HookCtx + ctx-hook signatures everywhere.
2026-06-10 P3 — uniform ctx hook convention (branch fd02d9f)
HookCtx frozen dataclass + hook_ctx() constructor in harness/meta.py; ctx.deps read straight from
$CCCI_DEPS_FILE (json, both shapes) — meta.py stays import-cycle-free (deps.py imports lifecycle
which imports meta). Registry keys carry hook_params; meta.load() enforces the expected positional
names per hook key (READY_PROBE/BACKUP_VERIFY/EXTRA_ENV/UPGRADE_EXTRA_ENV=(ctx,),
SCREENSHOT=(page, ctx)); run_pre_hook applies meta.check_hook_signature(fn, ("ctx",)) to ops.py
hooks before calling. Conversion of 17 ops.py + 8 recipe_meta hooks was scripted (def-line regex +
bare domain -> ctx.domain inside the pre*/hook function bodies only) and diff-reviewed; the
only manual fixes: keycloak pre_restore passed meta -> ctx.meta, and two comment lines in
lasuite-drive/-meet metas that the regex over-replaced were restored. wait_ready_probes gained
op= (install/upgrade call sites pass it) so probes can know the phase.
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 180 passed; lint PASS.
Next: P4 — discovery placement rule + op_state/deps fixtures + migrate hand-parsers.
2026-06-10 P4 — custom-test ergonomics (branch 29a28e2)
Pre-change sweeps confirmed the plan's zero-users claims: no top-level non-lifecycle test_*.py in any recipe dir; no recipe test file reads os.environ / CCCI_OP_STATE_FILE directly (the only op-state consumers are the generic assertions via harness.generic.op_state — harness-side, fine). So P4 = discovery glob removal + new op_state fixture + pinning tests; no test migrations needed. test_discovery.py's HC2 gate test moved its repo-local custom fixture under functional/ (the rule); test_discovery_phase2.py now asserts top-level custom is NOT discovered. op_state fixture skips (clear reason) when env unset / file missing / unparseable; tested via request.getfixturevalue.
Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 184 passed; lint PASS.
Next: P5 — customization manifest (print block + results.json key).
2026-06-10 P5 — customization manifest (branch 68954be)
(Resumed after a usage-limit pause mid-P5; working tree carried the in-flight manifest.py.) New runner/harness/manifest.py: build() collects {meta_non_default, hooks, overlays, custom_tests, env_overrides} via the SAME discovery/meta functions the run uses (so the manifest can never disagree with what actually executes — incl. the HC2 _gated() repo-local gate), render() prints the block. Orchestrator builds+prints right after meta load / repo-local snapshot, BEFORE the quick-lane branch (both lanes get the block); the dict rides into build_results(customization=...) verbatim. run_quick writes no results.json, so the single build_results call site covers all. Hooks render as "", tuples as lists (JSON-clean); ops.py pre-ops listed by cheap source scan (same approach as discovery._module_defines — no import at manifest time).
Lint flagged: C408 dict() literal, import-block order (manifest after deps), ruff-format on the new test file — all fixed. Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest tests/unit -q -> 191 passed; nix develop .#lint --command scripts/lint.sh -> lint: PASS.
Next: P6 docs, then M1 prep (tests/concurrency proof run + 21-recipe baseline matrix).
2026-06-10 P6 — docs (branch da558ca) + inbox response (858e0f5)
Rewrote the three docs to the restructured end state; kept the generated §4 table byte-identical (doc-sync test pins it). recipe-customization.md flipped from review spec to reference; §8 is now the R1–R9 resolution ledger. Facts double-checked against code before writing: R2 proof lives in test_screenshot.py::test_screenshot_reachable_through_real_load_path (not test_meta.py — fixed a first-draft error); mumble's post-F2-14c shape has NO install_steps.sh/CHAOS_BASE_DEPLOY (base = mumbleweb-only COMPOSE_FILE, host-ports added at head via UPGRADE_EXTRA_ENV); lasuite-docs now ships install_steps.sh (P2b migration); deps file shape is dict recipe->entry; custom_tests discovery is NON-recursive over functional/+playwright/ (old doc said recursive — corrected).
Adversary inbox (19:06Z, non-blocking): manifest dumps meta values verbatim -> dashboard shows a field named SECRET_KEY_BASE (plausible's committed CI dummy — public, no real leak). Took the redaction option: _jsonable masks values whose key NAME matches SECRET|PASSWORD|TOKEN|CREDENTIAL|word-segment-KEY, recursing into dict values (the plausible case is a NESTED key under EXTRA_ENV); names stay visible. KEYCLOAK_URL deliberately not matched (word-segment KEY). Unit test pins redacted+passthrough both.
Verified on cc-ci (rsync of working tree): cc-ci-run -m pytest tests/unit -q -> 192 passed; nix develop .#lint --command scripts/lint.sh -> lint: PASS.
Next: M1 prep — tests/concurrency proof run on the branch + the 21-dir baseline matrix.
2026-06-10 M1 prep + claim
Concurrency proof run on branch head 858e0f5 (rsynced tree on cc-ci): cc-ci-run -m pytest
tests/concurrency -q -> 23 passed in 11.46s (suite untouched by the restructure, as planned).
Baseline matrix: pulled every /var/lib/cc-ci-runs/*/results.json (141 files) and took the most
recent per recipe. 19/21 dirs covered by results.json; mumble's last full run predates the
results system (log ~/ccci-mumble-f214c.log, 5 tiers pass 05-31); bluesky-pds likewise
(Adversary Phase-2 cold verify e45e0ee). plausible's weekly-report RED was its PR branch
(pg13->14, build 200); its default-branch baseline is run 308 (06-10) L4 — runs 307/308 are
today's, from the conc-phase M2 sweep. Bad canaries recorded at their designed-fail tier.
Claimed M1. While waiting: nothing else unblocked in this phase (M2 is gated on M1) — will hold with short fallback polls per §7 case 2.
2026-06-11 M2 reconciliation — discourse upgrade-HC1 root-cause hunt + bluesky re-characterization
Resumed after a loop stall (~21:18Z–23:50Z): the m2b/ab sweeps had finished but nothing processed them. Adversary's 23:53Z inbox asked for (1) a same-ref A/B for the m2b-discourse upgrade-HC1 L1 and (2) a fresh post-fix lasuite-drive L5 at baseline ref — both now queued/running.
Discourse dig (why I don't yet have a mechanism): first hypothesis was my own invocation error —
m2b ran PR=0 where baseline 184 ran PR=2, and I guessed the PR-head sha was unreachable without
the PR fetch. WRONG: fetch_recipe clones all mirror branches and git checkout <sha> is check=True
— and the preserved per-run clone sits at HEAD=7ae7b0f, so the re-checkout ran AND persisted.
Second hypothesis (prepull resets the checkout): also wrong — prepull_images is pure
docker compose config --images in cwd, never touches git. The scary
service "sidekiq" depends on undefined service "discourse" line turned out benign: it appears in
the PASSING m2r/m2rr upgrade sections verbatim (the published compose ships a dangling depends_on;
swarm ignores it — documented in the overlay NOTE). What's left: abra stamped the PREV-TAG commit
(eb96de94 = 0.7.0+3.3.1) on the chaos redeploy while the tree was at 7ae7b0f. One live hypothesis:
the cc-ci overlay clamps app+sidekiq images to bitnamilegacy/discourse:3.3.1; at this PR head
(0.9.0+3.5.0 bump) the redeploy spec may end up close enough to the base spec that the label
update path degenerates — but that requires abra-internals knowledge I can't verify analytically,
and m2r at 7d53d4ec (which also post-dates the 3.5.0 bump?) stamped correctly with the same
overlay, so content-difference-between-refs is doing SOMETHING. Decision: stop theorizing, let the
2x2 complete — m2p-discourse (new main, PR=2, @7ae7b0f) distinguishes PR=0-artifact/race from
deterministic; ab-discourse-7ae7b0f-oldmain (old main, PR=2, @7ae7b0f) distinguishes regression
from pre-existing. Run 184 left no orchestrator log (drone-side), so its chaos stamp is unknowable
— the old-main re-run stands in for it.
lifecycle.py diff c2508c7..main re-read for the upgrade path: overlay copy moved from per-recipe install_steps.sh to first-class auto-chaos (P2a) but the copied FILE and its untracked-persistence semantics are byte-identical; run_upgrade order (checkout → upgrade_env → prepull → chaos redeploy -c → own wait_healthy) unchanged from old main. Nothing jumps out as the delta.
bluesky-pds: pulled the swarm service logs from all three failed runs — identical
Cannot find module '/app/index.js' crash-loop (Node v24.15.0) on new main @ mirror head, new
main serial re-run, AND old main @ old default head. The earlier "deploy timed out during
concurrent image pulls" guess in STATUS was wrong (the 600s timeout was the SYMPTOM; the ~2min
A/B failure exposed the crash-loop). Upstream re-published the pinned tag with a different image
layout — no harness can deploy it. Filed in STATUS as restructure-neutral with grep-able evidence.
2026-06-11 lasuite-drive root cause #2 — completed one-shot poisons convergence (caught live)
Watching the m2p proof run instead of just waiting paid off: the fix-forward's best-effort line printed (so #1 is fixed), but the install assert then sat in pytest for 25+ minutes. Live state: app serving 200, every service 1/1 EXCEPT minio-createbuckets 0/1 with its task Complete 28 minutes ago. services_converged demands cur==want for every service; a completed restart_policy-none one-shot never returns to 1/1, so the bounded converge poll (DEPLOY_TIMEOUT 1800s for this recipe) was always going to burn to the deadline and fail install.
Why nobody ever saw this before P2b: the old setup_custom_tests.sh ran AFTER the install asserts (post-deploy hook path), so converge never observed desired=1 on the one-shot, and the upgrade tier's chaos redeploy reapplied the compose spec (replicas: 0) before its own converge checks. P2b folded the trigger into ops.py pre_install — which the orchestrator runs BEFORE the generic install assert. Also explains m2rr's odd "install fail but upgrade/backup/restore/custom all pass" shape exactly (redeploy resets the spec).
Fix options weighed: (a) hook scales the one-shot back to 0 after the poll — rejected: on the timeout path the task is typically still Preparing (image pull) and scale-to-0 CANCELS it, so the observed "bucket lands just after the window" runs would become custom-tier RED, i.e. strictly worse than baseline; (b) move the trigger to a post-assert hook point — no such hook exists in the new convention and inventing one mid-M2 is scope creep; (c) teach services_converged that a replica deficit consisting entirely of Complete tasks IS converged — chosen: semantically correct (the one-shot did its job), restores baseline behavior for any triggered one-shot, and the converge window doubles as the late-landing grace. Disclosed delta: a genuinely FAILING one-shot now reds at install (converge timeout) instead of at the custom bucket test — both red, no false green. Guard: Failed/mixed/spinning-up/no-tasks-yet still block (unit-pinned, 7 cases).
Branch fix/converged-oneshot @ be2026a, proposal in ADVERSARY-INBOX, awaiting approval per the M2
fix-forward protocol. Unit suite 199 passed + lint PASS from the cc-ci working-tree rsync.