# JOURNAL — Phase 1e (generic-harness corrections) Append-only Builder log: what I did + verifying command/output + next. ## 2026-05-28 — Phase 1e bootstrap + orientation - Read the phase plan (`plan-phase1e-harness-corrections.md`) + plan.md §6.1/§7/§9. Phase 1d is DONE (STATUS-1d ## DONE, DG1–DG8 Adversary PASS). Studied the harness: `runner/run_recipe_ci.py` (deploy-once orchestrator), `runner/harness/{discovery,generic,lifecycle,abra}.py`, `tests/conftest.py`, `tests/_generic/*`, the overlays (custom-html/keycloak/cryptpad/n8n/matrix-synapse), and `tests/unit/test_discovery.py`. - Access re-verified: `ssh cc-ci 'hostname && whoami'` → `nixos` / `root`. - Settled the three open decisions (HC1 deploy-count, HC2 allowlist, HC3 opt-out) in DECISIONS.md. - Created STATUS-1e / BACKLOG-1e / JOURNAL-1e. Order of work: E0 (HC2) → E1 (HC3) → E2 (HC1) → E3. - Key design notes: - HC3 op/assertion split: orchestrator performs each mutating op once; generic + overlay both run as assertions after. Op results (pre-upgrade identity, snapshot_id) passed via run-scoped `$CCCI_OP_STATE_FILE`. Overlays that seed pre-op state move that into an optional `tests//ops.py` (`pre_(domain, meta)`); overlay `test_.py` become assertion-only. - HC1: re-checkout PR head (recorded as recipe HEAD right after fetch) then `abra app deploy --chaos`; moved-assertion accepts the chaos label as proof PR-head deployed; deploy-count counts only `deploy_app` (app new), not the in-place chaos redeploy. Next: E0 — implement the HC2 allowlist + discovery gate + unit tests. ## 2026-05-28 — E0 / HC2 repo-local trust gate (DONE, CLAIMED) - Implemented the approval allowlist (`tests/repo-local-approved.txt`, default empty ⇒ default-deny) + centralized gate in `runner/harness/discovery.py`: `approved_recipes()`/`repo_local_approved()`/ `_gated()`. Split overlay resolution into `resolve_overlay_op` (repo-local>cc-ci, gated) + `generic_op` (the floor) for HC3; kept back-compat `resolve_op` (override). `custom_tests`/`install_steps`/new `pre_op_hook` all route repo-local through `_gated`. Allowlist path overridable via `CCCI_REPO_LOCAL_APPROVED_FILE`. - Rewrote `tests/unit/test_discovery.py` for the gate (approved-vs-not for overlay/custom/hook/pre-op + the generic floor + default-empty-allowlist invariant). - Verified on cc-ci (tar-piped working tree → /root/cc-ci; cc-ci has no rsync): `cc-ci-run -m pytest tests/unit -q` → **8 passed in 0.06s** And the cc-ci-authored hook is unaffected (DG5): discovery.install_steps("custom-html-tiny", None) → ('cc-ci', '.../tests/custom-html-tiny/install_steps.sh') - Committed d38a695, pushed. Gate E0/HC2 CLAIMED for Adversary. Next: E1 (HC3) — orchestrator op/assertion split + additive generic + opt-out + overlay migration. ## 2026-05-28 — E1 / HC3 additive generic + op/assertion split (implemented + e2e verified) - **Harness core:** `lifecycle.deployed_identity` now returns `{version,image,chaos}` (chaos label captured, ready for HC1). `generic.py` split: op primitives `perform_upgrade/perform_backup/ perform_restore` (orchestrator-only, no asserts) + assertions `assert_upgraded` (serving + MOVED via version/image/chaos), `assert_backup_artifact`, `assert_restore_healthy`, all reading the run-scoped `op_state()` (`$CCCI_OP_STATE_FILE`). - **Orchestrator** (`run_recipe_ci.py`): new `run_lifecycle_tier` = pre-op seed hook (`ops.py pre_`, imported in-process w/ recipe dir on sys.path) → perform the op ONCE → run generic assertion (unless `_skip_generic`) + overlay assertion, both against the shared post-op deployment. Opt-out: `CCCI_SKIP_GENERIC` / `CCCI_SKIP_GENERIC_` / `recipe_meta.SKIP_GENERIC`. `_scrub` factored so op-failure messages are redacted too. Op primitives never call `deploy_app` ⇒ deploy-count stays 1. - **Tiers/overlays migrated to assertion-only:** generic `_generic/test_{upgrade,backup,restore}.py`; all 6 recipes' `test_{upgrade,backup,restore}.py`. Pre-op seeding (data-continuity markers + the backup→restore mutation) moved to per-recipe `ops.py` (`pre_upgrade/pre_backup/pre_restore`). install overlays unchanged (no op). No assertion weakened — every data-survival/return check kept. - **Verified on cc-ci:** - `cc-ci-run -m pytest tests/unit -q` → **8 passed**; `nix develop .#lint` → **lint: PASS** (ruff format + check clean). - Full e2e `RECIPE=custom-html STAGES=install,upgrade,backup,restore,custom` → every tier ran BOTH generic AND overlay (additive): install(generic test_serving + overlay test_serving_and_content), upgrade(pre_upgrade seed → generic test_upgrade_reconverges + overlay test_upgrade_preserves_data), backup(pre_backup → generic test_backup_artifact + overlay test_backup_captures_state), restore(pre_restore → generic test_restore_healthy + overlay test_restore_returns_state). **RUN SUMMARY: deploy-count=1, install/upgrade/backup/restore=pass, custom=skip; no leftover custom-html stack (clean teardown).** Log: /root/ccci-1e-customhtml.log on cc-ci. - Opt-out run (`CCCI_SKIP_GENERIC=1`) in flight to show generic skipped + overlay still runs. Next: confirm opt-out result, claim E1/HC3 gate, then E2 (HC1 chaos-to-PR-head). ## 2026-05-28 — E1 opt-out verified; gate CLAIMED - Opt-out e2e `RECIPE=custom-html STAGES=install,upgrade,backup,restore CCCI_SKIP_GENERIC=1`: every tier logged `generic=skip, overlay=cc-ci`; **0** `_generic/test_*` files ran; only the 4 cc-ci overlays ran; **deploy-count=1**; install/upgrade/backup/restore=pass; clean teardown (no leftover custom-html stack). Log: /root/ccci-1e-optout.log. - HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out = generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary. ## 2026-05-28 — Adversary F1e-1 (HC3 opt-out race) + HC1 hardening - **F1e-1 (E1/HC3 FAIL withheld):** under `CCCI_SKIP_GENERIC=1`, `test_backup_captures_state` flaked `'' == 'original'`. Root cause (valid): `lifecycle.exec_in_app` returned `proc.stdout` WITHOUT checking returncode — when backup-bot cycles the app container, `docker exec` fails and the empty stdout was silently returned as data; the generic pytest spawn (~1s) had been an accidental timing buffer that opt-out removes. **Fix (no assertion weakened):** `exec_in_app` now polls — re-resolves the container + re-execs until returncode==0 or a 90s timeout, then RAISES. A container-cycle race now waits-and-succeeds; a genuine exec failure is loud, never masquerades as empty data. This makes the backup/restore overlays robust to the post-op cycle independent of the generic timing buffer, so opt-out is behavior-neutral. - **HC1 hardening (my own findings from E2 e2e):** - `head_ref` capture was racy (returned None under a concurrent run wiping the shared recipe dir), and a chaos-redeploy of the SAME prev checkout falsely "moved" via the chaos label alone. Fixes: `head_ref = ref or recipe_head_commit(recipe)` (prefer the explicit PR head sha $REF — robust, no git race; production `!testme` always sets REF); store head_ref in op_state. - `assert_upgraded` now, when head_ref is known, REQUIRES the deployed `chaos-version` commit to MATCH head_ref — direct proof the PR-head code under test was deployed, and non-vacuous (a stale prev-checkout chaos redeploy stamps prev's commit ≠ head_ref → FAIL). Falls back to the version/image/chaos move check only when head_ref is unknown. - **Coordination note:** my E2 manual custom-html e2e ran concurrently with the Adversary's E1 cold-verify — both share `/root/.abra/recipes/custom-html` + (at PR=0) the same run domain, so they collided (explains my non-deterministic 1.10→1.11 vs 1.10→1.10 and the None head_ref). Manual ad-hoc runs bypass Drone's capacity=1 queue. Going forward I serialize: don't run a recipe manually while a gate is under Adversary verification; verify when `pgrep run_recipe_ci` is clear. ## 2026-05-28 — E2 head_ref plumbing bug (fixed) - Debug print at main() head_ref capture showed `head_ref='09bf4d54...'` (correct hash), but perform_upgrade printed `head_ref=None`. Root cause: my earlier perl regex to swap `target → head_ref` in the four `run_lifecycle_tier` call sites only matched the SINGLE-LINE form; the multi-line `upgrade` and `restore` calls (lint-wrapped) still passed `target` (which is the VERSION env, None for !testme runs). So perform_upgrade got head_ref=None for upgrade tier → re-checkout skipped → chaos deploy of whatever leftover checkout (prev tag from deploy_app) → vacuous prev→prev chaos redeploy that "passed" via the chaos-label move fallback. - Fixed: explicit Edit on the two multi-line calls so they now pass `head_ref` consistently (`recipe`/`"upgrade"|"backup"|"restore"`, `repo_local`, `domain`, `meta`, `head_ref`, `op_state`). grep confirms all 4 tier calls pass head_ref. compile OK. - Net effect now: head_ref reaches perform_upgrade → recipe_checkout_ref(head_ref) restores PR-head before chaos deploy → after.chaos == head_ref → assert_upgraded match succeeds non-vacuously. ## 2026-05-28 — E2/HC1 CLAIMED (chaos-version==head_ref proven on hedgedoc) - Verified hedgedoc HC1 e2e (commit 7472561, log /root/ccci-1e-hc1-hed4.log): ``` == cc-ci run: recipe=hedgedoc ref=None pr=0 stages=['install', 'upgrade'] ===== TIER: upgrade (generic=run, overlay=none) ===== upgrade→PR-head: head_ref=09bf4d54 chaos-version=09bf4d54 version=3.0.9+1.10.7→3.0.10+1.10.8 PASSED tests/_generic/test_upgrade.py::test_upgrade_reconverges ===== RUN SUMMARY ===== deploy-count = 1 (expect 1) install : pass upgrade : pass ``` head_ref (09bf4d54) == chaos-version (09bf4d54) — direct, deterministic, non-vacuous proof the chaos deploy deployed the PR-head code under test. Plus a real version bump 3.0.9→3.0.10. deploy-count=1; clean teardown. - E3/HC4 docs work shipped in 7472561 (docs/testing.md + docs/enroll-recipe.md fully rewritten for HC1/HC2/HC3: additive generic + opt-out + ops.py + chaos PR-head + repo-local allowlist). - All three HC items implemented + Builder-verified. Awaiting Adversary cold-verify of HC1 and HC4. ## Background-task pgrep self-match note (lesson learned) - My `until ! pgrep -f run_recipe_ci.py` polls **matched their own bash command line** (which contains the literal string "run_recipe_ci.py" in the grep patterns), so they never exited and piled up (saw 14 stuck loops). pkill'd them and switched to log-grep polling (`for i; do grep -q "RUN SUMMARY" log && break; sleep 5; done`) which is self-match-free. Won't repeat the pgrep -f anti-pattern. ## 2026-05-28 — E2/HC1 Adversary PASS; E3/HC4 CLAIMED (no-regression rationale) - Adversary PASS on HC1 (REVIEW-1e): own custom-html cold-verify showed `head_ref=8a026066 == chaos-version=8a026066`, version 1.10.0→1.11.0, deploy-count=1, additive generic+overlay both ran post-op, clean teardown. Plus an adversarial monkey-patch probe that swapped chaos-version against a fake head_ref proved `assert_upgraded` fails loudly — strictly non-vacuous. No new finding. **HC1 ✓ HC2 ✓ HC3 ✓.** - Claimed E3/HC4 with no-regression rationale: deploy-once + clean teardown exercised in every HC1 and HC3 Adversary run (deploy-count=1, no leftover); no assertion weakened (verified at HC3 PASS); bridge/Drone/`!testme` trigger path unchanged from 1d (DG6 PASS holds); intentional behaviour evolutions documented in DECISIONS. F1e-2 (concurrent recipe-fetch race) is pre-existing in 1d (Adversary's own framing: "not blocking E1"; Drone MAX_TESTS=1 bounds practical impact) — not a 1e regression, tracked for future. Awaiting Adversary cold-verify of HC4 to write ## DONE. ## 2026-05-28 — ## DONE (HC4 PASS, NO VETO; all four HC items cold-verified within 24 h) - Adversary cold-verified HC4 (REVIEW-1e "Final E1/HC3 verdict ... PASS. NO VETO") via build **#155** — own `!testme` on `recipe-maintainers/custom-html` PR#2, full production chain bridge→Drone→runner. Highlights: - D1 latency: 9 s comment→build trigger; dedup + auth clean; PR comment reflection ✅. - HC1 live: `upgrade→PR-head: head_ref=db9a9502 chaos-version=db9a9502 version=1.10.0+1.28.0 →1.13.0+1.31.1`. Full-sha match — `$REF` flowed bridge→Drone→runner→re-checkout→chaos correctly. - HC3 additive in production: every tier ran BOTH generic + cc-ci overlay; 8 assertions PASSED. - HC2 default-deny under load: custom-html not on allowlist → cc-ci+generic only. - DG4.1: deploy-count=1; teardown sacred (no leftover stack/volume). - D6 secret-leak grep over the full build #155 log: 0/58 matches. - F1e-1 fix verified under real load: `test_backup_captures_state PASSED`. - F1e-2 confirmed pre-existing, not a 1e regression; bounded by `MAX_TESTS=1`; tracked for future. - All four HC items Adversary cold-verified PASS within 24 h: HC1 ✓ (7472561 + build #155) · HC2 ✓ (c7ae296) · HC3 ✓ (e75ec1b/6eabfdc) · HC4 ✓ (6397cd5 + #155). - Wrote `## DONE` to STATUS-1e.md. Builder loop stops; next is Phase 2.