build #155 (own !testme on custom-html PR#2): head_ref=db9a9502 == chaos-version=db9a9502 (1.10.0→1.13.0), additive generic+overlay both ran (8 assertions PASS), HC2 default-deny held under load, deploy-count=1, teardown sacred, D6 secret-leak grep 0/58. F1e-1 CLOSED. F1e-2 pre-existing (not a 1e regression). The generic-harness corrections are landed; foundation ready for Phase 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
13 KiB
JOURNAL — Phase 1e (generic-harness corrections)
Append-only Builder log: what I did + verifying command/output + next.
2026-05-28 — Phase 1e bootstrap + orientation
- Read the phase plan (
plan-phase1e-harness-corrections.md) + plan.md §6.1/§7/§9. Phase 1d is DONE (STATUS-1d ## DONE, DG1–DG8 Adversary PASS). Studied the harness:runner/run_recipe_ci.py(deploy-once orchestrator),runner/harness/{discovery,generic,lifecycle,abra}.py,tests/conftest.py,tests/_generic/*, the overlays (custom-html/keycloak/cryptpad/n8n/matrix-synapse), andtests/unit/test_discovery.py. - Access re-verified:
ssh cc-ci 'hostname && whoami'→nixos/root. - Settled the three open decisions (HC1 deploy-count, HC2 allowlist, HC3 opt-out) in DECISIONS.md.
- Created STATUS-1e / BACKLOG-1e / JOURNAL-1e. Order of work: E0 (HC2) → E1 (HC3) → E2 (HC1) → E3.
- Key design notes:
- HC3 op/assertion split: orchestrator performs each mutating op once; generic + overlay both run as
assertions after. Op results (pre-upgrade identity, snapshot_id) passed via run-scoped
$CCCI_OP_STATE_FILE. Overlays that seed pre-op state move that into an optionaltests/<recipe>/ops.py(pre_<op>(domain, meta)); overlaytest_<op>.pybecome assertion-only. - HC1: re-checkout PR head (recorded as recipe HEAD right after fetch) then
abra app deploy --chaos; moved-assertion accepts the chaos label as proof PR-head deployed; deploy-count counts onlydeploy_app(app new), not the in-place chaos redeploy.
- HC3 op/assertion split: orchestrator performs each mutating op once; generic + overlay both run as
assertions after. Op results (pre-upgrade identity, snapshot_id) passed via run-scoped
Next: E0 — implement the HC2 allowlist + discovery gate + unit tests.
2026-05-28 — E0 / HC2 repo-local trust gate (DONE, CLAIMED)
- Implemented the approval allowlist (
tests/repo-local-approved.txt, default empty ⇒ default-deny)- centralized gate in
runner/harness/discovery.py:approved_recipes()/repo_local_approved()/_gated(). Split overlay resolution intoresolve_overlay_op(repo-local>cc-ci, gated) +generic_op(the floor) for HC3; kept back-compatresolve_op(override).custom_tests/install_steps/newpre_op_hookall route repo-local through_gated. Allowlist path overridable viaCCCI_REPO_LOCAL_APPROVED_FILE.
- centralized gate in
- Rewrote
tests/unit/test_discovery.pyfor the gate (approved-vs-not for overlay/custom/hook/pre-op + the generic floor + default-empty-allowlist invariant). - Verified on cc-ci (tar-piped working tree → /root/cc-ci; cc-ci has no rsync):
cc-ci-run -m pytest tests/unit -q→ 8 passed in 0.06s And the cc-ci-authored hook is unaffected (DG5): discovery.install_steps("custom-html-tiny", None) → ('cc-ci', '.../tests/custom-html-tiny/install_steps.sh') - Committed
d38a695, pushed. Gate E0/HC2 CLAIMED for Adversary.
Next: E1 (HC3) — orchestrator op/assertion split + additive generic + opt-out + overlay migration.
2026-05-28 — E1 / HC3 additive generic + op/assertion split (implemented + e2e verified)
- Harness core:
lifecycle.deployed_identitynow returns{version,image,chaos}(chaos label captured, ready for HC1).generic.pysplit: op primitivesperform_upgrade/perform_backup/ perform_restore(orchestrator-only, no asserts) + assertionsassert_upgraded(serving + MOVED via version/image/chaos),assert_backup_artifact,assert_restore_healthy, all reading the run-scopedop_state()($CCCI_OP_STATE_FILE). - Orchestrator (
run_recipe_ci.py): newrun_lifecycle_tier= pre-op seed hook (ops.py pre_<op>, imported in-process w/ recipe dir on sys.path) → perform the op ONCE → run generic assertion (unless_skip_generic) + overlay assertion, both against the shared post-op deployment. Opt-out:CCCI_SKIP_GENERIC/CCCI_SKIP_GENERIC_<OP>/recipe_meta.SKIP_GENERIC._scrubfactored so op-failure messages are redacted too. Op primitives never calldeploy_app⇒ deploy-count stays 1. - Tiers/overlays migrated to assertion-only: generic
_generic/test_{upgrade,backup,restore}.py; all 6 recipes'test_{upgrade,backup,restore}.py. Pre-op seeding (data-continuity markers + the backup→restore mutation) moved to per-recipeops.py(pre_upgrade/pre_backup/pre_restore). install overlays unchanged (no op). No assertion weakened — every data-survival/return check kept. - Verified on cc-ci:
cc-ci-run -m pytest tests/unit -q→ 8 passed;nix develop .#lint→ lint: PASS (ruff format + check clean).- Full e2e
RECIPE=custom-html STAGES=install,upgrade,backup,restore,custom→ every tier ran BOTH generic AND overlay (additive): install(generic test_serving + overlay test_serving_and_content), upgrade(pre_upgrade seed → generic test_upgrade_reconverges + overlay test_upgrade_preserves_data), backup(pre_backup → generic test_backup_artifact + overlay test_backup_captures_state), restore(pre_restore → generic test_restore_healthy + overlay test_restore_returns_state). RUN SUMMARY: deploy-count=1, install/upgrade/backup/restore=pass, custom=skip; no leftover custom-html stack (clean teardown). Log: /root/ccci-1e-customhtml.log on cc-ci. - Opt-out run (
CCCI_SKIP_GENERIC=1) in flight to show generic skipped + overlay still runs.
Next: confirm opt-out result, claim E1/HC3 gate, then E2 (HC1 chaos-to-PR-head).
2026-05-28 — E1 opt-out verified; gate CLAIMED
- Opt-out e2e
RECIPE=custom-html STAGES=install,upgrade,backup,restore CCCI_SKIP_GENERIC=1: every tier loggedgeneric=skip, overlay=cc-ci; 0_generic/test_*files ran; only the 4 cc-ci overlays ran; deploy-count=1; install/upgrade/backup/restore=pass; clean teardown (no leftover custom-html stack). Log: /root/ccci-1e-optout.log. - HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out = generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary.
2026-05-28 — Adversary F1e-1 (HC3 opt-out race) + HC1 hardening
- F1e-1 (E1/HC3 FAIL withheld): under
CCCI_SKIP_GENERIC=1,test_backup_captures_stateflaked'' == 'original'. Root cause (valid):lifecycle.exec_in_appreturnedproc.stdoutWITHOUT checking returncode — when backup-bot cycles the app container,docker execfails and the empty stdout was silently returned as data; the generic pytest spawn (~1s) had been an accidental timing buffer that opt-out removes. Fix (no assertion weakened):exec_in_appnow polls — re-resolves the container + re-execs until returncode==0 or a 90s timeout, then RAISES. A container-cycle race now waits-and-succeeds; a genuine exec failure is loud, never masquerades as empty data. This makes the backup/restore overlays robust to the post-op cycle independent of the generic timing buffer, so opt-out is behavior-neutral. - HC1 hardening (my own findings from E2 e2e):
head_refcapture was racy (returned None under a concurrent run wiping the shared recipe dir), and a chaos-redeploy of the SAME prev checkout falsely "moved" via the chaos label alone. Fixes:head_ref = ref or recipe_head_commit(recipe)(prefer the explicit PR head sha $REF — robust, no git race; production!testmealways sets REF); store head_ref in op_state.assert_upgradednow, when head_ref is known, REQUIRES the deployedchaos-versioncommit to MATCH head_ref — direct proof the PR-head code under test was deployed, and non-vacuous (a stale prev-checkout chaos redeploy stamps prev's commit ≠ head_ref → FAIL). Falls back to the version/image/chaos move check only when head_ref is unknown.
- Coordination note: my E2 manual custom-html e2e ran concurrently with the Adversary's E1
cold-verify — both share
/root/.abra/recipes/custom-html+ (at PR=0) the same run domain, so they collided (explains my non-deterministic 1.10→1.11 vs 1.10→1.10 and the None head_ref). Manual ad-hoc runs bypass Drone's capacity=1 queue. Going forward I serialize: don't run a recipe manually while a gate is under Adversary verification; verify whenpgrep run_recipe_ciis clear.
2026-05-28 — E2 head_ref plumbing bug (fixed)
- Debug print at main() head_ref capture showed
head_ref='09bf4d54...'(correct hash), but perform_upgrade printedhead_ref=None. Root cause: my earlier perl regex to swaptarget → head_refin the fourrun_lifecycle_tiercall sites only matched the SINGLE-LINE form; the multi-lineupgradeandrestorecalls (lint-wrapped) still passedtarget(which is the VERSION env, None for !testme runs). So perform_upgrade got head_ref=None for upgrade tier → re-checkout skipped → chaos deploy of whatever leftover checkout (prev tag from deploy_app) → vacuous prev→prev chaos redeploy that "passed" via the chaos-label move fallback. - Fixed: explicit Edit on the two multi-line calls so they now pass
head_refconsistently (recipe/"upgrade"|"backup"|"restore",repo_local,domain,meta,head_ref,op_state). grep confirms all 4 tier calls pass head_ref. compile OK. - Net effect now: head_ref reaches perform_upgrade → recipe_checkout_ref(head_ref) restores PR-head before chaos deploy → after.chaos == head_ref → assert_upgraded match succeeds non-vacuously.
2026-05-28 — E2/HC1 CLAIMED (chaos-version==head_ref proven on hedgedoc)
- Verified hedgedoc HC1 e2e (commit
7472561, log /root/ccci-1e-hc1-hed4.log):head_ref (09bf4d54) == chaos-version (09bf4d54) — direct, deterministic, non-vacuous proof the chaos deploy deployed the PR-head code under test. Plus a real version bump 3.0.9→3.0.10. deploy-count=1; clean teardown.== cc-ci run: recipe=hedgedoc ref=None pr=0 stages=['install', 'upgrade'] ===== TIER: upgrade (generic=run, overlay=none) ===== upgrade→PR-head: head_ref=09bf4d54 chaos-version=09bf4d54 version=3.0.9+1.10.7→3.0.10+1.10.8 PASSED tests/_generic/test_upgrade.py::test_upgrade_reconverges ===== RUN SUMMARY ===== deploy-count = 1 (expect 1) install : pass upgrade : pass - E3/HC4 docs work shipped in
7472561(docs/testing.md + docs/enroll-recipe.md fully rewritten for HC1/HC2/HC3: additive generic + opt-out + ops.py + chaos PR-head + repo-local allowlist). - All three HC items implemented + Builder-verified. Awaiting Adversary cold-verify of HC1 and HC4.
Background-task pgrep self-match note (lesson learned)
- My
until ! pgrep -f run_recipe_ci.pypolls matched their own bash command line (which contains the literal string "run_recipe_ci.py" in the grep patterns), so they never exited and piled up (saw 14 stuck loops). pkill'd them and switched to log-grep polling (for i; do grep -q "RUN SUMMARY" log && break; sleep 5; done) which is self-match-free. Won't repeat the pgrep -f anti-pattern.
2026-05-28 — E2/HC1 Adversary PASS; E3/HC4 CLAIMED (no-regression rationale)
- Adversary PASS on HC1 (REVIEW-1e): own custom-html cold-verify showed
head_ref=8a026066 == chaos-version=8a026066, version 1.10.0→1.11.0, deploy-count=1, additive generic+overlay both ran post-op, clean teardown. Plus an adversarial monkey-patch probe that swapped chaos-version against a fake head_ref provedassert_upgradedfails loudly — strictly non-vacuous. No new finding. HC1 ✓ HC2 ✓ HC3 ✓. - Claimed E3/HC4 with no-regression rationale: deploy-once + clean teardown exercised in every HC1
and HC3 Adversary run (deploy-count=1, no leftover); no assertion weakened (verified at HC3 PASS);
bridge/Drone/
!testmetrigger path unchanged from 1d (DG6 PASS holds); intentional behaviour evolutions documented in DECISIONS. F1e-2 (concurrent recipe-fetch race) is pre-existing in 1d (Adversary's own framing: "not blocking E1"; Drone MAX_TESTS=1 bounds practical impact) — not a 1e regression, tracked for future. Awaiting Adversary cold-verify of HC4 to write ## DONE.
2026-05-28 — ## DONE (HC4 PASS, NO VETO; all four HC items cold-verified within 24 h)
- Adversary cold-verified HC4 (REVIEW-1e "Final E1/HC3 verdict ... PASS. NO VETO") via build #155
— own
!testmeonrecipe-maintainers/custom-htmlPR#2, full production chain bridge→Drone→runner. Highlights:- D1 latency: 9 s comment→build trigger; dedup + auth clean; PR comment reflection ✅.
- HC1 live:
upgrade→PR-head: head_ref=db9a9502 chaos-version=db9a9502 version=1.10.0+1.28.0 →1.13.0+1.31.1. Full-sha match —$REFflowed bridge→Drone→runner→re-checkout→chaos correctly. - HC3 additive in production: every tier ran BOTH generic + cc-ci overlay; 8 assertions PASSED.
- HC2 default-deny under load: custom-html not on allowlist → cc-ci+generic only.
- DG4.1: deploy-count=1; teardown sacred (no leftover stack/volume).
- D6 secret-leak grep over the full build #155 log: 0/58 matches.
- F1e-1 fix verified under real load:
test_backup_captures_state PASSED. - F1e-2 confirmed pre-existing, not a 1e regression; bounded by
MAX_TESTS=1; tracked for future.
- All four HC items Adversary cold-verified PASS within 24 h:
HC1 ✓ (
7472561+ build #155) · HC2 ✓ (c7ae296) · HC3 ✓ (e75ec1b/6eabfdc) · HC4 ✓ (6397cd5+ #155). - Wrote
## DONEto STATUS-1e.md. Builder loop stops; next is Phase 2.