Files
cc-ci/machine-docs/JOURNAL-1e.md
autonomic-bot 0fe12188f2 DONE(1e): Phase 1e complete — HC1-HC4 all Adversary cold-verified PASS, NO VETO
build #155 (own !testme on custom-html PR#2): head_ref=db9a9502 == chaos-version=db9a9502
(1.10.0→1.13.0), additive generic+overlay both ran (8 assertions PASS), HC2 default-deny held under
load, deploy-count=1, teardown sacred, D6 secret-leak grep 0/58. F1e-1 CLOSED. F1e-2 pre-existing
(not a 1e regression). The generic-harness corrections are landed; foundation ready for Phase 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:26:42 +01:00

13 KiB
Raw Permalink Blame History

JOURNAL — Phase 1e (generic-harness corrections)

Append-only Builder log: what I did + verifying command/output + next.

2026-05-28 — Phase 1e bootstrap + orientation

  • Read the phase plan (plan-phase1e-harness-corrections.md) + plan.md §6.1/§7/§9. Phase 1d is DONE (STATUS-1d ## DONE, DG1DG8 Adversary PASS). Studied the harness: runner/run_recipe_ci.py (deploy-once orchestrator), runner/harness/{discovery,generic,lifecycle,abra}.py, tests/conftest.py, tests/_generic/*, the overlays (custom-html/keycloak/cryptpad/n8n/matrix-synapse), and tests/unit/test_discovery.py.
  • Access re-verified: ssh cc-ci 'hostname && whoami'nixos / root.
  • Settled the three open decisions (HC1 deploy-count, HC2 allowlist, HC3 opt-out) in DECISIONS.md.
  • Created STATUS-1e / BACKLOG-1e / JOURNAL-1e. Order of work: E0 (HC2) → E1 (HC3) → E2 (HC1) → E3.
  • Key design notes:
    • HC3 op/assertion split: orchestrator performs each mutating op once; generic + overlay both run as assertions after. Op results (pre-upgrade identity, snapshot_id) passed via run-scoped $CCCI_OP_STATE_FILE. Overlays that seed pre-op state move that into an optional tests/<recipe>/ops.py (pre_<op>(domain, meta)); overlay test_<op>.py become assertion-only.
    • HC1: re-checkout PR head (recorded as recipe HEAD right after fetch) then abra app deploy --chaos; moved-assertion accepts the chaos label as proof PR-head deployed; deploy-count counts only deploy_app (app new), not the in-place chaos redeploy.

Next: E0 — implement the HC2 allowlist + discovery gate + unit tests.

2026-05-28 — E0 / HC2 repo-local trust gate (DONE, CLAIMED)

  • Implemented the approval allowlist (tests/repo-local-approved.txt, default empty ⇒ default-deny)
    • centralized gate in runner/harness/discovery.py: approved_recipes()/repo_local_approved()/ _gated(). Split overlay resolution into resolve_overlay_op (repo-local>cc-ci, gated) + generic_op (the floor) for HC3; kept back-compat resolve_op (override). custom_tests/install_steps/new pre_op_hook all route repo-local through _gated. Allowlist path overridable via CCCI_REPO_LOCAL_APPROVED_FILE.
  • Rewrote tests/unit/test_discovery.py for the gate (approved-vs-not for overlay/custom/hook/pre-op + the generic floor + default-empty-allowlist invariant).
  • Verified on cc-ci (tar-piped working tree → /root/cc-ci; cc-ci has no rsync): cc-ci-run -m pytest tests/unit -q8 passed in 0.06s And the cc-ci-authored hook is unaffected (DG5): discovery.install_steps("custom-html-tiny", None) → ('cc-ci', '.../tests/custom-html-tiny/install_steps.sh')
  • Committed d38a695, pushed. Gate E0/HC2 CLAIMED for Adversary.

Next: E1 (HC3) — orchestrator op/assertion split + additive generic + opt-out + overlay migration.

2026-05-28 — E1 / HC3 additive generic + op/assertion split (implemented + e2e verified)

  • Harness core: lifecycle.deployed_identity now returns {version,image,chaos} (chaos label captured, ready for HC1). generic.py split: op primitives perform_upgrade/perform_backup/ perform_restore (orchestrator-only, no asserts) + assertions assert_upgraded (serving + MOVED via version/image/chaos), assert_backup_artifact, assert_restore_healthy, all reading the run-scoped op_state() ($CCCI_OP_STATE_FILE).
  • Orchestrator (run_recipe_ci.py): new run_lifecycle_tier = pre-op seed hook (ops.py pre_<op>, imported in-process w/ recipe dir on sys.path) → perform the op ONCE → run generic assertion (unless _skip_generic) + overlay assertion, both against the shared post-op deployment. Opt-out: CCCI_SKIP_GENERIC / CCCI_SKIP_GENERIC_<OP> / recipe_meta.SKIP_GENERIC. _scrub factored so op-failure messages are redacted too. Op primitives never call deploy_app ⇒ deploy-count stays 1.
  • Tiers/overlays migrated to assertion-only: generic _generic/test_{upgrade,backup,restore}.py; all 6 recipes' test_{upgrade,backup,restore}.py. Pre-op seeding (data-continuity markers + the backup→restore mutation) moved to per-recipe ops.py (pre_upgrade/pre_backup/pre_restore). install overlays unchanged (no op). No assertion weakened — every data-survival/return check kept.
  • Verified on cc-ci:
    • cc-ci-run -m pytest tests/unit -q8 passed; nix develop .#lintlint: PASS (ruff format + check clean).
    • Full e2e RECIPE=custom-html STAGES=install,upgrade,backup,restore,custom → every tier ran BOTH generic AND overlay (additive): install(generic test_serving + overlay test_serving_and_content), upgrade(pre_upgrade seed → generic test_upgrade_reconverges + overlay test_upgrade_preserves_data), backup(pre_backup → generic test_backup_artifact + overlay test_backup_captures_state), restore(pre_restore → generic test_restore_healthy + overlay test_restore_returns_state). RUN SUMMARY: deploy-count=1, install/upgrade/backup/restore=pass, custom=skip; no leftover custom-html stack (clean teardown). Log: /root/ccci-1e-customhtml.log on cc-ci.
    • Opt-out run (CCCI_SKIP_GENERIC=1) in flight to show generic skipped + overlay still runs.

Next: confirm opt-out result, claim E1/HC3 gate, then E2 (HC1 chaos-to-PR-head).

2026-05-28 — E1 opt-out verified; gate CLAIMED

  • Opt-out e2e RECIPE=custom-html STAGES=install,upgrade,backup,restore CCCI_SKIP_GENERIC=1: every tier logged generic=skip, overlay=cc-ci; 0 _generic/test_* files ran; only the 4 cc-ci overlays ran; deploy-count=1; install/upgrade/backup/restore=pass; clean teardown (no leftover custom-html stack). Log: /root/ccci-1e-optout.log.
  • HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out = generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary.

2026-05-28 — Adversary F1e-1 (HC3 opt-out race) + HC1 hardening

  • F1e-1 (E1/HC3 FAIL withheld): under CCCI_SKIP_GENERIC=1, test_backup_captures_state flaked '' == 'original'. Root cause (valid): lifecycle.exec_in_app returned proc.stdout WITHOUT checking returncode — when backup-bot cycles the app container, docker exec fails and the empty stdout was silently returned as data; the generic pytest spawn (~1s) had been an accidental timing buffer that opt-out removes. Fix (no assertion weakened): exec_in_app now polls — re-resolves the container + re-execs until returncode==0 or a 90s timeout, then RAISES. A container-cycle race now waits-and-succeeds; a genuine exec failure is loud, never masquerades as empty data. This makes the backup/restore overlays robust to the post-op cycle independent of the generic timing buffer, so opt-out is behavior-neutral.
  • HC1 hardening (my own findings from E2 e2e):
    • head_ref capture was racy (returned None under a concurrent run wiping the shared recipe dir), and a chaos-redeploy of the SAME prev checkout falsely "moved" via the chaos label alone. Fixes: head_ref = ref or recipe_head_commit(recipe) (prefer the explicit PR head sha $REF — robust, no git race; production !testme always sets REF); store head_ref in op_state.
    • assert_upgraded now, when head_ref is known, REQUIRES the deployed chaos-version commit to MATCH head_ref — direct proof the PR-head code under test was deployed, and non-vacuous (a stale prev-checkout chaos redeploy stamps prev's commit ≠ head_ref → FAIL). Falls back to the version/image/chaos move check only when head_ref is unknown.
  • Coordination note: my E2 manual custom-html e2e ran concurrently with the Adversary's E1 cold-verify — both share /root/.abra/recipes/custom-html + (at PR=0) the same run domain, so they collided (explains my non-deterministic 1.10→1.11 vs 1.10→1.10 and the None head_ref). Manual ad-hoc runs bypass Drone's capacity=1 queue. Going forward I serialize: don't run a recipe manually while a gate is under Adversary verification; verify when pgrep run_recipe_ci is clear.

2026-05-28 — E2 head_ref plumbing bug (fixed)

  • Debug print at main() head_ref capture showed head_ref='09bf4d54...' (correct hash), but perform_upgrade printed head_ref=None. Root cause: my earlier perl regex to swap target → head_ref in the four run_lifecycle_tier call sites only matched the SINGLE-LINE form; the multi-line upgrade and restore calls (lint-wrapped) still passed target (which is the VERSION env, None for !testme runs). So perform_upgrade got head_ref=None for upgrade tier → re-checkout skipped → chaos deploy of whatever leftover checkout (prev tag from deploy_app) → vacuous prev→prev chaos redeploy that "passed" via the chaos-label move fallback.
  • Fixed: explicit Edit on the two multi-line calls so they now pass head_ref consistently (recipe/"upgrade"|"backup"|"restore", repo_local, domain, meta, head_ref, op_state). grep confirms all 4 tier calls pass head_ref. compile OK.
  • Net effect now: head_ref reaches perform_upgrade → recipe_checkout_ref(head_ref) restores PR-head before chaos deploy → after.chaos == head_ref → assert_upgraded match succeeds non-vacuously.

2026-05-28 — E2/HC1 CLAIMED (chaos-version==head_ref proven on hedgedoc)

  • Verified hedgedoc HC1 e2e (commit 7472561, log /root/ccci-1e-hc1-hed4.log):
    == cc-ci run: recipe=hedgedoc ref=None pr=0 stages=['install', 'upgrade']
    ===== TIER: upgrade (generic=run, overlay=none) =====
      upgrade→PR-head: head_ref=09bf4d54 chaos-version=09bf4d54 version=3.0.9+1.10.7→3.0.10+1.10.8
    PASSED tests/_generic/test_upgrade.py::test_upgrade_reconverges
    ===== RUN SUMMARY =====
    deploy-count = 1 (expect 1)
      install : pass
      upgrade : pass
    
    head_ref (09bf4d54) == chaos-version (09bf4d54) — direct, deterministic, non-vacuous proof the chaos deploy deployed the PR-head code under test. Plus a real version bump 3.0.9→3.0.10. deploy-count=1; clean teardown.
  • E3/HC4 docs work shipped in 7472561 (docs/testing.md + docs/enroll-recipe.md fully rewritten for HC1/HC2/HC3: additive generic + opt-out + ops.py + chaos PR-head + repo-local allowlist).
  • All three HC items implemented + Builder-verified. Awaiting Adversary cold-verify of HC1 and HC4.

Background-task pgrep self-match note (lesson learned)

  • My until ! pgrep -f run_recipe_ci.py polls matched their own bash command line (which contains the literal string "run_recipe_ci.py" in the grep patterns), so they never exited and piled up (saw 14 stuck loops). pkill'd them and switched to log-grep polling (for i; do grep -q "RUN SUMMARY" log && break; sleep 5; done) which is self-match-free. Won't repeat the pgrep -f anti-pattern.

2026-05-28 — E2/HC1 Adversary PASS; E3/HC4 CLAIMED (no-regression rationale)

  • Adversary PASS on HC1 (REVIEW-1e): own custom-html cold-verify showed head_ref=8a026066 == chaos-version=8a026066, version 1.10.0→1.11.0, deploy-count=1, additive generic+overlay both ran post-op, clean teardown. Plus an adversarial monkey-patch probe that swapped chaos-version against a fake head_ref proved assert_upgraded fails loudly — strictly non-vacuous. No new finding. HC1 ✓ HC2 ✓ HC3 ✓.
  • Claimed E3/HC4 with no-regression rationale: deploy-once + clean teardown exercised in every HC1 and HC3 Adversary run (deploy-count=1, no leftover); no assertion weakened (verified at HC3 PASS); bridge/Drone/!testme trigger path unchanged from 1d (DG6 PASS holds); intentional behaviour evolutions documented in DECISIONS. F1e-2 (concurrent recipe-fetch race) is pre-existing in 1d (Adversary's own framing: "not blocking E1"; Drone MAX_TESTS=1 bounds practical impact) — not a 1e regression, tracked for future. Awaiting Adversary cold-verify of HC4 to write ## DONE.

2026-05-28 — ## DONE (HC4 PASS, NO VETO; all four HC items cold-verified within 24 h)

  • Adversary cold-verified HC4 (REVIEW-1e "Final E1/HC3 verdict ... PASS. NO VETO") via build #155 — own !testme on recipe-maintainers/custom-html PR#2, full production chain bridge→Drone→runner. Highlights:
    • D1 latency: 9 s comment→build trigger; dedup + auth clean; PR comment reflection .
    • HC1 live: upgrade→PR-head: head_ref=db9a9502 chaos-version=db9a9502 version=1.10.0+1.28.0 →1.13.0+1.31.1. Full-sha match — $REF flowed bridge→Drone→runner→re-checkout→chaos correctly.
    • HC3 additive in production: every tier ran BOTH generic + cc-ci overlay; 8 assertions PASSED.
    • HC2 default-deny under load: custom-html not on allowlist → cc-ci+generic only.
    • DG4.1: deploy-count=1; teardown sacred (no leftover stack/volume).
    • D6 secret-leak grep over the full build #155 log: 0/58 matches.
    • F1e-1 fix verified under real load: test_backup_captures_state PASSED.
    • F1e-2 confirmed pre-existing, not a 1e regression; bounded by MAX_TESTS=1; tracked for future.
  • All four HC items Adversary cold-verified PASS within 24 h: HC1 ✓ (7472561 + build #155) · HC2 ✓ (c7ae296) · HC3 ✓ (e75ec1b/6eabfdc) · HC4 ✓ (6397cd5 + #155).
  • Wrote ## DONE to STATUS-1e.md. Builder loop stops; next is Phase 2.