Files
cc-ci/machine-docs/JOURNAL-1e.md
autonomic-bot 0fe12188f2 DONE(1e): Phase 1e complete — HC1-HC4 all Adversary cold-verified PASS, NO VETO
build #155 (own !testme on custom-html PR#2): head_ref=db9a9502 == chaos-version=db9a9502
(1.10.0→1.13.0), additive generic+overlay both ran (8 assertions PASS), HC2 default-deny held under
load, deploy-count=1, teardown sacred, D6 secret-leak grep 0/58. F1e-1 CLOSED. F1e-2 pre-existing
(not a 1e regression). The generic-harness corrections are landed; foundation ready for Phase 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:26:42 +01:00

174 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL — Phase 1e (generic-harness corrections)
Append-only Builder log: what I did + verifying command/output + next.
## 2026-05-28 — Phase 1e bootstrap + orientation
- Read the phase plan (`plan-phase1e-harness-corrections.md`) + plan.md §6.1/§7/§9. Phase 1d is DONE
(STATUS-1d ## DONE, DG1DG8 Adversary PASS). Studied the harness: `runner/run_recipe_ci.py`
(deploy-once orchestrator), `runner/harness/{discovery,generic,lifecycle,abra}.py`, `tests/conftest.py`,
`tests/_generic/*`, the overlays (custom-html/keycloak/cryptpad/n8n/matrix-synapse), and
`tests/unit/test_discovery.py`.
- Access re-verified: `ssh cc-ci 'hostname && whoami'``nixos` / `root`.
- Settled the three open decisions (HC1 deploy-count, HC2 allowlist, HC3 opt-out) in DECISIONS.md.
- Created STATUS-1e / BACKLOG-1e / JOURNAL-1e. Order of work: E0 (HC2) → E1 (HC3) → E2 (HC1) → E3.
- Key design notes:
- HC3 op/assertion split: orchestrator performs each mutating op once; generic + overlay both run as
assertions after. Op results (pre-upgrade identity, snapshot_id) passed via run-scoped
`$CCCI_OP_STATE_FILE`. Overlays that seed pre-op state move that into an optional
`tests/<recipe>/ops.py` (`pre_<op>(domain, meta)`); overlay `test_<op>.py` become assertion-only.
- HC1: re-checkout PR head (recorded as recipe HEAD right after fetch) then `abra app deploy --chaos`;
moved-assertion accepts the chaos label as proof PR-head deployed; deploy-count counts only
`deploy_app` (app new), not the in-place chaos redeploy.
Next: E0 — implement the HC2 allowlist + discovery gate + unit tests.
## 2026-05-28 — E0 / HC2 repo-local trust gate (DONE, CLAIMED)
- Implemented the approval allowlist (`tests/repo-local-approved.txt`, default empty ⇒ default-deny)
+ centralized gate in `runner/harness/discovery.py`: `approved_recipes()`/`repo_local_approved()`/
`_gated()`. Split overlay resolution into `resolve_overlay_op` (repo-local>cc-ci, gated) + `generic_op`
(the floor) for HC3; kept back-compat `resolve_op` (override). `custom_tests`/`install_steps`/new
`pre_op_hook` all route repo-local through `_gated`. Allowlist path overridable via
`CCCI_REPO_LOCAL_APPROVED_FILE`.
- Rewrote `tests/unit/test_discovery.py` for the gate (approved-vs-not for overlay/custom/hook/pre-op +
the generic floor + default-empty-allowlist invariant).
- Verified on cc-ci (tar-piped working tree → /root/cc-ci; cc-ci has no rsync):
`cc-ci-run -m pytest tests/unit -q`**8 passed in 0.06s**
And the cc-ci-authored hook is unaffected (DG5):
discovery.install_steps("custom-html-tiny", None) → ('cc-ci', '.../tests/custom-html-tiny/install_steps.sh')
- Committed d38a695, pushed. Gate E0/HC2 CLAIMED for Adversary.
Next: E1 (HC3) — orchestrator op/assertion split + additive generic + opt-out + overlay migration.
## 2026-05-28 — E1 / HC3 additive generic + op/assertion split (implemented + e2e verified)
- **Harness core:** `lifecycle.deployed_identity` now returns `{version,image,chaos}` (chaos label
captured, ready for HC1). `generic.py` split: op primitives `perform_upgrade/perform_backup/
perform_restore` (orchestrator-only, no asserts) + assertions `assert_upgraded` (serving + MOVED via
version/image/chaos), `assert_backup_artifact`, `assert_restore_healthy`, all reading the run-scoped
`op_state()` (`$CCCI_OP_STATE_FILE`).
- **Orchestrator** (`run_recipe_ci.py`): new `run_lifecycle_tier` = pre-op seed hook (`ops.py
pre_<op>`, imported in-process w/ recipe dir on sys.path) → perform the op ONCE → run generic
assertion (unless `_skip_generic`) + overlay assertion, both against the shared post-op deployment.
Opt-out: `CCCI_SKIP_GENERIC` / `CCCI_SKIP_GENERIC_<OP>` / `recipe_meta.SKIP_GENERIC`. `_scrub`
factored so op-failure messages are redacted too. Op primitives never call `deploy_app` ⇒
deploy-count stays 1.
- **Tiers/overlays migrated to assertion-only:** generic `_generic/test_{upgrade,backup,restore}.py`;
all 6 recipes' `test_{upgrade,backup,restore}.py`. Pre-op seeding (data-continuity markers + the
backup→restore mutation) moved to per-recipe `ops.py` (`pre_upgrade/pre_backup/pre_restore`).
install overlays unchanged (no op). No assertion weakened — every data-survival/return check kept.
- **Verified on cc-ci:**
- `cc-ci-run -m pytest tests/unit -q` → **8 passed**; `nix develop .#lint` → **lint: PASS** (ruff
format + check clean).
- Full e2e `RECIPE=custom-html STAGES=install,upgrade,backup,restore,custom` → every tier ran BOTH
generic AND overlay (additive): install(generic test_serving + overlay test_serving_and_content),
upgrade(pre_upgrade seed → generic test_upgrade_reconverges + overlay test_upgrade_preserves_data),
backup(pre_backup → generic test_backup_artifact + overlay test_backup_captures_state),
restore(pre_restore → generic test_restore_healthy + overlay test_restore_returns_state).
**RUN SUMMARY: deploy-count=1, install/upgrade/backup/restore=pass, custom=skip; no leftover
custom-html stack (clean teardown).** Log: /root/ccci-1e-customhtml.log on cc-ci.
- Opt-out run (`CCCI_SKIP_GENERIC=1`) in flight to show generic skipped + overlay still runs.
Next: confirm opt-out result, claim E1/HC3 gate, then E2 (HC1 chaos-to-PR-head).
## 2026-05-28 — E1 opt-out verified; gate CLAIMED
- Opt-out e2e `RECIPE=custom-html STAGES=install,upgrade,backup,restore CCCI_SKIP_GENERIC=1`:
every tier logged `generic=skip, overlay=cc-ci`; **0** `_generic/test_*` files ran; only the 4
cc-ci overlays ran; **deploy-count=1**; install/upgrade/backup/restore=pass; clean teardown (no
leftover custom-html stack). Log: /root/ccci-1e-optout.log.
- HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out =
generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary.
## 2026-05-28 — Adversary F1e-1 (HC3 opt-out race) + HC1 hardening
- **F1e-1 (E1/HC3 FAIL withheld):** under `CCCI_SKIP_GENERIC=1`, `test_backup_captures_state` flaked
`'' == 'original'`. Root cause (valid): `lifecycle.exec_in_app` returned `proc.stdout` WITHOUT
checking returncode — when backup-bot cycles the app container, `docker exec` fails and the empty
stdout was silently returned as data; the generic pytest spawn (~1s) had been an accidental timing
buffer that opt-out removes. **Fix (no assertion weakened):** `exec_in_app` now polls — re-resolves
the container + re-execs until returncode==0 or a 90s timeout, then RAISES. A container-cycle race
now waits-and-succeeds; a genuine exec failure is loud, never masquerades as empty data. This makes
the backup/restore overlays robust to the post-op cycle independent of the generic timing buffer, so
opt-out is behavior-neutral.
- **HC1 hardening (my own findings from E2 e2e):**
- `head_ref` capture was racy (returned None under a concurrent run wiping the shared recipe dir),
and a chaos-redeploy of the SAME prev checkout falsely "moved" via the chaos label alone. Fixes:
`head_ref = ref or recipe_head_commit(recipe)` (prefer the explicit PR head sha $REF — robust, no
git race; production `!testme` always sets REF); store head_ref in op_state.
- `assert_upgraded` now, when head_ref is known, REQUIRES the deployed `chaos-version` commit to
MATCH head_ref — direct proof the PR-head code under test was deployed, and non-vacuous (a stale
prev-checkout chaos redeploy stamps prev's commit ≠ head_ref → FAIL). Falls back to the
version/image/chaos move check only when head_ref is unknown.
- **Coordination note:** my E2 manual custom-html e2e ran concurrently with the Adversary's E1
cold-verify — both share `/root/.abra/recipes/custom-html` + (at PR=0) the same run domain, so they
collided (explains my non-deterministic 1.10→1.11 vs 1.10→1.10 and the None head_ref). Manual ad-hoc
runs bypass Drone's capacity=1 queue. Going forward I serialize: don't run a recipe manually while a
gate is under Adversary verification; verify when `pgrep run_recipe_ci` is clear.
## 2026-05-28 — E2 head_ref plumbing bug (fixed)
- Debug print at main() head_ref capture showed `head_ref='09bf4d54...'` (correct hash), but
perform_upgrade printed `head_ref=None`. Root cause: my earlier perl regex to swap `target →
head_ref` in the four `run_lifecycle_tier` call sites only matched the SINGLE-LINE form; the
multi-line `upgrade` and `restore` calls (lint-wrapped) still passed `target` (which is the VERSION
env, None for !testme runs). So perform_upgrade got head_ref=None for upgrade tier → re-checkout
skipped → chaos deploy of whatever leftover checkout (prev tag from deploy_app) → vacuous prev→prev
chaos redeploy that "passed" via the chaos-label move fallback.
- Fixed: explicit Edit on the two multi-line calls so they now pass `head_ref` consistently
(`recipe`/`"upgrade"|"backup"|"restore"`, `repo_local`, `domain`, `meta`, `head_ref`, `op_state`).
grep confirms all 4 tier calls pass head_ref. compile OK.
- Net effect now: head_ref reaches perform_upgrade → recipe_checkout_ref(head_ref) restores PR-head
before chaos deploy → after.chaos == head_ref → assert_upgraded match succeeds non-vacuously.
## 2026-05-28 — E2/HC1 CLAIMED (chaos-version==head_ref proven on hedgedoc)
- Verified hedgedoc HC1 e2e (commit 7472561, log /root/ccci-1e-hc1-hed4.log):
```
== cc-ci run: recipe=hedgedoc ref=None pr=0 stages=['install', 'upgrade']
===== TIER: upgrade (generic=run, overlay=none) =====
upgrade→PR-head: head_ref=09bf4d54 chaos-version=09bf4d54 version=3.0.9+1.10.7→3.0.10+1.10.8
PASSED tests/_generic/test_upgrade.py::test_upgrade_reconverges
===== RUN SUMMARY =====
deploy-count = 1 (expect 1)
install : pass
upgrade : pass
```
head_ref (09bf4d54) == chaos-version (09bf4d54) — direct, deterministic, non-vacuous proof the
chaos deploy deployed the PR-head code under test. Plus a real version bump 3.0.9→3.0.10.
deploy-count=1; clean teardown.
- E3/HC4 docs work shipped in 7472561 (docs/testing.md + docs/enroll-recipe.md fully rewritten for
HC1/HC2/HC3: additive generic + opt-out + ops.py + chaos PR-head + repo-local allowlist).
- All three HC items implemented + Builder-verified. Awaiting Adversary cold-verify of HC1 and HC4.
## Background-task pgrep self-match note (lesson learned)
- My `until ! pgrep -f run_recipe_ci.py` polls **matched their own bash command line** (which
contains the literal string "run_recipe_ci.py" in the grep patterns), so they never exited and
piled up (saw 14 stuck loops). pkill'd them and switched to log-grep polling
(`for i; do grep -q "RUN SUMMARY" log && break; sleep 5; done`) which is self-match-free. Won't
repeat the pgrep -f anti-pattern.
## 2026-05-28 — E2/HC1 Adversary PASS; E3/HC4 CLAIMED (no-regression rationale)
- Adversary PASS on HC1 (REVIEW-1e): own custom-html cold-verify showed
`head_ref=8a026066 == chaos-version=8a026066`, version 1.10.0→1.11.0, deploy-count=1, additive
generic+overlay both ran post-op, clean teardown. Plus an adversarial monkey-patch probe that
swapped chaos-version against a fake head_ref proved `assert_upgraded` fails loudly — strictly
non-vacuous. No new finding. **HC1 ✓ HC2 ✓ HC3 ✓.**
- Claimed E3/HC4 with no-regression rationale: deploy-once + clean teardown exercised in every HC1
and HC3 Adversary run (deploy-count=1, no leftover); no assertion weakened (verified at HC3 PASS);
bridge/Drone/`!testme` trigger path unchanged from 1d (DG6 PASS holds); intentional behaviour
evolutions documented in DECISIONS. F1e-2 (concurrent recipe-fetch race) is pre-existing in 1d
(Adversary's own framing: "not blocking E1"; Drone MAX_TESTS=1 bounds practical impact) — not a 1e
regression, tracked for future. Awaiting Adversary cold-verify of HC4 to write ## DONE.
## 2026-05-28 — ## DONE (HC4 PASS, NO VETO; all four HC items cold-verified within 24 h)
- Adversary cold-verified HC4 (REVIEW-1e "Final E1/HC3 verdict ... PASS. NO VETO") via build **#155**
— own `!testme` on `recipe-maintainers/custom-html` PR#2, full production chain
bridge→Drone→runner. Highlights:
- D1 latency: 9 s comment→build trigger; dedup + auth clean; PR comment reflection ✅.
- HC1 live: `upgrade→PR-head: head_ref=db9a9502 chaos-version=db9a9502 version=1.10.0+1.28.0
→1.13.0+1.31.1`. Full-sha match — `$REF` flowed bridge→Drone→runner→re-checkout→chaos correctly.
- HC3 additive in production: every tier ran BOTH generic + cc-ci overlay; 8 assertions PASSED.
- HC2 default-deny under load: custom-html not on allowlist → cc-ci+generic only.
- DG4.1: deploy-count=1; teardown sacred (no leftover stack/volume).
- D6 secret-leak grep over the full build #155 log: 0/58 matches.
- F1e-1 fix verified under real load: `test_backup_captures_state PASSED`.
- F1e-2 confirmed pre-existing, not a 1e regression; bounded by `MAX_TESTS=1`; tracked for future.
- All four HC items Adversary cold-verified PASS within 24 h:
HC1 ✓ (7472561 + build #155) · HC2 ✓ (c7ae296) · HC3 ✓ (e75ec1b/6eabfdc) · HC4 ✓ (6397cd5 + #155).
- Wrote `## DONE` to STATUS-1e.md. Builder loop stops; next is Phase 2.