build #155 (own !testme on custom-html PR#2): head_ref=db9a9502 == chaos-version=db9a9502 (1.10.0→1.13.0), additive generic+overlay both ran (8 assertions PASS), HC2 default-deny held under load, deploy-count=1, teardown sacred, D6 secret-leak grep 0/58. F1e-1 CLOSED. F1e-2 pre-existing (not a 1e regression). The generic-harness corrections are landed; foundation ready for Phase 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
174 lines
13 KiB
Markdown
174 lines
13 KiB
Markdown
# JOURNAL — Phase 1e (generic-harness corrections)
|
||
|
||
Append-only Builder log: what I did + verifying command/output + next.
|
||
|
||
## 2026-05-28 — Phase 1e bootstrap + orientation
|
||
- Read the phase plan (`plan-phase1e-harness-corrections.md`) + plan.md §6.1/§7/§9. Phase 1d is DONE
|
||
(STATUS-1d ## DONE, DG1–DG8 Adversary PASS). Studied the harness: `runner/run_recipe_ci.py`
|
||
(deploy-once orchestrator), `runner/harness/{discovery,generic,lifecycle,abra}.py`, `tests/conftest.py`,
|
||
`tests/_generic/*`, the overlays (custom-html/keycloak/cryptpad/n8n/matrix-synapse), and
|
||
`tests/unit/test_discovery.py`.
|
||
- Access re-verified: `ssh cc-ci 'hostname && whoami'` → `nixos` / `root`.
|
||
- Settled the three open decisions (HC1 deploy-count, HC2 allowlist, HC3 opt-out) in DECISIONS.md.
|
||
- Created STATUS-1e / BACKLOG-1e / JOURNAL-1e. Order of work: E0 (HC2) → E1 (HC3) → E2 (HC1) → E3.
|
||
- Key design notes:
|
||
- HC3 op/assertion split: orchestrator performs each mutating op once; generic + overlay both run as
|
||
assertions after. Op results (pre-upgrade identity, snapshot_id) passed via run-scoped
|
||
`$CCCI_OP_STATE_FILE`. Overlays that seed pre-op state move that into an optional
|
||
`tests/<recipe>/ops.py` (`pre_<op>(domain, meta)`); overlay `test_<op>.py` become assertion-only.
|
||
- HC1: re-checkout PR head (recorded as recipe HEAD right after fetch) then `abra app deploy --chaos`;
|
||
moved-assertion accepts the chaos label as proof PR-head deployed; deploy-count counts only
|
||
`deploy_app` (app new), not the in-place chaos redeploy.
|
||
|
||
Next: E0 — implement the HC2 allowlist + discovery gate + unit tests.
|
||
|
||
## 2026-05-28 — E0 / HC2 repo-local trust gate (DONE, CLAIMED)
|
||
- Implemented the approval allowlist (`tests/repo-local-approved.txt`, default empty ⇒ default-deny)
|
||
+ centralized gate in `runner/harness/discovery.py`: `approved_recipes()`/`repo_local_approved()`/
|
||
`_gated()`. Split overlay resolution into `resolve_overlay_op` (repo-local>cc-ci, gated) + `generic_op`
|
||
(the floor) for HC3; kept back-compat `resolve_op` (override). `custom_tests`/`install_steps`/new
|
||
`pre_op_hook` all route repo-local through `_gated`. Allowlist path overridable via
|
||
`CCCI_REPO_LOCAL_APPROVED_FILE`.
|
||
- Rewrote `tests/unit/test_discovery.py` for the gate (approved-vs-not for overlay/custom/hook/pre-op +
|
||
the generic floor + default-empty-allowlist invariant).
|
||
- Verified on cc-ci (tar-piped working tree → /root/cc-ci; cc-ci has no rsync):
|
||
`cc-ci-run -m pytest tests/unit -q` → **8 passed in 0.06s**
|
||
And the cc-ci-authored hook is unaffected (DG5):
|
||
discovery.install_steps("custom-html-tiny", None) → ('cc-ci', '.../tests/custom-html-tiny/install_steps.sh')
|
||
- Committed d38a695, pushed. Gate E0/HC2 CLAIMED for Adversary.
|
||
|
||
Next: E1 (HC3) — orchestrator op/assertion split + additive generic + opt-out + overlay migration.
|
||
|
||
## 2026-05-28 — E1 / HC3 additive generic + op/assertion split (implemented + e2e verified)
|
||
- **Harness core:** `lifecycle.deployed_identity` now returns `{version,image,chaos}` (chaos label
|
||
captured, ready for HC1). `generic.py` split: op primitives `perform_upgrade/perform_backup/
|
||
perform_restore` (orchestrator-only, no asserts) + assertions `assert_upgraded` (serving + MOVED via
|
||
version/image/chaos), `assert_backup_artifact`, `assert_restore_healthy`, all reading the run-scoped
|
||
`op_state()` (`$CCCI_OP_STATE_FILE`).
|
||
- **Orchestrator** (`run_recipe_ci.py`): new `run_lifecycle_tier` = pre-op seed hook (`ops.py
|
||
pre_<op>`, imported in-process w/ recipe dir on sys.path) → perform the op ONCE → run generic
|
||
assertion (unless `_skip_generic`) + overlay assertion, both against the shared post-op deployment.
|
||
Opt-out: `CCCI_SKIP_GENERIC` / `CCCI_SKIP_GENERIC_<OP>` / `recipe_meta.SKIP_GENERIC`. `_scrub`
|
||
factored so op-failure messages are redacted too. Op primitives never call `deploy_app` ⇒
|
||
deploy-count stays 1.
|
||
- **Tiers/overlays migrated to assertion-only:** generic `_generic/test_{upgrade,backup,restore}.py`;
|
||
all 6 recipes' `test_{upgrade,backup,restore}.py`. Pre-op seeding (data-continuity markers + the
|
||
backup→restore mutation) moved to per-recipe `ops.py` (`pre_upgrade/pre_backup/pre_restore`).
|
||
install overlays unchanged (no op). No assertion weakened — every data-survival/return check kept.
|
||
- **Verified on cc-ci:**
|
||
- `cc-ci-run -m pytest tests/unit -q` → **8 passed**; `nix develop .#lint` → **lint: PASS** (ruff
|
||
format + check clean).
|
||
- Full e2e `RECIPE=custom-html STAGES=install,upgrade,backup,restore,custom` → every tier ran BOTH
|
||
generic AND overlay (additive): install(generic test_serving + overlay test_serving_and_content),
|
||
upgrade(pre_upgrade seed → generic test_upgrade_reconverges + overlay test_upgrade_preserves_data),
|
||
backup(pre_backup → generic test_backup_artifact + overlay test_backup_captures_state),
|
||
restore(pre_restore → generic test_restore_healthy + overlay test_restore_returns_state).
|
||
**RUN SUMMARY: deploy-count=1, install/upgrade/backup/restore=pass, custom=skip; no leftover
|
||
custom-html stack (clean teardown).** Log: /root/ccci-1e-customhtml.log on cc-ci.
|
||
- Opt-out run (`CCCI_SKIP_GENERIC=1`) in flight to show generic skipped + overlay still runs.
|
||
|
||
Next: confirm opt-out result, claim E1/HC3 gate, then E2 (HC1 chaos-to-PR-head).
|
||
|
||
## 2026-05-28 — E1 opt-out verified; gate CLAIMED
|
||
- Opt-out e2e `RECIPE=custom-html STAGES=install,upgrade,backup,restore CCCI_SKIP_GENERIC=1`:
|
||
every tier logged `generic=skip, overlay=cc-ci`; **0** `_generic/test_*` files ran; only the 4
|
||
cc-ci overlays ran; **deploy-count=1**; install/upgrade/backup/restore=pass; clean teardown (no
|
||
leftover custom-html stack). Log: /root/ccci-1e-optout.log.
|
||
- HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out =
|
||
generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary.
|
||
|
||
## 2026-05-28 — Adversary F1e-1 (HC3 opt-out race) + HC1 hardening
|
||
- **F1e-1 (E1/HC3 FAIL withheld):** under `CCCI_SKIP_GENERIC=1`, `test_backup_captures_state` flaked
|
||
`'' == 'original'`. Root cause (valid): `lifecycle.exec_in_app` returned `proc.stdout` WITHOUT
|
||
checking returncode — when backup-bot cycles the app container, `docker exec` fails and the empty
|
||
stdout was silently returned as data; the generic pytest spawn (~1s) had been an accidental timing
|
||
buffer that opt-out removes. **Fix (no assertion weakened):** `exec_in_app` now polls — re-resolves
|
||
the container + re-execs until returncode==0 or a 90s timeout, then RAISES. A container-cycle race
|
||
now waits-and-succeeds; a genuine exec failure is loud, never masquerades as empty data. This makes
|
||
the backup/restore overlays robust to the post-op cycle independent of the generic timing buffer, so
|
||
opt-out is behavior-neutral.
|
||
- **HC1 hardening (my own findings from E2 e2e):**
|
||
- `head_ref` capture was racy (returned None under a concurrent run wiping the shared recipe dir),
|
||
and a chaos-redeploy of the SAME prev checkout falsely "moved" via the chaos label alone. Fixes:
|
||
`head_ref = ref or recipe_head_commit(recipe)` (prefer the explicit PR head sha $REF — robust, no
|
||
git race; production `!testme` always sets REF); store head_ref in op_state.
|
||
- `assert_upgraded` now, when head_ref is known, REQUIRES the deployed `chaos-version` commit to
|
||
MATCH head_ref — direct proof the PR-head code under test was deployed, and non-vacuous (a stale
|
||
prev-checkout chaos redeploy stamps prev's commit ≠ head_ref → FAIL). Falls back to the
|
||
version/image/chaos move check only when head_ref is unknown.
|
||
- **Coordination note:** my E2 manual custom-html e2e ran concurrently with the Adversary's E1
|
||
cold-verify — both share `/root/.abra/recipes/custom-html` + (at PR=0) the same run domain, so they
|
||
collided (explains my non-deterministic 1.10→1.11 vs 1.10→1.10 and the None head_ref). Manual ad-hoc
|
||
runs bypass Drone's capacity=1 queue. Going forward I serialize: don't run a recipe manually while a
|
||
gate is under Adversary verification; verify when `pgrep run_recipe_ci` is clear.
|
||
|
||
## 2026-05-28 — E2 head_ref plumbing bug (fixed)
|
||
- Debug print at main() head_ref capture showed `head_ref='09bf4d54...'` (correct hash), but
|
||
perform_upgrade printed `head_ref=None`. Root cause: my earlier perl regex to swap `target →
|
||
head_ref` in the four `run_lifecycle_tier` call sites only matched the SINGLE-LINE form; the
|
||
multi-line `upgrade` and `restore` calls (lint-wrapped) still passed `target` (which is the VERSION
|
||
env, None for !testme runs). So perform_upgrade got head_ref=None for upgrade tier → re-checkout
|
||
skipped → chaos deploy of whatever leftover checkout (prev tag from deploy_app) → vacuous prev→prev
|
||
chaos redeploy that "passed" via the chaos-label move fallback.
|
||
- Fixed: explicit Edit on the two multi-line calls so they now pass `head_ref` consistently
|
||
(`recipe`/`"upgrade"|"backup"|"restore"`, `repo_local`, `domain`, `meta`, `head_ref`, `op_state`).
|
||
grep confirms all 4 tier calls pass head_ref. compile OK.
|
||
- Net effect now: head_ref reaches perform_upgrade → recipe_checkout_ref(head_ref) restores PR-head
|
||
before chaos deploy → after.chaos == head_ref → assert_upgraded match succeeds non-vacuously.
|
||
|
||
## 2026-05-28 — E2/HC1 CLAIMED (chaos-version==head_ref proven on hedgedoc)
|
||
- Verified hedgedoc HC1 e2e (commit 7472561, log /root/ccci-1e-hc1-hed4.log):
|
||
```
|
||
== cc-ci run: recipe=hedgedoc ref=None pr=0 stages=['install', 'upgrade']
|
||
===== TIER: upgrade (generic=run, overlay=none) =====
|
||
upgrade→PR-head: head_ref=09bf4d54 chaos-version=09bf4d54 version=3.0.9+1.10.7→3.0.10+1.10.8
|
||
PASSED tests/_generic/test_upgrade.py::test_upgrade_reconverges
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 1 (expect 1)
|
||
install : pass
|
||
upgrade : pass
|
||
```
|
||
head_ref (09bf4d54) == chaos-version (09bf4d54) — direct, deterministic, non-vacuous proof the
|
||
chaos deploy deployed the PR-head code under test. Plus a real version bump 3.0.9→3.0.10.
|
||
deploy-count=1; clean teardown.
|
||
- E3/HC4 docs work shipped in 7472561 (docs/testing.md + docs/enroll-recipe.md fully rewritten for
|
||
HC1/HC2/HC3: additive generic + opt-out + ops.py + chaos PR-head + repo-local allowlist).
|
||
- All three HC items implemented + Builder-verified. Awaiting Adversary cold-verify of HC1 and HC4.
|
||
|
||
## Background-task pgrep self-match note (lesson learned)
|
||
- My `until ! pgrep -f run_recipe_ci.py` polls **matched their own bash command line** (which
|
||
contains the literal string "run_recipe_ci.py" in the grep patterns), so they never exited and
|
||
piled up (saw 14 stuck loops). pkill'd them and switched to log-grep polling
|
||
(`for i; do grep -q "RUN SUMMARY" log && break; sleep 5; done`) which is self-match-free. Won't
|
||
repeat the pgrep -f anti-pattern.
|
||
|
||
## 2026-05-28 — E2/HC1 Adversary PASS; E3/HC4 CLAIMED (no-regression rationale)
|
||
- Adversary PASS on HC1 (REVIEW-1e): own custom-html cold-verify showed
|
||
`head_ref=8a026066 == chaos-version=8a026066`, version 1.10.0→1.11.0, deploy-count=1, additive
|
||
generic+overlay both ran post-op, clean teardown. Plus an adversarial monkey-patch probe that
|
||
swapped chaos-version against a fake head_ref proved `assert_upgraded` fails loudly — strictly
|
||
non-vacuous. No new finding. **HC1 ✓ HC2 ✓ HC3 ✓.**
|
||
- Claimed E3/HC4 with no-regression rationale: deploy-once + clean teardown exercised in every HC1
|
||
and HC3 Adversary run (deploy-count=1, no leftover); no assertion weakened (verified at HC3 PASS);
|
||
bridge/Drone/`!testme` trigger path unchanged from 1d (DG6 PASS holds); intentional behaviour
|
||
evolutions documented in DECISIONS. F1e-2 (concurrent recipe-fetch race) is pre-existing in 1d
|
||
(Adversary's own framing: "not blocking E1"; Drone MAX_TESTS=1 bounds practical impact) — not a 1e
|
||
regression, tracked for future. Awaiting Adversary cold-verify of HC4 to write ## DONE.
|
||
|
||
## 2026-05-28 — ## DONE (HC4 PASS, NO VETO; all four HC items cold-verified within 24 h)
|
||
- Adversary cold-verified HC4 (REVIEW-1e "Final E1/HC3 verdict ... PASS. NO VETO") via build **#155**
|
||
— own `!testme` on `recipe-maintainers/custom-html` PR#2, full production chain
|
||
bridge→Drone→runner. Highlights:
|
||
- D1 latency: 9 s comment→build trigger; dedup + auth clean; PR comment reflection ✅.
|
||
- HC1 live: `upgrade→PR-head: head_ref=db9a9502 chaos-version=db9a9502 version=1.10.0+1.28.0
|
||
→1.13.0+1.31.1`. Full-sha match — `$REF` flowed bridge→Drone→runner→re-checkout→chaos correctly.
|
||
- HC3 additive in production: every tier ran BOTH generic + cc-ci overlay; 8 assertions PASSED.
|
||
- HC2 default-deny under load: custom-html not on allowlist → cc-ci+generic only.
|
||
- DG4.1: deploy-count=1; teardown sacred (no leftover stack/volume).
|
||
- D6 secret-leak grep over the full build #155 log: 0/58 matches.
|
||
- F1e-1 fix verified under real load: `test_backup_captures_state PASSED`.
|
||
- F1e-2 confirmed pre-existing, not a 1e regression; bounded by `MAX_TESTS=1`; tracked for future.
|
||
- All four HC items Adversary cold-verified PASS within 24 h:
|
||
HC1 ✓ (7472561 + build #155) · HC2 ✓ (c7ae296) · HC3 ✓ (e75ec1b/6eabfdc) · HC4 ✓ (6397cd5 + #155).
|
||
- Wrote `## DONE` to STATUS-1e.md. Builder loop stops; next is Phase 2.
|