Files
cc-ci/machine-docs/JOURNAL-1e.md
autonomic-bot 6eabfdc0fb fix(1e): F1e-1 exec_in_app race + HC1 head_ref/move hardening
F1e-1 (Adversary): exec_in_app silently returned '' on a failed docker exec, flipping a healthy
recipe RED under opt-out (post-backup container cycle, no readiness buffer). Now polls (re-resolve
container + re-exec) until rc==0 or 90s, then RAISES — never masks an exec failure as empty data.
No assertion weakened. Verified: opt-out install,backup,restore on custom-html now PASS.

HC1: head_ref = ref or recipe_head_commit (prefer explicit PR head sha $REF — robust, no git race;
production !testme always sets REF). assert_upgraded, when head_ref known, REQUIRES the deployed
chaos-version commit to MATCH head_ref (direct + non-vacuous proof the PR-head code was deployed; a
stale prev-checkout chaos redeploy fails). Falls back to version/image/chaos move check otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:41:42 +01:00

104 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL — Phase 1e (generic-harness corrections)
Append-only Builder log: what I did + verifying command/output + next.
## 2026-05-28 — Phase 1e bootstrap + orientation
- Read the phase plan (`plan-phase1e-harness-corrections.md`) + plan.md §6.1/§7/§9. Phase 1d is DONE
(STATUS-1d ## DONE, DG1DG8 Adversary PASS). Studied the harness: `runner/run_recipe_ci.py`
(deploy-once orchestrator), `runner/harness/{discovery,generic,lifecycle,abra}.py`, `tests/conftest.py`,
`tests/_generic/*`, the overlays (custom-html/keycloak/cryptpad/n8n/matrix-synapse), and
`tests/unit/test_discovery.py`.
- Access re-verified: `ssh cc-ci 'hostname && whoami'``nixos` / `root`.
- Settled the three open decisions (HC1 deploy-count, HC2 allowlist, HC3 opt-out) in DECISIONS.md.
- Created STATUS-1e / BACKLOG-1e / JOURNAL-1e. Order of work: E0 (HC2) → E1 (HC3) → E2 (HC1) → E3.
- Key design notes:
- HC3 op/assertion split: orchestrator performs each mutating op once; generic + overlay both run as
assertions after. Op results (pre-upgrade identity, snapshot_id) passed via run-scoped
`$CCCI_OP_STATE_FILE`. Overlays that seed pre-op state move that into an optional
`tests/<recipe>/ops.py` (`pre_<op>(domain, meta)`); overlay `test_<op>.py` become assertion-only.
- HC1: re-checkout PR head (recorded as recipe HEAD right after fetch) then `abra app deploy --chaos`;
moved-assertion accepts the chaos label as proof PR-head deployed; deploy-count counts only
`deploy_app` (app new), not the in-place chaos redeploy.
Next: E0 — implement the HC2 allowlist + discovery gate + unit tests.
## 2026-05-28 — E0 / HC2 repo-local trust gate (DONE, CLAIMED)
- Implemented the approval allowlist (`tests/repo-local-approved.txt`, default empty ⇒ default-deny)
+ centralized gate in `runner/harness/discovery.py`: `approved_recipes()`/`repo_local_approved()`/
`_gated()`. Split overlay resolution into `resolve_overlay_op` (repo-local>cc-ci, gated) + `generic_op`
(the floor) for HC3; kept back-compat `resolve_op` (override). `custom_tests`/`install_steps`/new
`pre_op_hook` all route repo-local through `_gated`. Allowlist path overridable via
`CCCI_REPO_LOCAL_APPROVED_FILE`.
- Rewrote `tests/unit/test_discovery.py` for the gate (approved-vs-not for overlay/custom/hook/pre-op +
the generic floor + default-empty-allowlist invariant).
- Verified on cc-ci (tar-piped working tree → /root/cc-ci; cc-ci has no rsync):
`cc-ci-run -m pytest tests/unit -q`**8 passed in 0.06s**
And the cc-ci-authored hook is unaffected (DG5):
discovery.install_steps("custom-html-tiny", None) → ('cc-ci', '.../tests/custom-html-tiny/install_steps.sh')
- Committed d38a695, pushed. Gate E0/HC2 CLAIMED for Adversary.
Next: E1 (HC3) — orchestrator op/assertion split + additive generic + opt-out + overlay migration.
## 2026-05-28 — E1 / HC3 additive generic + op/assertion split (implemented + e2e verified)
- **Harness core:** `lifecycle.deployed_identity` now returns `{version,image,chaos}` (chaos label
captured, ready for HC1). `generic.py` split: op primitives `perform_upgrade/perform_backup/
perform_restore` (orchestrator-only, no asserts) + assertions `assert_upgraded` (serving + MOVED via
version/image/chaos), `assert_backup_artifact`, `assert_restore_healthy`, all reading the run-scoped
`op_state()` (`$CCCI_OP_STATE_FILE`).
- **Orchestrator** (`run_recipe_ci.py`): new `run_lifecycle_tier` = pre-op seed hook (`ops.py
pre_<op>`, imported in-process w/ recipe dir on sys.path) → perform the op ONCE → run generic
assertion (unless `_skip_generic`) + overlay assertion, both against the shared post-op deployment.
Opt-out: `CCCI_SKIP_GENERIC` / `CCCI_SKIP_GENERIC_<OP>` / `recipe_meta.SKIP_GENERIC`. `_scrub`
factored so op-failure messages are redacted too. Op primitives never call `deploy_app` ⇒
deploy-count stays 1.
- **Tiers/overlays migrated to assertion-only:** generic `_generic/test_{upgrade,backup,restore}.py`;
all 6 recipes' `test_{upgrade,backup,restore}.py`. Pre-op seeding (data-continuity markers + the
backup→restore mutation) moved to per-recipe `ops.py` (`pre_upgrade/pre_backup/pre_restore`).
install overlays unchanged (no op). No assertion weakened — every data-survival/return check kept.
- **Verified on cc-ci:**
- `cc-ci-run -m pytest tests/unit -q` → **8 passed**; `nix develop .#lint` → **lint: PASS** (ruff
format + check clean).
- Full e2e `RECIPE=custom-html STAGES=install,upgrade,backup,restore,custom` → every tier ran BOTH
generic AND overlay (additive): install(generic test_serving + overlay test_serving_and_content),
upgrade(pre_upgrade seed → generic test_upgrade_reconverges + overlay test_upgrade_preserves_data),
backup(pre_backup → generic test_backup_artifact + overlay test_backup_captures_state),
restore(pre_restore → generic test_restore_healthy + overlay test_restore_returns_state).
**RUN SUMMARY: deploy-count=1, install/upgrade/backup/restore=pass, custom=skip; no leftover
custom-html stack (clean teardown).** Log: /root/ccci-1e-customhtml.log on cc-ci.
- Opt-out run (`CCCI_SKIP_GENERIC=1`) in flight to show generic skipped + overlay still runs.
Next: confirm opt-out result, claim E1/HC3 gate, then E2 (HC1 chaos-to-PR-head).
## 2026-05-28 — E1 opt-out verified; gate CLAIMED
- Opt-out e2e `RECIPE=custom-html STAGES=install,upgrade,backup,restore CCCI_SKIP_GENERIC=1`:
every tier logged `generic=skip, overlay=cc-ci`; **0** `_generic/test_*` files ran; only the 4
cc-ci overlays ran; **deploy-count=1**; install/upgrade/backup/restore=pass; clean teardown (no
leftover custom-html stack). Log: /root/ccci-1e-optout.log.
- HC3 proven both ways: default = generic+overlay additive on one deployment (op once); opt-out =
generic floor skipped, overlay still runs. Gate E1/HC3 CLAIMED for Adversary.
## 2026-05-28 — Adversary F1e-1 (HC3 opt-out race) + HC1 hardening
- **F1e-1 (E1/HC3 FAIL withheld):** under `CCCI_SKIP_GENERIC=1`, `test_backup_captures_state` flaked
`'' == 'original'`. Root cause (valid): `lifecycle.exec_in_app` returned `proc.stdout` WITHOUT
checking returncode — when backup-bot cycles the app container, `docker exec` fails and the empty
stdout was silently returned as data; the generic pytest spawn (~1s) had been an accidental timing
buffer that opt-out removes. **Fix (no assertion weakened):** `exec_in_app` now polls — re-resolves
the container + re-execs until returncode==0 or a 90s timeout, then RAISES. A container-cycle race
now waits-and-succeeds; a genuine exec failure is loud, never masquerades as empty data. This makes
the backup/restore overlays robust to the post-op cycle independent of the generic timing buffer, so
opt-out is behavior-neutral.
- **HC1 hardening (my own findings from E2 e2e):**
- `head_ref` capture was racy (returned None under a concurrent run wiping the shared recipe dir),
and a chaos-redeploy of the SAME prev checkout falsely "moved" via the chaos label alone. Fixes:
`head_ref = ref or recipe_head_commit(recipe)` (prefer the explicit PR head sha $REF — robust, no
git race; production `!testme` always sets REF); store head_ref in op_state.
- `assert_upgraded` now, when head_ref is known, REQUIRES the deployed `chaos-version` commit to
MATCH head_ref — direct proof the PR-head code under test was deployed, and non-vacuous (a stale
prev-checkout chaos redeploy stamps prev's commit ≠ head_ref → FAIL). Falls back to the
version/image/chaos move check only when head_ref is unknown.
- **Coordination note:** my E2 manual custom-html e2e ran concurrently with the Adversary's E1
cold-verify — both share `/root/.abra/recipes/custom-html` + (at PR=0) the same run domain, so they
collided (explains my non-deterministic 1.10→1.11 vs 1.10→1.10 and the None head_ref). Manual ad-hoc
runs bypass Drone's capacity=1 queue. Going forward I serialize: don't run a recipe manually while a
gate is under Adversary verification; verify when `pgrep run_recipe_ci` is clear.