Files
cc-ci-orchestrator/cc-ci-plan/plan-phase1e-harness-corrections.md
autonomic-bot 36a6c9872a orchestrator: reboot-resilience + session auto-resume + full session plan/tooling
Reboot survival for the Pi orchestrator host:
- systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot
  records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the
  orchestrator session.
- reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count).
- launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed
  orchestrator announces itself (PushNotification) + reports reboots.
- AGENTS.md: on-startup notify routine documented.

Plans/tooling accumulated this session:
- plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review),
  sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance.
- launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution,
  limit-stall re-nudge, INBOX side-channel detection.
- plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED.
- prompts: isolation discipline + INBOX + pacing.
- .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, *.tmp.*).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 20:28:10 +01:00

140 lines
9.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 1e — Generic-harness corrections (Autonomous Build Plan)
**Status:** QUEUED — runs **after Phase 1d** and **before Phase 2** (`plan-phase2-recipe-tests.md`).
It corrects the **shared generic-test harness** from 1d, so it must land before Phase 2 authors
overlays on top of it.
**Transition:** **manual** (operator kicks it off).
**Builds on:** the Phase-1d generic suite (`runner/run_recipe_ci.py`, `runner/harness/*`,
`tests/_generic/*`, `tests/conftest.py`) — see `plan-phase1d-generic-test-suite.md`.
**Owner agents:** same Builder + Adversary loops (`plan.md` §6/§7); Adversary cold-verifies.
**This file's path:** `/srv/cc-ci/cc-ci-plan/plan-phase1e-harness-corrections.md`
**Phase order:** 1c → 1b → 1d → **1e** → 2 → 2b → 3.
---
## 0. Why this phase
An operator review of the 1d generic suite (2026-05-28) found three corrections to the **shared
harness** — the foundation every recipe overlay (Phase 2) builds on. Fixing them now, once, is far
cheaper than after overlays exist. All three are small in code but change behavior, so each needs a
fresh Adversary cold-verification and must not weaken any existing test.
---
## 1. Definition of Done (Phase 1e exit condition)
Terminates when every item holds **and the Adversary has independently cold-verified** (logged in
`machine-docs/REVIEW-1e.md`):
- [ ] **HC1 — Upgrade tier upgrades to the code under test (PR head), not a published tag.** The
upgrade tier deploys the **previous published version** (last release before the PR) and then
**upgrades to the PR head via `abra app deploy --chaos`** (chaos = the current checkout). The
PR's actual changes are exercised by the upgrade path. (§2.1)
- [ ] **HC2 — Repo-local (PR-authored) code is not executed unless the recipe is approved.** By
default the harness runs **only cc-ci-authored** overlays/install-steps (`tests/<recipe>/…`) +
the generic; PR-authored repo-local `test_*.py` and `install_steps.sh` are **not run**.
Repo-local code is honored **only for recipes on an explicit cc-ci-maintained approval
allowlist** (default-deny). (§2.2)
- [ ] **HC3 — Generic runs by default (additive); skipping it is explicit.** When a recipe ships an
overlay for an op, the **generic still runs** alongside it by default; the generic is skipped
**only** when an explicit env/flag opts out. The baseline floor is never lost silently. (§2.3)
- [ ] **HC4 — No regression, cold-verified.** The Adversary re-runs the relevant D1D10 / DG1DG8
acceptance from a cold start: nothing weakened, deploy-once (DG4.1) still holds, teardown still
sacred, and the three new behaviors are demonstrated (HC1: a PR-head upgrade proven to deploy
PR-head; HC2: a repo-local test is *ignored* for a non-approved recipe and *run* for an approved
one; HC3: generic runs with an overlay present, and is skipped only with the opt-out set).
When HC1HC4 hold and are confirmed, write `## DONE` to `machine-docs/STATUS-1e.md`.
---
## 2. The three corrections
### 2.1 HC1 — Upgrade to the PR head (not a published tag)
Current 1d behavior: deploy previous published version, then `abra app upgrade` to the **newest
published tag** — and because deploying the prev tag re-checks-out the recipe, the **PR-head code is
never deployed**, so a recipe PR's changes aren't exercised by upgrade.
Corrected:
1. Deploy the **previous published version** (the last release before the code under test) as the
"before" state.
2. **Restore the PR-head checkout** (re-checkout the PR ref / re-use the post-fetch snapshot — the
prev-tag deploy will have reset `~/.abra/recipes/<recipe>`).
3. **Upgrade to it via `abra app deploy --chaos`** (chaos = current checkout = PR head) in place on
the shared deployment.
4. Assert reconverge + still serving (as today).
- **Adapt the "deployment moved" assertion** (`generic.do_upgrade`): prev→PR-head may *not* bump the
coop-cloud version label (a PR can change a recipe without a version bump), so also accept an
image/config change, or assert the running config now matches the PR head — keep it non-vacuous
without false-failing a legit unbumped PR.
- **Non-PR `!testme`** (no PR head): "current checkout" = the catalogue current, so upgrade tests
prev→current — still valid.
- Preserve **deploy-once** spirit: this is still one app deployment mutated in place (prev → chaos
redeploy of PR head is the upgrade op, not a fresh second app). Reconcile with the DG4.1
deploy-count guard — define whether a chaos redeploy counts as a "deploy" and adjust the guard so
the legitimate upgrade isn't flagged (e.g. count `abra app new` installs, not in-place redeploys).
### 2.2 HC2 — Repo-local trust gate (default-deny; cc-ci overlays only)
`install_steps.sh` and repo-local `test_*.py` are PR-author-controlled code that runs on the CI host
with `/run/secrets/*` present — an untrusted-code risk. Operator decision (2026-05-28):
- **Default:** the harness runs **only cc-ci-authored** overlays + install-steps
(`tests/<recipe>/…`) and the generic. Repo-local (`<recipe-repo>/tests/`) `test_*.py` and
`install_steps.sh` are **discovered-but-not-executed**.
- **Approved recipes only:** repo-local code is honored **only** when the recipe is on an explicit,
**cc-ci-maintained approval allowlist** (default-empty ⇒ default-deny). Adding a recipe to the
allowlist is a deliberate cc-ci-maintainer act after reviewing that recipe's tests.
- Update `discovery.resolve_op` / `custom_tests` / `install_steps` so the **repo-local source is
only consulted for allowlisted recipes**; otherwise precedence is **cc-ci > generic** only.
- **Open (settle in DECISIONS):** the allowlist's form + location (a checked-in file like
`tests/repo-local-approved.txt`, or a field in a cc-ci config), and the approval workflow. Keep it
simple + auditable + in git.
- (Future hardening, → IDEAS, not this phase: sandbox/network-restrict even cc-ci overlays.)
### 2.3 HC3 — Generic by default (additive), explicit opt-out
Supersedes 1d's pure-override default. New rule: when a recipe ships an overlay for an op, **both the
generic and the overlay run** for that op by default; the generic is skipped **only** when an
explicit opt-out is set.
- **Opt-out mechanism (propose; settle in DECISIONS):** an env flag `CCCI_SKIP_GENERIC` (all ops) and
per-op `CCCI_SKIP_GENERIC_<OP>` (e.g. `..._UPGRADE`), settable via the recipe's `recipe_meta.py`
(a `SKIP_GENERIC` list) so it's declarative per recipe, not a hidden global.
- **Op-vs-assertion split (required by additive + deploy-once):** a mutating op (upgrade/backup/
restore) must run **once**, then **both** the generic assertions and the overlay assertions
evaluate the post-op state — never upgrade/backup twice. So refactor the tiers: the **orchestrator
performs the op once** (the harness owns the op), then runs generic assertions (unless opted out) +
overlay assertions against the shared post-op deployment. For `install` (no op) both assertion sets
just run. This keeps deploy-once and one-op-per-tier intact.
- Net effect: the generic "is it actually serving / did the upgrade move / snapshot produced" floor
is **always** exercised unless a recipe explicitly declares it skips generics — overlays add, they
don't silently subtract.
---
## 3. Method / milestones (bounded)
- **E0 — HC2 trust gate.** Gate repo-local behind the approval allowlist (default-deny); cc-ci+generic
only otherwise. *Accept:* repo-local ignored for a non-approved recipe, run for an approved one.
- **E1 — HC3 additive + op/assertion split.** Generic runs alongside overlays by default; op runs
once; opt-out env skips the generic assertions. *Accept:* overlay + generic both run on one
deployment; opt-out skips generic; deploy-count still 1.
- **E2 — HC1 upgrade-to-PR-head.** prev-release → PR-head via `deploy --chaos`; moved-assertion
adapted; deploy-count guard reconciled. *Accept:* upgrade demonstrably deploys PR-head.
- **E3 — HC4 cold re-verification + docs.** Adversary cold-verifies no regression + the three new
behaviors; update `docs/` + `machine-docs/DECISIONS.md`; flip `STATUS-1e.md` to `## DONE`.
---
## 4. Guardrails
- **Never weaken a test** — these are correctness/security fixes; the cardinal rule still wins.
- **Default-secure** — repo-local PR code is off unless the recipe is explicitly approved; the
allowlist lives in git and is auditable.
- **Floor-by-default** — the generic baseline always runs unless a recipe explicitly opts out.
- **Deploy-once preserved** — one app deployment, one teardown; ops run once; reconcile the DG4.1
guard with the chaos-upgrade redeploy.
- **Bounded** — three fixes + verification, then stop; bigger hardening (sandboxing) → IDEAS.
## 5. Open decisions (log in machine-docs/DECISIONS.md)
- HC2: approval-allowlist form/location + the approval workflow.
- HC3: opt-out flag name/granularity + declaring it via `recipe_meta.py`.
- HC1: how the DG4.1 deploy-count guard treats an in-place chaos upgrade (don't flag the legit op).