Files

autonomic-bot 36a6c9872a orchestrator: reboot-resilience + session auto-resume + full session plan/tooling

Reboot survival for the Pi orchestrator host:
- systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot
  records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the
  orchestrator session.
- reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count).
- launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed
  orchestrator announces itself (PushNotification) + reports reboots.
- AGENTS.md: on-startup notify routine documented.

Plans/tooling accumulated this session:
- plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review),
  sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance.
- launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution,
  limit-stall re-nudge, INBOX side-channel detection.
- plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED.
- prompts: isolation discipline + INBOX + pacing.
- .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, *.tmp.*).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-28 20:28:10 +01:00

9.0 KiB

Raw Blame History

cc-ci Phase 1e — Generic-harness corrections (Autonomous Build Plan)

Status: QUEUED — runs after Phase 1d and before Phase 2 (plan-phase2-recipe-tests.md). It corrects the shared generic-test harness from 1d, so it must land before Phase 2 authors overlays on top of it. Transition: manual (operator kicks it off). Builds on: the Phase-1d generic suite (runner/run_recipe_ci.py, runner/harness/*, tests/_generic/*, tests/conftest.py) — see plan-phase1d-generic-test-suite.md. Owner agents: same Builder + Adversary loops (plan.md §6/§7); Adversary cold-verifies. This file's path: /srv/cc-ci/cc-ci-plan/plan-phase1e-harness-corrections.md Phase order: 1c → 1b → 1d → 1e → 2 → 2b → 3.

0. Why this phase

An operator review of the 1d generic suite (2026-05-28) found three corrections to the shared harness — the foundation every recipe overlay (Phase 2) builds on. Fixing them now, once, is far cheaper than after overlays exist. All three are small in code but change behavior, so each needs a fresh Adversary cold-verification and must not weaken any existing test.

1. Definition of Done (Phase 1e exit condition)

Terminates when every item holds and the Adversary has independently cold-verified (logged in machine-docs/REVIEW-1e.md):

HC1 — Upgrade tier upgrades to the code under test (PR head), not a published tag. The upgrade tier deploys the previous published version (last release before the PR) and then upgrades to the PR head via abra app deploy --chaos (chaos = the current checkout). The PR's actual changes are exercised by the upgrade path. (§2.1)
HC2 — Repo-local (PR-authored) code is not executed unless the recipe is approved. By default the harness runs only cc-ci-authored overlays/install-steps (tests/<recipe>/…) + the generic; PR-authored repo-local test_*.py and install_steps.sh are not run. Repo-local code is honored only for recipes on an explicit cc-ci-maintained approval allowlist (default-deny). (§2.2)
HC3 — Generic runs by default (additive); skipping it is explicit. When a recipe ships an overlay for an op, the generic still runs alongside it by default; the generic is skipped only when an explicit env/flag opts out. The baseline floor is never lost silently. (§2.3)
HC4 — No regression, cold-verified. The Adversary re-runs the relevant D1–D10 / DG1–DG8 acceptance from a cold start: nothing weakened, deploy-once (DG4.1) still holds, teardown still sacred, and the three new behaviors are demonstrated (HC1: a PR-head upgrade proven to deploy PR-head; HC2: a repo-local test is ignored for a non-approved recipe and run for an approved one; HC3: generic runs with an overlay present, and is skipped only with the opt-out set).

When HC1–HC4 hold and are confirmed, write ## DONE to machine-docs/STATUS-1e.md.

2. The three corrections

2.1 HC1 — Upgrade to the PR head (not a published tag)

Current 1d behavior: deploy previous published version, then abra app upgrade to the newest published tag — and because deploying the prev tag re-checks-out the recipe, the PR-head code is never deployed, so a recipe PR's changes aren't exercised by upgrade.

Corrected:

Deploy the previous published version (the last release before the code under test) as the "before" state.
Restore the PR-head checkout (re-checkout the PR ref / re-use the post-fetch snapshot — the prev-tag deploy will have reset ~/.abra/recipes/<recipe>).
Upgrade to it via abra app deploy --chaos (chaos = current checkout = PR head) in place on the shared deployment.
Assert reconverge + still serving (as today).

Adapt the "deployment moved" assertion (generic.do_upgrade): prev→PR-head may not bump the coop-cloud version label (a PR can change a recipe without a version bump), so also accept an image/config change, or assert the running config now matches the PR head — keep it non-vacuous without false-failing a legit unbumped PR.
Non-PR !testme (no PR head): "current checkout" = the catalogue current, so upgrade tests prev→current — still valid.
Preserve deploy-once spirit: this is still one app deployment mutated in place (prev → chaos redeploy of PR head is the upgrade op, not a fresh second app). Reconcile with the DG4.1 deploy-count guard — define whether a chaos redeploy counts as a "deploy" and adjust the guard so the legitimate upgrade isn't flagged (e.g. count abra app new installs, not in-place redeploys).

2.2 HC2 — Repo-local trust gate (default-deny; cc-ci overlays only)

install_steps.sh and repo-local test_*.py are PR-author-controlled code that runs on the CI host with /run/secrets/* present — an untrusted-code risk. Operator decision (2026-05-28):

Default: the harness runs only cc-ci-authored overlays + install-steps (tests/<recipe>/…) and the generic. Repo-local (<recipe-repo>/tests/) test_*.py and install_steps.sh are discovered-but-not-executed.
Approved recipes only: repo-local code is honored only when the recipe is on an explicit, cc-ci-maintained approval allowlist (default-empty ⇒ default-deny). Adding a recipe to the allowlist is a deliberate cc-ci-maintainer act after reviewing that recipe's tests.
Update discovery.resolve_op / custom_tests / install_steps so the repo-local source is only consulted for allowlisted recipes; otherwise precedence is cc-ci > generic only.
Open (settle in DECISIONS): the allowlist's form + location (a checked-in file like tests/repo-local-approved.txt, or a field in a cc-ci config), and the approval workflow. Keep it simple + auditable + in git.
(Future hardening, → IDEAS, not this phase: sandbox/network-restrict even cc-ci overlays.)

2.3 HC3 — Generic by default (additive), explicit opt-out

Supersedes 1d's pure-override default. New rule: when a recipe ships an overlay for an op, both the generic and the overlay run for that op by default; the generic is skipped only when an explicit opt-out is set.

Opt-out mechanism (propose; settle in DECISIONS): an env flag CCCI_SKIP_GENERIC (all ops) and per-op CCCI_SKIP_GENERIC_<OP> (e.g. ..._UPGRADE), settable via the recipe's recipe_meta.py (a SKIP_GENERIC list) so it's declarative per recipe, not a hidden global.
Op-vs-assertion split (required by additive + deploy-once): a mutating op (upgrade/backup/ restore) must run once, then both the generic assertions and the overlay assertions evaluate the post-op state — never upgrade/backup twice. So refactor the tiers: the orchestrator performs the op once (the harness owns the op), then runs generic assertions (unless opted out) + overlay assertions against the shared post-op deployment. For install (no op) both assertion sets just run. This keeps deploy-once and one-op-per-tier intact.
Net effect: the generic "is it actually serving / did the upgrade move / snapshot produced" floor is always exercised unless a recipe explicitly declares it skips generics — overlays add, they don't silently subtract.

3. Method / milestones (bounded)

E0 — HC2 trust gate. Gate repo-local behind the approval allowlist (default-deny); cc-ci+generic only otherwise. Accept: repo-local ignored for a non-approved recipe, run for an approved one.
E1 — HC3 additive + op/assertion split. Generic runs alongside overlays by default; op runs once; opt-out env skips the generic assertions. Accept: overlay + generic both run on one deployment; opt-out skips generic; deploy-count still 1.
E2 — HC1 upgrade-to-PR-head. prev-release → PR-head via deploy --chaos; moved-assertion adapted; deploy-count guard reconciled. Accept: upgrade demonstrably deploys PR-head.
E3 — HC4 cold re-verification + docs. Adversary cold-verifies no regression + the three new behaviors; update docs/ + machine-docs/DECISIONS.md; flip STATUS-1e.md to ## DONE.

4. Guardrails

Never weaken a test — these are correctness/security fixes; the cardinal rule still wins.
Default-secure — repo-local PR code is off unless the recipe is explicitly approved; the allowlist lives in git and is auditable.
Floor-by-default — the generic baseline always runs unless a recipe explicitly opts out.
Deploy-once preserved — one app deployment, one teardown; ops run once; reconcile the DG4.1 guard with the chaos-upgrade redeploy.
Bounded — three fixes + verification, then stop; bigger hardening (sandboxing) → IDEAS.

5. Open decisions (log in machine-docs/DECISIONS.md)

HC2: approval-allowlist form/location + the approval workflow.
HC3: opt-out flag name/granularity + declaring it via recipe_meta.py.
HC1: how the DG4.1 deploy-count guard treats an in-place chaos upgrade (don't flag the legit op).

9.0 KiB Raw Blame History Unescape Escape