Files
cc-ci-orchestrator/cc-ci-plan/plan-phase5-verify-upgrade-flow.md
autonomic-bot bf71420106 Add cc-ci-upgrader agent: observable one-shot weekly upgrade-run agent
The weekly upgrade run now executes inside a dedicated, remote-control agent
(cc-ci-upgrader) — viewable/steerable at claude.ai/code like the Builder — rather
than buried in headless cron output.

- launch-upgrader.sh: spins up the cc-ci-upgrader tmux session under
  --remote-control with a kickoff that runs /upgrade-all (DEFAULT mode) to
  completion. On finish the agent STOPS and stays idle (does NOT self-terminate)
  so the run + summary stay reviewable in the web UI. `start` = use-or-create:
  leaves an in-flight (busy) run alone, else clears a finished/idle/wedged
  session and runs fresh; `fresh` always restarts. UPGRADER_ARGS passes flags
  (e.g. --dry-run); never --with-tests.
- launch.sh: orchestrator_alive() now also skips the cc-ci-upgrader
  remote-control name, so the upgrader job isn't mistaken for the orchestrator.
- upgrade-all skill: documents it runs as the cc-ci-upgrader agent; the weekly
  cron invokes `launch-upgrader.sh start` (not /upgrade-all inline).
- Phase 5: V8a verifies the agent lifecycle (launch → run to completion → stay
  idle/viewable → next start clears it); V9 stops the verification session.
- cron memory: weekly task = launch-upgrader.sh start at 0 3 * * 6 UTC.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 21:12:47 +01:00

92 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 5 — verify the `/recipe-upgrade` + `testme-on-pr.sh` end-to-end flow
**Status:** QUEUED — the **FINAL** phase, after Phase 4. Transition: auto (last in the launcher
sequence); the watchdog STOPS after it (build complete). **Owner:** Builder + Adversary loops
(Adversary cold-verifies); the orchestrator may also drive it.
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase5-verify-upgrade-flow.md`
**Phase order:** … 3 → 4 → **5**.
---
## 0. Why this is LAST
The `/recipe-upgrade` and `/upgrade-all` skills live in the **orchestration repo**
(`/srv/cc-ci/.claude/skills/`) and drive cc-ci by posting **`!testme` on a recipe PR**
(`testme-on-pr.sh`) and reading the verdict back **from the PR**. That flow only works once cc-ci is
**fully built** — specifically it depends on **Phase 3 (results UX)**: the run's result must be posted
back to the PR (a Gitea commit status and/or result comment on the PR head) for `testme-on-pr.sh` to
read it. So this verification can only run after the CI server is complete. It proves the upgrade
tooling actually works end-to-end **before** the weekly `/upgrade-all` cron is activated
([[cc-ci-upgrade-all-cron]] — activate only after the build is done).
**Use a sandbox recipe for the live runs** (e.g. `custom-html-tiny` or a dedicated throwaway recipe),
and **close every PR opened during verification afterward** — do not pollute real recipe mirrors, and
**never merge** anything.
## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-5.md`)
- [ ] **V1 — `!testme` on a recipe PR is the real gate.** Posting `!testme` (as the bot, a
collaborator) on a recipe mirror PR triggers bridge → Drone → harness, runs the suite, and
**posts the run + its result back to the PR** (commit status on the PR head and/or a result
comment, per Phase 3). A `!testme` from a **non-collaborator is rejected**. A bad trigger
(`!testmexyz`) does NOT fire. (Re-uses the existing !testme D-gate evidence.)
- [ ] **V2 — `testme-on-pr.sh` reads the verdict correctly.** `POST=1` posts exactly **one**
`!testme`; `POST=0` polls WITHOUT re-triggering; the helper returns `VERDICT=GREEN` on a passing
PR, `VERDICT=RED` on a failing PR, `VERDICT=PENDING` while in flight, and a `BUILD=<url>` that
points at the Drone build. (Drive a known-green and a known-red PR head.)
- [ ] **V3 — `/recipe-upgrade <sandbox-recipe>` (DEFAULT mode) end-to-end GREEN.** On a recipe with a
real available upgrade: it reconciles the mirror, opens a recipe PR with the bump, `!testme`
runs and **results appear in the PR**, GREEN → `RESULT: SUCCESS`. **Nothing merged.**
- [ ] **V4 — 3-iteration regression loop.** Seed a real upgrade regression (e.g. a deliberately bad
image tag) → `!testme` RED → fix on the recipe branch → re-`!testme` → GREEN, all within **≤3
`!testme` runs**, each recorded in the PR. With a persistent regression, after 3 runs it leaves
the PR open and reports `RESULT: FAILED — upgrade not green after 3 !testme runs` (does not
weaken anything to force green).
- [ ] **V5 — stale-test DEFAULT = comment, no test edit.** With a genuinely stale test for the new
version, default mode (no `--with-tests`) leaves an **explanatory comment** on the recipe PR
(upgrade looks correct; which test is stale + why; "re-run `--with-tests`"), modifies **no
test**, and reports `RESULT: SUCCESS-PENDING-TESTS`.
- [ ] **V6 — `--with-tests` opens + verifies a cc-ci test PR.** The same stale-test case under
`--with-tests` opens a cc-ci test-update PR (dedicated branch, separate clone — never `main`,
never the loops' clones), **verifies the recipe upgrade WITH the test change applied** via the
branch-checkout harness run (`verify-pr.sh`, since `!testme` uses prod tests), pairs the two PRs
with cross-notes, and reports `RESULT: SUCCESS+TESTPR`. Nothing merged.
- [ ] **V7 — mirror reconciliation.** After a run, the `recipe-maintainers/<recipe>` mirror `main` is
**identical to upstream main**; an open PR whose change is already in upstream main is
**auto-closed** (merged-upstream); a superseded open PR is **closed and replaced** by the new
one. (Seed each case and confirm.)
- [ ] **V8 — `/upgrade-all` DEFAULT run.** `--dry-run` lists candidates correctly; a live run over a
small set (12 sandbox recipes) opens recipe PRs **using default mode (never `--with-tests`)**,
surfaces "tests look stale" PRs in its own summary section, runs **sequentially with teardown**
between recipes, and the summary leads with the PR list. Confirms the weekly cron never
auto-edits tests.
- [ ] **V8a — the `cc-ci-upgrader` agent.** `cc-ci-plan/launch-upgrader.sh start` spins up a
remote-control `cc-ci-upgrader` session (viewable at claude.ai/code) that runs `/upgrade-all`
(DEFAULT) **to completion, then STOPS and stays idle** (does NOT self-terminate) with the
summary visible. A second `start` while a run is **in flight (busy)** leaves it alone; a `start`
against a **finished/idle (or wedged)** session **kills it and runs fresh**. (Use
`UPGRADER_ARGS=--dry-run` for a cheap check, then a real small-set run.) This is exactly what the
weekly cron invokes — verify the cron-equivalent path end-to-end.
- [ ] **V9 — cleanup.** Every PR opened during verification (recipe + any cc-ci test PR) is **closed**
and any sandbox deploy is **torn down**; the verification `cc-ci-upgrader` session is stopped
(`launch-upgrader.sh stop`); the box is left clean.
## 2. Method / notes
- **Sandbox first.** Prefer `custom-html-tiny` / a throwaway recipe for V3V8 so real recipe mirrors
aren't spammed. For the seeded-regression (V4) and stale-test (V5/V6) cases, construct the seed on a
sandbox recipe / a scratch branch, not a real upstream recipe.
- **Real path throughout** — real `!testme` → bridge → Drone → harness → results-in-PR; real `abra`.
- The skills are at `/srv/cc-ci/.claude/skills/{recipe-upgrade,upgrade-all,ci-test-review}/`; read
their SKILL.md as the spec for expected behavior.
- If Phase 3's PR-result surface differs from what `testme-on-pr.sh` polls (commit status vs comment),
**reconcile them here** — either adjust the helper to read the real surface, or file a CI finding —
and record the resolution in `DECISIONS.md`. (This is the most likely real gap to find.)
## 3. Guardrails
- **NEVER merge** any PR; **never weaken a test** to make a run pass.
- **Close verification PRs + tear down deploys** when done (V9) — leave no junk.
- **Single-writer:** cc-ci test PRs on dedicated branches + separate clones; never push `main` or
touch the loops' working clones; shared Swarm is stateful — serialize + tear down.
- **Bounded:** this phase VERIFIES the flow; it does not redesign the skills. A real defect → fix the
skill/helper (small) or file it; a bigger idea → `IDEAS.md`.