Phase 5 (final): verify the /recipe-upgrade + testme-on-pr.sh end-to-end flow

Appended as the LAST phase in the launcher sequence (… 3 4 5). It can only run once cc-ci is fully built — the !testme-on-recipe-PR flow depends on Phase 3 (results UX) surfacing the run result back on the PR for testme-on-pr.sh to read. DoD (Adversary cold-verifies): !testme on a recipe PR is the real gate + results land in the PR (V1); testme-on-pr.sh reads GREEN/RED/PENDING + BUILD url, POST=0 polls without re-triggering (V2); /recipe-upgrade default end-to-end green on a sandbox recipe, nothing merged (V3); the ≤3 !testme regression loop (V4); stale test DEFAULT = comment-only, no test edit (V5); --with-tests opens+verifies a cc-ci test PR, paired (V6); mirror reconcile closes merged/superseded PRs and main==upstream (V7); /upgrade-all default dry-run + small live run never edits tests (V8); all verification PRs closed + deploys torn down (V9). Use a sandbox recipe; never merge; never weaken tests. Watchdog reloaded (seq …3 4 5). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 20:38:39 +01:00
parent 4a1da1dd60
commit 4f74676c72
2 changed files with 84 additions and 1 deletions
--- a/cc-ci-plan/launch.sh
+++ b/cc-ci-plan/launch.sh
@ -62,7 +62,7 @@ WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}"
 # Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order,
 # auto-transitions on the phase's "## DONE" (in BUILDER_DIR/<statusbasename>), and STOPS after the
 # last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence.
-PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
+PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md;5|plan-phase5-verify-upgrade-flow.md|STATUS-5.md}"
 IFS=';' read -r -a PHASES <<< "$PHASES_SPEC"
 PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}"
 # --------------------------------------------------------------------------
--- a/cc-ci-plan/plan-phase5-verify-upgrade-flow.md
+++ b/cc-ci-plan/plan-phase5-verify-upgrade-flow.md
@ -0,0 +1,83 @@
+# cc-ci Phase 5 — verify the `/recipe-upgrade` + `testme-on-pr.sh` end-to-end flow
+
+**Status:** QUEUED — the **FINAL** phase, after Phase 4. Transition: auto (last in the launcher
+sequence); the watchdog STOPS after it (build complete). **Owner:** Builder + Adversary loops
+(Adversary cold-verifies); the orchestrator may also drive it.
+**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase5-verify-upgrade-flow.md`
+**Phase order:** … 3 → 4 → **5**.
+
+---
+
+## 0. Why this is LAST
+
+The `/recipe-upgrade` and `/upgrade-all` skills live in the **orchestration repo**
+(`/srv/cc-ci/.claude/skills/`) and drive cc-ci by posting **`!testme` on a recipe PR**
+(`testme-on-pr.sh`) and reading the verdict back **from the PR**. That flow only works once cc-ci is
+**fully built** — specifically it depends on **Phase 3 (results UX)**: the run's result must be posted
+back to the PR (a Gitea commit status and/or result comment on the PR head) for `testme-on-pr.sh` to
+read it. So this verification can only run after the CI server is complete. It proves the upgrade
+tooling actually works end-to-end **before** the weekly `/upgrade-all` cron is activated
+([[cc-ci-upgrade-all-cron]] — activate only after the build is done).
+
+**Use a sandbox recipe for the live runs** (e.g. `custom-html-tiny` or a dedicated throwaway recipe),
+and **close every PR opened during verification afterward** — do not pollute real recipe mirrors, and
+**never merge** anything.
+
+## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-5.md`)
+
+- [ ] **V1 — `!testme` on a recipe PR is the real gate.** Posting `!testme` (as the bot, a
+      collaborator) on a recipe mirror PR triggers bridge → Drone → harness, runs the suite, and
+      **posts the run + its result back to the PR** (commit status on the PR head and/or a result
+      comment, per Phase 3). A `!testme` from a **non-collaborator is rejected**. A bad trigger
+      (`!testmexyz`) does NOT fire. (Re-uses the existing !testme D-gate evidence.)
+- [ ] **V2 — `testme-on-pr.sh` reads the verdict correctly.** `POST=1` posts exactly **one**
+      `!testme`; `POST=0` polls WITHOUT re-triggering; the helper returns `VERDICT=GREEN` on a passing
+      PR, `VERDICT=RED` on a failing PR, `VERDICT=PENDING` while in flight, and a `BUILD=<url>` that
+      points at the Drone build. (Drive a known-green and a known-red PR head.)
+- [ ] **V3 — `/recipe-upgrade <sandbox-recipe>` (DEFAULT mode) end-to-end GREEN.** On a recipe with a
+      real available upgrade: it reconciles the mirror, opens a recipe PR with the bump, `!testme`
+      runs and **results appear in the PR**, GREEN → `RESULT: SUCCESS`. **Nothing merged.**
+- [ ] **V4 — 3-iteration regression loop.** Seed a real upgrade regression (e.g. a deliberately bad
+      image tag) → `!testme` RED → fix on the recipe branch → re-`!testme` → GREEN, all within **≤3
+      `!testme` runs**, each recorded in the PR. With a persistent regression, after 3 runs it leaves
+      the PR open and reports `RESULT: FAILED — upgrade not green after 3 !testme runs` (does not
+      weaken anything to force green).
+- [ ] **V5 — stale-test DEFAULT = comment, no test edit.** With a genuinely stale test for the new
+      version, default mode (no `--with-tests`) leaves an **explanatory comment** on the recipe PR
+      (upgrade looks correct; which test is stale + why; "re-run `--with-tests`"), modifies **no
+      test**, and reports `RESULT: SUCCESS-PENDING-TESTS`.
+- [ ] **V6 — `--with-tests` opens + verifies a cc-ci test PR.** The same stale-test case under
+      `--with-tests` opens a cc-ci test-update PR (dedicated branch, separate clone — never `main`,
+      never the loops' clones), **verifies the recipe upgrade WITH the test change applied** via the
+      branch-checkout harness run (`verify-pr.sh`, since `!testme` uses prod tests), pairs the two PRs
+      with cross-notes, and reports `RESULT: SUCCESS+TESTPR`. Nothing merged.
+- [ ] **V7 — mirror reconciliation.** After a run, the `recipe-maintainers/<recipe>` mirror `main` is
+      **identical to upstream main**; an open PR whose change is already in upstream main is
+      **auto-closed** (merged-upstream); a superseded open PR is **closed and replaced** by the new
+      one. (Seed each case and confirm.)
+- [ ] **V8 — `/upgrade-all` DEFAULT run.** `--dry-run` lists candidates correctly; a live run over a
+      small set (1–2 sandbox recipes) opens recipe PRs **using default mode (never `--with-tests`)**,
+      surfaces "tests look stale" PRs in its own summary section, runs **sequentially with teardown**
+      between recipes, and the summary leads with the PR list. Confirms the weekly cron never
+      auto-edits tests.
+- [ ] **V9 — cleanup.** Every PR opened during verification (recipe + any cc-ci test PR) is **closed**
+      and any sandbox deploy is **torn down**; the box is left clean.
+
+## 2. Method / notes
+- **Sandbox first.** Prefer `custom-html-tiny` / a throwaway recipe for V3–V8 so real recipe mirrors
+  aren't spammed. For the seeded-regression (V4) and stale-test (V5/V6) cases, construct the seed on a
+  sandbox recipe / a scratch branch, not a real upstream recipe.
+- **Real path throughout** — real `!testme` → bridge → Drone → harness → results-in-PR; real `abra`.
+- The skills are at `/srv/cc-ci/.claude/skills/{recipe-upgrade,upgrade-all,ci-test-review}/`; read
+  their SKILL.md as the spec for expected behavior.
+- If Phase 3's PR-result surface differs from what `testme-on-pr.sh` polls (commit status vs comment),
+  **reconcile them here** — either adjust the helper to read the real surface, or file a CI finding —
+  and record the resolution in `DECISIONS.md`. (This is the most likely real gap to find.)
+
+## 3. Guardrails
+- **NEVER merge** any PR; **never weaken a test** to make a run pass.
+- **Close verification PRs + tear down deploys** when done (V9) — leave no junk.
+- **Single-writer:** cc-ci test PRs on dedicated branches + separate clones; never push `main` or
+  touch the loops' working clones; shared Swarm is stateful — serialize + tear down.
+- **Bounded:** this phase VERIFIES the flow; it does not redesign the skills. A real defect → fix the
+  skill/helper (small) or file it; a bigger idea → `IDEAS.md`.