recipe-upgrade: !testme-on-PR verification + make test PRs opt-in (--with-tests)

Per operator: - Verify via `!testme` posted ON the recipe PR (the real CI path) so results are viewable in the PR; iterate up to 3 !testme runs (fix a real regression + re-test). New helper testme-on-pr.sh posts !testme and polls the PR head commit status for the verdict (POST=0 to keep polling without re-triggering). - Test updates are now OPT-IN via `--with-tests`. DEFAULT: recipe PR only using existing tests; if a test fails and is genuinely stale, leave an explanatory COMMENT on the PR (upgrade looks correct; re-run --with-tests to update tests) and do NOT touch any test. --with-tests keeps the verified cc-ci test-update PR path (verified via the branch-checkout harness run, since !testme uses prod tests). - upgrade-all (weekly cron) calls the DEFAULT — never auto-edits tests unattended; surfaces "tests look stale" PRs in the summary for the operator to opt in per-recipe. - New RESULT: SUCCESS-PENDING-TESTS for the recipe-green-but-test-stale default case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 20:18:59 +01:00
parent c7da03fa6c
commit 4a1da1dd60
3 changed files with 134 additions and 38 deletions
--- a/.claude/skills/recipe-upgrade/SKILL.md
+++ b/.claude/skills/recipe-upgrade/SKILL.md
@ -1,6 +1,6 @@
 ---
 name: recipe-upgrade
-description: Upgrade ONE Co-op Cloud recipe end-to-end and verify it on the cc-ci CI server. Researches available upstream upgrades, plans them (breaking changes, migrations, config), implements the bump (image tags + recipe version label + config), then VERIFIES the change green on cc-ci (full suite, cold, against the PR head) and opens a recipe PR. If the upgrade is correct but a cc-ci TEST is now stale, it also updates the test, verifies that, and opens a second PR to the cc-ci repo. NEVER merges — leaves verified, ready-to-merge PRs. The per-recipe worker behind /upgrade-all. Invoke as /recipe-upgrade <recipe>.
+description: Upgrade ONE Co-op Cloud recipe end-to-end and verify it on the cc-ci CI server. Researches available upstream upgrades, plans them (breaking changes, migrations, config), implements the bump (image tags + recipe version label + config), opens a recipe PR, and verifies it by posting `!testme` on the PR (real CI; results visible in the PR; iterates up to 3×). DEFAULT: recipe PR only, using existing tests — if a test fails because it is genuinely stale, it leaves an explanatory COMMENT on the PR for the operator (does NOT touch tests). With `--with-tests`: also opens + verifies a PR to update the stale cc-ci test. NEVER merges. The per-recipe worker behind /upgrade-all. Invoke as /recipe-upgrade <recipe> [--with-tests].
 ---

 # recipe-upgrade
@ -18,6 +18,15 @@ harness, and the bot token to push mirror branches). The Gitea PR API call uses
 cc-ci harness decides pass/fail). **Create + verify, NEVER merge** (operator merges). **Never weaken a
 test** to make a PR green.

+## Arguments
+- `<recipe>` — the recipe to upgrade (required).
+- `--with-tests` (optional) — **opt in to updating cc-ci tests.** Without it (the DEFAULT, and what the
+  weekly `/upgrade-all` cron uses), this skill **only ever opens a recipe PR using the existing tests**;
+  if a test fails and is diagnosed as genuinely stale, it **leaves a comment** on the recipe PR
+  explaining why and stops — it does **not** modify any test. With `--with-tests`, a stale-test failure
+  additionally gets a **verified cc-ci test-update PR** (step 5b). A real upgrade regression is fixed +
+  re-tested in **both** modes.
+
 ## Procedure

 ### 1. Plan the upgrade (research — follow recipe-upgrade-plan methodology)
@ -73,32 +82,51 @@ real current upstream main). Capture the `PR_URL`. Optionally export `RECIPE_PR_
 table + the planned operator-action notes). Re-running with the same target version updates the
 existing same-branch PR rather than duplicating it.

-### 4. VERIFY the upgrade on the CI server (the gate)
-Run the cc-ci full suite **cold** against the PR head — the dogfood gate from `ci-test-review`:
+### 4. VERIFY by running `!testme` ON THE PR (results visible in the PR; iterate ≤3×)
+Trigger the **real** CI on the recipe PR by posting `!testme` — the bridge runs the harness on cc-ci
+and posts the run + its results back to the PR, so the operator sees them in the PR. Use the PR index
+from step 3's `PR_URL` (`.../pulls/<N>`):
 ```
-RECIPE=<recipe> REF=upgrade-<new-version> /srv/cc-ci/.claude/skills/ci-test-review/verify-pr.sh
+/srv/cc-ci/.claude/skills/recipe-upgrade/testme-on-pr.sh <recipe> <pr-index>     # POST=1: posts !testme + polls
+# if it prints VERDICT=PENDING, poll again WITHOUT re-triggering:
+POST=0 /srv/cc-ci/.claude/skills/recipe-upgrade/testme-on-pr.sh <recipe> <pr-index>
 ```
-(`SRC` defaults to `recipe-maintainers/<recipe>`.) **GREEN** ⇔ harness exits 0 → the recipe PR is
-verified, ready for operator merge. **General bar = one cold green**; use `REPEAT=3` only for a recipe
-already known to be FLAKY (e.g. lasuite-drive). Record the run summary in the report.
+- **GREEN** → the recipe PR is verified on cc-ci, results in the PR, ready for operator merge. Done.
+- **RED** → go to step 5. You get **up to 3 `!testme` runs total** on this PR (initial + 2 retries):
+  use them to fix and re-test a real upgrade regression (step 5a). Each retry is a fresh `!testme`
+  after you push a fix, so every attempt's result is recorded in the PR.

-### 5. If verification is RED → diagnose (recipe vs cc-ci TEST), per ci-test-review
-Read the harness log and classify the failure:
+### 5. If `!testme` is RED → diagnose from the run (recipe regression vs stale cc-ci TEST)
+Read the failing run (linked from the PR / the harness log on cc-ci) and classify:

- **The UPGRADE itself is broken** (real regression: bad tag, missing migration, config gap) → fix it
-  on the recipe branch (bounded — ≤3 iterations), re-push, re-verify. The recipe PR is only "working"
-  once cc-ci is green. If still red after the budget, leave the PR open and report
-  `FAILED — upgrade not green`.
- **The upgrade is correct but a cc-ci TEST is now stale/wrong** for the new version (asserts old
-  behavior, an overlay needs updating, a readiness gate changed) → this is a **CI-server fix**. Make
-  it the `ci-test-review` way:
+**5a. The UPGRADE itself is broken** (real regression: bad tag, missing migration, config gap) — fix
+it on the recipe branch, push, and re-run `!testme` (within the 3-run budget). The recipe PR is only
+"working" once `!testme` is GREEN. If still red after 3 runs, leave the PR open (its red `!testme`
+results stand as the evidence) and report `FAILED — upgrade not green after 3 !testme runs`.
+
+**5b. The upgrade is correct but a cc-ci TEST is genuinely stale/wrong** for the new version (asserts
+old behavior, an overlay needs updating, a readiness gate changed). `!testme` can't go green without a
+test change, and a test change is **gated by `--with-tests`**:
+
+- **DEFAULT (no `--with-tests`) — comment, don't touch tests.** Post a **comment on the recipe PR**
+  that: states the upgrade itself looks correct; identifies the specific test that failed and **why it
+  appears stale** (what the new version changed vs what the test asserts); and tells the operator they
+  can **re-run `/recipe-upgrade <recipe> --with-tests`** to also open + verify a cc-ci test-update PR.
+  Do **NOT** modify any test. Report `SUCCESS-PENDING-TESTS` (recipe PR open; `!testme` red on a
+  stale test; operator to decide).
+- **`--with-tests` — open + verify a cc-ci test PR.** Make it the `ci-test-review` way:
  1. Branch `recipe-maintainers/cc-ci` in a **separate clone** (single-writer: never push `main`,
-     never touch the build loops' `/cc-ci` `/cc-ci-adv` clones), update the test/overlay.
-  2. **Verify** the recipe upgrade **with the updated test applied**: check the cc-ci branch out on
-     cc-ci, rebuild if needed, re-run `verify-pr.sh` for the recipe + a small regression sample.
-  3. Open the **cc-ci test PR** via the Gitea API (`/srv/cc-ci/.testenv` creds). Capture its URL.
-  4. The recipe PR + the cc-ci test PR are a **pair** — note in each that it depends on the other.
- **FLAKY** (passes on a re-run) → note it; don't author a fix for a flake.
+     never touch the build loops' `/cc-ci` `/cc-ci-adv` clones); update the test/overlay.
+  2. **Verify the recipe upgrade WITH the updated test applied.** `!testme` on the recipe PR uses the
+     *deployed/main* cc-ci tests, so it can't see the unmerged test change — verify this combo by
+     running the harness with the cc-ci branch checked out on cc-ci:
+     `RECIPE=<recipe> REF=upgrade-<new-version> /srv/cc-ci/.claude/skills/ci-test-review/verify-pr.sh`
+     (after checking out the cc-ci test branch on cc-ci + rebuilding if needed), plus a small
+     regression sample. Green ⇔ the upgrade passes under the corrected test.
+  3. Open the **cc-ci test PR** via the Gitea API. Note in BOTH PRs that they are a dependent pair
+     (the recipe PR's `!testme` goes green only once the test PR is merged). Report `SUCCESS+TESTPR`.
+
+**FLAKY** (passes on a re-`!testme`) → note it; don't author a fix for a flake.

 Never weaken a test to turn a red upgrade green.

@ -106,21 +134,26 @@ Never weaken a test to turn a red upgrade green.
 Write the report to `/srv/cc-ci/.cc-ci-logs/upgrades/<recipe>-upgrade-<YYYY-MM-DD>.md` and print, as
 the **last line**, one of these exact prefixes (so `/upgrade-all` can collect it):

- `RESULT: SUCCESS — <recipe> <old> → <new>, cc-ci GREEN, recipe PR: <url>`
- `RESULT: SUCCESS+TESTPR — <recipe> <old> → <new>, cc-ci GREEN; recipe PR: <url>; cc-ci test PR: <url>`
- `RESULT: FAILED — <recipe> at <step>: <one-line reason>` (e.g. upgrade not green after 3 tries)
+- `RESULT: SUCCESS — <recipe> <old> → <new>, !testme GREEN, recipe PR: <url>`
+- `RESULT: SUCCESS-PENDING-TESTS — <recipe> <old> → <new>, recipe PR: <url>; !testme RED on a stale test (commented; re-run --with-tests to update tests)`
+- `RESULT: SUCCESS+TESTPR — <recipe> <old> → <new>; recipe PR: <url>; cc-ci test PR: <url> (verified with the test change; pair)`
+- `RESULT: FAILED — <recipe> at <step>: <one-line reason>` (e.g. upgrade not green after 3 !testme runs)
 - `RESULT: SKIPPED — <recipe>: <up-to-date | dirty-worktree | reason>`

 Always state explicitly that **nothing was merged** — the PR(s) await operator review.

 ## Guardrails
- **cc-ci is the gate** — deterministic harness decides pass/fail; AI plans, implements, diagnoses.
+- **cc-ci is the gate, via `!testme` on the PR** — the harness decides pass/fail and the results live
+  in the PR (operator-visible); AI plans, implements, diagnoses. Up to 3 `!testme` runs per PR.
 - **Create + verify, NEVER merge.** Recipe PR (and any cc-ci test PR) are operator-merged after review.
+- **Tests are opt-in.** Default = recipe PR only, existing tests; a stale-test failure gets an
+  explanatory PR **comment**, never a test edit. Only `--with-tests` opens a cc-ci test PR. The weekly
+  `/upgrade-all` cron always runs the **default** — it never auto-updates tests.
 - **Mirror reflects reality.** Each run force-syncs the `recipe-maintainers/<recipe>` mirror `main` to
  true upstream main, closes open PRs already merged upstream, and replaces any superseded open PR with
  the new one — so an open mirror PR always means "genuinely still open against current upstream main".
- **Prefer a recipe-only PR.** Only open a cc-ci test PR when the upgrade is correct but a test is
-  genuinely stale — not to paper over a real upgrade regression.
+- **Prefer a recipe-only PR.** Only open a cc-ci test PR (under `--with-tests`) when the upgrade is
+  correct but a test is genuinely stale — never to paper over a real upgrade regression.
 - **Never weaken a test**; **bounded** changes (the upgrade + minimal test update, not rewrites).
 - **Real abra path** throughout (`abra recipe upgrade` / `abra app deploy` via the harness).
 - **Single-writer / coordination:** dedicated branches + separate clones; never push `main` or disturb
--- a/.claude/skills/recipe-upgrade/testme-on-pr.sh
+++ b/.claude/skills/recipe-upgrade/testme-on-pr.sh
@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+# recipe-upgrade :: trigger cc-ci on a recipe PR via `!testme` and read the verdict FROM the PR.
+# -------------------------------------------------------------------------
+# Runs on the ORCHESTRATOR (just hits the Gitea API with /srv/cc-ci/.testenv creds). Posting
+# `!testme` as a PR comment is the REAL CI path — the bridge runs the harness on cc-ci and the
+# run + its results are posted back to the PR, so the operator can see them in the PR itself.
+#
+#   testme-on-pr.sh <recipe> <pr-index>
+# env: POST=1     post a fresh `!testme` first (default). POST=0 = poll an in-flight run only,
+#                 WITHOUT re-triggering (use on follow-up calls so you don't launch duplicate runs).
+#      MAX_WAIT=480  INTERVAL=30   poll window per call, in seconds. 480s stays under the 10-min
+#                 tool cap — if it prints VERDICT=PENDING, just run again with POST=0 to keep polling.
+#
+# Prints: VERDICT=GREEN|RED|PENDING  and  BUILD=<drone-build-url>
+set -o errexit -o nounset -o pipefail
+
+RECIPE="${1:?usage: testme-on-pr.sh <recipe> <pr-index>}"
+PRIDX="${2:?usage: testme-on-pr.sh <recipe> <pr-index>}"
+TESTENV="${TESTENV:-/srv/cc-ci/.testenv}"
+set -a; . "$TESTENV"; set +a
+: "${GITEA_USERNAME:?}"; : "${GITEA_PASSWORD:?}"; : "${GITEA_URL:?}"
+NS="${GITEA_NAMESPACE:-recipe-maintainers}"
+API="https://${GITEA_URL}/api/v1"; AUTH=(-u "${GITEA_USERNAME}:${GITEA_PASSWORD}")
+POST="${POST:-1}"; MAX_WAIT="${MAX_WAIT:-480}"; INTERVAL="${INTERVAL:-30}"
+
+SHA="$(curl -s "${AUTH[@]}" "${API}/repos/${NS}/${RECIPE}/pulls/${PRIDX}" \
+       | python3 -c "import json,sys;print(json.load(sys.stdin)['head']['sha'])" 2>/dev/null || true)"
+[ -n "$SHA" ] || { echo "ERROR: could not resolve ${NS}/${RECIPE} PR #${PRIDX} head sha"; exit 1; }
+
+if [ "$POST" = "1" ]; then
+  curl -s -o /dev/null "${AUTH[@]}" -H "Content-Type: application/json" -X POST \
+    "${API}/repos/${NS}/${RECIPE}/issues/${PRIDX}/comments" -d '{"body":"!testme"}'
+  echo "→ posted !testme on ${NS}/${RECIPE}#${PRIDX} (head ${SHA:0:8}) — results will appear in the PR"
+  sleep 5
+fi
+
+# Poll the Drone-reported combined commit status on the PR head (terminal = success/failure/error).
+deadline=$(( $(date +%s) + MAX_WAIT ))
+url=""
+while [ "$(date +%s)" -lt "$deadline" ]; do
+  read -r state url < <(curl -s "${AUTH[@]}" "${API}/repos/${NS}/${RECIPE}/commits/${SHA}/status" 2>/dev/null \
+      | python3 -c "import json,sys
+try: d=json.load(sys.stdin)
+except Exception: d={}
+sts=d.get('statuses') or []
+print(d.get('state','') or 'none', (sts[0].get('target_url') if sts else '') or '')" 2>/dev/null || echo "none ")
+  case "$state" in
+    success)        echo "VERDICT=GREEN";   echo "BUILD=${url}"; exit 0;;
+    failure|error)  echo "VERDICT=RED";     echo "BUILD=${url}"; exit 2;;
+  esac
+  sleep "$INTERVAL"
+done
+echo "VERDICT=PENDING"; echo "BUILD=${url:-?}"
+echo "(run still in flight — re-run with POST=0 to keep polling without re-triggering)"
+exit 3
--- a/.claude/skills/upgrade-all/SKILL.md
+++ b/.claude/skills/upgrade-all/SKILL.md
@ -51,12 +51,18 @@ A table — Recipe | Status (will upgrade / skipped:reason) | Available upgrade(
 For each recipe in `RECIPES_TO_UPGRADE`, spawn an Agent (`subagent_type: "general-purpose"`,
 description `"Upgrade <recipe> on cc-ci"`) with a prompt like:

-> Run the `/recipe-upgrade <recipe>` skill end-to-end
-> (`/srv/cc-ci/.claude/skills/recipe-upgrade/SKILL.md`): plan, implement, **verify green on cc-ci**,
-> open a recipe PR, and — only if the upgrade is correct but a cc-ci test went stale — open a verified
-> cc-ci test PR too. Drive cc-ci over `ssh cc-ci`. Do NOT prompt. Do NOT push upstream. Do NOT merge.
-> Print exactly one `RESULT:` line as your final line (the SUCCESS / SUCCESS+TESTPR / FAILED / SKIPPED
-> forms from the recipe-upgrade skill).
+> Run the `/recipe-upgrade <recipe>` skill end-to-end — **DEFAULT mode, do NOT pass `--with-tests`**
+> (`/srv/cc-ci/.claude/skills/recipe-upgrade/SKILL.md`): plan, implement, open a recipe PR, and verify
+> it by posting `!testme` on the PR (results visible in the PR; iterate ≤3×). Use the **existing
+> tests** — if a test fails because it is genuinely stale, **leave an explanatory comment on the PR**
+> for the operator and do NOT modify any test. Drive cc-ci over `ssh cc-ci`. Do NOT prompt. Do NOT push
+> upstream. Do NOT merge. Print exactly one `RESULT:` line as your final line (SUCCESS /
+> SUCCESS-PENDING-TESTS / FAILED / SKIPPED — `SUCCESS+TESTPR` will not occur in default mode).
+
+**Why default (no `--with-tests`):** the weekly cron must never auto-edit cc-ci tests unattended — a
+test change deserves a human decision. So the cron opens recipe PRs only; where a recipe's existing
+test looks stale, the operator sees the explanation in the PR comment and can re-run that one recipe
+with `/recipe-upgrade <recipe> --with-tests` to also get a verified test-update PR.

 - **Sequential (default):** one Agent at a time; wait, collect its `RESULT:`, then the next. If the
  Agent tool call itself errors, record `FAILED — agent tool error` and continue. One failure never
@ -65,8 +71,8 @@ description `"Upgrade <recipe> on cc-ci"`) with a prompt like:
  the results). Failures are isolated per recipe.

 ## 4. Collect results
-Parse each final `RESULT:` line into SUCCESS / SUCCESS+TESTPR / FAILED / SKIPPED. A subagent that
-emitted no `RESULT:` line → `FAILED — no result emitted`.
+Parse each final `RESULT:` line into SUCCESS / SUCCESS-PENDING-TESTS / FAILED / SKIPPED (default mode
+won't emit `SUCCESS+TESTPR`). A subagent that emitted no `RESULT:` line → `FAILED — no result emitted`.

 ## 5. Write + print the summary
 Write `/srv/cc-ci/.cc-ci-logs/upgrades/upgrade-all-<YYYY-MM-DD>.md` and print it, **leading with the
@ -74,9 +80,11 @@ PR list** (the actionable output):
 ```markdown
 # cc-ci Weekly Upgrade Run — <YYYY-MM-DD>
 ## Summary
- Considered: N · Upgraded (PR opened): N · With cc-ci test PR: N · Failed: N · Skipped: N
+- Considered: N · PR green (!testme): N · PR open but tests look stale (commented): N · Failed: N · Skipped: N
 ## PRs to review (NOT merged)
- <recipe> <old> → <new> — recipe PR: <url>   [+ cc-ci test PR: <url>]
+- <recipe> <old> → <new> — recipe PR: <url> — !testme GREEN
+## PRs where a test looks stale (operator: re-run `--with-tests` to update tests)
+- <recipe> <old> → <new> — recipe PR: <url> — !testme RED on <test>; see PR comment
 ## Failed (investigate)
 - <recipe> — at <step>: <reason>   (log: .cc-ci-logs/upgrades/<recipe>-upgrade-<date>.md)
 ## Skipped