recipe-upgrade: !testme-on-PR verification + make test PRs opt-in (--with-tests)

Per operator:
- Verify via `!testme` posted ON the recipe PR (the real CI path) so results are
  viewable in the PR; iterate up to 3 !testme runs (fix a real regression + re-test).
  New helper testme-on-pr.sh posts !testme and polls the PR head commit status
  for the verdict (POST=0 to keep polling without re-triggering).
- Test updates are now OPT-IN via `--with-tests`. DEFAULT: recipe PR only using
  existing tests; if a test fails and is genuinely stale, leave an explanatory
  COMMENT on the PR (upgrade looks correct; re-run --with-tests to update tests)
  and do NOT touch any test. --with-tests keeps the verified cc-ci test-update PR
  path (verified via the branch-checkout harness run, since !testme uses prod tests).
- upgrade-all (weekly cron) calls the DEFAULT — never auto-edits tests unattended;
  surfaces "tests look stale" PRs in the summary for the operator to opt in per-recipe.
- New RESULT: SUCCESS-PENDING-TESTS for the recipe-green-but-test-stale default case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 20:18:59 +01:00
parent c7da03fa6c
commit 4a1da1dd60
3 changed files with 134 additions and 38 deletions

View File

@ -1,6 +1,6 @@
--- ---
name: recipe-upgrade name: recipe-upgrade
description: Upgrade ONE Co-op Cloud recipe end-to-end and verify it on the cc-ci CI server. Researches available upstream upgrades, plans them (breaking changes, migrations, config), implements the bump (image tags + recipe version label + config), then VERIFIES the change green on cc-ci (full suite, cold, against the PR head) and opens a recipe PR. If the upgrade is correct but a cc-ci TEST is now stale, it also updates the test, verifies that, and opens a second PR to the cc-ci repo. NEVER merges — leaves verified, ready-to-merge PRs. The per-recipe worker behind /upgrade-all. Invoke as /recipe-upgrade <recipe>. description: Upgrade ONE Co-op Cloud recipe end-to-end and verify it on the cc-ci CI server. Researches available upstream upgrades, plans them (breaking changes, migrations, config), implements the bump (image tags + recipe version label + config), opens a recipe PR, and verifies it by posting `!testme` on the PR (real CI; results visible in the PR; iterates up to 3×). DEFAULT: recipe PR only, using existing tests — if a test fails because it is genuinely stale, it leaves an explanatory COMMENT on the PR for the operator (does NOT touch tests). With `--with-tests`: also opens + verifies a PR to update the stale cc-ci test. NEVER merges. The per-recipe worker behind /upgrade-all. Invoke as /recipe-upgrade <recipe> [--with-tests].
--- ---
# recipe-upgrade # recipe-upgrade
@ -18,6 +18,15 @@ harness, and the bot token to push mirror branches). The Gitea PR API call uses
cc-ci harness decides pass/fail). **Create + verify, NEVER merge** (operator merges). **Never weaken a cc-ci harness decides pass/fail). **Create + verify, NEVER merge** (operator merges). **Never weaken a
test** to make a PR green. test** to make a PR green.
## Arguments
- `<recipe>` — the recipe to upgrade (required).
- `--with-tests` (optional) — **opt in to updating cc-ci tests.** Without it (the DEFAULT, and what the
weekly `/upgrade-all` cron uses), this skill **only ever opens a recipe PR using the existing tests**;
if a test fails and is diagnosed as genuinely stale, it **leaves a comment** on the recipe PR
explaining why and stops — it does **not** modify any test. With `--with-tests`, a stale-test failure
additionally gets a **verified cc-ci test-update PR** (step 5b). A real upgrade regression is fixed +
re-tested in **both** modes.
## Procedure ## Procedure
### 1. Plan the upgrade (research — follow recipe-upgrade-plan methodology) ### 1. Plan the upgrade (research — follow recipe-upgrade-plan methodology)
@ -73,32 +82,51 @@ real current upstream main). Capture the `PR_URL`. Optionally export `RECIPE_PR_
table + the planned operator-action notes). Re-running with the same target version updates the table + the planned operator-action notes). Re-running with the same target version updates the
existing same-branch PR rather than duplicating it. existing same-branch PR rather than duplicating it.
### 4. VERIFY the upgrade on the CI server (the gate) ### 4. VERIFY by running `!testme` ON THE PR (results visible in the PR; iterate ≤3×)
Run the cc-ci full suite **cold** against the PR head — the dogfood gate from `ci-test-review`: Trigger the **real** CI on the recipe PR by posting `!testme` — the bridge runs the harness on cc-ci
and posts the run + its results back to the PR, so the operator sees them in the PR. Use the PR index
from step 3's `PR_URL` (`.../pulls/<N>`):
``` ```
RECIPE=<recipe> REF=upgrade-<new-version> /srv/cc-ci/.claude/skills/ci-test-review/verify-pr.sh /srv/cc-ci/.claude/skills/recipe-upgrade/testme-on-pr.sh <recipe> <pr-index> # POST=1: posts !testme + polls
# if it prints VERDICT=PENDING, poll again WITHOUT re-triggering:
POST=0 /srv/cc-ci/.claude/skills/recipe-upgrade/testme-on-pr.sh <recipe> <pr-index>
``` ```
(`SRC` defaults to `recipe-maintainers/<recipe>`.) **GREEN** ⇔ harness exits 0 → the recipe PR is - **GREEN** → the recipe PR is verified on cc-ci, results in the PR, ready for operator merge. Done.
verified, ready for operator merge. **General bar = one cold green**; use `REPEAT=3` only for a recipe - **RED** → go to step 5. You get **up to 3 `!testme` runs total** on this PR (initial + 2 retries):
already known to be FLAKY (e.g. lasuite-drive). Record the run summary in the report. use them to fix and re-test a real upgrade regression (step 5a). Each retry is a fresh `!testme`
after you push a fix, so every attempt's result is recorded in the PR.
### 5. If verification is RED → diagnose (recipe vs cc-ci TEST), per ci-test-review ### 5. If `!testme` is RED → diagnose from the run (recipe regression vs stale cc-ci TEST)
Read the harness log and classify the failure: Read the failing run (linked from the PR / the harness log on cc-ci) and classify:
- **The UPGRADE itself is broken** (real regression: bad tag, missing migration, config gap) fix it **5a. The UPGRADE itself is broken** (real regression: bad tag, missing migration, config gap) fix
on the recipe branch (bounded — ≤3 iterations), re-push, re-verify. The recipe PR is only "working" it on the recipe branch, push, and re-run `!testme` (within the 3-run budget). The recipe PR is only
once cc-ci is green. If still red after the budget, leave the PR open and report "working" once `!testme` is GREEN. If still red after 3 runs, leave the PR open (its red `!testme`
`FAILED — upgrade not green`. results stand as the evidence) and report `FAILED — upgrade not green after 3 !testme runs`.
- **The upgrade is correct but a cc-ci TEST is now stale/wrong** for the new version (asserts old
behavior, an overlay needs updating, a readiness gate changed) → this is a **CI-server fix**. Make **5b. The upgrade is correct but a cc-ci TEST is genuinely stale/wrong** for the new version (asserts
it the `ci-test-review` way: old behavior, an overlay needs updating, a readiness gate changed). `!testme` can't go green without a
test change, and a test change is **gated by `--with-tests`**:
- **DEFAULT (no `--with-tests`) — comment, don't touch tests.** Post a **comment on the recipe PR**
that: states the upgrade itself looks correct; identifies the specific test that failed and **why it
appears stale** (what the new version changed vs what the test asserts); and tells the operator they
can **re-run `/recipe-upgrade <recipe> --with-tests`** to also open + verify a cc-ci test-update PR.
Do **NOT** modify any test. Report `SUCCESS-PENDING-TESTS` (recipe PR open; `!testme` red on a
stale test; operator to decide).
- **`--with-tests` — open + verify a cc-ci test PR.** Make it the `ci-test-review` way:
1. Branch `recipe-maintainers/cc-ci` in a **separate clone** (single-writer: never push `main`, 1. Branch `recipe-maintainers/cc-ci` in a **separate clone** (single-writer: never push `main`,
never touch the build loops' `/cc-ci` `/cc-ci-adv` clones), update the test/overlay. never touch the build loops' `/cc-ci` `/cc-ci-adv` clones); update the test/overlay.
2. **Verify** the recipe upgrade **with the updated test applied**: check the cc-ci branch out on 2. **Verify the recipe upgrade WITH the updated test applied.** `!testme` on the recipe PR uses the
cc-ci, rebuild if needed, re-run `verify-pr.sh` for the recipe + a small regression sample. *deployed/main* cc-ci tests, so it can't see the unmerged test change — verify this combo by
3. Open the **cc-ci test PR** via the Gitea API (`/srv/cc-ci/.testenv` creds). Capture its URL. running the harness with the cc-ci branch checked out on cc-ci:
4. The recipe PR + the cc-ci test PR are a **pair** — note in each that it depends on the other. `RECIPE=<recipe> REF=upgrade-<new-version> /srv/cc-ci/.claude/skills/ci-test-review/verify-pr.sh`
- **FLAKY** (passes on a re-run) → note it; don't author a fix for a flake. (after checking out the cc-ci test branch on cc-ci + rebuilding if needed), plus a small
regression sample. Green ⇔ the upgrade passes under the corrected test.
3. Open the **cc-ci test PR** via the Gitea API. Note in BOTH PRs that they are a dependent pair
(the recipe PR's `!testme` goes green only once the test PR is merged). Report `SUCCESS+TESTPR`.
**FLAKY** (passes on a re-`!testme`) → note it; don't author a fix for a flake.
Never weaken a test to turn a red upgrade green. Never weaken a test to turn a red upgrade green.
@ -106,21 +134,26 @@ Never weaken a test to turn a red upgrade green.
Write the report to `/srv/cc-ci/.cc-ci-logs/upgrades/<recipe>-upgrade-<YYYY-MM-DD>.md` and print, as Write the report to `/srv/cc-ci/.cc-ci-logs/upgrades/<recipe>-upgrade-<YYYY-MM-DD>.md` and print, as
the **last line**, one of these exact prefixes (so `/upgrade-all` can collect it): the **last line**, one of these exact prefixes (so `/upgrade-all` can collect it):
- `RESULT: SUCCESS — <recipe> <old><new>, cc-ci GREEN, recipe PR: <url>` - `RESULT: SUCCESS — <recipe> <old><new>, !testme GREEN, recipe PR: <url>`
- `RESULT: SUCCESS+TESTPR<recipe> <old><new>, cc-ci GREEN; recipe PR: <url>; cc-ci test PR: <url>` - `RESULT: SUCCESS-PENDING-TESTS<recipe> <old><new>, recipe PR: <url>; !testme RED on a stale test (commented; re-run --with-tests to update tests)`
- `RESULT: FAILED<recipe> at <step>: <one-line reason>` (e.g. upgrade not green after 3 tries) - `RESULT: SUCCESS+TESTPR<recipe> <old> <new>; recipe PR: <url>; cc-ci test PR: <url> (verified with the test change; pair)`
- `RESULT: FAILED — <recipe> at <step>: <one-line reason>` (e.g. upgrade not green after 3 !testme runs)
- `RESULT: SKIPPED — <recipe>: <up-to-date | dirty-worktree | reason>` - `RESULT: SKIPPED — <recipe>: <up-to-date | dirty-worktree | reason>`
Always state explicitly that **nothing was merged** — the PR(s) await operator review. Always state explicitly that **nothing was merged** — the PR(s) await operator review.
## Guardrails ## Guardrails
- **cc-ci is the gate** — deterministic harness decides pass/fail; AI plans, implements, diagnoses. - **cc-ci is the gate, via `!testme` on the PR** — the harness decides pass/fail and the results live
in the PR (operator-visible); AI plans, implements, diagnoses. Up to 3 `!testme` runs per PR.
- **Create + verify, NEVER merge.** Recipe PR (and any cc-ci test PR) are operator-merged after review. - **Create + verify, NEVER merge.** Recipe PR (and any cc-ci test PR) are operator-merged after review.
- **Tests are opt-in.** Default = recipe PR only, existing tests; a stale-test failure gets an
explanatory PR **comment**, never a test edit. Only `--with-tests` opens a cc-ci test PR. The weekly
`/upgrade-all` cron always runs the **default** — it never auto-updates tests.
- **Mirror reflects reality.** Each run force-syncs the `recipe-maintainers/<recipe>` mirror `main` to - **Mirror reflects reality.** Each run force-syncs the `recipe-maintainers/<recipe>` mirror `main` to
true upstream main, closes open PRs already merged upstream, and replaces any superseded open PR with true upstream main, closes open PRs already merged upstream, and replaces any superseded open PR with
the new one — so an open mirror PR always means "genuinely still open against current upstream main". the new one — so an open mirror PR always means "genuinely still open against current upstream main".
- **Prefer a recipe-only PR.** Only open a cc-ci test PR when the upgrade is correct but a test is - **Prefer a recipe-only PR.** Only open a cc-ci test PR (under `--with-tests`) when the upgrade is
genuinely stale — not to paper over a real upgrade regression. correct but a test is genuinely stale — never to paper over a real upgrade regression.
- **Never weaken a test**; **bounded** changes (the upgrade + minimal test update, not rewrites). - **Never weaken a test**; **bounded** changes (the upgrade + minimal test update, not rewrites).
- **Real abra path** throughout (`abra recipe upgrade` / `abra app deploy` via the harness). - **Real abra path** throughout (`abra recipe upgrade` / `abra app deploy` via the harness).
- **Single-writer / coordination:** dedicated branches + separate clones; never push `main` or disturb - **Single-writer / coordination:** dedicated branches + separate clones; never push `main` or disturb

View File

@ -0,0 +1,55 @@
#!/usr/bin/env bash
# recipe-upgrade :: trigger cc-ci on a recipe PR via `!testme` and read the verdict FROM the PR.
# -------------------------------------------------------------------------
# Runs on the ORCHESTRATOR (just hits the Gitea API with /srv/cc-ci/.testenv creds). Posting
# `!testme` as a PR comment is the REAL CI path — the bridge runs the harness on cc-ci and the
# run + its results are posted back to the PR, so the operator can see them in the PR itself.
#
# testme-on-pr.sh <recipe> <pr-index>
# env: POST=1 post a fresh `!testme` first (default). POST=0 = poll an in-flight run only,
# WITHOUT re-triggering (use on follow-up calls so you don't launch duplicate runs).
# MAX_WAIT=480 INTERVAL=30 poll window per call, in seconds. 480s stays under the 10-min
# tool cap — if it prints VERDICT=PENDING, just run again with POST=0 to keep polling.
#
# Prints: VERDICT=GREEN|RED|PENDING and BUILD=<drone-build-url>
set -o errexit -o nounset -o pipefail
RECIPE="${1:?usage: testme-on-pr.sh <recipe> <pr-index>}"
PRIDX="${2:?usage: testme-on-pr.sh <recipe> <pr-index>}"
TESTENV="${TESTENV:-/srv/cc-ci/.testenv}"
set -a; . "$TESTENV"; set +a
: "${GITEA_USERNAME:?}"; : "${GITEA_PASSWORD:?}"; : "${GITEA_URL:?}"
NS="${GITEA_NAMESPACE:-recipe-maintainers}"
API="https://${GITEA_URL}/api/v1"; AUTH=(-u "${GITEA_USERNAME}:${GITEA_PASSWORD}")
POST="${POST:-1}"; MAX_WAIT="${MAX_WAIT:-480}"; INTERVAL="${INTERVAL:-30}"
SHA="$(curl -s "${AUTH[@]}" "${API}/repos/${NS}/${RECIPE}/pulls/${PRIDX}" \
| python3 -c "import json,sys;print(json.load(sys.stdin)['head']['sha'])" 2>/dev/null || true)"
[ -n "$SHA" ] || { echo "ERROR: could not resolve ${NS}/${RECIPE} PR #${PRIDX} head sha"; exit 1; }
if [ "$POST" = "1" ]; then
curl -s -o /dev/null "${AUTH[@]}" -H "Content-Type: application/json" -X POST \
"${API}/repos/${NS}/${RECIPE}/issues/${PRIDX}/comments" -d '{"body":"!testme"}'
echo "→ posted !testme on ${NS}/${RECIPE}#${PRIDX} (head ${SHA:0:8}) — results will appear in the PR"
sleep 5
fi
# Poll the Drone-reported combined commit status on the PR head (terminal = success/failure/error).
deadline=$(( $(date +%s) + MAX_WAIT ))
url=""
while [ "$(date +%s)" -lt "$deadline" ]; do
read -r state url < <(curl -s "${AUTH[@]}" "${API}/repos/${NS}/${RECIPE}/commits/${SHA}/status" 2>/dev/null \
| python3 -c "import json,sys
try: d=json.load(sys.stdin)
except Exception: d={}
sts=d.get('statuses') or []
print(d.get('state','') or 'none', (sts[0].get('target_url') if sts else '') or '')" 2>/dev/null || echo "none ")
case "$state" in
success) echo "VERDICT=GREEN"; echo "BUILD=${url}"; exit 0;;
failure|error) echo "VERDICT=RED"; echo "BUILD=${url}"; exit 2;;
esac
sleep "$INTERVAL"
done
echo "VERDICT=PENDING"; echo "BUILD=${url:-?}"
echo "(run still in flight — re-run with POST=0 to keep polling without re-triggering)"
exit 3

View File

@ -51,12 +51,18 @@ A table — Recipe | Status (will upgrade / skipped:reason) | Available upgrade(
For each recipe in `RECIPES_TO_UPGRADE`, spawn an Agent (`subagent_type: "general-purpose"`, For each recipe in `RECIPES_TO_UPGRADE`, spawn an Agent (`subagent_type: "general-purpose"`,
description `"Upgrade <recipe> on cc-ci"`) with a prompt like: description `"Upgrade <recipe> on cc-ci"`) with a prompt like:
> Run the `/recipe-upgrade <recipe>` skill end-to-end > Run the `/recipe-upgrade <recipe>` skill end-to-end — **DEFAULT mode, do NOT pass `--with-tests`**
> (`/srv/cc-ci/.claude/skills/recipe-upgrade/SKILL.md`): plan, implement, **verify green on cc-ci**, > (`/srv/cc-ci/.claude/skills/recipe-upgrade/SKILL.md`): plan, implement, open a recipe PR, and verify
> open a recipe PR, and — only if the upgrade is correct but a cc-ci test went stale — open a verified > it by posting `!testme` on the PR (results visible in the PR; iterate ≤3×). Use the **existing
> cc-ci test PR too. Drive cc-ci over `ssh cc-ci`. Do NOT prompt. Do NOT push upstream. Do NOT merge. > tests** — if a test fails because it is genuinely stale, **leave an explanatory comment on the PR**
> Print exactly one `RESULT:` line as your final line (the SUCCESS / SUCCESS+TESTPR / FAILED / SKIPPED > for the operator and do NOT modify any test. Drive cc-ci over `ssh cc-ci`. Do NOT prompt. Do NOT push
> forms from the recipe-upgrade skill). > upstream. Do NOT merge. Print exactly one `RESULT:` line as your final line (SUCCESS /
> SUCCESS-PENDING-TESTS / FAILED / SKIPPED — `SUCCESS+TESTPR` will not occur in default mode).
**Why default (no `--with-tests`):** the weekly cron must never auto-edit cc-ci tests unattended — a
test change deserves a human decision. So the cron opens recipe PRs only; where a recipe's existing
test looks stale, the operator sees the explanation in the PR comment and can re-run that one recipe
with `/recipe-upgrade <recipe> --with-tests` to also get a verified test-update PR.
- **Sequential (default):** one Agent at a time; wait, collect its `RESULT:`, then the next. If the - **Sequential (default):** one Agent at a time; wait, collect its `RESULT:`, then the next. If the
Agent tool call itself errors, record `FAILED — agent tool error` and continue. One failure never Agent tool call itself errors, record `FAILED — agent tool error` and continue. One failure never
@ -65,8 +71,8 @@ description `"Upgrade <recipe> on cc-ci"`) with a prompt like:
the results). Failures are isolated per recipe. the results). Failures are isolated per recipe.
## 4. Collect results ## 4. Collect results
Parse each final `RESULT:` line into SUCCESS / SUCCESS+TESTPR / FAILED / SKIPPED. A subagent that Parse each final `RESULT:` line into SUCCESS / SUCCESS-PENDING-TESTS / FAILED / SKIPPED (default mode
emitted no `RESULT:` line → `FAILED — no result emitted`. won't emit `SUCCESS+TESTPR`). A subagent that emitted no `RESULT:` line → `FAILED — no result emitted`.
## 5. Write + print the summary ## 5. Write + print the summary
Write `/srv/cc-ci/.cc-ci-logs/upgrades/upgrade-all-<YYYY-MM-DD>.md` and print it, **leading with the Write `/srv/cc-ci/.cc-ci-logs/upgrades/upgrade-all-<YYYY-MM-DD>.md` and print it, **leading with the
@ -74,9 +80,11 @@ PR list** (the actionable output):
```markdown ```markdown
# cc-ci Weekly Upgrade Run — <YYYY-MM-DD> # cc-ci Weekly Upgrade Run — <YYYY-MM-DD>
## Summary ## Summary
- Considered: N · Upgraded (PR opened): N · With cc-ci test PR: N · Failed: N · Skipped: N - Considered: N · PR green (!testme): N · PR open but tests look stale (commented): N · Failed: N · Skipped: N
## PRs to review (NOT merged) ## PRs to review (NOT merged)
- <recipe> <old><new> — recipe PR: <url> [+ cc-ci test PR: <url>] - <recipe> <old><new> — recipe PR: <url> — !testme GREEN
## PRs where a test looks stale (operator: re-run `--with-tests` to update tests)
- <recipe> <old><new> — recipe PR: <url> — !testme RED on <test>; see PR comment
## Failed (investigate) ## Failed (investigate)
- <recipe> — at <step>: <reason> (log: .cc-ci-logs/upgrades/<recipe>-upgrade-<date>.md) - <recipe> — at <step>: <reason> (log: .cc-ci-logs/upgrades/<recipe>-upgrade-<date>.md)
## Skipped ## Skipped