The weekly upgrade run now executes inside a dedicated, remote-control agent (cc-ci-upgrader) — viewable/steerable at claude.ai/code like the Builder — rather than buried in headless cron output. - launch-upgrader.sh: spins up the cc-ci-upgrader tmux session under --remote-control with a kickoff that runs /upgrade-all (DEFAULT mode) to completion. On finish the agent STOPS and stays idle (does NOT self-terminate) so the run + summary stay reviewable in the web UI. `start` = use-or-create: leaves an in-flight (busy) run alone, else clears a finished/idle/wedged session and runs fresh; `fresh` always restarts. UPGRADER_ARGS passes flags (e.g. --dry-run); never --with-tests. - launch.sh: orchestrator_alive() now also skips the cc-ci-upgrader remote-control name, so the upgrader job isn't mistaken for the orchestrator. - upgrade-all skill: documents it runs as the cc-ci-upgrader agent; the weekly cron invokes `launch-upgrader.sh start` (not /upgrade-all inline). - Phase 5: V8a verifies the agent lifecycle (launch → run to completion → stay idle/viewable → next start clears it); V9 stops the verification session. - cron memory: weekly task = launch-upgrader.sh start at 0 3 * * 6 UTC. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.9 KiB
cc-ci Phase 5 — verify the /recipe-upgrade + testme-on-pr.sh end-to-end flow
Status: QUEUED — the FINAL phase, after Phase 4. Transition: auto (last in the launcher
sequence); the watchdog STOPS after it (build complete). Owner: Builder + Adversary loops
(Adversary cold-verifies); the orchestrator may also drive it.
This file: /srv/cc-ci/cc-ci-plan/plan-phase5-verify-upgrade-flow.md
Phase order: … 3 → 4 → 5.
0. Why this is LAST
The /recipe-upgrade and /upgrade-all skills live in the orchestration repo
(/srv/cc-ci/.claude/skills/) and drive cc-ci by posting !testme on a recipe PR
(testme-on-pr.sh) and reading the verdict back from the PR. That flow only works once cc-ci is
fully built — specifically it depends on Phase 3 (results UX): the run's result must be posted
back to the PR (a Gitea commit status and/or result comment on the PR head) for testme-on-pr.sh to
read it. So this verification can only run after the CI server is complete. It proves the upgrade
tooling actually works end-to-end before the weekly /upgrade-all cron is activated
(cc-ci-upgrade-all-cron — activate only after the build is done).
Use a sandbox recipe for the live runs (e.g. custom-html-tiny or a dedicated throwaway recipe),
and close every PR opened during verification afterward — do not pollute real recipe mirrors, and
never merge anything.
1. Definition of Done (Adversary cold-verifies → machine-docs/REVIEW-5.md)
- V1 —
!testmeon a recipe PR is the real gate. Posting!testme(as the bot, a collaborator) on a recipe mirror PR triggers bridge → Drone → harness, runs the suite, and posts the run + its result back to the PR (commit status on the PR head and/or a result comment, per Phase 3). A!testmefrom a non-collaborator is rejected. A bad trigger (!testmexyz) does NOT fire. (Re-uses the existing !testme D-gate evidence.) - V2 —
testme-on-pr.shreads the verdict correctly.POST=1posts exactly one!testme;POST=0polls WITHOUT re-triggering; the helper returnsVERDICT=GREENon a passing PR,VERDICT=REDon a failing PR,VERDICT=PENDINGwhile in flight, and aBUILD=<url>that points at the Drone build. (Drive a known-green and a known-red PR head.) - V3 —
/recipe-upgrade <sandbox-recipe>(DEFAULT mode) end-to-end GREEN. On a recipe with a real available upgrade: it reconciles the mirror, opens a recipe PR with the bump,!testmeruns and results appear in the PR, GREEN →RESULT: SUCCESS. Nothing merged. - V4 — 3-iteration regression loop. Seed a real upgrade regression (e.g. a deliberately bad
image tag) →
!testmeRED → fix on the recipe branch → re-!testme→ GREEN, all within ≤3!testmeruns, each recorded in the PR. With a persistent regression, after 3 runs it leaves the PR open and reportsRESULT: FAILED — upgrade not green after 3 !testme runs(does not weaken anything to force green). - V5 — stale-test DEFAULT = comment, no test edit. With a genuinely stale test for the new
version, default mode (no
--with-tests) leaves an explanatory comment on the recipe PR (upgrade looks correct; which test is stale + why; "re-run--with-tests"), modifies no test, and reportsRESULT: SUCCESS-PENDING-TESTS. - V6 —
--with-testsopens + verifies a cc-ci test PR. The same stale-test case under--with-testsopens a cc-ci test-update PR (dedicated branch, separate clone — nevermain, never the loops' clones), verifies the recipe upgrade WITH the test change applied via the branch-checkout harness run (verify-pr.sh, since!testmeuses prod tests), pairs the two PRs with cross-notes, and reportsRESULT: SUCCESS+TESTPR. Nothing merged. - V7 — mirror reconciliation. After a run, the
recipe-maintainers/<recipe>mirrormainis identical to upstream main; an open PR whose change is already in upstream main is auto-closed (merged-upstream); a superseded open PR is closed and replaced by the new one. (Seed each case and confirm.) - V8 —
/upgrade-allDEFAULT run.--dry-runlists candidates correctly; a live run over a small set (1–2 sandbox recipes) opens recipe PRs using default mode (never--with-tests), surfaces "tests look stale" PRs in its own summary section, runs sequentially with teardown between recipes, and the summary leads with the PR list. Confirms the weekly cron never auto-edits tests. - V8a — the
cc-ci-upgraderagent.cc-ci-plan/launch-upgrader.sh startspins up a remote-controlcc-ci-upgradersession (viewable at claude.ai/code) that runs/upgrade-all(DEFAULT) to completion, then STOPS and stays idle (does NOT self-terminate) with the summary visible. A secondstartwhile a run is in flight (busy) leaves it alone; astartagainst a finished/idle (or wedged) session kills it and runs fresh. (UseUPGRADER_ARGS=--dry-runfor a cheap check, then a real small-set run.) This is exactly what the weekly cron invokes — verify the cron-equivalent path end-to-end. - V9 — cleanup. Every PR opened during verification (recipe + any cc-ci test PR) is closed
and any sandbox deploy is torn down; the verification
cc-ci-upgradersession is stopped (launch-upgrader.sh stop); the box is left clean.
2. Method / notes
- Sandbox first. Prefer
custom-html-tiny/ a throwaway recipe for V3–V8 so real recipe mirrors aren't spammed. For the seeded-regression (V4) and stale-test (V5/V6) cases, construct the seed on a sandbox recipe / a scratch branch, not a real upstream recipe. - Real path throughout — real
!testme→ bridge → Drone → harness → results-in-PR; realabra. - The skills are at
/srv/cc-ci/.claude/skills/{recipe-upgrade,upgrade-all,ci-test-review}/; read their SKILL.md as the spec for expected behavior. - If Phase 3's PR-result surface differs from what
testme-on-pr.shpolls (commit status vs comment), reconcile them here — either adjust the helper to read the real surface, or file a CI finding — and record the resolution inDECISIONS.md. (This is the most likely real gap to find.)
3. Guardrails
- NEVER merge any PR; never weaken a test to make a run pass.
- Close verification PRs + tear down deploys when done (V9) — leave no junk.
- Single-writer: cc-ci test PRs on dedicated branches + separate clones; never push
mainor touch the loops' working clones; shared Swarm is stateful — serialize + tear down. - Bounded: this phase VERIFIES the flow; it does not redesign the skills. A real defect → fix the
skill/helper (small) or file it; a bigger idea →
IDEAS.md.