Operator: when the final phase completes, install the weekly cron anchored to actual completion — first run ~1h after the build finishes, weekly from then on (supersedes the fixed "Sat 03:00 UTC" placeholder). - plan-phase5 §4: orchestrator computes T0=now+1h, installs a weekly job at T0's DOW+HH:MM running launch-upgrader.sh start; cron env needs claude on PATH + tmux + claude.ai login (mirror cc-ci-loops.service). VERIFY the first kickoff: cheap --dry-run pre-check, then confirm the real T0 fire launched the cc-ci-upgrader agent (status RUNNING, ran /upgrade-all, summary produced); record schedule + verified kickoff in DECISIONS.md. - upgrade-all skill Cron section + cron memory updated to the completion-anchored schedule + first-kickoff verification. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
117 lines
8.7 KiB
Markdown
117 lines
8.7 KiB
Markdown
# cc-ci Phase 5 — verify the `/recipe-upgrade` + `testme-on-pr.sh` end-to-end flow
|
||
|
||
**Status:** QUEUED — the **FINAL** phase, after Phase 4. Transition: auto (last in the launcher
|
||
sequence); the watchdog STOPS after it (build complete). **Owner:** Builder + Adversary loops
|
||
(Adversary cold-verifies); the orchestrator may also drive it.
|
||
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase5-verify-upgrade-flow.md`
|
||
**Phase order:** … 3 → 4 → **5**.
|
||
|
||
---
|
||
|
||
## 0. Why this is LAST
|
||
|
||
The `/recipe-upgrade` and `/upgrade-all` skills live in the **orchestration repo**
|
||
(`/srv/cc-ci/.claude/skills/`) and drive cc-ci by posting **`!testme` on a recipe PR**
|
||
(`testme-on-pr.sh`) and reading the verdict back **from the PR**. That flow only works once cc-ci is
|
||
**fully built** — specifically it depends on **Phase 3 (results UX)**: the run's result must be posted
|
||
back to the PR (a Gitea commit status and/or result comment on the PR head) for `testme-on-pr.sh` to
|
||
read it. So this verification can only run after the CI server is complete. It proves the upgrade
|
||
tooling actually works end-to-end **before** the weekly `/upgrade-all` cron is activated
|
||
([[cc-ci-upgrade-all-cron]] — activate only after the build is done).
|
||
|
||
**Use a sandbox recipe for the live runs** (e.g. `custom-html-tiny` or a dedicated throwaway recipe),
|
||
and **close every PR opened during verification afterward** — do not pollute real recipe mirrors, and
|
||
**never merge** anything.
|
||
|
||
## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-5.md`)
|
||
|
||
- [ ] **V1 — `!testme` on a recipe PR is the real gate.** Posting `!testme` (as the bot, a
|
||
collaborator) on a recipe mirror PR triggers bridge → Drone → harness, runs the suite, and
|
||
**posts the run + its result back to the PR** (commit status on the PR head and/or a result
|
||
comment, per Phase 3). A `!testme` from a **non-collaborator is rejected**. A bad trigger
|
||
(`!testmexyz`) does NOT fire. (Re-uses the existing !testme D-gate evidence.)
|
||
- [ ] **V2 — `testme-on-pr.sh` reads the verdict correctly.** `POST=1` posts exactly **one**
|
||
`!testme`; `POST=0` polls WITHOUT re-triggering; the helper returns `VERDICT=GREEN` on a passing
|
||
PR, `VERDICT=RED` on a failing PR, `VERDICT=PENDING` while in flight, and a `BUILD=<url>` that
|
||
points at the Drone build. (Drive a known-green and a known-red PR head.)
|
||
- [ ] **V3 — `/recipe-upgrade <sandbox-recipe>` (DEFAULT mode) end-to-end GREEN.** On a recipe with a
|
||
real available upgrade: it reconciles the mirror, opens a recipe PR with the bump, `!testme`
|
||
runs and **results appear in the PR**, GREEN → `RESULT: SUCCESS`. **Nothing merged.**
|
||
- [ ] **V4 — 3-iteration regression loop.** Seed a real upgrade regression (e.g. a deliberately bad
|
||
image tag) → `!testme` RED → fix on the recipe branch → re-`!testme` → GREEN, all within **≤3
|
||
`!testme` runs**, each recorded in the PR. With a persistent regression, after 3 runs it leaves
|
||
the PR open and reports `RESULT: FAILED — upgrade not green after 3 !testme runs` (does not
|
||
weaken anything to force green).
|
||
- [ ] **V5 — stale-test DEFAULT = comment, no test edit.** With a genuinely stale test for the new
|
||
version, default mode (no `--with-tests`) leaves an **explanatory comment** on the recipe PR
|
||
(upgrade looks correct; which test is stale + why; "re-run `--with-tests`"), modifies **no
|
||
test**, and reports `RESULT: SUCCESS-PENDING-TESTS`.
|
||
- [ ] **V6 — `--with-tests` opens + verifies a cc-ci test PR.** The same stale-test case under
|
||
`--with-tests` opens a cc-ci test-update PR (dedicated branch, separate clone — never `main`,
|
||
never the loops' clones), **verifies the recipe upgrade WITH the test change applied** via the
|
||
branch-checkout harness run (`verify-pr.sh`, since `!testme` uses prod tests), pairs the two PRs
|
||
with cross-notes, and reports `RESULT: SUCCESS+TESTPR`. Nothing merged.
|
||
- [ ] **V7 — mirror reconciliation.** After a run, the `recipe-maintainers/<recipe>` mirror `main` is
|
||
**identical to upstream main**; an open PR whose change is already in upstream main is
|
||
**auto-closed** (merged-upstream); a superseded open PR is **closed and replaced** by the new
|
||
one. (Seed each case and confirm.)
|
||
- [ ] **V8 — `/upgrade-all` DEFAULT run.** `--dry-run` lists candidates correctly; a live run over a
|
||
small set (1–2 sandbox recipes) opens recipe PRs **using default mode (never `--with-tests`)**,
|
||
surfaces "tests look stale" PRs in its own summary section, runs **sequentially with teardown**
|
||
between recipes, and the summary leads with the PR list. Confirms the weekly cron never
|
||
auto-edits tests.
|
||
- [ ] **V8a — the `cc-ci-upgrader` agent.** `cc-ci-plan/launch-upgrader.sh start` spins up a
|
||
remote-control `cc-ci-upgrader` session (viewable at claude.ai/code) that runs `/upgrade-all`
|
||
(DEFAULT) **to completion, then STOPS and stays idle** (does NOT self-terminate) with the
|
||
summary visible. A second `start` while a run is **in flight (busy)** leaves it alone; a `start`
|
||
against a **finished/idle (or wedged)** session **kills it and runs fresh**. (Use
|
||
`UPGRADER_ARGS=--dry-run` for a cheap check, then a real small-set run.) This is exactly what the
|
||
weekly cron invokes — verify the cron-equivalent path end-to-end.
|
||
- [ ] **V9 — cleanup.** Every PR opened during verification (recipe + any cc-ci test PR) is **closed**
|
||
and any sandbox deploy is **torn down**; the verification `cc-ci-upgrader` session is stopped
|
||
(`launch-upgrader.sh stop`); the box is left clean.
|
||
|
||
## 2. Method / notes
|
||
- **Sandbox first.** Prefer `custom-html-tiny` / a throwaway recipe for V3–V8 so real recipe mirrors
|
||
aren't spammed. For the seeded-regression (V4) and stale-test (V5/V6) cases, construct the seed on a
|
||
sandbox recipe / a scratch branch, not a real upstream recipe.
|
||
- **Real path throughout** — real `!testme` → bridge → Drone → harness → results-in-PR; real `abra`.
|
||
- The skills are at `/srv/cc-ci/.claude/skills/{recipe-upgrade,upgrade-all,ci-test-review}/`; read
|
||
their SKILL.md as the spec for expected behavior.
|
||
- If Phase 3's PR-result surface differs from what `testme-on-pr.sh` polls (commit status vs comment),
|
||
**reconcile them here** — either adjust the helper to read the real surface, or file a CI finding —
|
||
and record the resolution in `DECISIONS.md`. (This is the most likely real gap to find.)
|
||
|
||
## 3. Guardrails
|
||
- **NEVER merge** any PR; **never weaken a test** to make a run pass.
|
||
- **Close verification PRs + tear down deploys** when done (V9) — leave no junk.
|
||
- **Single-writer:** cc-ci test PRs on dedicated branches + separate clones; never push `main` or
|
||
touch the loops' working clones; shared Swarm is stateful — serialize + tear down.
|
||
- **Bounded:** this phase VERIFIES the flow; it does not redesign the skills. A real defect → fix the
|
||
skill/helper (small) or file it; a bigger idea → `IDEAS.md`.
|
||
|
||
## 4. On phase completion — install the weekly cron (+1h) and verify it fires
|
||
|
||
Once V1–V9 pass (the upgrade flow + the `cc-ci-upgrader` agent are proven) AND this is the final phase
|
||
(the whole cc-ci build is complete), the **orchestrator** installs the recurring weekly upgrade cron.
|
||
This SUPERSEDES the earlier fixed "Sat 03:00 UTC" placeholder — the schedule is now anchored to actual
|
||
completion ([[cc-ci-upgrade-all-cron]]).
|
||
|
||
1. **Schedule = completion + 1 hour, weekly thereafter.** Compute `T0 = now + 1h`; install a **weekly**
|
||
job at T0's day-of-week + `HH:MM` (cron `MM HH * * DOW`) so the FIRST run fires ~1h after the build
|
||
finishes, then every 7 days at that same slot.
|
||
2. **What it runs:** `/srv/cc-ci/cc-ci-plan/launch-upgrader.sh start` (→ spins up the `cc-ci-upgrader`
|
||
agent → `/upgrade-all` DEFAULT → stops idle/viewable). The cron environment must have the `claude`
|
||
CLI on `PATH`, `tmux` access, and the box logged into the claude.ai account (remote-control) — a
|
||
bare cron env is minimal, so set `PATH`/`HOME` explicitly (mirror how `cc-ci-loops.service`
|
||
launches). A user crontab / systemd timer / Claude scheduled task are all acceptable mechanisms;
|
||
pick one that gives that environment.
|
||
3. **VERIFY the first kickoff (the point of this step).**
|
||
- Cheap pre-check first: `UPGRADER_ARGS=--dry-run cc-ci-plan/launch-upgrader.sh fresh` (or schedule
|
||
a one-off test fire a couple of minutes out) to confirm the launcher path works from the chosen
|
||
cron mechanism's environment.
|
||
- Then confirm the **real** first fire at T0 (~1h later) actually kicked it off:
|
||
`launch-upgrader.sh status` shows `cc-ci-upgrader RUNNING`, it ran `/upgrade-all`, and produced a
|
||
summary. If it did NOT fire/launch (PATH, login, mechanism), fix and re-verify.
|
||
- Record the installed schedule + the verified-kickoff result in `DECISIONS.md`.
|