plan: Phase 3r — /ci-test-review Claude skill (on-demand AI review + recipe-vs-CI PR diagnosis)

Deterministic CI stays the primary, AI-free path. Adds a separate on-demand skill (ships in the cc-ci repo .claude/skills/ci-test-review/) that runs the full suite across all recipes and, per failure, AI-diagnoses + classifies: recipe PR (+ proposed change) vs CI-server PR vs stale-test; or 'all passed, recipes+tests up to date' (incl. a latest-version freshness check). Proposes, never auto-merges (operator-merge rule). Slotted 3 -> 3r -> 4. AI only diagnoses; execution stays deterministic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 16:39:07 +01:00
parent 61ab3ecb3a
commit 5f84f8c028
2 changed files with 84 additions and 1 deletions
--- a/cc-ci-plan/launch.sh
+++ b/cc-ci-plan/launch.sh
@ -56,7 +56,7 @@ WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}"
 # Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order,
 # auto-transitions on the phase's "## DONE" (in BUILDER_DIR/<statusbasename>), and STOPS after the
 # last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence.
-PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
+PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;3r|plan-phase3r-ci-test-review-skill.md|STATUS-3r.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
 IFS=';' read -r -a PHASES <<< "$PHASES_SPEC"
 PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}"
 # --------------------------------------------------------------------------
--- a/cc-ci-plan/plan-phase3r-ci-test-review-skill.md
+++ b/cc-ci-plan/plan-phase3r-ci-test-review-skill.md
@ -0,0 +1,83 @@
+# cc-ci Phase 3r — `/ci-test-review` Claude skill (on-demand AI review + diagnosis)
+
+**Status:** QUEUED — after Phase 3 (results UX), before Phase 4 (final review). A **Claude skill that
+ships in the cc-ci product repo** (`.claude/skills/ci-test-review/`), built by the loops.
+**Transition:** auto (in the launcher sequence). **Owner:** Builder + Adversary loops.
+**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase3r-ci-test-review-skill.md`
+**Phase order:** … 3 → **3r** → 4.
+
+---
+
+## 0. Why — two distinct modes
+
+cc-ci's **primary** use is **deterministic testing with NO AI**: `!testme` / the nightly sweep run the
+harness, recipes pass/fail, done. That stays the bread-and-butter and is unaffected by this phase.
+
+This phase adds a **separate, on-demand, AI-driven REVIEW layer** — the `/ci-test-review` skill — that
+a maintainer (or a fresh Claude) invokes to: run the full suite across **all** recipes, and **for any
+failure, diagnose the root cause and classify it** — does it need a **PR to the recipe** (and what
+that PR is), a **PR to the CI server** (harness/test/infra), or is the **test out of date**? — or, if
+everything's green and current, report **"all passed, recipes + tests up to date."** It codifies the
+recipe-PR-vs-CI-PR judgment the Adversary already exercises (e.g. lasuite-drive→recipe PR, immich
+pg_dump→recipe PR) into a runnable tool.
+
+**Crisp boundary:** the test *execution* is deterministic (the existing harness); **AI is used ONLY
+for diagnosis/classification/PR-proposal**, never to decide pass/fail.
+
+## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-3r.md`)
+
+- [ ] **R1 — Skill exists + invocable.** `.claude/skills/ci-test-review/SKILL.md` (+ helper script)
+      in the cc-ci repo, invocable as `/ci-test-review`. Reuses the existing harness + recipe
+      enrollment list — does NOT reinvent the runner.
+- [ ] **R2 — Deterministic execution.** Runs the real harness (`run_recipe_ci.py`) across **all
+      enrolled recipes**, full suite per recipe, collecting per-recipe / per-tier pass/fail/skip.
+      No AI in execution. (Decide cold vs `--quick`: default **cold** for an authoritative review;
+      `--quick` as a fast-mode option.)
+- [ ] **R3 — "Up to date" freshness.** Tests each recipe at its **latest published version** (not a
+      stale pin) and flags any recipe whose tested version lags upstream, and any test that's stale
+      vs the recipe — so "all passed" means "passes against current upstream," not against an old pin.
+- [ ] **R4 — All-green output.** If every recipe passes AND is current, emit a clear **"ALL PASSED —
+      recipes up to date, tests up to date"** summary (with versions tested).
+- [ ] **R5 — Failure diagnosis + classification.** For each failure, AI determines root cause and
+      classifies it as exactly one of:
+      - **RECIPE bug** → needs a **recipe PR**; output the concrete proposed change (e.g. "add
+        collabora WOPI healthcheck + start_period", "add pg_dump backup hook").
+      - **CI-SERVER bug** → needs a **cc-ci PR** (harness/test/infra); output the proposed change.
+      - **TEST out-of-date** → the recipe legitimately changed; the cc-ci test needs updating; output
+        the update.
+      - **FLAKY / infra** → distinguish from a real failure (e.g. re-run) so a flake isn't
+        misclassified as a recipe/CI bug.
+- [ ] **R6 — Structured report; PROPOSE not auto-merge.** Output a structured report (per recipe
+      status; per failure: root cause + classification + proposed PR target + change sketch), or the
+      all-green summary. It **proposes** PRs — it does **NOT** auto-create or merge them (the operator
+      merges recipe PRs only after cc-ci verifies them green, per the recipe-PR rule). It MAY offer to
+      hand a proposed recipe PR to the `recipe-create-pr` skill as an explicit, opt-in follow-up.
+- [ ] **R7 — Documented as distinct from deterministic CI.** `docs/` makes clear the default
+      `!testme`/nightly path is deterministic (no AI); `/ci-test-review` is the on-demand AI review
+      layer. A maintainer who wants zero AI never invokes it.
+- [ ] **R8 — Adversary-verified behavior.** Proven: reports ALL-GREEN correctly when green; on a
+      **seeded recipe bug** classifies → recipe-PR with a sane proposed change; on a **seeded harness
+      bug** classifies → CI-PR; the freshness check flags a deliberately stale-pinned recipe; a flake
+      isn't misreported as a real bug.
+
+## 2. Method / notes
+- **Lean on the harness + the Adversary's existing diagnostic methodology.** The classification
+  (recipe vs CI vs test-stale) is exactly what the loops already do per finding — the skill
+  productizes that reasoning, with the harness providing the deterministic evidence.
+- **Real abra path** throughout (the harness it drives already uses real abra).
+- The skill's value is turning "N recipes, here's the raw pass/fail" into "here's what's actually
+  wrong and the specific PR that fixes it (recipe vs CI), or all-clear."
+
+## 3. Guardrails
+- **Deterministic execution stays AI-free** — AI only diagnoses/classifies/proposes; it never
+  decides pass/fail.
+- **Propose, don't auto-merge** — recipe PRs are operator-merged only after cc-ci verifies them
+  (the standing recipe-PR rule).
+- **Don't weaken any test** to make a review "pass."
+- **Bounded** — one review skill; not a general agent. It runs the suite, diagnoses, reports.
+
+## 4. Open decisions (log in machine-docs/DECISIONS.md)
+- Cold vs `--quick` default for the review run (lean cold = authoritative; offer `--quick` fast mode).
+- Report format (markdown summary + per-recipe table + per-failure diagnosis block).
+- How "latest published version" + "test staleness" are determined per recipe.
+- Whether/how the skill optionally invokes `recipe-create-pr` for a proposed recipe PR (opt-in).