plan: Phase 3r — /ci-test-review Claude skill (on-demand AI review + recipe-vs-CI PR diagnosis)

Deterministic CI stays the primary, AI-free path. Adds a separate on-demand skill (ships in the
cc-ci repo .claude/skills/ci-test-review/) that runs the full suite across all recipes and, per
failure, AI-diagnoses + classifies: recipe PR (+ proposed change) vs CI-server PR vs stale-test;
or 'all passed, recipes+tests up to date' (incl. a latest-version freshness check). Proposes, never
auto-merges (operator-merge rule). Slotted 3 -> 3r -> 4. AI only diagnoses; execution stays
deterministic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 16:39:07 +01:00
parent 61ab3ecb3a
commit 5f84f8c028
2 changed files with 84 additions and 1 deletions

View File

@ -56,7 +56,7 @@ WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}"
# Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order,
# auto-transitions on the phase's "## DONE" (in BUILDER_DIR/<statusbasename>), and STOPS after the
# last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence.
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;3r|plan-phase3r-ci-test-review-skill.md|STATUS-3r.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
IFS=';' read -r -a PHASES <<< "$PHASES_SPEC"
PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}"
# --------------------------------------------------------------------------

View File

@ -0,0 +1,83 @@
# cc-ci Phase 3r — `/ci-test-review` Claude skill (on-demand AI review + diagnosis)
**Status:** QUEUED — after Phase 3 (results UX), before Phase 4 (final review). A **Claude skill that
ships in the cc-ci product repo** (`.claude/skills/ci-test-review/`), built by the loops.
**Transition:** auto (in the launcher sequence). **Owner:** Builder + Adversary loops.
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase3r-ci-test-review-skill.md`
**Phase order:** … 3 → **3r** → 4.
---
## 0. Why — two distinct modes
cc-ci's **primary** use is **deterministic testing with NO AI**: `!testme` / the nightly sweep run the
harness, recipes pass/fail, done. That stays the bread-and-butter and is unaffected by this phase.
This phase adds a **separate, on-demand, AI-driven REVIEW layer** — the `/ci-test-review` skill — that
a maintainer (or a fresh Claude) invokes to: run the full suite across **all** recipes, and **for any
failure, diagnose the root cause and classify it** — does it need a **PR to the recipe** (and what
that PR is), a **PR to the CI server** (harness/test/infra), or is the **test out of date**? — or, if
everything's green and current, report **"all passed, recipes + tests up to date."** It codifies the
recipe-PR-vs-CI-PR judgment the Adversary already exercises (e.g. lasuite-drive→recipe PR, immich
pg_dump→recipe PR) into a runnable tool.
**Crisp boundary:** the test *execution* is deterministic (the existing harness); **AI is used ONLY
for diagnosis/classification/PR-proposal**, never to decide pass/fail.
## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-3r.md`)
- [ ] **R1 — Skill exists + invocable.** `.claude/skills/ci-test-review/SKILL.md` (+ helper script)
in the cc-ci repo, invocable as `/ci-test-review`. Reuses the existing harness + recipe
enrollment list — does NOT reinvent the runner.
- [ ] **R2 — Deterministic execution.** Runs the real harness (`run_recipe_ci.py`) across **all
enrolled recipes**, full suite per recipe, collecting per-recipe / per-tier pass/fail/skip.
No AI in execution. (Decide cold vs `--quick`: default **cold** for an authoritative review;
`--quick` as a fast-mode option.)
- [ ] **R3 — "Up to date" freshness.** Tests each recipe at its **latest published version** (not a
stale pin) and flags any recipe whose tested version lags upstream, and any test that's stale
vs the recipe — so "all passed" means "passes against current upstream," not against an old pin.
- [ ] **R4 — All-green output.** If every recipe passes AND is current, emit a clear **"ALL PASSED —
recipes up to date, tests up to date"** summary (with versions tested).
- [ ] **R5 — Failure diagnosis + classification.** For each failure, AI determines root cause and
classifies it as exactly one of:
- **RECIPE bug** → needs a **recipe PR**; output the concrete proposed change (e.g. "add
collabora WOPI healthcheck + start_period", "add pg_dump backup hook").
- **CI-SERVER bug** → needs a **cc-ci PR** (harness/test/infra); output the proposed change.
- **TEST out-of-date** → the recipe legitimately changed; the cc-ci test needs updating; output
the update.
- **FLAKY / infra** → distinguish from a real failure (e.g. re-run) so a flake isn't
misclassified as a recipe/CI bug.
- [ ] **R6 — Structured report; PROPOSE not auto-merge.** Output a structured report (per recipe
status; per failure: root cause + classification + proposed PR target + change sketch), or the
all-green summary. It **proposes** PRs — it does **NOT** auto-create or merge them (the operator
merges recipe PRs only after cc-ci verifies them green, per the recipe-PR rule). It MAY offer to
hand a proposed recipe PR to the `recipe-create-pr` skill as an explicit, opt-in follow-up.
- [ ] **R7 — Documented as distinct from deterministic CI.** `docs/` makes clear the default
`!testme`/nightly path is deterministic (no AI); `/ci-test-review` is the on-demand AI review
layer. A maintainer who wants zero AI never invokes it.
- [ ] **R8 — Adversary-verified behavior.** Proven: reports ALL-GREEN correctly when green; on a
**seeded recipe bug** classifies → recipe-PR with a sane proposed change; on a **seeded harness
bug** classifies → CI-PR; the freshness check flags a deliberately stale-pinned recipe; a flake
isn't misreported as a real bug.
## 2. Method / notes
- **Lean on the harness + the Adversary's existing diagnostic methodology.** The classification
(recipe vs CI vs test-stale) is exactly what the loops already do per finding — the skill
productizes that reasoning, with the harness providing the deterministic evidence.
- **Real abra path** throughout (the harness it drives already uses real abra).
- The skill's value is turning "N recipes, here's the raw pass/fail" into "here's what's actually
wrong and the specific PR that fixes it (recipe vs CI), or all-clear."
## 3. Guardrails
- **Deterministic execution stays AI-free** — AI only diagnoses/classifies/proposes; it never
decides pass/fail.
- **Propose, don't auto-merge** — recipe PRs are operator-merged only after cc-ci verifies them
(the standing recipe-PR rule).
- **Don't weaken any test** to make a review "pass."
- **Bounded** — one review skill; not a general agent. It runs the suite, diagnoses, reports.
## 4. Open decisions (log in machine-docs/DECISIONS.md)
- Cold vs `--quick` default for the review run (lean cold = authoritative; offer `--quick` fast mode).
- Report format (markdown summary + per-recipe table + per-failure diagnosis block).
- How "latest published version" + "test staleness" are determined per recipe.
- Whether/how the skill optionally invokes `recipe-create-pr` for a proposed recipe PR (opt-in).