plan: Phase 3r — /ci-test-review Claude skill (on-demand AI review + recipe-vs-CI PR diagnosis)
Deterministic CI stays the primary, AI-free path. Adds a separate on-demand skill (ships in the cc-ci repo .claude/skills/ci-test-review/) that runs the full suite across all recipes and, per failure, AI-diagnoses + classifies: recipe PR (+ proposed change) vs CI-server PR vs stale-test; or 'all passed, recipes+tests up to date' (incl. a latest-version freshness check). Proposes, never auto-merges (operator-merge rule). Slotted 3 -> 3r -> 4. AI only diagnoses; execution stays deterministic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -56,7 +56,7 @@ WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}"
|
||||
# Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order,
|
||||
# auto-transitions on the phase's "## DONE" (in BUILDER_DIR/<statusbasename>), and STOPS after the
|
||||
# last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence.
|
||||
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
|
||||
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;3r|plan-phase3r-ci-test-review-skill.md|STATUS-3r.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
|
||||
IFS=';' read -r -a PHASES <<< "$PHASES_SPEC"
|
||||
PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}"
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
83
cc-ci-plan/plan-phase3r-ci-test-review-skill.md
Normal file
83
cc-ci-plan/plan-phase3r-ci-test-review-skill.md
Normal file
@ -0,0 +1,83 @@
|
||||
# cc-ci Phase 3r — `/ci-test-review` Claude skill (on-demand AI review + diagnosis)
|
||||
|
||||
**Status:** QUEUED — after Phase 3 (results UX), before Phase 4 (final review). A **Claude skill that
|
||||
ships in the cc-ci product repo** (`.claude/skills/ci-test-review/`), built by the loops.
|
||||
**Transition:** auto (in the launcher sequence). **Owner:** Builder + Adversary loops.
|
||||
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase3r-ci-test-review-skill.md`
|
||||
**Phase order:** … 3 → **3r** → 4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Why — two distinct modes
|
||||
|
||||
cc-ci's **primary** use is **deterministic testing with NO AI**: `!testme` / the nightly sweep run the
|
||||
harness, recipes pass/fail, done. That stays the bread-and-butter and is unaffected by this phase.
|
||||
|
||||
This phase adds a **separate, on-demand, AI-driven REVIEW layer** — the `/ci-test-review` skill — that
|
||||
a maintainer (or a fresh Claude) invokes to: run the full suite across **all** recipes, and **for any
|
||||
failure, diagnose the root cause and classify it** — does it need a **PR to the recipe** (and what
|
||||
that PR is), a **PR to the CI server** (harness/test/infra), or is the **test out of date**? — or, if
|
||||
everything's green and current, report **"all passed, recipes + tests up to date."** It codifies the
|
||||
recipe-PR-vs-CI-PR judgment the Adversary already exercises (e.g. lasuite-drive→recipe PR, immich
|
||||
pg_dump→recipe PR) into a runnable tool.
|
||||
|
||||
**Crisp boundary:** the test *execution* is deterministic (the existing harness); **AI is used ONLY
|
||||
for diagnosis/classification/PR-proposal**, never to decide pass/fail.
|
||||
|
||||
## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-3r.md`)
|
||||
|
||||
- [ ] **R1 — Skill exists + invocable.** `.claude/skills/ci-test-review/SKILL.md` (+ helper script)
|
||||
in the cc-ci repo, invocable as `/ci-test-review`. Reuses the existing harness + recipe
|
||||
enrollment list — does NOT reinvent the runner.
|
||||
- [ ] **R2 — Deterministic execution.** Runs the real harness (`run_recipe_ci.py`) across **all
|
||||
enrolled recipes**, full suite per recipe, collecting per-recipe / per-tier pass/fail/skip.
|
||||
No AI in execution. (Decide cold vs `--quick`: default **cold** for an authoritative review;
|
||||
`--quick` as a fast-mode option.)
|
||||
- [ ] **R3 — "Up to date" freshness.** Tests each recipe at its **latest published version** (not a
|
||||
stale pin) and flags any recipe whose tested version lags upstream, and any test that's stale
|
||||
vs the recipe — so "all passed" means "passes against current upstream," not against an old pin.
|
||||
- [ ] **R4 — All-green output.** If every recipe passes AND is current, emit a clear **"ALL PASSED —
|
||||
recipes up to date, tests up to date"** summary (with versions tested).
|
||||
- [ ] **R5 — Failure diagnosis + classification.** For each failure, AI determines root cause and
|
||||
classifies it as exactly one of:
|
||||
- **RECIPE bug** → needs a **recipe PR**; output the concrete proposed change (e.g. "add
|
||||
collabora WOPI healthcheck + start_period", "add pg_dump backup hook").
|
||||
- **CI-SERVER bug** → needs a **cc-ci PR** (harness/test/infra); output the proposed change.
|
||||
- **TEST out-of-date** → the recipe legitimately changed; the cc-ci test needs updating; output
|
||||
the update.
|
||||
- **FLAKY / infra** → distinguish from a real failure (e.g. re-run) so a flake isn't
|
||||
misclassified as a recipe/CI bug.
|
||||
- [ ] **R6 — Structured report; PROPOSE not auto-merge.** Output a structured report (per recipe
|
||||
status; per failure: root cause + classification + proposed PR target + change sketch), or the
|
||||
all-green summary. It **proposes** PRs — it does **NOT** auto-create or merge them (the operator
|
||||
merges recipe PRs only after cc-ci verifies them green, per the recipe-PR rule). It MAY offer to
|
||||
hand a proposed recipe PR to the `recipe-create-pr` skill as an explicit, opt-in follow-up.
|
||||
- [ ] **R7 — Documented as distinct from deterministic CI.** `docs/` makes clear the default
|
||||
`!testme`/nightly path is deterministic (no AI); `/ci-test-review` is the on-demand AI review
|
||||
layer. A maintainer who wants zero AI never invokes it.
|
||||
- [ ] **R8 — Adversary-verified behavior.** Proven: reports ALL-GREEN correctly when green; on a
|
||||
**seeded recipe bug** classifies → recipe-PR with a sane proposed change; on a **seeded harness
|
||||
bug** classifies → CI-PR; the freshness check flags a deliberately stale-pinned recipe; a flake
|
||||
isn't misreported as a real bug.
|
||||
|
||||
## 2. Method / notes
|
||||
- **Lean on the harness + the Adversary's existing diagnostic methodology.** The classification
|
||||
(recipe vs CI vs test-stale) is exactly what the loops already do per finding — the skill
|
||||
productizes that reasoning, with the harness providing the deterministic evidence.
|
||||
- **Real abra path** throughout (the harness it drives already uses real abra).
|
||||
- The skill's value is turning "N recipes, here's the raw pass/fail" into "here's what's actually
|
||||
wrong and the specific PR that fixes it (recipe vs CI), or all-clear."
|
||||
|
||||
## 3. Guardrails
|
||||
- **Deterministic execution stays AI-free** — AI only diagnoses/classifies/proposes; it never
|
||||
decides pass/fail.
|
||||
- **Propose, don't auto-merge** — recipe PRs are operator-merged only after cc-ci verifies them
|
||||
(the standing recipe-PR rule).
|
||||
- **Don't weaken any test** to make a review "pass."
|
||||
- **Bounded** — one review skill; not a general agent. It runs the suite, diagnoses, reports.
|
||||
|
||||
## 4. Open decisions (log in machine-docs/DECISIONS.md)
|
||||
- Cold vs `--quick` default for the review run (lean cold = authoritative; offer `--quick` fast mode).
|
||||
- Report format (markdown summary + per-recipe table + per-failure diagnosis block).
|
||||
- How "latest published version" + "test staleness" are determined per recipe.
|
||||
- Whether/how the skill optionally invokes `recipe-create-pr` for a proposed recipe PR (opt-in).
|
||||
Reference in New Issue
Block a user