diff --git a/cc-ci-plan/launch.sh b/cc-ci-plan/launch.sh index a066923..d3840a5 100755 --- a/cc-ci-plan/launch.sh +++ b/cc-ci-plan/launch.sh @@ -56,7 +56,7 @@ WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}" # Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order, # auto-transitions on the phase's "## DONE" (in BUILDER_DIR/), and STOPS after the # last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence. -PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}" +PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;3r|plan-phase3r-ci-test-review-skill.md|STATUS-3r.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}" IFS=';' read -r -a PHASES <<< "$PHASES_SPEC" PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}" # -------------------------------------------------------------------------- diff --git a/cc-ci-plan/plan-phase3r-ci-test-review-skill.md b/cc-ci-plan/plan-phase3r-ci-test-review-skill.md new file mode 100644 index 0000000..5399b30 --- /dev/null +++ b/cc-ci-plan/plan-phase3r-ci-test-review-skill.md @@ -0,0 +1,83 @@ +# cc-ci Phase 3r — `/ci-test-review` Claude skill (on-demand AI review + diagnosis) + +**Status:** QUEUED — after Phase 3 (results UX), before Phase 4 (final review). A **Claude skill that +ships in the cc-ci product repo** (`.claude/skills/ci-test-review/`), built by the loops. +**Transition:** auto (in the launcher sequence). **Owner:** Builder + Adversary loops. +**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase3r-ci-test-review-skill.md` +**Phase order:** … 3 → **3r** → 4. + +--- + +## 0. Why — two distinct modes + +cc-ci's **primary** use is **deterministic testing with NO AI**: `!testme` / the nightly sweep run the +harness, recipes pass/fail, done. That stays the bread-and-butter and is unaffected by this phase. + +This phase adds a **separate, on-demand, AI-driven REVIEW layer** — the `/ci-test-review` skill — that +a maintainer (or a fresh Claude) invokes to: run the full suite across **all** recipes, and **for any +failure, diagnose the root cause and classify it** — does it need a **PR to the recipe** (and what +that PR is), a **PR to the CI server** (harness/test/infra), or is the **test out of date**? — or, if +everything's green and current, report **"all passed, recipes + tests up to date."** It codifies the +recipe-PR-vs-CI-PR judgment the Adversary already exercises (e.g. lasuite-drive→recipe PR, immich +pg_dump→recipe PR) into a runnable tool. + +**Crisp boundary:** the test *execution* is deterministic (the existing harness); **AI is used ONLY +for diagnosis/classification/PR-proposal**, never to decide pass/fail. + +## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-3r.md`) + +- [ ] **R1 — Skill exists + invocable.** `.claude/skills/ci-test-review/SKILL.md` (+ helper script) + in the cc-ci repo, invocable as `/ci-test-review`. Reuses the existing harness + recipe + enrollment list — does NOT reinvent the runner. +- [ ] **R2 — Deterministic execution.** Runs the real harness (`run_recipe_ci.py`) across **all + enrolled recipes**, full suite per recipe, collecting per-recipe / per-tier pass/fail/skip. + No AI in execution. (Decide cold vs `--quick`: default **cold** for an authoritative review; + `--quick` as a fast-mode option.) +- [ ] **R3 — "Up to date" freshness.** Tests each recipe at its **latest published version** (not a + stale pin) and flags any recipe whose tested version lags upstream, and any test that's stale + vs the recipe — so "all passed" means "passes against current upstream," not against an old pin. +- [ ] **R4 — All-green output.** If every recipe passes AND is current, emit a clear **"ALL PASSED — + recipes up to date, tests up to date"** summary (with versions tested). +- [ ] **R5 — Failure diagnosis + classification.** For each failure, AI determines root cause and + classifies it as exactly one of: + - **RECIPE bug** → needs a **recipe PR**; output the concrete proposed change (e.g. "add + collabora WOPI healthcheck + start_period", "add pg_dump backup hook"). + - **CI-SERVER bug** → needs a **cc-ci PR** (harness/test/infra); output the proposed change. + - **TEST out-of-date** → the recipe legitimately changed; the cc-ci test needs updating; output + the update. + - **FLAKY / infra** → distinguish from a real failure (e.g. re-run) so a flake isn't + misclassified as a recipe/CI bug. +- [ ] **R6 — Structured report; PROPOSE not auto-merge.** Output a structured report (per recipe + status; per failure: root cause + classification + proposed PR target + change sketch), or the + all-green summary. It **proposes** PRs — it does **NOT** auto-create or merge them (the operator + merges recipe PRs only after cc-ci verifies them green, per the recipe-PR rule). It MAY offer to + hand a proposed recipe PR to the `recipe-create-pr` skill as an explicit, opt-in follow-up. +- [ ] **R7 — Documented as distinct from deterministic CI.** `docs/` makes clear the default + `!testme`/nightly path is deterministic (no AI); `/ci-test-review` is the on-demand AI review + layer. A maintainer who wants zero AI never invokes it. +- [ ] **R8 — Adversary-verified behavior.** Proven: reports ALL-GREEN correctly when green; on a + **seeded recipe bug** classifies → recipe-PR with a sane proposed change; on a **seeded harness + bug** classifies → CI-PR; the freshness check flags a deliberately stale-pinned recipe; a flake + isn't misreported as a real bug. + +## 2. Method / notes +- **Lean on the harness + the Adversary's existing diagnostic methodology.** The classification + (recipe vs CI vs test-stale) is exactly what the loops already do per finding — the skill + productizes that reasoning, with the harness providing the deterministic evidence. +- **Real abra path** throughout (the harness it drives already uses real abra). +- The skill's value is turning "N recipes, here's the raw pass/fail" into "here's what's actually + wrong and the specific PR that fixes it (recipe vs CI), or all-clear." + +## 3. Guardrails +- **Deterministic execution stays AI-free** — AI only diagnoses/classifies/proposes; it never + decides pass/fail. +- **Propose, don't auto-merge** — recipe PRs are operator-merged only after cc-ci verifies them + (the standing recipe-PR rule). +- **Don't weaken any test** to make a review "pass." +- **Bounded** — one review skill; not a general agent. It runs the suite, diagnoses, reports. + +## 4. Open decisions (log in machine-docs/DECISIONS.md) +- Cold vs `--quick` default for the review run (lean cold = authoritative; offer `--quick` fast mode). +- Report format (markdown summary + per-recipe table + per-failure diagnosis block). +- How "latest published version" + "test staleness" are determined per recipe. +- Whether/how the skill optionally invokes `recipe-create-pr` for a proposed recipe PR (opt-in).