orchestrator: add /ci-test-review skill (in THIS repo) + drop Phase 3r from loops queue

The on-demand AI review layer is now an orchestration-repo skill built directly by the orchestrator, NOT a loops phase in the cc-ci product repo: - .claude/skills/ci-test-review/{SKILL.md,run-all-recipes.sh}: runs the real cc-ci harness across all enrolled recipes (deterministic, AI-free execution), then AI diagnoses each failure and classifies it as needing a recipe PR / a CI-server PR / a stale-test update — or reports "ALL PASSED, recipes + tests up to date". Proposes PRs; never decides pass/fail; never auto-merges. - .gitignore: track .claude/skills/ (shareable) while still ignoring local claude session state (locks, history) under .claude/. - launch.sh: remove Phase 3r from PHASES_SPEC; loops sequence back to 1c 1b 1d 1e 2w 2pc 2 2b 3 4. Deleted plan-phase3r (superseded by the skill). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 16:57:26 +01:00
parent 5f84f8c028
commit 2530845e50
3 changed files with 167 additions and 83 deletions
--- a/.claude/skills/ci-test-review/SKILL.md
+++ b/.claude/skills/ci-test-review/SKILL.md
@ -0,0 +1,80 @@
+---
+name: ci-test-review
+description: On-demand AI review of the cc-ci server — runs the deterministic recipe test suite across ALL enrolled recipes, and for any failure diagnoses the root cause and classifies it as needing a RECIPE PR (with the proposed change), a CI-SERVER PR, or a stale-test update; or reports "all passed, recipes + tests up to date". Use when the operator wants a full health check + actionable diagnosis of the recipe CI, beyond the deterministic pass/fail of normal !testme runs. Invoke as /ci-test-review.
+---
+
+# ci-test-review
+
+An **on-demand, AI-driven review** of the cc-ci recipe CI server. cc-ci's primary use is
+**deterministic** testing (normal `!testme` / nightly — no AI). This skill adds a review layer: run
+the full suite across **all** recipes, then **AI-diagnose** any failures and say exactly what fixes
+them — a recipe PR, a CI-server PR, or a test update — or confirm everything is green and current.
+
+**Hard boundary:** the test *execution* is deterministic (the cc-ci harness). **AI is used ONLY for
+diagnosis / classification / proposing the PR — never to decide pass/fail.**
+
+## Access (this skill runs from the orchestration repo, drives cc-ci over the tailnet)
+- cc-ci is reachable via `ssh cc-ci` (root, through the userspace-tailscaled SOCKS proxy on
+  127.0.0.1:1055). If `ssh cc-ci` fails, restart the proxy (`systemctl restart cc-ci-tailscaled`) per
+  the orchestrator's access notes, then retry.
+- The cc-ci repo + harness live at `/root/cc-ci` on the cc-ci VM; the runner is
+  `cc-ci-run runner/run_recipe_ci.py` with `RECIPE=<name>` (+ optional `STAGES=...`).
+
+## Procedure
+
+### 1. Run the deterministic sweep (no AI)
+Run the helper, which enumerates enrolled recipes and runs the **full suite per recipe** on cc-ci,
+emitting one structured result line per recipe (per-tier pass/fail/skip + the tested vs latest
+published version):
+
+```
+bash "$(dirname "$0")/run-all-recipes.sh"            # or: .claude/skills/ci-test-review/run-all-recipes.sh
+```
+
+It writes a results table to stdout and a machine-readable summary to
+`/srv/cc-ci/.cc-ci-logs/ci-test-review-<runid>.json`. Do NOT hand-judge pass/fail — trust the
+harness's per-tier results.
+
+### 2. If everything is green AND current → report all-clear
+If every recipe's tiers are `pass`/`skip(N/A)` (no `fail`) **and** every recipe was tested at its
+**latest published version** with no stale test flagged, output:
+
+> **ALL PASSED — recipes up to date, tests up to date.** <N recipes, versions listed>
+
+Stop here. Nothing to fix.
+
+### 3. For each failure → diagnose root cause + CLASSIFY (this is the AI part)
+For every `fail` (after confirming it's not a flake — see below), read the run log, the recipe's
+compose/code (on cc-ci `~/.abra/recipes/<recipe>`), and the harness, then determine the **root cause**
+and classify it as **exactly one** of:
+
+- **RECIPE bug → recipe PR.** The recipe itself is broken/fragile (bad healthcheck, missing backup
+  hook, startup race, etc.). Output the **concrete proposed change** (e.g. "add a collabora WOPI
+  healthcheck + `start_period`", "add a `pg_dump` backup hook"). This is a PR to the recipe
+  (recipe-maintainer); it is "working" only once cc-ci verifies it green, then the operator merges.
+- **CI-SERVER bug → cc-ci PR.** The harness/test/infra is wrong (a test asserts the wrong thing, a
+  readiness gate is off, an infra config bug). Output the proposed change to the cc-ci repo.
+- **TEST out-of-date → cc-ci test update.** The recipe legitimately changed upstream and the cc-ci
+  test/overlay needs updating to match. Output the update.
+- **FLAKY / infra-transient.** Distinguish from a real failure: **re-run the recipe once or twice**;
+  if it then passes, classify as flaky (note it; suggest a robustness fix if there's a pattern) — do
+  NOT report a flake as a recipe/CI bug.
+
+### 4. Freshness (part of "up to date")
+For each recipe, compare the **tested version** to the **latest published** recipe version (the
+sweep records both). Flag any recipe tested at a stale pin, and any cc-ci test that lags a recipe
+change — "all passed" must mean "passes against current upstream", not against an old pin.
+
+### 5. Output a structured report
+- A per-recipe **status table** (recipe | tiers | version tested / latest | verdict).
+- For each failure: **root cause → classification (recipe PR / CI PR / test-update / flaky) →
+  proposed change** (concrete enough to act on).
+- Or the all-clear summary from step 2.
+
+## Guardrails
+- **Deterministic execution stays AI-free** — only diagnose; never override the harness's pass/fail.
+- **Propose, don't auto-merge.** This skill outputs recommendations + proposed PRs. It does NOT
+  create or merge PRs. (A recipe PR is operator-merged only after cc-ci verifies it green.) If asked,
+  it may hand a specific proposed recipe PR to the `recipe-create-pr` skill as an explicit follow-up.
+- **Don't weaken any test** to make the review "pass."
+- **Flake ≠ bug** — re-run before classifying a failure as a real recipe/CI defect.
--- a/.claude/skills/ci-test-review/run-all-recipes.sh
+++ b/.claude/skills/ci-test-review/run-all-recipes.sh
@ -0,0 +1,87 @@
+#!/usr/bin/env bash
+# ci-test-review :: deterministic sweep (NO AI)
+# -------------------------------------------------------------------------
+# Runs the real cc-ci harness (runner/run_recipe_ci.py) across all enrolled
+# recipes on the cc-ci VM, full suite per recipe, and emits:
+#   - a per-recipe status table to stdout
+#   - a machine-readable JSON summary to $OUT  (default /srv/cc-ci/.cc-ci-logs/)
+# Full per-recipe logs are kept ON cc-ci under /root/cc-ci-review-logs/<runid>/
+# so the AI diagnosis step can read them via `ssh cc-ci`.
+#
+# This script makes NO pass/fail judgement of its own — it records the
+# harness's exit code (0=green, 1=fail) and its RUN SUMMARY block verbatim.
+# AI diagnosis/classification happens AFTER, per SKILL.md.
+#
+# Env knobs (all optional):
+#   RECIPES="ghost n8n"   limit to these recipes (default: all enrolled)
+#   STAGES="install"      pass through to harness (default: full suite)
+#   QUICK=1               add --quick fast lane (default: cold/authoritative)
+#   SSH=cc-ci             ssh host alias (default: cc-ci)
+#   OUT=/path/dir         local dir for the JSON summary (default .cc-ci-logs)
+set -o errexit -o nounset -o pipefail
+
+SSH="${SSH:-cc-ci}"
+OUT="${OUT:-/srv/cc-ci/.cc-ci-logs}"
+RUNID="$(date -u +%Y%m%dT%H%M%SZ)"
+REMOTE_LOGDIR="/root/cc-ci-review-logs/${RUNID}"
+SUMMARY="${OUT}/ci-test-review-${RUNID}.json"
+mkdir -p "$OUT"
+
+quick_flag=""; [ "${QUICK:-0}" = "1" ] && quick_flag="--quick"
+stages_env=""; [ -n "${STAGES:-}" ] && stages_env="STAGES='${STAGES}'"
+
+echo "ci-test-review sweep ${RUNID}  (host=${SSH} quick=${QUICK:-0} stages=${STAGES:-all})" >&2
+
+# 1. Enumerate enrolled recipes (tests/<recipe>/ dirs, minus harness scaffolding).
+if [ -n "${RECIPES:-}" ]; then
+  recipes="$RECIPES"
+else
+  recipes="$(ssh "$SSH" 'cd /root/cc-ci/tests 2>/dev/null && ls -d */ 2>/dev/null' \
+    | sed 's#/##' | grep -vE '^(_generic|unit|__pycache__)$' || true)"
+fi
+[ -n "$recipes" ] || { echo "no enrolled recipes found" >&2; exit 1; }
+echo "recipes: $(echo $recipes | tr '\n' ' ')" >&2
+
+ssh "$SSH" "mkdir -p '$REMOTE_LOGDIR'"
+
+# 2. Run the harness per recipe (deterministic). Capture exit code + RUN SUMMARY + version.
+printf '%-18s %-7s %s\n' "RECIPE" "VERDICT" "TIERS (from harness RUN SUMMARY)"
+printf '%-18s %-7s %s\n' "------" "-------" "--------------------------------"
+
+json_items=""
+for r in $recipes; do
+  log="${REMOTE_LOGDIR}/${r}.log"
+  # Real harness invocation (real abra inside). Tee full log on cc-ci; capture rc.
+  rc=0
+  ssh "$SSH" "cd /root/cc-ci && ${stages_env} RECIPE='${r}' cc-ci-run runner/run_recipe_ci.py ${quick_flag} >'${log}' 2>&1" || rc=$?
+  # Pull back the RUN SUMMARY block + the abra-checked-out recipe version, if recorded.
+  summary_block="$(ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f{print}' '${log}' 2>/dev/null | sed 's/^/    /'" )"
+  tiers="$(ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f&&/^  (install|upgrade|backup|restore|custom)/{gsub(/^  /,\"\");print}' '${log}' 2>/dev/null" | tr '\n' ';')"
+  version="$(ssh "$SSH" "grep -oE 'checked out [^ ]+ at [^ ]+|recipe version[: ]+[^ ]+|VERSION=[^ ]+' '${log}' 2>/dev/null | head -1" || true)"
+  verdict=GREEN; [ "$rc" = "0" ] || verdict=FAIL
+  printf '%-18s %-7s %s\n' "$r" "$verdict" "${tiers:-<no summary — see log>}"
+  # accumulate JSON (escape quotes/newlines minimally)
+  esc_tiers="$(printf '%s' "$tiers" | sed 's/"/\\"/g')"
+  esc_ver="$(printf '%s' "$version" | sed 's/"/\\"/g')"
+  json_items="${json_items}${json_items:+,}
+  {\"recipe\":\"${r}\",\"rc\":${rc},\"verdict\":\"${verdict}\",\"tiers\":\"${esc_tiers}\",\"version\":\"${esc_ver}\",\"log\":\"${SSH}:${log}\"}"
+done
+
+# 3. JSON summary for the AI diagnosis step.
+cat > "$SUMMARY" <<JSON
+{
+  "runid": "${RUNID}",
+  "host": "${SSH}",
+  "quick": ${QUICK:-0},
+  "stages": "${STAGES:-all}",
+  "remote_logdir": "${SSH}:${REMOTE_LOGDIR}",
+  "recipes": [${json_items}
+  ]
+}
+JSON
+
+echo >&2
+echo "JSON summary : ${SUMMARY}" >&2
+echo "full logs    : ${SSH}:${REMOTE_LOGDIR}/<recipe>.log  (read via ssh for diagnosis)" >&2
+# Exit non-zero if any recipe failed, so a caller can branch; AI still diagnoses either way.
+grep -q '"verdict":"FAIL"' "$SUMMARY" && exit 2 || exit 0
--- a/cc-ci-plan/plan-phase3r-ci-test-review-skill.md
+++ b/cc-ci-plan/plan-phase3r-ci-test-review-skill.md
@ -1,83 +0,0 @@
-# cc-ci Phase 3r — `/ci-test-review` Claude skill (on-demand AI review + diagnosis)
-
-**Status:** QUEUED — after Phase 3 (results UX), before Phase 4 (final review). A **Claude skill that
-ships in the cc-ci product repo** (`.claude/skills/ci-test-review/`), built by the loops.
-**Transition:** auto (in the launcher sequence). **Owner:** Builder + Adversary loops.
-**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase3r-ci-test-review-skill.md`
-**Phase order:** … 3 → **3r** → 4.
-
---
-
-## 0. Why — two distinct modes
-
-cc-ci's **primary** use is **deterministic testing with NO AI**: `!testme` / the nightly sweep run the
-harness, recipes pass/fail, done. That stays the bread-and-butter and is unaffected by this phase.
-
-This phase adds a **separate, on-demand, AI-driven REVIEW layer** — the `/ci-test-review` skill — that
-a maintainer (or a fresh Claude) invokes to: run the full suite across **all** recipes, and **for any
-failure, diagnose the root cause and classify it** — does it need a **PR to the recipe** (and what
-that PR is), a **PR to the CI server** (harness/test/infra), or is the **test out of date**? — or, if
-everything's green and current, report **"all passed, recipes + tests up to date."** It codifies the
-recipe-PR-vs-CI-PR judgment the Adversary already exercises (e.g. lasuite-drive→recipe PR, immich
-pg_dump→recipe PR) into a runnable tool.
-
-**Crisp boundary:** the test *execution* is deterministic (the existing harness); **AI is used ONLY
-for diagnosis/classification/PR-proposal**, never to decide pass/fail.
-
-## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-3r.md`)
-
- [ ] **R1 — Skill exists + invocable.** `.claude/skills/ci-test-review/SKILL.md` (+ helper script)
-      in the cc-ci repo, invocable as `/ci-test-review`. Reuses the existing harness + recipe
-      enrollment list — does NOT reinvent the runner.
- [ ] **R2 — Deterministic execution.** Runs the real harness (`run_recipe_ci.py`) across **all
-      enrolled recipes**, full suite per recipe, collecting per-recipe / per-tier pass/fail/skip.
-      No AI in execution. (Decide cold vs `--quick`: default **cold** for an authoritative review;
-      `--quick` as a fast-mode option.)
- [ ] **R3 — "Up to date" freshness.** Tests each recipe at its **latest published version** (not a
-      stale pin) and flags any recipe whose tested version lags upstream, and any test that's stale
-      vs the recipe — so "all passed" means "passes against current upstream," not against an old pin.
- [ ] **R4 — All-green output.** If every recipe passes AND is current, emit a clear **"ALL PASSED —
-      recipes up to date, tests up to date"** summary (with versions tested).
- [ ] **R5 — Failure diagnosis + classification.** For each failure, AI determines root cause and
-      classifies it as exactly one of:
-      - **RECIPE bug** → needs a **recipe PR**; output the concrete proposed change (e.g. "add
-        collabora WOPI healthcheck + start_period", "add pg_dump backup hook").
-      - **CI-SERVER bug** → needs a **cc-ci PR** (harness/test/infra); output the proposed change.
-      - **TEST out-of-date** → the recipe legitimately changed; the cc-ci test needs updating; output
-        the update.
-      - **FLAKY / infra** → distinguish from a real failure (e.g. re-run) so a flake isn't
-        misclassified as a recipe/CI bug.
- [ ] **R6 — Structured report; PROPOSE not auto-merge.** Output a structured report (per recipe
-      status; per failure: root cause + classification + proposed PR target + change sketch), or the
-      all-green summary. It **proposes** PRs — it does **NOT** auto-create or merge them (the operator
-      merges recipe PRs only after cc-ci verifies them green, per the recipe-PR rule). It MAY offer to
-      hand a proposed recipe PR to the `recipe-create-pr` skill as an explicit, opt-in follow-up.
- [ ] **R7 — Documented as distinct from deterministic CI.** `docs/` makes clear the default
-      `!testme`/nightly path is deterministic (no AI); `/ci-test-review` is the on-demand AI review
-      layer. A maintainer who wants zero AI never invokes it.
- [ ] **R8 — Adversary-verified behavior.** Proven: reports ALL-GREEN correctly when green; on a
-      **seeded recipe bug** classifies → recipe-PR with a sane proposed change; on a **seeded harness
-      bug** classifies → CI-PR; the freshness check flags a deliberately stale-pinned recipe; a flake
-      isn't misreported as a real bug.
-
-## 2. Method / notes
- **Lean on the harness + the Adversary's existing diagnostic methodology.** The classification
-  (recipe vs CI vs test-stale) is exactly what the loops already do per finding — the skill
-  productizes that reasoning, with the harness providing the deterministic evidence.
- **Real abra path** throughout (the harness it drives already uses real abra).
- The skill's value is turning "N recipes, here's the raw pass/fail" into "here's what's actually
-  wrong and the specific PR that fixes it (recipe vs CI), or all-clear."
-
-## 3. Guardrails
- **Deterministic execution stays AI-free** — AI only diagnoses/classifies/proposes; it never
-  decides pass/fail.
- **Propose, don't auto-merge** — recipe PRs are operator-merged only after cc-ci verifies them
-  (the standing recipe-PR rule).
- **Don't weaken any test** to make a review "pass."
- **Bounded** — one review skill; not a general agent. It runs the suite, diagnoses, reports.
-
-## 4. Open decisions (log in machine-docs/DECISIONS.md)
- Cold vs `--quick` default for the review run (lean cold = authoritative; offer `--quick` fast mode).
- Report format (markdown summary + per-recipe table + per-failure diagnosis block).
- How "latest published version" + "test staleness" are determined per recipe.
- Whether/how the skill optionally invokes `recipe-create-pr` for a proposed recipe PR (opt-in).