orchestrator: add /ci-test-review skill (in THIS repo) + drop Phase 3r from loops queue

The on-demand AI review layer is now an orchestration-repo skill built directly
by the orchestrator, NOT a loops phase in the cc-ci product repo:

- .claude/skills/ci-test-review/{SKILL.md,run-all-recipes.sh}: runs the real
  cc-ci harness across all enrolled recipes (deterministic, AI-free execution),
  then AI diagnoses each failure and classifies it as needing a recipe PR / a
  CI-server PR / a stale-test update — or reports "ALL PASSED, recipes + tests
  up to date". Proposes PRs; never decides pass/fail; never auto-merges.
- .gitignore: track .claude/skills/ (shareable) while still ignoring local
  claude session state (locks, history) under .claude/.
- launch.sh: remove Phase 3r from PHASES_SPEC; loops sequence back to
  1c 1b 1d 1e 2w 2pc 2 2b 3 4. Deleted plan-phase3r (superseded by the skill).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 16:57:26 +01:00
parent 5f84f8c028
commit 2530845e50
3 changed files with 167 additions and 83 deletions

View File

@ -0,0 +1,80 @@
---
name: ci-test-review
description: On-demand AI review of the cc-ci server — runs the deterministic recipe test suite across ALL enrolled recipes, and for any failure diagnoses the root cause and classifies it as needing a RECIPE PR (with the proposed change), a CI-SERVER PR, or a stale-test update; or reports "all passed, recipes + tests up to date". Use when the operator wants a full health check + actionable diagnosis of the recipe CI, beyond the deterministic pass/fail of normal !testme runs. Invoke as /ci-test-review.
---
# ci-test-review
An **on-demand, AI-driven review** of the cc-ci recipe CI server. cc-ci's primary use is
**deterministic** testing (normal `!testme` / nightly — no AI). This skill adds a review layer: run
the full suite across **all** recipes, then **AI-diagnose** any failures and say exactly what fixes
them — a recipe PR, a CI-server PR, or a test update — or confirm everything is green and current.
**Hard boundary:** the test *execution* is deterministic (the cc-ci harness). **AI is used ONLY for
diagnosis / classification / proposing the PR — never to decide pass/fail.**
## Access (this skill runs from the orchestration repo, drives cc-ci over the tailnet)
- cc-ci is reachable via `ssh cc-ci` (root, through the userspace-tailscaled SOCKS proxy on
127.0.0.1:1055). If `ssh cc-ci` fails, restart the proxy (`systemctl restart cc-ci-tailscaled`) per
the orchestrator's access notes, then retry.
- The cc-ci repo + harness live at `/root/cc-ci` on the cc-ci VM; the runner is
`cc-ci-run runner/run_recipe_ci.py` with `RECIPE=<name>` (+ optional `STAGES=...`).
## Procedure
### 1. Run the deterministic sweep (no AI)
Run the helper, which enumerates enrolled recipes and runs the **full suite per recipe** on cc-ci,
emitting one structured result line per recipe (per-tier pass/fail/skip + the tested vs latest
published version):
```
bash "$(dirname "$0")/run-all-recipes.sh" # or: .claude/skills/ci-test-review/run-all-recipes.sh
```
It writes a results table to stdout and a machine-readable summary to
`/srv/cc-ci/.cc-ci-logs/ci-test-review-<runid>.json`. Do NOT hand-judge pass/fail — trust the
harness's per-tier results.
### 2. If everything is green AND current → report all-clear
If every recipe's tiers are `pass`/`skip(N/A)` (no `fail`) **and** every recipe was tested at its
**latest published version** with no stale test flagged, output:
> **ALL PASSED — recipes up to date, tests up to date.** <N recipes, versions listed>
Stop here. Nothing to fix.
### 3. For each failure → diagnose root cause + CLASSIFY (this is the AI part)
For every `fail` (after confirming it's not a flake — see below), read the run log, the recipe's
compose/code (on cc-ci `~/.abra/recipes/<recipe>`), and the harness, then determine the **root cause**
and classify it as **exactly one** of:
- **RECIPE bug → recipe PR.** The recipe itself is broken/fragile (bad healthcheck, missing backup
hook, startup race, etc.). Output the **concrete proposed change** (e.g. "add a collabora WOPI
healthcheck + `start_period`", "add a `pg_dump` backup hook"). This is a PR to the recipe
(recipe-maintainer); it is "working" only once cc-ci verifies it green, then the operator merges.
- **CI-SERVER bug → cc-ci PR.** The harness/test/infra is wrong (a test asserts the wrong thing, a
readiness gate is off, an infra config bug). Output the proposed change to the cc-ci repo.
- **TEST out-of-date → cc-ci test update.** The recipe legitimately changed upstream and the cc-ci
test/overlay needs updating to match. Output the update.
- **FLAKY / infra-transient.** Distinguish from a real failure: **re-run the recipe once or twice**;
if it then passes, classify as flaky (note it; suggest a robustness fix if there's a pattern) — do
NOT report a flake as a recipe/CI bug.
### 4. Freshness (part of "up to date")
For each recipe, compare the **tested version** to the **latest published** recipe version (the
sweep records both). Flag any recipe tested at a stale pin, and any cc-ci test that lags a recipe
change — "all passed" must mean "passes against current upstream", not against an old pin.
### 5. Output a structured report
- A per-recipe **status table** (recipe | tiers | version tested / latest | verdict).
- For each failure: **root cause → classification (recipe PR / CI PR / test-update / flaky) →
proposed change** (concrete enough to act on).
- Or the all-clear summary from step 2.
## Guardrails
- **Deterministic execution stays AI-free** — only diagnose; never override the harness's pass/fail.
- **Propose, don't auto-merge.** This skill outputs recommendations + proposed PRs. It does NOT
create or merge PRs. (A recipe PR is operator-merged only after cc-ci verifies it green.) If asked,
it may hand a specific proposed recipe PR to the `recipe-create-pr` skill as an explicit follow-up.
- **Don't weaken any test** to make the review "pass."
- **Flake ≠ bug** — re-run before classifying a failure as a real recipe/CI defect.

View File

@ -0,0 +1,87 @@
#!/usr/bin/env bash
# ci-test-review :: deterministic sweep (NO AI)
# -------------------------------------------------------------------------
# Runs the real cc-ci harness (runner/run_recipe_ci.py) across all enrolled
# recipes on the cc-ci VM, full suite per recipe, and emits:
# - a per-recipe status table to stdout
# - a machine-readable JSON summary to $OUT (default /srv/cc-ci/.cc-ci-logs/)
# Full per-recipe logs are kept ON cc-ci under /root/cc-ci-review-logs/<runid>/
# so the AI diagnosis step can read them via `ssh cc-ci`.
#
# This script makes NO pass/fail judgement of its own — it records the
# harness's exit code (0=green, 1=fail) and its RUN SUMMARY block verbatim.
# AI diagnosis/classification happens AFTER, per SKILL.md.
#
# Env knobs (all optional):
# RECIPES="ghost n8n" limit to these recipes (default: all enrolled)
# STAGES="install" pass through to harness (default: full suite)
# QUICK=1 add --quick fast lane (default: cold/authoritative)
# SSH=cc-ci ssh host alias (default: cc-ci)
# OUT=/path/dir local dir for the JSON summary (default .cc-ci-logs)
set -o errexit -o nounset -o pipefail
SSH="${SSH:-cc-ci}"
OUT="${OUT:-/srv/cc-ci/.cc-ci-logs}"
RUNID="$(date -u +%Y%m%dT%H%M%SZ)"
REMOTE_LOGDIR="/root/cc-ci-review-logs/${RUNID}"
SUMMARY="${OUT}/ci-test-review-${RUNID}.json"
mkdir -p "$OUT"
quick_flag=""; [ "${QUICK:-0}" = "1" ] && quick_flag="--quick"
stages_env=""; [ -n "${STAGES:-}" ] && stages_env="STAGES='${STAGES}'"
echo "ci-test-review sweep ${RUNID} (host=${SSH} quick=${QUICK:-0} stages=${STAGES:-all})" >&2
# 1. Enumerate enrolled recipes (tests/<recipe>/ dirs, minus harness scaffolding).
if [ -n "${RECIPES:-}" ]; then
recipes="$RECIPES"
else
recipes="$(ssh "$SSH" 'cd /root/cc-ci/tests 2>/dev/null && ls -d */ 2>/dev/null' \
| sed 's#/##' | grep -vE '^(_generic|unit|__pycache__)$' || true)"
fi
[ -n "$recipes" ] || { echo "no enrolled recipes found" >&2; exit 1; }
echo "recipes: $(echo $recipes | tr '\n' ' ')" >&2
ssh "$SSH" "mkdir -p '$REMOTE_LOGDIR'"
# 2. Run the harness per recipe (deterministic). Capture exit code + RUN SUMMARY + version.
printf '%-18s %-7s %s\n' "RECIPE" "VERDICT" "TIERS (from harness RUN SUMMARY)"
printf '%-18s %-7s %s\n' "------" "-------" "--------------------------------"
json_items=""
for r in $recipes; do
log="${REMOTE_LOGDIR}/${r}.log"
# Real harness invocation (real abra inside). Tee full log on cc-ci; capture rc.
rc=0
ssh "$SSH" "cd /root/cc-ci && ${stages_env} RECIPE='${r}' cc-ci-run runner/run_recipe_ci.py ${quick_flag} >'${log}' 2>&1" || rc=$?
# Pull back the RUN SUMMARY block + the abra-checked-out recipe version, if recorded.
summary_block="$(ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f{print}' '${log}' 2>/dev/null | sed 's/^/ /'" )"
tiers="$(ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f&&/^ (install|upgrade|backup|restore|custom)/{gsub(/^ /,\"\");print}' '${log}' 2>/dev/null" | tr '\n' ';')"
version="$(ssh "$SSH" "grep -oE 'checked out [^ ]+ at [^ ]+|recipe version[: ]+[^ ]+|VERSION=[^ ]+' '${log}' 2>/dev/null | head -1" || true)"
verdict=GREEN; [ "$rc" = "0" ] || verdict=FAIL
printf '%-18s %-7s %s\n' "$r" "$verdict" "${tiers:-<no summary — see log>}"
# accumulate JSON (escape quotes/newlines minimally)
esc_tiers="$(printf '%s' "$tiers" | sed 's/"/\\"/g')"
esc_ver="$(printf '%s' "$version" | sed 's/"/\\"/g')"
json_items="${json_items}${json_items:+,}
{\"recipe\":\"${r}\",\"rc\":${rc},\"verdict\":\"${verdict}\",\"tiers\":\"${esc_tiers}\",\"version\":\"${esc_ver}\",\"log\":\"${SSH}:${log}\"}"
done
# 3. JSON summary for the AI diagnosis step.
cat > "$SUMMARY" <<JSON
{
"runid": "${RUNID}",
"host": "${SSH}",
"quick": ${QUICK:-0},
"stages": "${STAGES:-all}",
"remote_logdir": "${SSH}:${REMOTE_LOGDIR}",
"recipes": [${json_items}
]
}
JSON
echo >&2
echo "JSON summary : ${SUMMARY}" >&2
echo "full logs : ${SSH}:${REMOTE_LOGDIR}/<recipe>.log (read via ssh for diagnosis)" >&2
# Exit non-zero if any recipe failed, so a caller can branch; AI still diagnoses either way.
grep -q '"verdict":"FAIL"' "$SUMMARY" && exit 2 || exit 0

View File

@ -1,83 +0,0 @@
# cc-ci Phase 3r — `/ci-test-review` Claude skill (on-demand AI review + diagnosis)
**Status:** QUEUED — after Phase 3 (results UX), before Phase 4 (final review). A **Claude skill that
ships in the cc-ci product repo** (`.claude/skills/ci-test-review/`), built by the loops.
**Transition:** auto (in the launcher sequence). **Owner:** Builder + Adversary loops.
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase3r-ci-test-review-skill.md`
**Phase order:** … 3 → **3r** → 4.
---
## 0. Why — two distinct modes
cc-ci's **primary** use is **deterministic testing with NO AI**: `!testme` / the nightly sweep run the
harness, recipes pass/fail, done. That stays the bread-and-butter and is unaffected by this phase.
This phase adds a **separate, on-demand, AI-driven REVIEW layer** — the `/ci-test-review` skill — that
a maintainer (or a fresh Claude) invokes to: run the full suite across **all** recipes, and **for any
failure, diagnose the root cause and classify it** — does it need a **PR to the recipe** (and what
that PR is), a **PR to the CI server** (harness/test/infra), or is the **test out of date**? — or, if
everything's green and current, report **"all passed, recipes + tests up to date."** It codifies the
recipe-PR-vs-CI-PR judgment the Adversary already exercises (e.g. lasuite-drive→recipe PR, immich
pg_dump→recipe PR) into a runnable tool.
**Crisp boundary:** the test *execution* is deterministic (the existing harness); **AI is used ONLY
for diagnosis/classification/PR-proposal**, never to decide pass/fail.
## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-3r.md`)
- [ ] **R1 — Skill exists + invocable.** `.claude/skills/ci-test-review/SKILL.md` (+ helper script)
in the cc-ci repo, invocable as `/ci-test-review`. Reuses the existing harness + recipe
enrollment list — does NOT reinvent the runner.
- [ ] **R2 — Deterministic execution.** Runs the real harness (`run_recipe_ci.py`) across **all
enrolled recipes**, full suite per recipe, collecting per-recipe / per-tier pass/fail/skip.
No AI in execution. (Decide cold vs `--quick`: default **cold** for an authoritative review;
`--quick` as a fast-mode option.)
- [ ] **R3 — "Up to date" freshness.** Tests each recipe at its **latest published version** (not a
stale pin) and flags any recipe whose tested version lags upstream, and any test that's stale
vs the recipe — so "all passed" means "passes against current upstream," not against an old pin.
- [ ] **R4 — All-green output.** If every recipe passes AND is current, emit a clear **"ALL PASSED —
recipes up to date, tests up to date"** summary (with versions tested).
- [ ] **R5 — Failure diagnosis + classification.** For each failure, AI determines root cause and
classifies it as exactly one of:
- **RECIPE bug** → needs a **recipe PR**; output the concrete proposed change (e.g. "add
collabora WOPI healthcheck + start_period", "add pg_dump backup hook").
- **CI-SERVER bug** → needs a **cc-ci PR** (harness/test/infra); output the proposed change.
- **TEST out-of-date** → the recipe legitimately changed; the cc-ci test needs updating; output
the update.
- **FLAKY / infra** → distinguish from a real failure (e.g. re-run) so a flake isn't
misclassified as a recipe/CI bug.
- [ ] **R6 — Structured report; PROPOSE not auto-merge.** Output a structured report (per recipe
status; per failure: root cause + classification + proposed PR target + change sketch), or the
all-green summary. It **proposes** PRs — it does **NOT** auto-create or merge them (the operator
merges recipe PRs only after cc-ci verifies them green, per the recipe-PR rule). It MAY offer to
hand a proposed recipe PR to the `recipe-create-pr` skill as an explicit, opt-in follow-up.
- [ ] **R7 — Documented as distinct from deterministic CI.** `docs/` makes clear the default
`!testme`/nightly path is deterministic (no AI); `/ci-test-review` is the on-demand AI review
layer. A maintainer who wants zero AI never invokes it.
- [ ] **R8 — Adversary-verified behavior.** Proven: reports ALL-GREEN correctly when green; on a
**seeded recipe bug** classifies → recipe-PR with a sane proposed change; on a **seeded harness
bug** classifies → CI-PR; the freshness check flags a deliberately stale-pinned recipe; a flake
isn't misreported as a real bug.
## 2. Method / notes
- **Lean on the harness + the Adversary's existing diagnostic methodology.** The classification
(recipe vs CI vs test-stale) is exactly what the loops already do per finding — the skill
productizes that reasoning, with the harness providing the deterministic evidence.
- **Real abra path** throughout (the harness it drives already uses real abra).
- The skill's value is turning "N recipes, here's the raw pass/fail" into "here's what's actually
wrong and the specific PR that fixes it (recipe vs CI), or all-clear."
## 3. Guardrails
- **Deterministic execution stays AI-free** — AI only diagnoses/classifies/proposes; it never
decides pass/fail.
- **Propose, don't auto-merge** — recipe PRs are operator-merged only after cc-ci verifies them
(the standing recipe-PR rule).
- **Don't weaken any test** to make a review "pass."
- **Bounded** — one review skill; not a general agent. It runs the suite, diagnoses, reports.
## 4. Open decisions (log in machine-docs/DECISIONS.md)
- Cold vs `--quick` default for the review run (lean cold = authoritative; offer `--quick` fast mode).
- Report format (markdown summary + per-recipe table + per-failure diagnosis block).
- How "latest published version" + "test staleness" are determined per recipe.
- Whether/how the skill optionally invokes `recipe-create-pr` for a proposed recipe PR (opt-in).