orchestrator: add /ci-test-review skill (in THIS repo) + drop Phase 3r from loops queue
The on-demand AI review layer is now an orchestration-repo skill built directly
by the orchestrator, NOT a loops phase in the cc-ci product repo:
- .claude/skills/ci-test-review/{SKILL.md,run-all-recipes.sh}: runs the real
cc-ci harness across all enrolled recipes (deterministic, AI-free execution),
then AI diagnoses each failure and classifies it as needing a recipe PR / a
CI-server PR / a stale-test update — or reports "ALL PASSED, recipes + tests
up to date". Proposes PRs; never decides pass/fail; never auto-merges.
- .gitignore: track .claude/skills/ (shareable) while still ignoring local
claude session state (locks, history) under .claude/.
- launch.sh: remove Phase 3r from PHASES_SPEC; loops sequence back to
1c 1b 1d 1e 2w 2pc 2 2b 3 4. Deleted plan-phase3r (superseded by the skill).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
80
.claude/skills/ci-test-review/SKILL.md
Normal file
80
.claude/skills/ci-test-review/SKILL.md
Normal file
@ -0,0 +1,80 @@
|
||||
---
|
||||
name: ci-test-review
|
||||
description: On-demand AI review of the cc-ci server — runs the deterministic recipe test suite across ALL enrolled recipes, and for any failure diagnoses the root cause and classifies it as needing a RECIPE PR (with the proposed change), a CI-SERVER PR, or a stale-test update; or reports "all passed, recipes + tests up to date". Use when the operator wants a full health check + actionable diagnosis of the recipe CI, beyond the deterministic pass/fail of normal !testme runs. Invoke as /ci-test-review.
|
||||
---
|
||||
|
||||
# ci-test-review
|
||||
|
||||
An **on-demand, AI-driven review** of the cc-ci recipe CI server. cc-ci's primary use is
|
||||
**deterministic** testing (normal `!testme` / nightly — no AI). This skill adds a review layer: run
|
||||
the full suite across **all** recipes, then **AI-diagnose** any failures and say exactly what fixes
|
||||
them — a recipe PR, a CI-server PR, or a test update — or confirm everything is green and current.
|
||||
|
||||
**Hard boundary:** the test *execution* is deterministic (the cc-ci harness). **AI is used ONLY for
|
||||
diagnosis / classification / proposing the PR — never to decide pass/fail.**
|
||||
|
||||
## Access (this skill runs from the orchestration repo, drives cc-ci over the tailnet)
|
||||
- cc-ci is reachable via `ssh cc-ci` (root, through the userspace-tailscaled SOCKS proxy on
|
||||
127.0.0.1:1055). If `ssh cc-ci` fails, restart the proxy (`systemctl restart cc-ci-tailscaled`) per
|
||||
the orchestrator's access notes, then retry.
|
||||
- The cc-ci repo + harness live at `/root/cc-ci` on the cc-ci VM; the runner is
|
||||
`cc-ci-run runner/run_recipe_ci.py` with `RECIPE=<name>` (+ optional `STAGES=...`).
|
||||
|
||||
## Procedure
|
||||
|
||||
### 1. Run the deterministic sweep (no AI)
|
||||
Run the helper, which enumerates enrolled recipes and runs the **full suite per recipe** on cc-ci,
|
||||
emitting one structured result line per recipe (per-tier pass/fail/skip + the tested vs latest
|
||||
published version):
|
||||
|
||||
```
|
||||
bash "$(dirname "$0")/run-all-recipes.sh" # or: .claude/skills/ci-test-review/run-all-recipes.sh
|
||||
```
|
||||
|
||||
It writes a results table to stdout and a machine-readable summary to
|
||||
`/srv/cc-ci/.cc-ci-logs/ci-test-review-<runid>.json`. Do NOT hand-judge pass/fail — trust the
|
||||
harness's per-tier results.
|
||||
|
||||
### 2. If everything is green AND current → report all-clear
|
||||
If every recipe's tiers are `pass`/`skip(N/A)` (no `fail`) **and** every recipe was tested at its
|
||||
**latest published version** with no stale test flagged, output:
|
||||
|
||||
> **ALL PASSED — recipes up to date, tests up to date.** <N recipes, versions listed>
|
||||
|
||||
Stop here. Nothing to fix.
|
||||
|
||||
### 3. For each failure → diagnose root cause + CLASSIFY (this is the AI part)
|
||||
For every `fail` (after confirming it's not a flake — see below), read the run log, the recipe's
|
||||
compose/code (on cc-ci `~/.abra/recipes/<recipe>`), and the harness, then determine the **root cause**
|
||||
and classify it as **exactly one** of:
|
||||
|
||||
- **RECIPE bug → recipe PR.** The recipe itself is broken/fragile (bad healthcheck, missing backup
|
||||
hook, startup race, etc.). Output the **concrete proposed change** (e.g. "add a collabora WOPI
|
||||
healthcheck + `start_period`", "add a `pg_dump` backup hook"). This is a PR to the recipe
|
||||
(recipe-maintainer); it is "working" only once cc-ci verifies it green, then the operator merges.
|
||||
- **CI-SERVER bug → cc-ci PR.** The harness/test/infra is wrong (a test asserts the wrong thing, a
|
||||
readiness gate is off, an infra config bug). Output the proposed change to the cc-ci repo.
|
||||
- **TEST out-of-date → cc-ci test update.** The recipe legitimately changed upstream and the cc-ci
|
||||
test/overlay needs updating to match. Output the update.
|
||||
- **FLAKY / infra-transient.** Distinguish from a real failure: **re-run the recipe once or twice**;
|
||||
if it then passes, classify as flaky (note it; suggest a robustness fix if there's a pattern) — do
|
||||
NOT report a flake as a recipe/CI bug.
|
||||
|
||||
### 4. Freshness (part of "up to date")
|
||||
For each recipe, compare the **tested version** to the **latest published** recipe version (the
|
||||
sweep records both). Flag any recipe tested at a stale pin, and any cc-ci test that lags a recipe
|
||||
change — "all passed" must mean "passes against current upstream", not against an old pin.
|
||||
|
||||
### 5. Output a structured report
|
||||
- A per-recipe **status table** (recipe | tiers | version tested / latest | verdict).
|
||||
- For each failure: **root cause → classification (recipe PR / CI PR / test-update / flaky) →
|
||||
proposed change** (concrete enough to act on).
|
||||
- Or the all-clear summary from step 2.
|
||||
|
||||
## Guardrails
|
||||
- **Deterministic execution stays AI-free** — only diagnose; never override the harness's pass/fail.
|
||||
- **Propose, don't auto-merge.** This skill outputs recommendations + proposed PRs. It does NOT
|
||||
create or merge PRs. (A recipe PR is operator-merged only after cc-ci verifies it green.) If asked,
|
||||
it may hand a specific proposed recipe PR to the `recipe-create-pr` skill as an explicit follow-up.
|
||||
- **Don't weaken any test** to make the review "pass."
|
||||
- **Flake ≠ bug** — re-run before classifying a failure as a real recipe/CI defect.
|
||||
87
.claude/skills/ci-test-review/run-all-recipes.sh
Executable file
87
.claude/skills/ci-test-review/run-all-recipes.sh
Executable file
@ -0,0 +1,87 @@
|
||||
#!/usr/bin/env bash
|
||||
# ci-test-review :: deterministic sweep (NO AI)
|
||||
# -------------------------------------------------------------------------
|
||||
# Runs the real cc-ci harness (runner/run_recipe_ci.py) across all enrolled
|
||||
# recipes on the cc-ci VM, full suite per recipe, and emits:
|
||||
# - a per-recipe status table to stdout
|
||||
# - a machine-readable JSON summary to $OUT (default /srv/cc-ci/.cc-ci-logs/)
|
||||
# Full per-recipe logs are kept ON cc-ci under /root/cc-ci-review-logs/<runid>/
|
||||
# so the AI diagnosis step can read them via `ssh cc-ci`.
|
||||
#
|
||||
# This script makes NO pass/fail judgement of its own — it records the
|
||||
# harness's exit code (0=green, 1=fail) and its RUN SUMMARY block verbatim.
|
||||
# AI diagnosis/classification happens AFTER, per SKILL.md.
|
||||
#
|
||||
# Env knobs (all optional):
|
||||
# RECIPES="ghost n8n" limit to these recipes (default: all enrolled)
|
||||
# STAGES="install" pass through to harness (default: full suite)
|
||||
# QUICK=1 add --quick fast lane (default: cold/authoritative)
|
||||
# SSH=cc-ci ssh host alias (default: cc-ci)
|
||||
# OUT=/path/dir local dir for the JSON summary (default .cc-ci-logs)
|
||||
set -o errexit -o nounset -o pipefail
|
||||
|
||||
SSH="${SSH:-cc-ci}"
|
||||
OUT="${OUT:-/srv/cc-ci/.cc-ci-logs}"
|
||||
RUNID="$(date -u +%Y%m%dT%H%M%SZ)"
|
||||
REMOTE_LOGDIR="/root/cc-ci-review-logs/${RUNID}"
|
||||
SUMMARY="${OUT}/ci-test-review-${RUNID}.json"
|
||||
mkdir -p "$OUT"
|
||||
|
||||
quick_flag=""; [ "${QUICK:-0}" = "1" ] && quick_flag="--quick"
|
||||
stages_env=""; [ -n "${STAGES:-}" ] && stages_env="STAGES='${STAGES}'"
|
||||
|
||||
echo "ci-test-review sweep ${RUNID} (host=${SSH} quick=${QUICK:-0} stages=${STAGES:-all})" >&2
|
||||
|
||||
# 1. Enumerate enrolled recipes (tests/<recipe>/ dirs, minus harness scaffolding).
|
||||
if [ -n "${RECIPES:-}" ]; then
|
||||
recipes="$RECIPES"
|
||||
else
|
||||
recipes="$(ssh "$SSH" 'cd /root/cc-ci/tests 2>/dev/null && ls -d */ 2>/dev/null' \
|
||||
| sed 's#/##' | grep -vE '^(_generic|unit|__pycache__)$' || true)"
|
||||
fi
|
||||
[ -n "$recipes" ] || { echo "no enrolled recipes found" >&2; exit 1; }
|
||||
echo "recipes: $(echo $recipes | tr '\n' ' ')" >&2
|
||||
|
||||
ssh "$SSH" "mkdir -p '$REMOTE_LOGDIR'"
|
||||
|
||||
# 2. Run the harness per recipe (deterministic). Capture exit code + RUN SUMMARY + version.
|
||||
printf '%-18s %-7s %s\n' "RECIPE" "VERDICT" "TIERS (from harness RUN SUMMARY)"
|
||||
printf '%-18s %-7s %s\n' "------" "-------" "--------------------------------"
|
||||
|
||||
json_items=""
|
||||
for r in $recipes; do
|
||||
log="${REMOTE_LOGDIR}/${r}.log"
|
||||
# Real harness invocation (real abra inside). Tee full log on cc-ci; capture rc.
|
||||
rc=0
|
||||
ssh "$SSH" "cd /root/cc-ci && ${stages_env} RECIPE='${r}' cc-ci-run runner/run_recipe_ci.py ${quick_flag} >'${log}' 2>&1" || rc=$?
|
||||
# Pull back the RUN SUMMARY block + the abra-checked-out recipe version, if recorded.
|
||||
summary_block="$(ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f{print}' '${log}' 2>/dev/null | sed 's/^/ /'" )"
|
||||
tiers="$(ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f&&/^ (install|upgrade|backup|restore|custom)/{gsub(/^ /,\"\");print}' '${log}' 2>/dev/null" | tr '\n' ';')"
|
||||
version="$(ssh "$SSH" "grep -oE 'checked out [^ ]+ at [^ ]+|recipe version[: ]+[^ ]+|VERSION=[^ ]+' '${log}' 2>/dev/null | head -1" || true)"
|
||||
verdict=GREEN; [ "$rc" = "0" ] || verdict=FAIL
|
||||
printf '%-18s %-7s %s\n' "$r" "$verdict" "${tiers:-<no summary — see log>}"
|
||||
# accumulate JSON (escape quotes/newlines minimally)
|
||||
esc_tiers="$(printf '%s' "$tiers" | sed 's/"/\\"/g')"
|
||||
esc_ver="$(printf '%s' "$version" | sed 's/"/\\"/g')"
|
||||
json_items="${json_items}${json_items:+,}
|
||||
{\"recipe\":\"${r}\",\"rc\":${rc},\"verdict\":\"${verdict}\",\"tiers\":\"${esc_tiers}\",\"version\":\"${esc_ver}\",\"log\":\"${SSH}:${log}\"}"
|
||||
done
|
||||
|
||||
# 3. JSON summary for the AI diagnosis step.
|
||||
cat > "$SUMMARY" <<JSON
|
||||
{
|
||||
"runid": "${RUNID}",
|
||||
"host": "${SSH}",
|
||||
"quick": ${QUICK:-0},
|
||||
"stages": "${STAGES:-all}",
|
||||
"remote_logdir": "${SSH}:${REMOTE_LOGDIR}",
|
||||
"recipes": [${json_items}
|
||||
]
|
||||
}
|
||||
JSON
|
||||
|
||||
echo >&2
|
||||
echo "JSON summary : ${SUMMARY}" >&2
|
||||
echo "full logs : ${SSH}:${REMOTE_LOGDIR}/<recipe>.log (read via ssh for diagnosis)" >&2
|
||||
# Exit non-zero if any recipe failed, so a caller can branch; AI still diagnoses either way.
|
||||
grep -q '"verdict":"FAIL"' "$SUMMARY" && exit 2 || exit 0
|
||||
@ -1,83 +0,0 @@
|
||||
# cc-ci Phase 3r — `/ci-test-review` Claude skill (on-demand AI review + diagnosis)
|
||||
|
||||
**Status:** QUEUED — after Phase 3 (results UX), before Phase 4 (final review). A **Claude skill that
|
||||
ships in the cc-ci product repo** (`.claude/skills/ci-test-review/`), built by the loops.
|
||||
**Transition:** auto (in the launcher sequence). **Owner:** Builder + Adversary loops.
|
||||
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase3r-ci-test-review-skill.md`
|
||||
**Phase order:** … 3 → **3r** → 4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Why — two distinct modes
|
||||
|
||||
cc-ci's **primary** use is **deterministic testing with NO AI**: `!testme` / the nightly sweep run the
|
||||
harness, recipes pass/fail, done. That stays the bread-and-butter and is unaffected by this phase.
|
||||
|
||||
This phase adds a **separate, on-demand, AI-driven REVIEW layer** — the `/ci-test-review` skill — that
|
||||
a maintainer (or a fresh Claude) invokes to: run the full suite across **all** recipes, and **for any
|
||||
failure, diagnose the root cause and classify it** — does it need a **PR to the recipe** (and what
|
||||
that PR is), a **PR to the CI server** (harness/test/infra), or is the **test out of date**? — or, if
|
||||
everything's green and current, report **"all passed, recipes + tests up to date."** It codifies the
|
||||
recipe-PR-vs-CI-PR judgment the Adversary already exercises (e.g. lasuite-drive→recipe PR, immich
|
||||
pg_dump→recipe PR) into a runnable tool.
|
||||
|
||||
**Crisp boundary:** the test *execution* is deterministic (the existing harness); **AI is used ONLY
|
||||
for diagnosis/classification/PR-proposal**, never to decide pass/fail.
|
||||
|
||||
## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-3r.md`)
|
||||
|
||||
- [ ] **R1 — Skill exists + invocable.** `.claude/skills/ci-test-review/SKILL.md` (+ helper script)
|
||||
in the cc-ci repo, invocable as `/ci-test-review`. Reuses the existing harness + recipe
|
||||
enrollment list — does NOT reinvent the runner.
|
||||
- [ ] **R2 — Deterministic execution.** Runs the real harness (`run_recipe_ci.py`) across **all
|
||||
enrolled recipes**, full suite per recipe, collecting per-recipe / per-tier pass/fail/skip.
|
||||
No AI in execution. (Decide cold vs `--quick`: default **cold** for an authoritative review;
|
||||
`--quick` as a fast-mode option.)
|
||||
- [ ] **R3 — "Up to date" freshness.** Tests each recipe at its **latest published version** (not a
|
||||
stale pin) and flags any recipe whose tested version lags upstream, and any test that's stale
|
||||
vs the recipe — so "all passed" means "passes against current upstream," not against an old pin.
|
||||
- [ ] **R4 — All-green output.** If every recipe passes AND is current, emit a clear **"ALL PASSED —
|
||||
recipes up to date, tests up to date"** summary (with versions tested).
|
||||
- [ ] **R5 — Failure diagnosis + classification.** For each failure, AI determines root cause and
|
||||
classifies it as exactly one of:
|
||||
- **RECIPE bug** → needs a **recipe PR**; output the concrete proposed change (e.g. "add
|
||||
collabora WOPI healthcheck + start_period", "add pg_dump backup hook").
|
||||
- **CI-SERVER bug** → needs a **cc-ci PR** (harness/test/infra); output the proposed change.
|
||||
- **TEST out-of-date** → the recipe legitimately changed; the cc-ci test needs updating; output
|
||||
the update.
|
||||
- **FLAKY / infra** → distinguish from a real failure (e.g. re-run) so a flake isn't
|
||||
misclassified as a recipe/CI bug.
|
||||
- [ ] **R6 — Structured report; PROPOSE not auto-merge.** Output a structured report (per recipe
|
||||
status; per failure: root cause + classification + proposed PR target + change sketch), or the
|
||||
all-green summary. It **proposes** PRs — it does **NOT** auto-create or merge them (the operator
|
||||
merges recipe PRs only after cc-ci verifies them green, per the recipe-PR rule). It MAY offer to
|
||||
hand a proposed recipe PR to the `recipe-create-pr` skill as an explicit, opt-in follow-up.
|
||||
- [ ] **R7 — Documented as distinct from deterministic CI.** `docs/` makes clear the default
|
||||
`!testme`/nightly path is deterministic (no AI); `/ci-test-review` is the on-demand AI review
|
||||
layer. A maintainer who wants zero AI never invokes it.
|
||||
- [ ] **R8 — Adversary-verified behavior.** Proven: reports ALL-GREEN correctly when green; on a
|
||||
**seeded recipe bug** classifies → recipe-PR with a sane proposed change; on a **seeded harness
|
||||
bug** classifies → CI-PR; the freshness check flags a deliberately stale-pinned recipe; a flake
|
||||
isn't misreported as a real bug.
|
||||
|
||||
## 2. Method / notes
|
||||
- **Lean on the harness + the Adversary's existing diagnostic methodology.** The classification
|
||||
(recipe vs CI vs test-stale) is exactly what the loops already do per finding — the skill
|
||||
productizes that reasoning, with the harness providing the deterministic evidence.
|
||||
- **Real abra path** throughout (the harness it drives already uses real abra).
|
||||
- The skill's value is turning "N recipes, here's the raw pass/fail" into "here's what's actually
|
||||
wrong and the specific PR that fixes it (recipe vs CI), or all-clear."
|
||||
|
||||
## 3. Guardrails
|
||||
- **Deterministic execution stays AI-free** — AI only diagnoses/classifies/proposes; it never
|
||||
decides pass/fail.
|
||||
- **Propose, don't auto-merge** — recipe PRs are operator-merged only after cc-ci verifies them
|
||||
(the standing recipe-PR rule).
|
||||
- **Don't weaken any test** to make a review "pass."
|
||||
- **Bounded** — one review skill; not a general agent. It runs the suite, diagnoses, reports.
|
||||
|
||||
## 4. Open decisions (log in machine-docs/DECISIONS.md)
|
||||
- Cold vs `--quick` default for the review run (lean cold = authoritative; offer `--quick` fast mode).
|
||||
- Report format (markdown summary + per-recipe table + per-failure diagnosis block).
|
||||
- How "latest published version" + "test staleness" are determined per recipe.
|
||||
- Whether/how the skill optionally invokes `recipe-create-pr` for a proposed recipe PR (opt-in).
|
||||
Reference in New Issue
Block a user