From 2530845e501e71da99c432316b2cc86827854c96 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Fri, 29 May 2026 16:57:26 +0100 Subject: [PATCH] orchestrator: add /ci-test-review skill (in THIS repo) + drop Phase 3r from loops queue MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The on-demand AI review layer is now an orchestration-repo skill built directly by the orchestrator, NOT a loops phase in the cc-ci product repo: - .claude/skills/ci-test-review/{SKILL.md,run-all-recipes.sh}: runs the real cc-ci harness across all enrolled recipes (deterministic, AI-free execution), then AI diagnoses each failure and classifies it as needing a recipe PR / a CI-server PR / a stale-test update — or reports "ALL PASSED, recipes + tests up to date". Proposes PRs; never decides pass/fail; never auto-merges. - .gitignore: track .claude/skills/ (shareable) while still ignoring local claude session state (locks, history) under .claude/. - launch.sh: remove Phase 3r from PHASES_SPEC; loops sequence back to 1c 1b 1d 1e 2w 2pc 2 2b 3 4. Deleted plan-phase3r (superseded by the skill). Co-Authored-By: Claude Opus 4.8 --- .claude/skills/ci-test-review/SKILL.md | 80 +++++++++++++++++ .../skills/ci-test-review/run-all-recipes.sh | 87 +++++++++++++++++++ .../plan-phase3r-ci-test-review-skill.md | 83 ------------------ 3 files changed, 167 insertions(+), 83 deletions(-) create mode 100644 .claude/skills/ci-test-review/SKILL.md create mode 100755 .claude/skills/ci-test-review/run-all-recipes.sh delete mode 100644 cc-ci-plan/plan-phase3r-ci-test-review-skill.md diff --git a/.claude/skills/ci-test-review/SKILL.md b/.claude/skills/ci-test-review/SKILL.md new file mode 100644 index 0000000..52241f2 --- /dev/null +++ b/.claude/skills/ci-test-review/SKILL.md @@ -0,0 +1,80 @@ +--- +name: ci-test-review +description: On-demand AI review of the cc-ci server — runs the deterministic recipe test suite across ALL enrolled recipes, and for any failure diagnoses the root cause and classifies it as needing a RECIPE PR (with the proposed change), a CI-SERVER PR, or a stale-test update; or reports "all passed, recipes + tests up to date". Use when the operator wants a full health check + actionable diagnosis of the recipe CI, beyond the deterministic pass/fail of normal !testme runs. Invoke as /ci-test-review. +--- + +# ci-test-review + +An **on-demand, AI-driven review** of the cc-ci recipe CI server. cc-ci's primary use is +**deterministic** testing (normal `!testme` / nightly — no AI). This skill adds a review layer: run +the full suite across **all** recipes, then **AI-diagnose** any failures and say exactly what fixes +them — a recipe PR, a CI-server PR, or a test update — or confirm everything is green and current. + +**Hard boundary:** the test *execution* is deterministic (the cc-ci harness). **AI is used ONLY for +diagnosis / classification / proposing the PR — never to decide pass/fail.** + +## Access (this skill runs from the orchestration repo, drives cc-ci over the tailnet) +- cc-ci is reachable via `ssh cc-ci` (root, through the userspace-tailscaled SOCKS proxy on + 127.0.0.1:1055). If `ssh cc-ci` fails, restart the proxy (`systemctl restart cc-ci-tailscaled`) per + the orchestrator's access notes, then retry. +- The cc-ci repo + harness live at `/root/cc-ci` on the cc-ci VM; the runner is + `cc-ci-run runner/run_recipe_ci.py` with `RECIPE=` (+ optional `STAGES=...`). + +## Procedure + +### 1. Run the deterministic sweep (no AI) +Run the helper, which enumerates enrolled recipes and runs the **full suite per recipe** on cc-ci, +emitting one structured result line per recipe (per-tier pass/fail/skip + the tested vs latest +published version): + +``` +bash "$(dirname "$0")/run-all-recipes.sh" # or: .claude/skills/ci-test-review/run-all-recipes.sh +``` + +It writes a results table to stdout and a machine-readable summary to +`/srv/cc-ci/.cc-ci-logs/ci-test-review-.json`. Do NOT hand-judge pass/fail — trust the +harness's per-tier results. + +### 2. If everything is green AND current → report all-clear +If every recipe's tiers are `pass`/`skip(N/A)` (no `fail`) **and** every recipe was tested at its +**latest published version** with no stale test flagged, output: + +> **ALL PASSED — recipes up to date, tests up to date.** + +Stop here. Nothing to fix. + +### 3. For each failure → diagnose root cause + CLASSIFY (this is the AI part) +For every `fail` (after confirming it's not a flake — see below), read the run log, the recipe's +compose/code (on cc-ci `~/.abra/recipes/`), and the harness, then determine the **root cause** +and classify it as **exactly one** of: + +- **RECIPE bug → recipe PR.** The recipe itself is broken/fragile (bad healthcheck, missing backup + hook, startup race, etc.). Output the **concrete proposed change** (e.g. "add a collabora WOPI + healthcheck + `start_period`", "add a `pg_dump` backup hook"). This is a PR to the recipe + (recipe-maintainer); it is "working" only once cc-ci verifies it green, then the operator merges. +- **CI-SERVER bug → cc-ci PR.** The harness/test/infra is wrong (a test asserts the wrong thing, a + readiness gate is off, an infra config bug). Output the proposed change to the cc-ci repo. +- **TEST out-of-date → cc-ci test update.** The recipe legitimately changed upstream and the cc-ci + test/overlay needs updating to match. Output the update. +- **FLAKY / infra-transient.** Distinguish from a real failure: **re-run the recipe once or twice**; + if it then passes, classify as flaky (note it; suggest a robustness fix if there's a pattern) — do + NOT report a flake as a recipe/CI bug. + +### 4. Freshness (part of "up to date") +For each recipe, compare the **tested version** to the **latest published** recipe version (the +sweep records both). Flag any recipe tested at a stale pin, and any cc-ci test that lags a recipe +change — "all passed" must mean "passes against current upstream", not against an old pin. + +### 5. Output a structured report +- A per-recipe **status table** (recipe | tiers | version tested / latest | verdict). +- For each failure: **root cause → classification (recipe PR / CI PR / test-update / flaky) → + proposed change** (concrete enough to act on). +- Or the all-clear summary from step 2. + +## Guardrails +- **Deterministic execution stays AI-free** — only diagnose; never override the harness's pass/fail. +- **Propose, don't auto-merge.** This skill outputs recommendations + proposed PRs. It does NOT + create or merge PRs. (A recipe PR is operator-merged only after cc-ci verifies it green.) If asked, + it may hand a specific proposed recipe PR to the `recipe-create-pr` skill as an explicit follow-up. +- **Don't weaken any test** to make the review "pass." +- **Flake ≠ bug** — re-run before classifying a failure as a real recipe/CI defect. diff --git a/.claude/skills/ci-test-review/run-all-recipes.sh b/.claude/skills/ci-test-review/run-all-recipes.sh new file mode 100755 index 0000000..bdc34ec --- /dev/null +++ b/.claude/skills/ci-test-review/run-all-recipes.sh @@ -0,0 +1,87 @@ +#!/usr/bin/env bash +# ci-test-review :: deterministic sweep (NO AI) +# ------------------------------------------------------------------------- +# Runs the real cc-ci harness (runner/run_recipe_ci.py) across all enrolled +# recipes on the cc-ci VM, full suite per recipe, and emits: +# - a per-recipe status table to stdout +# - a machine-readable JSON summary to $OUT (default /srv/cc-ci/.cc-ci-logs/) +# Full per-recipe logs are kept ON cc-ci under /root/cc-ci-review-logs// +# so the AI diagnosis step can read them via `ssh cc-ci`. +# +# This script makes NO pass/fail judgement of its own — it records the +# harness's exit code (0=green, 1=fail) and its RUN SUMMARY block verbatim. +# AI diagnosis/classification happens AFTER, per SKILL.md. +# +# Env knobs (all optional): +# RECIPES="ghost n8n" limit to these recipes (default: all enrolled) +# STAGES="install" pass through to harness (default: full suite) +# QUICK=1 add --quick fast lane (default: cold/authoritative) +# SSH=cc-ci ssh host alias (default: cc-ci) +# OUT=/path/dir local dir for the JSON summary (default .cc-ci-logs) +set -o errexit -o nounset -o pipefail + +SSH="${SSH:-cc-ci}" +OUT="${OUT:-/srv/cc-ci/.cc-ci-logs}" +RUNID="$(date -u +%Y%m%dT%H%M%SZ)" +REMOTE_LOGDIR="/root/cc-ci-review-logs/${RUNID}" +SUMMARY="${OUT}/ci-test-review-${RUNID}.json" +mkdir -p "$OUT" + +quick_flag=""; [ "${QUICK:-0}" = "1" ] && quick_flag="--quick" +stages_env=""; [ -n "${STAGES:-}" ] && stages_env="STAGES='${STAGES}'" + +echo "ci-test-review sweep ${RUNID} (host=${SSH} quick=${QUICK:-0} stages=${STAGES:-all})" >&2 + +# 1. Enumerate enrolled recipes (tests// dirs, minus harness scaffolding). +if [ -n "${RECIPES:-}" ]; then + recipes="$RECIPES" +else + recipes="$(ssh "$SSH" 'cd /root/cc-ci/tests 2>/dev/null && ls -d */ 2>/dev/null' \ + | sed 's#/##' | grep -vE '^(_generic|unit|__pycache__)$' || true)" +fi +[ -n "$recipes" ] || { echo "no enrolled recipes found" >&2; exit 1; } +echo "recipes: $(echo $recipes | tr '\n' ' ')" >&2 + +ssh "$SSH" "mkdir -p '$REMOTE_LOGDIR'" + +# 2. Run the harness per recipe (deterministic). Capture exit code + RUN SUMMARY + version. +printf '%-18s %-7s %s\n' "RECIPE" "VERDICT" "TIERS (from harness RUN SUMMARY)" +printf '%-18s %-7s %s\n' "------" "-------" "--------------------------------" + +json_items="" +for r in $recipes; do + log="${REMOTE_LOGDIR}/${r}.log" + # Real harness invocation (real abra inside). Tee full log on cc-ci; capture rc. + rc=0 + ssh "$SSH" "cd /root/cc-ci && ${stages_env} RECIPE='${r}' cc-ci-run runner/run_recipe_ci.py ${quick_flag} >'${log}' 2>&1" || rc=$? + # Pull back the RUN SUMMARY block + the abra-checked-out recipe version, if recorded. + summary_block="$(ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f{print}' '${log}' 2>/dev/null | sed 's/^/ /'" )" + tiers="$(ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f&&/^ (install|upgrade|backup|restore|custom)/{gsub(/^ /,\"\");print}' '${log}' 2>/dev/null" | tr '\n' ';')" + version="$(ssh "$SSH" "grep -oE 'checked out [^ ]+ at [^ ]+|recipe version[: ]+[^ ]+|VERSION=[^ ]+' '${log}' 2>/dev/null | head -1" || true)" + verdict=GREEN; [ "$rc" = "0" ] || verdict=FAIL + printf '%-18s %-7s %s\n' "$r" "$verdict" "${tiers:-}" + # accumulate JSON (escape quotes/newlines minimally) + esc_tiers="$(printf '%s' "$tiers" | sed 's/"/\\"/g')" + esc_ver="$(printf '%s' "$version" | sed 's/"/\\"/g')" + json_items="${json_items}${json_items:+,} + {\"recipe\":\"${r}\",\"rc\":${rc},\"verdict\":\"${verdict}\",\"tiers\":\"${esc_tiers}\",\"version\":\"${esc_ver}\",\"log\":\"${SSH}:${log}\"}" +done + +# 3. JSON summary for the AI diagnosis step. +cat > "$SUMMARY" <&2 +echo "JSON summary : ${SUMMARY}" >&2 +echo "full logs : ${SSH}:${REMOTE_LOGDIR}/.log (read via ssh for diagnosis)" >&2 +# Exit non-zero if any recipe failed, so a caller can branch; AI still diagnoses either way. +grep -q '"verdict":"FAIL"' "$SUMMARY" && exit 2 || exit 0 diff --git a/cc-ci-plan/plan-phase3r-ci-test-review-skill.md b/cc-ci-plan/plan-phase3r-ci-test-review-skill.md deleted file mode 100644 index 5399b30..0000000 --- a/cc-ci-plan/plan-phase3r-ci-test-review-skill.md +++ /dev/null @@ -1,83 +0,0 @@ -# cc-ci Phase 3r — `/ci-test-review` Claude skill (on-demand AI review + diagnosis) - -**Status:** QUEUED — after Phase 3 (results UX), before Phase 4 (final review). A **Claude skill that -ships in the cc-ci product repo** (`.claude/skills/ci-test-review/`), built by the loops. -**Transition:** auto (in the launcher sequence). **Owner:** Builder + Adversary loops. -**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase3r-ci-test-review-skill.md` -**Phase order:** … 3 → **3r** → 4. - ---- - -## 0. Why — two distinct modes - -cc-ci's **primary** use is **deterministic testing with NO AI**: `!testme` / the nightly sweep run the -harness, recipes pass/fail, done. That stays the bread-and-butter and is unaffected by this phase. - -This phase adds a **separate, on-demand, AI-driven REVIEW layer** — the `/ci-test-review` skill — that -a maintainer (or a fresh Claude) invokes to: run the full suite across **all** recipes, and **for any -failure, diagnose the root cause and classify it** — does it need a **PR to the recipe** (and what -that PR is), a **PR to the CI server** (harness/test/infra), or is the **test out of date**? — or, if -everything's green and current, report **"all passed, recipes + tests up to date."** It codifies the -recipe-PR-vs-CI-PR judgment the Adversary already exercises (e.g. lasuite-drive→recipe PR, immich -pg_dump→recipe PR) into a runnable tool. - -**Crisp boundary:** the test *execution* is deterministic (the existing harness); **AI is used ONLY -for diagnosis/classification/PR-proposal**, never to decide pass/fail. - -## 1. Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-3r.md`) - -- [ ] **R1 — Skill exists + invocable.** `.claude/skills/ci-test-review/SKILL.md` (+ helper script) - in the cc-ci repo, invocable as `/ci-test-review`. Reuses the existing harness + recipe - enrollment list — does NOT reinvent the runner. -- [ ] **R2 — Deterministic execution.** Runs the real harness (`run_recipe_ci.py`) across **all - enrolled recipes**, full suite per recipe, collecting per-recipe / per-tier pass/fail/skip. - No AI in execution. (Decide cold vs `--quick`: default **cold** for an authoritative review; - `--quick` as a fast-mode option.) -- [ ] **R3 — "Up to date" freshness.** Tests each recipe at its **latest published version** (not a - stale pin) and flags any recipe whose tested version lags upstream, and any test that's stale - vs the recipe — so "all passed" means "passes against current upstream," not against an old pin. -- [ ] **R4 — All-green output.** If every recipe passes AND is current, emit a clear **"ALL PASSED — - recipes up to date, tests up to date"** summary (with versions tested). -- [ ] **R5 — Failure diagnosis + classification.** For each failure, AI determines root cause and - classifies it as exactly one of: - - **RECIPE bug** → needs a **recipe PR**; output the concrete proposed change (e.g. "add - collabora WOPI healthcheck + start_period", "add pg_dump backup hook"). - - **CI-SERVER bug** → needs a **cc-ci PR** (harness/test/infra); output the proposed change. - - **TEST out-of-date** → the recipe legitimately changed; the cc-ci test needs updating; output - the update. - - **FLAKY / infra** → distinguish from a real failure (e.g. re-run) so a flake isn't - misclassified as a recipe/CI bug. -- [ ] **R6 — Structured report; PROPOSE not auto-merge.** Output a structured report (per recipe - status; per failure: root cause + classification + proposed PR target + change sketch), or the - all-green summary. It **proposes** PRs — it does **NOT** auto-create or merge them (the operator - merges recipe PRs only after cc-ci verifies them green, per the recipe-PR rule). It MAY offer to - hand a proposed recipe PR to the `recipe-create-pr` skill as an explicit, opt-in follow-up. -- [ ] **R7 — Documented as distinct from deterministic CI.** `docs/` makes clear the default - `!testme`/nightly path is deterministic (no AI); `/ci-test-review` is the on-demand AI review - layer. A maintainer who wants zero AI never invokes it. -- [ ] **R8 — Adversary-verified behavior.** Proven: reports ALL-GREEN correctly when green; on a - **seeded recipe bug** classifies → recipe-PR with a sane proposed change; on a **seeded harness - bug** classifies → CI-PR; the freshness check flags a deliberately stale-pinned recipe; a flake - isn't misreported as a real bug. - -## 2. Method / notes -- **Lean on the harness + the Adversary's existing diagnostic methodology.** The classification - (recipe vs CI vs test-stale) is exactly what the loops already do per finding — the skill - productizes that reasoning, with the harness providing the deterministic evidence. -- **Real abra path** throughout (the harness it drives already uses real abra). -- The skill's value is turning "N recipes, here's the raw pass/fail" into "here's what's actually - wrong and the specific PR that fixes it (recipe vs CI), or all-clear." - -## 3. Guardrails -- **Deterministic execution stays AI-free** — AI only diagnoses/classifies/proposes; it never - decides pass/fail. -- **Propose, don't auto-merge** — recipe PRs are operator-merged only after cc-ci verifies them - (the standing recipe-PR rule). -- **Don't weaken any test** to make a review "pass." -- **Bounded** — one review skill; not a general agent. It runs the suite, diagnoses, reports. - -## 4. Open decisions (log in machine-docs/DECISIONS.md) -- Cold vs `--quick` default for the review run (lean cold = authoritative; offer `--quick` fast mode). -- Report format (markdown summary + per-recipe table + per-failure diagnosis block). -- How "latest published version" + "test staleness" are determined per recipe. -- Whether/how the skill optionally invokes `recipe-create-pr` for a proposed recipe PR (opt-in).