ci-test-review: close the loop — author + open + cc-ci-verify fix PRs (never merge)

Per operator: the skill should not just propose, it should CREATE the fix PR
(recipe repo or cc-ci repo) and VERIFY it green on its own CI server — but not
merge. It drives cc-ci like the loops do.

- SKILL.md: diagnose+classify (recipe vs CI-server) -> author the fix + open a
  PR (recipe-create-pr for recipe PRs; Gitea API for cc-ci PRs, dedicated branch
  in a separate clone, single-writer safe) -> VERIFY on cc-ci (full suite cold
  against the PR head = the !testme dogfood path) -> report a verified,
  ready-to-merge PR. Never merges; never weakens a test; flake != bug. General
  bar = one cold green; repeated-green (REPEAT=3) only for a known-flaky recipe.
  Adds coordination/single-writer guardrails (shared Swarm is stateful; tear
  down deploys; never push main or touch the loops' clones).
- verify-pr.sh: deterministic recipe-PR gate — RECIPE + REF -> cold full suite
  on cc-ci, green iff every repeat exits 0. CI-server-PR verification stays
  bespoke (branch checkout + rebuild + regression sample) per SKILL.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 17:04:35 +01:00
parent 2530845e50
commit cbe1406bce
2 changed files with 126 additions and 30 deletions

View File

@ -1,17 +1,25 @@
---
name: ci-test-review
description: On-demand AI review of the cc-ci server — runs the deterministic recipe test suite across ALL enrolled recipes, and for any failure diagnoses the root cause and classifies it as needing a RECIPE PR (with the proposed change), a CI-SERVER PR, or a stale-test update; or reports "all passed, recipes + tests up to date". Use when the operator wants a full health check + actionable diagnosis of the recipe CI, beyond the deterministic pass/fail of normal !testme runs. Invoke as /ci-test-review.
description: On-demand AI review + auto-fix of the cc-ci server. Runs the deterministic recipe test suite across ALL enrolled recipes; for any failure it diagnoses the root cause, classifies it as a RECIPE bug or a CI-SERVER bug, AUTHORS the fix, OPENS a PR (recipe repo or cc-ci repo), and VERIFIES that PR green on the CI server (deploy + full suite against the PR head, dogfood) — but NEVER merges (the operator merges). If everything is green and current it reports "all passed, recipes + tests up to date". Use for a full health check that ends in verified, ready-to-merge fix PRs. Invoke as /ci-test-review.
---
# ci-test-review
An **on-demand, AI-driven review** of the cc-ci recipe CI server. cc-ci's primary use is
**deterministic** testing (normal `!testme` / nightly — no AI). This skill adds a review layer: run
the full suite across **all** recipes, then **AI-diagnose** any failures and say exactly what fixes
them — a recipe PR, a CI-server PR, or a test update — or confirm everything is green and current.
An **on-demand, AI-driven review + auto-fix** layer over the cc-ci recipe CI server. cc-ci's primary
use is **deterministic** testing (normal `!testme` / nightly — no AI). This skill runs the full suite
across **all** recipes, then for any failure: **diagnoses** the root cause, **classifies** it (recipe
vs CI-server), **authors the fix, opens a PR, and verifies that PR green on the CI server** — the same
deploy-and-test dogfood the build loops use. It leaves a **verified, ready-to-merge PR**; it does
**NOT merge**. If everything's green and current it reports "all passed".
**Hard boundary:** the test *execution* is deterministic (the cc-ci harness). **AI is used ONLY for
diagnosis / classification / proposing the PR — never to decide pass/fail.**
**This skill drives its own CI server** (cc-ci) the way the loops do: it can push branches, deploy
recipes, and run the harness there to verify a fix end-to-end.
**Hard boundaries:**
- The test *execution and PR verification* are **deterministic** (the cc-ci harness decides
pass/fail). **AI never decides pass/fail** — it only diagnoses, authors the fix, and drives the flow.
- **Create + verify, never merge.** The operator reviews the cc-ci-green PR and merges.
- **Never weaken a test** to make a PR go green.
## Access (this skill runs from the orchestration repo, drives cc-ci over the tailnet)
- cc-ci is reachable via `ssh cc-ci` (root, through the userspace-tailscaled SOCKS proxy on
@ -43,38 +51,78 @@ If every recipe's tiers are `pass`/`skip(N/A)` (no `fail`) **and** every recipe
Stop here. Nothing to fix.
### 3. For each failure → diagnose root cause + CLASSIFY (this is the AI part)
### 3. For each failure → diagnose root cause + CLASSIFY (AI)
For every `fail` (after confirming it's not a flake — see below), read the run log, the recipe's
compose/code (on cc-ci `~/.abra/recipes/<recipe>`), and the harness, then determine the **root cause**
and classify it as **exactly one** of:
- **RECIPE bug → recipe PR.** The recipe itself is broken/fragile (bad healthcheck, missing backup
hook, startup race, etc.). Output the **concrete proposed change** (e.g. "add a collabora WOPI
healthcheck + `start_period`", "add a `pg_dump` backup hook"). This is a PR to the recipe
(recipe-maintainer); it is "working" only once cc-ci verifies it green, then the operator merges.
- **CI-SERVER bug → cc-ci PR.** The harness/test/infra is wrong (a test asserts the wrong thing, a
readiness gate is off, an infra config bug). Output the proposed change to the cc-ci repo.
- **TEST out-of-date → cc-ci test update.** The recipe legitimately changed upstream and the cc-ci
test/overlay needs updating to match. Output the update.
- **RECIPE bug → recipe-side fix.** The recipe itself is broken/fragile (bad healthcheck, missing
backup hook, startup race, etc.). The fix is a change to the recipe (e.g. "add a collabora WOPI
healthcheck + `start_period`", "add a `pg_dump` backup hook").
- **CI-SERVER bug → cc-ci-side fix.** The harness/test/infra is wrong (a test asserts the wrong
thing, a readiness gate is off, an infra config bug). The fix is a change to the cc-ci repo.
- **TEST out-of-date → cc-ci-side fix.** The recipe legitimately changed upstream and the cc-ci
test/overlay needs updating to match.
- **FLAKY / infra-transient.** Distinguish from a real failure: **re-run the recipe once or twice**;
if it then passes, classify as flaky (note it; suggest a robustness fix if there's a pattern) — do
NOT report a flake as a recipe/CI bug.
NOT author a recipe/CI "fix" for a flake.
### 4. Freshness (part of "up to date")
For each recipe, compare the **tested version** to the **latest published** recipe version (the
sweep records both). Flag any recipe tested at a stale pin, and any cc-ci test that lags a recipe
change — "all passed" must mean "passes against current upstream", not against an old pin.
Freshness is part of this: compare each recipe's **tested version** to the **latest published**
version (the sweep records both). A recipe tested at a stale pin, or a cc-ci test lagging an upstream
recipe change, is itself a finding — "all passed" must mean "passes against current upstream".
### 5. Output a structured report
### 4. Author the fix + OPEN a PR (AI authors; never merge)
For each real (non-flaky) finding, write the actual fix and open a PR. **Never merge.**
- **Recipe-side fix → recipe PR.** Branch the recipe, apply the fix, and open the PR via the
**`recipe-create-pr` skill** (`/srv/recipe-maintainer/.opencode/skills/recipe-create-pr/SKILL.md`) —
it handles the mirror to `git.autonomic.zone/recipe-maintainers/<recipe>` (upstream
`git.coopcloud.tech`). Keep the change **bounded** to the diagnosed root cause; don't rewrite the
recipe.
- **CI-server-side fix → cc-ci PR.** Branch the cc-ci product repo
(`recipe-maintainers/cc-ci`), apply the fix, and open the PR via the Gitea API (use the
`GITEA_*` creds from `/srv/cc-ci/.testenv`). **Single-writer discipline:** work on a dedicated
branch in a SEPARATE clone — **never push `main`, never touch the build loops' working clones**
(`/cc-ci`, `/cc-ci-adv`) or their in-flight state.
### 5. VERIFY each PR on the CI server (deterministic; still never merge)
A PR is only "working" once **cc-ci verifies it green** (operator rule) — dogfood the CI that found
the bug. Verification is deterministic (the harness), not an AI judgement.
- **Recipe PR:** run the **full suite COLD against the PR head** on cc-ci — the same path `!testme`
uses — via the helper:
```
RECIPE=<recipe> REF=<pr-head-branch-or-sha> .claude/skills/ci-test-review/verify-pr.sh
```
Green ⇔ the harness exits 0. **General bar = one cold green.** Use `REPEAT=3` (repeated-green)
**only for a recipe already known to be FLAKY** (e.g. lasuite-drive) as a flakiness proof — this is
NOT the general standard.
- **CI-server PR:** check the branch out in a clone on cc-ci, rebuild if the change requires it
(`nixos-rebuild` / flake), then re-run the **previously-failing recipe(s)** through the harness
**plus a small regression sample** of unrelated recipes. Green ⇔ the fix passes AND nothing
regressed.
- **If verification is RED:** iterate the fix on the same branch (**bounded — ≤3 attempts**) and
re-verify. If still red, **leave the PR open** and report it as *"opened, NOT yet green — needs
work"* with the failing evidence. Never abandon silently; never weaken a test to force green.
### 6. Output a structured report
- A per-recipe **status table** (recipe | tiers | version tested / latest | verdict).
- For each failure: **root cause → classification (recipe PR / CI PR / test-update / flaky) →
proposed change** (concrete enough to act on).
- For each finding: **root cause → classification (recipe / CI-server) → PR link → verification
verdict** (`GREEN — cold-verified on cc-ci, ready for operator merge` / `RED — opened, needs work`).
- Or the all-clear summary from step 2.
- Always end with the explicit reminder that **nothing was merged** — the PRs await operator review.
## Guardrails
- **Deterministic execution stays AI-free** — only diagnose; never override the harness's pass/fail.
- **Propose, don't auto-merge.** This skill outputs recommendations + proposed PRs. It does NOT
create or merge PRs. (A recipe PR is operator-merged only after cc-ci verifies it green.) If asked,
it may hand a specific proposed recipe PR to the `recipe-create-pr` skill as an explicit follow-up.
- **Don't weaken any test** to make the review "pass."
- **Flake ≠ bug** — re-run before classifying a failure as a real recipe/CI defect.
- **Deterministic execution + verification stay AI-free** — the harness decides pass/fail; AI only
diagnoses, authors the fix, and drives the flow.
- **Create + verify, NEVER merge.** Leave a cc-ci-green, ready-to-merge PR; the operator merges
(recipe PRs especially are operator-merged only after cc-ci verifies them green).
- **Don't weaken any test** to make a PR go green — the fix must make the recipe/CI genuinely correct.
- **Flake ≠ bug** — re-run before authoring any fix.
- **Real abra path** throughout (no docker-level bypass).
- **Bounded fixes** — each PR targets one diagnosed root cause; not a rewrite.
- **Coordination / single-writer:** while the build loops are actively developing cc-ci, the server
**and** the `recipe-maintainers/cc-ci` repo are in use. Always use a dedicated branch + separate
clone; never push `main` or disturb the loops' clones. Recipe deploys are **stateful on the shared
Swarm**, so run verification when you have effectively-exclusive use of the host (or serialize), and
**tear down whatever you deploy**.

View File

@ -0,0 +1,48 @@
#!/usr/bin/env bash
# ci-test-review :: deterministic PR verification on the CI server (NO AI)
# -------------------------------------------------------------------------
# Verifies a RECIPE PR the same way `!testme` does: deploy + run the FULL suite
# COLD against the PR head on cc-ci. The harness decides pass/fail — this script
# only runs it and reports the exit code + RUN SUMMARY. AI never judges here.
#
# RECIPE=ghost REF=my-fix-branch verify-pr.sh # 1 cold green = working
# RECIPE=lasuite-drive REF=sha REPEAT=3 verify-pr.sh # repeated-green: ONLY
# # for a known-flaky recipe
#
# REF is the recipe PR head (branch name or sha) on the recipe mirror that abra
# fetches from. Green iff EVERY repeat exits 0.
#
# (CI-SERVER PRs are verified differently — check the branch out in a clone on
# cc-ci, rebuild, re-run the failing recipe(s) + a regression sample — see
# SKILL.md step 5; that path is bespoke and not scripted here.)
set -o errexit -o nounset -o pipefail
SSH="${SSH:-cc-ci}"
REPEAT="${REPEAT:-1}"
: "${RECIPE:?set RECIPE (e.g. ghost)}"
: "${REF:?set REF (the PR head branch or sha)}"
RUNID="$(date -u +%Y%m%dT%H%M%SZ)"
REMOTE_LOG="/root/cc-ci-review-logs/verify-${RECIPE}-${RUNID}"
ssh "$SSH" "mkdir -p /root/cc-ci-review-logs"
echo "verify-pr: RECIPE=$RECIPE REF=$REF cold full-suite x${REPEAT} on ${SSH}" >&2
green=1
for i in $(seq 1 "$REPEAT"); do
log="${REMOTE_LOG}.${i}.log"
rc=0
# Real harness, cold (no --quick), against the PR head — same path as !testme.
ssh "$SSH" "cd /root/cc-ci && RECIPE='${RECIPE}' REF='${REF}' cc-ci-run runner/run_recipe_ci.py >'${log}' 2>&1" || rc=$?
echo "--- pass ${i}/${REPEAT}: exit ${rc} (log ${SSH}:${log}) ---" >&2
ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f{print}' '${log}'" >&2 || true
[ "$rc" = "0" ] || green=0
done
if [ "$green" = "1" ]; then
echo "VERDICT: GREEN — ${RECIPE} PR (REF=${REF}) passed cold full-suite x${REPEAT}. Ready for operator merge (NOT merged)." >&2
exit 0
else
echo "VERDICT: RED — ${RECIPE} PR (REF=${REF}) did not pass. Iterate the fix (<=3 attempts) or report needs-work. Do NOT weaken tests; do NOT merge." >&2
exit 2
fi