ci-test-review: close the loop — author + open + cc-ci-verify fix PRs (never merge)

Per operator: the skill should not just propose, it should CREATE the fix PR (recipe repo or cc-ci repo) and VERIFY it green on its own CI server — but not merge. It drives cc-ci like the loops do. - SKILL.md: diagnose+classify (recipe vs CI-server) -> author the fix + open a PR (recipe-create-pr for recipe PRs; Gitea API for cc-ci PRs, dedicated branch in a separate clone, single-writer safe) -> VERIFY on cc-ci (full suite cold against the PR head = the !testme dogfood path) -> report a verified, ready-to-merge PR. Never merges; never weakens a test; flake != bug. General bar = one cold green; repeated-green (REPEAT=3) only for a known-flaky recipe. Adds coordination/single-writer guardrails (shared Swarm is stateful; tear down deploys; never push main or touch the loops' clones). - verify-pr.sh: deterministic recipe-PR gate — RECIPE + REF -> cold full suite on cc-ci, green iff every repeat exits 0. CI-server-PR verification stays bespoke (branch checkout + rebuild + regression sample) per SKILL.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 17:04:35 +01:00
parent 2530845e50
commit cbe1406bce
2 changed files with 126 additions and 30 deletions
--- a/.claude/skills/ci-test-review/SKILL.md
+++ b/.claude/skills/ci-test-review/SKILL.md
@ -1,17 +1,25 @@
 ---
 name: ci-test-review
-description: On-demand AI review of the cc-ci server — runs the deterministic recipe test suite across ALL enrolled recipes, and for any failure diagnoses the root cause and classifies it as needing a RECIPE PR (with the proposed change), a CI-SERVER PR, or a stale-test update; or reports "all passed, recipes + tests up to date". Use when the operator wants a full health check + actionable diagnosis of the recipe CI, beyond the deterministic pass/fail of normal !testme runs. Invoke as /ci-test-review.
+description: On-demand AI review + auto-fix of the cc-ci server. Runs the deterministic recipe test suite across ALL enrolled recipes; for any failure it diagnoses the root cause, classifies it as a RECIPE bug or a CI-SERVER bug, AUTHORS the fix, OPENS a PR (recipe repo or cc-ci repo), and VERIFIES that PR green on the CI server (deploy + full suite against the PR head, dogfood) — but NEVER merges (the operator merges). If everything is green and current it reports "all passed, recipes + tests up to date". Use for a full health check that ends in verified, ready-to-merge fix PRs. Invoke as /ci-test-review.
 ---

 # ci-test-review

-An **on-demand, AI-driven review** of the cc-ci recipe CI server. cc-ci's primary use is
-**deterministic** testing (normal `!testme` / nightly — no AI). This skill adds a review layer: run
-the full suite across **all** recipes, then **AI-diagnose** any failures and say exactly what fixes
-them — a recipe PR, a CI-server PR, or a test update — or confirm everything is green and current.
+An **on-demand, AI-driven review + auto-fix** layer over the cc-ci recipe CI server. cc-ci's primary
+use is **deterministic** testing (normal `!testme` / nightly — no AI). This skill runs the full suite
+across **all** recipes, then for any failure: **diagnoses** the root cause, **classifies** it (recipe
+vs CI-server), **authors the fix, opens a PR, and verifies that PR green on the CI server** — the same
+deploy-and-test dogfood the build loops use. It leaves a **verified, ready-to-merge PR**; it does
+**NOT merge**. If everything's green and current it reports "all passed".

-**Hard boundary:** the test *execution* is deterministic (the cc-ci harness). **AI is used ONLY for
-diagnosis / classification / proposing the PR — never to decide pass/fail.**
+**This skill drives its own CI server** (cc-ci) the way the loops do: it can push branches, deploy
+recipes, and run the harness there to verify a fix end-to-end.
+
+**Hard boundaries:**
+- The test *execution and PR verification* are **deterministic** (the cc-ci harness decides
+  pass/fail). **AI never decides pass/fail** — it only diagnoses, authors the fix, and drives the flow.
+- **Create + verify, never merge.** The operator reviews the cc-ci-green PR and merges.
+- **Never weaken a test** to make a PR go green.

 ## Access (this skill runs from the orchestration repo, drives cc-ci over the tailnet)
 - cc-ci is reachable via `ssh cc-ci` (root, through the userspace-tailscaled SOCKS proxy on
@ -43,38 +51,78 @@ If every recipe's tiers are `pass`/`skip(N/A)` (no `fail`) **and** every recipe

 Stop here. Nothing to fix.

-### 3. For each failure → diagnose root cause + CLASSIFY (this is the AI part)
+### 3. For each failure → diagnose root cause + CLASSIFY (AI)
 For every `fail` (after confirming it's not a flake — see below), read the run log, the recipe's
 compose/code (on cc-ci `~/.abra/recipes/<recipe>`), and the harness, then determine the **root cause**
 and classify it as **exactly one** of:

- **RECIPE bug → recipe PR.** The recipe itself is broken/fragile (bad healthcheck, missing backup
-  hook, startup race, etc.). Output the **concrete proposed change** (e.g. "add a collabora WOPI
-  healthcheck + `start_period`", "add a `pg_dump` backup hook"). This is a PR to the recipe
-  (recipe-maintainer); it is "working" only once cc-ci verifies it green, then the operator merges.
- **CI-SERVER bug → cc-ci PR.** The harness/test/infra is wrong (a test asserts the wrong thing, a
-  readiness gate is off, an infra config bug). Output the proposed change to the cc-ci repo.
- **TEST out-of-date → cc-ci test update.** The recipe legitimately changed upstream and the cc-ci
-  test/overlay needs updating to match. Output the update.
+- **RECIPE bug → recipe-side fix.** The recipe itself is broken/fragile (bad healthcheck, missing
+  backup hook, startup race, etc.). The fix is a change to the recipe (e.g. "add a collabora WOPI
+  healthcheck + `start_period`", "add a `pg_dump` backup hook").
+- **CI-SERVER bug → cc-ci-side fix.** The harness/test/infra is wrong (a test asserts the wrong
+  thing, a readiness gate is off, an infra config bug). The fix is a change to the cc-ci repo.
+- **TEST out-of-date → cc-ci-side fix.** The recipe legitimately changed upstream and the cc-ci
+  test/overlay needs updating to match.
 - **FLAKY / infra-transient.** Distinguish from a real failure: **re-run the recipe once or twice**;
  if it then passes, classify as flaky (note it; suggest a robustness fix if there's a pattern) — do
-  NOT report a flake as a recipe/CI bug.
+  NOT author a recipe/CI "fix" for a flake.

-### 4. Freshness (part of "up to date")
-For each recipe, compare the **tested version** to the **latest published** recipe version (the
-sweep records both). Flag any recipe tested at a stale pin, and any cc-ci test that lags a recipe
-change — "all passed" must mean "passes against current upstream", not against an old pin.
+Freshness is part of this: compare each recipe's **tested version** to the **latest published**
+version (the sweep records both). A recipe tested at a stale pin, or a cc-ci test lagging an upstream
+recipe change, is itself a finding — "all passed" must mean "passes against current upstream".

-### 5. Output a structured report
+### 4. Author the fix + OPEN a PR (AI authors; never merge)
+For each real (non-flaky) finding, write the actual fix and open a PR. **Never merge.**
+
+- **Recipe-side fix → recipe PR.** Branch the recipe, apply the fix, and open the PR via the
+  **`recipe-create-pr` skill** (`/srv/recipe-maintainer/.opencode/skills/recipe-create-pr/SKILL.md`) —
+  it handles the mirror to `git.autonomic.zone/recipe-maintainers/<recipe>` (upstream
+  `git.coopcloud.tech`). Keep the change **bounded** to the diagnosed root cause; don't rewrite the
+  recipe.
+- **CI-server-side fix → cc-ci PR.** Branch the cc-ci product repo
+  (`recipe-maintainers/cc-ci`), apply the fix, and open the PR via the Gitea API (use the
+  `GITEA_*` creds from `/srv/cc-ci/.testenv`). **Single-writer discipline:** work on a dedicated
+  branch in a SEPARATE clone — **never push `main`, never touch the build loops' working clones**
+  (`/cc-ci`, `/cc-ci-adv`) or their in-flight state.
+
+### 5. VERIFY each PR on the CI server (deterministic; still never merge)
+A PR is only "working" once **cc-ci verifies it green** (operator rule) — dogfood the CI that found
+the bug. Verification is deterministic (the harness), not an AI judgement.
+
+- **Recipe PR:** run the **full suite COLD against the PR head** on cc-ci — the same path `!testme`
+  uses — via the helper:
+  ```
+  RECIPE=<recipe> REF=<pr-head-branch-or-sha> .claude/skills/ci-test-review/verify-pr.sh
+  ```
+  Green ⇔ the harness exits 0. **General bar = one cold green.** Use `REPEAT=3` (repeated-green)
+  **only for a recipe already known to be FLAKY** (e.g. lasuite-drive) as a flakiness proof — this is
+  NOT the general standard.
+- **CI-server PR:** check the branch out in a clone on cc-ci, rebuild if the change requires it
+  (`nixos-rebuild` / flake), then re-run the **previously-failing recipe(s)** through the harness
+  **plus a small regression sample** of unrelated recipes. Green ⇔ the fix passes AND nothing
+  regressed.
+- **If verification is RED:** iterate the fix on the same branch (**bounded — ≤3 attempts**) and
+  re-verify. If still red, **leave the PR open** and report it as *"opened, NOT yet green — needs
+  work"* with the failing evidence. Never abandon silently; never weaken a test to force green.
+
+### 6. Output a structured report
 - A per-recipe **status table** (recipe | tiers | version tested / latest | verdict).
- For each failure: **root cause → classification (recipe PR / CI PR / test-update / flaky) →
-  proposed change** (concrete enough to act on).
+- For each finding: **root cause → classification (recipe / CI-server) → PR link → verification
+  verdict** (`GREEN — cold-verified on cc-ci, ready for operator merge` / `RED — opened, needs work`).
 - Or the all-clear summary from step 2.
+- Always end with the explicit reminder that **nothing was merged** — the PRs await operator review.

 ## Guardrails
- **Deterministic execution stays AI-free** — only diagnose; never override the harness's pass/fail.
- **Propose, don't auto-merge.** This skill outputs recommendations + proposed PRs. It does NOT
-  create or merge PRs. (A recipe PR is operator-merged only after cc-ci verifies it green.) If asked,
-  it may hand a specific proposed recipe PR to the `recipe-create-pr` skill as an explicit follow-up.
- **Don't weaken any test** to make the review "pass."
- **Flake ≠ bug** — re-run before classifying a failure as a real recipe/CI defect.
+- **Deterministic execution + verification stay AI-free** — the harness decides pass/fail; AI only
+  diagnoses, authors the fix, and drives the flow.
+- **Create + verify, NEVER merge.** Leave a cc-ci-green, ready-to-merge PR; the operator merges
+  (recipe PRs especially are operator-merged only after cc-ci verifies them green).
+- **Don't weaken any test** to make a PR go green — the fix must make the recipe/CI genuinely correct.
+- **Flake ≠ bug** — re-run before authoring any fix.
+- **Real abra path** throughout (no docker-level bypass).
+- **Bounded fixes** — each PR targets one diagnosed root cause; not a rewrite.
+- **Coordination / single-writer:** while the build loops are actively developing cc-ci, the server
+  **and** the `recipe-maintainers/cc-ci` repo are in use. Always use a dedicated branch + separate
+  clone; never push `main` or disturb the loops' clones. Recipe deploys are **stateful on the shared
+  Swarm**, so run verification when you have effectively-exclusive use of the host (or serialize), and
+  **tear down whatever you deploy**.
--- a/.claude/skills/ci-test-review/verify-pr.sh
+++ b/.claude/skills/ci-test-review/verify-pr.sh
@ -0,0 +1,48 @@
+#!/usr/bin/env bash
+# ci-test-review :: deterministic PR verification on the CI server (NO AI)
+# -------------------------------------------------------------------------
+# Verifies a RECIPE PR the same way `!testme` does: deploy + run the FULL suite
+# COLD against the PR head on cc-ci. The harness decides pass/fail — this script
+# only runs it and reports the exit code + RUN SUMMARY. AI never judges here.
+#
+#   RECIPE=ghost REF=my-fix-branch  verify-pr.sh        # 1 cold green = working
+#   RECIPE=lasuite-drive REF=sha REPEAT=3 verify-pr.sh  # repeated-green: ONLY
+#                                                       # for a known-flaky recipe
+#
+# REF is the recipe PR head (branch name or sha) on the recipe mirror that abra
+# fetches from. Green iff EVERY repeat exits 0.
+#
+# (CI-SERVER PRs are verified differently — check the branch out in a clone on
+#  cc-ci, rebuild, re-run the failing recipe(s) + a regression sample — see
+#  SKILL.md step 5; that path is bespoke and not scripted here.)
+set -o errexit -o nounset -o pipefail
+
+SSH="${SSH:-cc-ci}"
+REPEAT="${REPEAT:-1}"
+: "${RECIPE:?set RECIPE (e.g. ghost)}"
+: "${REF:?set REF (the PR head branch or sha)}"
+
+RUNID="$(date -u +%Y%m%dT%H%M%SZ)"
+REMOTE_LOG="/root/cc-ci-review-logs/verify-${RECIPE}-${RUNID}"
+ssh "$SSH" "mkdir -p /root/cc-ci-review-logs"
+
+echo "verify-pr: RECIPE=$RECIPE REF=$REF cold full-suite x${REPEAT} on ${SSH}" >&2
+
+green=1
+for i in $(seq 1 "$REPEAT"); do
+  log="${REMOTE_LOG}.${i}.log"
+  rc=0
+  # Real harness, cold (no --quick), against the PR head — same path as !testme.
+  ssh "$SSH" "cd /root/cc-ci && RECIPE='${RECIPE}' REF='${REF}' cc-ci-run runner/run_recipe_ci.py >'${log}' 2>&1" || rc=$?
+  echo "--- pass ${i}/${REPEAT}: exit ${rc} (log ${SSH}:${log}) ---" >&2
+  ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f{print}' '${log}'" >&2 || true
+  [ "$rc" = "0" ] || green=0
+done
+
+if [ "$green" = "1" ]; then
+  echo "VERDICT: GREEN — ${RECIPE} PR (REF=${REF}) passed cold full-suite x${REPEAT}. Ready for operator merge (NOT merged)." >&2
+  exit 0
+else
+  echo "VERDICT: RED — ${RECIPE} PR (REF=${REF}) did not pass. Iterate the fix (<=3 attempts) or report needs-work. Do NOT weaken tests; do NOT merge." >&2
+  exit 2
+fi