diff --git a/.claude/skills/ci-test-review/SKILL.md b/.claude/skills/ci-test-review/SKILL.md index 52241f2..0e7c54a 100644 --- a/.claude/skills/ci-test-review/SKILL.md +++ b/.claude/skills/ci-test-review/SKILL.md @@ -1,17 +1,25 @@ --- name: ci-test-review -description: On-demand AI review of the cc-ci server — runs the deterministic recipe test suite across ALL enrolled recipes, and for any failure diagnoses the root cause and classifies it as needing a RECIPE PR (with the proposed change), a CI-SERVER PR, or a stale-test update; or reports "all passed, recipes + tests up to date". Use when the operator wants a full health check + actionable diagnosis of the recipe CI, beyond the deterministic pass/fail of normal !testme runs. Invoke as /ci-test-review. +description: On-demand AI review + auto-fix of the cc-ci server. Runs the deterministic recipe test suite across ALL enrolled recipes; for any failure it diagnoses the root cause, classifies it as a RECIPE bug or a CI-SERVER bug, AUTHORS the fix, OPENS a PR (recipe repo or cc-ci repo), and VERIFIES that PR green on the CI server (deploy + full suite against the PR head, dogfood) — but NEVER merges (the operator merges). If everything is green and current it reports "all passed, recipes + tests up to date". Use for a full health check that ends in verified, ready-to-merge fix PRs. Invoke as /ci-test-review. --- # ci-test-review -An **on-demand, AI-driven review** of the cc-ci recipe CI server. cc-ci's primary use is -**deterministic** testing (normal `!testme` / nightly — no AI). This skill adds a review layer: run -the full suite across **all** recipes, then **AI-diagnose** any failures and say exactly what fixes -them — a recipe PR, a CI-server PR, or a test update — or confirm everything is green and current. +An **on-demand, AI-driven review + auto-fix** layer over the cc-ci recipe CI server. cc-ci's primary +use is **deterministic** testing (normal `!testme` / nightly — no AI). This skill runs the full suite +across **all** recipes, then for any failure: **diagnoses** the root cause, **classifies** it (recipe +vs CI-server), **authors the fix, opens a PR, and verifies that PR green on the CI server** — the same +deploy-and-test dogfood the build loops use. It leaves a **verified, ready-to-merge PR**; it does +**NOT merge**. If everything's green and current it reports "all passed". -**Hard boundary:** the test *execution* is deterministic (the cc-ci harness). **AI is used ONLY for -diagnosis / classification / proposing the PR — never to decide pass/fail.** +**This skill drives its own CI server** (cc-ci) the way the loops do: it can push branches, deploy +recipes, and run the harness there to verify a fix end-to-end. + +**Hard boundaries:** +- The test *execution and PR verification* are **deterministic** (the cc-ci harness decides + pass/fail). **AI never decides pass/fail** — it only diagnoses, authors the fix, and drives the flow. +- **Create + verify, never merge.** The operator reviews the cc-ci-green PR and merges. +- **Never weaken a test** to make a PR go green. ## Access (this skill runs from the orchestration repo, drives cc-ci over the tailnet) - cc-ci is reachable via `ssh cc-ci` (root, through the userspace-tailscaled SOCKS proxy on @@ -43,38 +51,78 @@ If every recipe's tiers are `pass`/`skip(N/A)` (no `fail`) **and** every recipe Stop here. Nothing to fix. -### 3. For each failure → diagnose root cause + CLASSIFY (this is the AI part) +### 3. For each failure → diagnose root cause + CLASSIFY (AI) For every `fail` (after confirming it's not a flake — see below), read the run log, the recipe's compose/code (on cc-ci `~/.abra/recipes/`), and the harness, then determine the **root cause** and classify it as **exactly one** of: -- **RECIPE bug → recipe PR.** The recipe itself is broken/fragile (bad healthcheck, missing backup - hook, startup race, etc.). Output the **concrete proposed change** (e.g. "add a collabora WOPI - healthcheck + `start_period`", "add a `pg_dump` backup hook"). This is a PR to the recipe - (recipe-maintainer); it is "working" only once cc-ci verifies it green, then the operator merges. -- **CI-SERVER bug → cc-ci PR.** The harness/test/infra is wrong (a test asserts the wrong thing, a - readiness gate is off, an infra config bug). Output the proposed change to the cc-ci repo. -- **TEST out-of-date → cc-ci test update.** The recipe legitimately changed upstream and the cc-ci - test/overlay needs updating to match. Output the update. +- **RECIPE bug → recipe-side fix.** The recipe itself is broken/fragile (bad healthcheck, missing + backup hook, startup race, etc.). The fix is a change to the recipe (e.g. "add a collabora WOPI + healthcheck + `start_period`", "add a `pg_dump` backup hook"). +- **CI-SERVER bug → cc-ci-side fix.** The harness/test/infra is wrong (a test asserts the wrong + thing, a readiness gate is off, an infra config bug). The fix is a change to the cc-ci repo. +- **TEST out-of-date → cc-ci-side fix.** The recipe legitimately changed upstream and the cc-ci + test/overlay needs updating to match. - **FLAKY / infra-transient.** Distinguish from a real failure: **re-run the recipe once or twice**; if it then passes, classify as flaky (note it; suggest a robustness fix if there's a pattern) — do - NOT report a flake as a recipe/CI bug. + NOT author a recipe/CI "fix" for a flake. -### 4. Freshness (part of "up to date") -For each recipe, compare the **tested version** to the **latest published** recipe version (the -sweep records both). Flag any recipe tested at a stale pin, and any cc-ci test that lags a recipe -change — "all passed" must mean "passes against current upstream", not against an old pin. +Freshness is part of this: compare each recipe's **tested version** to the **latest published** +version (the sweep records both). A recipe tested at a stale pin, or a cc-ci test lagging an upstream +recipe change, is itself a finding — "all passed" must mean "passes against current upstream". -### 5. Output a structured report +### 4. Author the fix + OPEN a PR (AI authors; never merge) +For each real (non-flaky) finding, write the actual fix and open a PR. **Never merge.** + +- **Recipe-side fix → recipe PR.** Branch the recipe, apply the fix, and open the PR via the + **`recipe-create-pr` skill** (`/srv/recipe-maintainer/.opencode/skills/recipe-create-pr/SKILL.md`) — + it handles the mirror to `git.autonomic.zone/recipe-maintainers/` (upstream + `git.coopcloud.tech`). Keep the change **bounded** to the diagnosed root cause; don't rewrite the + recipe. +- **CI-server-side fix → cc-ci PR.** Branch the cc-ci product repo + (`recipe-maintainers/cc-ci`), apply the fix, and open the PR via the Gitea API (use the + `GITEA_*` creds from `/srv/cc-ci/.testenv`). **Single-writer discipline:** work on a dedicated + branch in a SEPARATE clone — **never push `main`, never touch the build loops' working clones** + (`/cc-ci`, `/cc-ci-adv`) or their in-flight state. + +### 5. VERIFY each PR on the CI server (deterministic; still never merge) +A PR is only "working" once **cc-ci verifies it green** (operator rule) — dogfood the CI that found +the bug. Verification is deterministic (the harness), not an AI judgement. + +- **Recipe PR:** run the **full suite COLD against the PR head** on cc-ci — the same path `!testme` + uses — via the helper: + ``` + RECIPE= REF= .claude/skills/ci-test-review/verify-pr.sh + ``` + Green ⇔ the harness exits 0. **General bar = one cold green.** Use `REPEAT=3` (repeated-green) + **only for a recipe already known to be FLAKY** (e.g. lasuite-drive) as a flakiness proof — this is + NOT the general standard. +- **CI-server PR:** check the branch out in a clone on cc-ci, rebuild if the change requires it + (`nixos-rebuild` / flake), then re-run the **previously-failing recipe(s)** through the harness + **plus a small regression sample** of unrelated recipes. Green ⇔ the fix passes AND nothing + regressed. +- **If verification is RED:** iterate the fix on the same branch (**bounded — ≤3 attempts**) and + re-verify. If still red, **leave the PR open** and report it as *"opened, NOT yet green — needs + work"* with the failing evidence. Never abandon silently; never weaken a test to force green. + +### 6. Output a structured report - A per-recipe **status table** (recipe | tiers | version tested / latest | verdict). -- For each failure: **root cause → classification (recipe PR / CI PR / test-update / flaky) → - proposed change** (concrete enough to act on). +- For each finding: **root cause → classification (recipe / CI-server) → PR link → verification + verdict** (`GREEN — cold-verified on cc-ci, ready for operator merge` / `RED — opened, needs work`). - Or the all-clear summary from step 2. +- Always end with the explicit reminder that **nothing was merged** — the PRs await operator review. ## Guardrails -- **Deterministic execution stays AI-free** — only diagnose; never override the harness's pass/fail. -- **Propose, don't auto-merge.** This skill outputs recommendations + proposed PRs. It does NOT - create or merge PRs. (A recipe PR is operator-merged only after cc-ci verifies it green.) If asked, - it may hand a specific proposed recipe PR to the `recipe-create-pr` skill as an explicit follow-up. -- **Don't weaken any test** to make the review "pass." -- **Flake ≠ bug** — re-run before classifying a failure as a real recipe/CI defect. +- **Deterministic execution + verification stay AI-free** — the harness decides pass/fail; AI only + diagnoses, authors the fix, and drives the flow. +- **Create + verify, NEVER merge.** Leave a cc-ci-green, ready-to-merge PR; the operator merges + (recipe PRs especially are operator-merged only after cc-ci verifies them green). +- **Don't weaken any test** to make a PR go green — the fix must make the recipe/CI genuinely correct. +- **Flake ≠ bug** — re-run before authoring any fix. +- **Real abra path** throughout (no docker-level bypass). +- **Bounded fixes** — each PR targets one diagnosed root cause; not a rewrite. +- **Coordination / single-writer:** while the build loops are actively developing cc-ci, the server + **and** the `recipe-maintainers/cc-ci` repo are in use. Always use a dedicated branch + separate + clone; never push `main` or disturb the loops' clones. Recipe deploys are **stateful on the shared + Swarm**, so run verification when you have effectively-exclusive use of the host (or serialize), and + **tear down whatever you deploy**. diff --git a/.claude/skills/ci-test-review/verify-pr.sh b/.claude/skills/ci-test-review/verify-pr.sh new file mode 100755 index 0000000..ac9937d --- /dev/null +++ b/.claude/skills/ci-test-review/verify-pr.sh @@ -0,0 +1,48 @@ +#!/usr/bin/env bash +# ci-test-review :: deterministic PR verification on the CI server (NO AI) +# ------------------------------------------------------------------------- +# Verifies a RECIPE PR the same way `!testme` does: deploy + run the FULL suite +# COLD against the PR head on cc-ci. The harness decides pass/fail — this script +# only runs it and reports the exit code + RUN SUMMARY. AI never judges here. +# +# RECIPE=ghost REF=my-fix-branch verify-pr.sh # 1 cold green = working +# RECIPE=lasuite-drive REF=sha REPEAT=3 verify-pr.sh # repeated-green: ONLY +# # for a known-flaky recipe +# +# REF is the recipe PR head (branch name or sha) on the recipe mirror that abra +# fetches from. Green iff EVERY repeat exits 0. +# +# (CI-SERVER PRs are verified differently — check the branch out in a clone on +# cc-ci, rebuild, re-run the failing recipe(s) + a regression sample — see +# SKILL.md step 5; that path is bespoke and not scripted here.) +set -o errexit -o nounset -o pipefail + +SSH="${SSH:-cc-ci}" +REPEAT="${REPEAT:-1}" +: "${RECIPE:?set RECIPE (e.g. ghost)}" +: "${REF:?set REF (the PR head branch or sha)}" + +RUNID="$(date -u +%Y%m%dT%H%M%SZ)" +REMOTE_LOG="/root/cc-ci-review-logs/verify-${RECIPE}-${RUNID}" +ssh "$SSH" "mkdir -p /root/cc-ci-review-logs" + +echo "verify-pr: RECIPE=$RECIPE REF=$REF cold full-suite x${REPEAT} on ${SSH}" >&2 + +green=1 +for i in $(seq 1 "$REPEAT"); do + log="${REMOTE_LOG}.${i}.log" + rc=0 + # Real harness, cold (no --quick), against the PR head — same path as !testme. + ssh "$SSH" "cd /root/cc-ci && RECIPE='${RECIPE}' REF='${REF}' cc-ci-run runner/run_recipe_ci.py >'${log}' 2>&1" || rc=$? + echo "--- pass ${i}/${REPEAT}: exit ${rc} (log ${SSH}:${log}) ---" >&2 + ssh "$SSH" "awk '/===== RUN SUMMARY =====/{f=1} f{print}' '${log}'" >&2 || true + [ "$rc" = "0" ] || green=0 +done + +if [ "$green" = "1" ]; then + echo "VERDICT: GREEN — ${RECIPE} PR (REF=${REF}) passed cold full-suite x${REPEAT}. Ready for operator merge (NOT merged)." >&2 + exit 0 +else + echo "VERDICT: RED — ${RECIPE} PR (REF=${REF}) did not pass. Iterate the fix (<=3 attempts) or report needs-work. Do NOT weaken tests; do NOT merge." >&2 + exit 2 +fi