diff --git a/.claude/skills/upgrade-all/SKILL.md b/.claude/skills/upgrade-all/SKILL.md index 08a62b1..723461d 100644 --- a/.claude/skills/upgrade-all/SKILL.md +++ b/.claude/skills/upgrade-all/SKILL.md @@ -1,6 +1,6 @@ --- name: upgrade-all -description: Weekly autonomous upgrade run for the cc-ci CI server. Surveys every enrolled recipe for available upstream upgrades, then runs /recipe-upgrade on each upgradeable one via a subagent — plan, implement, verify green on cc-ci, open a recipe PR (and, only if a cc-ci test went stale, a verified cc-ci test PR). Collects results into one summary listing every PR to review. Sequential by default (shared Swarm); --parallel to fan out; --dry-run to preview. NEVER merges. Built to run once weekly on a cron. Invoke as /upgrade-all. +description: Weekly autonomous upgrade run for the cc-ci CI server. Surveys every enrolled recipe for available upstream upgrades, then runs /recipe-upgrade on each upgradeable one via a subagent — plan, implement, verify green on cc-ci, open a recipe PR (and, only if a cc-ci test went stale, a verified cc-ci test PR). Collects results into one summary listing every PR to review. Concurrency-bounded by default — runs up to DRONE_RUNNER_CAPACITY (the drone runner's slots, currently 2) recipe subagents at a time; --sequential for one-at-a-time, --capacity N to override, --parallel to fan out all, --dry-run to preview. NEVER merges. Built to run once weekly on a cron. Invoke as /upgrade-all. --- # upgrade-all @@ -23,7 +23,17 @@ session, but the agent is the intended path so the weekly run isn't buried in he ## Arguments (optional `$ARGUMENTS`) - A space-separated list of recipe names → only those (else all enrolled recipes). - `--dry-run` → survey + print what WOULD upgrade; spawn nothing. -- `--parallel` → fan out all per-recipe subagents at once (faster, more host load — see safety below). +- **Default = concurrency-bounded:** run up to **`DRONE_RUNNER_CAPACITY`** recipe subagents at a + time (the drone runner's slots — currently `2`). The 2026-06-10 concurrency restructure + (`docs/concurrency.md`) makes concurrent recipe runs SAFE (per-run recipe trees + app-domain + locks + isolation), and the capacity knob is the operator's resource-tuned ceiling — so matching + the subagent pool to it uses all available concurrency without oversubscribing. +- `--capacity N` → override the pool size (else the live `DRONE_RUNNER_CAPACITY`). +- `--sequential` → force one subagent at a time (the old cautious default; use if the host is also + busy with the build loops). +- `--parallel` → unbounded: fan out ALL subagents at once, ignoring capacity (heaviest host load; + drone still queues `!testme` beyond capacity, but every subagent's step-2b chaos deploy runs + concurrently — only when the box is otherwise idle). ## 0. Sweep orphans from previous runs (FIRST, before anything else) A prior run's teardown can crash, an agent can be killed mid-deploy, or a manual debug probe can be @@ -108,9 +118,16 @@ ssh cc-ci "GITEA_USERNAME='$GITEA_USERNAME' GITEA_PASSWORD='$GITEA_PASSWORD' GIT (The per-recipe `/recipe-upgrade` also reconciles, so this is belt-and-suspenders for skipped recipes — count closed-merged PRs in the summary.) -## 2. Print the plan -A table — Recipe | Status (will upgrade / skipped:reason) | Available upgrade(s) — plus the mode -(`Sequential` default / `Parallel`). If `--dry-run`, **stop here**. +## 2. Determine concurrency, then print the plan +Read the live capacity (unless `--capacity N` / `--sequential` / `--parallel` was passed): +``` +CAP=$(ssh cc-ci 'systemctl show drone-runner-exec -p Environment 2>/dev/null' | grep -oE 'DRONE_RUNNER_CAPACITY=[0-9]+' | cut -d= -f2) +CAP=${CAP:-2} # fallback to the documented default if the query fails +``` +`--sequential` → `CAP=1`; `--capacity N` → `CAP=N`; `--parallel` → `CAP=∞` (all at once). +Then print a table — Recipe | Status (will upgrade / skipped:reason) | Available upgrade(s) — plus +the mode (`Concurrency-bounded (N=)` default / `Sequential` / `Parallel`). If `--dry-run`, +**stop here**. ## 3. Upgrade each recipe via a subagent For each recipe in `RECIPES_TO_UPGRADE`, spawn an Agent (`subagent_type: "general-purpose"`, @@ -130,11 +147,19 @@ test change deserves a human decision. So the cron opens recipe PRs only; where test looks stale, the operator sees the explanation in the PR comment and can re-run that one recipe with `/recipe-upgrade --with-tests` to also get a verified test-update PR. -- **Sequential (default):** one Agent at a time; wait, collect its `RESULT:`, then the next. If the - Agent tool call itself errors, record `FAILED — agent tool error` and continue. One failure never - aborts the run. -- **Parallel (`--parallel`):** emit all Agent calls in one message (no `run_in_background` — you need - the results). Failures are isolated per recipe. +- **Concurrency-bounded (default), `CAP` at a time:** process `RECIPES_TO_UPGRADE` in waves of + `CAP`. For each wave, emit `CAP` Agent calls **in one message** (they run concurrently — no + `run_in_background`, you need their `RESULT:` lines); wait for all `CAP` to return; collect their + results; then start the next wave. This keeps the drone runner's `CAP` slots busy without + oversubscribing. `CAP=1` degenerates to one-at-a-time (sequential). An Agent tool-call error → + record `FAILED — agent tool error` and continue; one failure never aborts the run or its wave. +- **Parallel (`--parallel`):** `CAP=∞` — emit ALL Agent calls in one message. Heaviest load; + failures still isolated per recipe. + +Note (wave barrier): a wave waits for its slowest recipe before the next starts, so a slot can idle +if one recipe (e.g. discourse's multi-image upgrade) runs much longer than its wave-mates. That's +acceptable for the weekly run; do NOT try to maintain a rolling pool with `run_in_background` (you'd +lose the `RESULT:` lines). Order `RECIPES_TO_UPGRADE` heaviest-first so long upgrades start early. ## 4. Collect results Parse each final `RESULT:` line into SUCCESS / SUCCESS-PENDING-TESTS / FAILED / SKIPPED (default mode @@ -175,10 +200,17 @@ still around). It runs `/recipe-report`, reviews this run + the live recipe/PR s https://report.ci.commoninternet.net. Fire-and-forget — it runs independently; you can then go idle. ## Safety / coordination (this matters — shared host with the build loops) -- **Sequential is the default for a reason.** Recipe deploys are **stateful on the shared Swarm** and - parallel deploys can OOM/collide. Between sequential recipes, the per-recipe `recipe-upgrade` tears - down what it deployed; verify a recipe is undeployed before the next starts - (`ssh cc-ci 'script -qec "abra app ls -n" /dev/null'` — pseudo-TTY wrapped, per the box above). +- **Concurrency is bounded to the drone capacity, and that's deliberate.** The 2026-06-10 + concurrency restructure (`docs/concurrency.md`) made concurrent recipe runs *correct* — per-run + recipe trees, app-domain locks, isolation — so two recipes no longer collide/corrupt. The + remaining limit is host RESOURCES (memory), which is exactly what `DRONE_RUNNER_CAPACITY` (=2) is + tuned to. Running `CAP` subagents matches the runner's slots; do NOT exceed it on the shared box + (that's what `--parallel` is for, and only when the box is idle). Each `recipe-upgrade` still + tears down its own step-2b deploy; the end-of-run reap (§4b) clears any that leaked. +- **If the host is ALSO running the build loops, prefer `--sequential`** (CAP=1) — the loops and a + capacity-2 upgrade together can oversubscribe the 7.6 GB box (a Swarm task can wedge in `New` + state under load). The orchestrator should serialize: don't run a capacity>1 upgrade concurrently + with active phase loops. - **Single-writer:** every PR (recipe or cc-ci test) is on a dedicated branch; **never push `main`**, never touch the build loops' `/cc-ci` `/cc-ci-adv` working clones or their in-flight state. - **Contention with active loop development:** while the loops are still building cc-ci, this run