upgrade-all: default to concurrency-bounded (DRONE_RUNNER_CAPACITY) subagents

Now that the 2026-06-10 concurrency restructure makes concurrent recipe runs safe (per-run trees, app-domain locks, isolation), default /upgrade-all to run up to DRONE_RUNNER_CAPACITY (the drone runner's slots, currently 2) recipe subagents at a time instead of strictly sequential — using all available concurrency without oversubscribing. Query the live capacity from 'systemctl show drone-runner-exec' (fallback 2); process recipes in waves of CAP (emit CAP Agent calls per message, await, next wave). Flags: --capacity N, --sequential (CAP=1, old default — use when the build loops share the box), --parallel (unbounded). Applies to the NEXT run; the in-flight run is unaffected.
2026-06-12 01:39:44 +00:00
parent 1eb720e95a
commit a45517b432
1 changed files with 46 additions and 14 deletions
--- a/.claude/skills/upgrade-all/SKILL.md
+++ b/.claude/skills/upgrade-all/SKILL.md
@ -1,6 +1,6 @@
 ---
 name: upgrade-all
-description: Weekly autonomous upgrade run for the cc-ci CI server. Surveys every enrolled recipe for available upstream upgrades, then runs /recipe-upgrade on each upgradeable one via a subagent — plan, implement, verify green on cc-ci, open a recipe PR (and, only if a cc-ci test went stale, a verified cc-ci test PR). Collects results into one summary listing every PR to review. Sequential by default (shared Swarm); --parallel to fan out; --dry-run to preview. NEVER merges. Built to run once weekly on a cron. Invoke as /upgrade-all.
+description: Weekly autonomous upgrade run for the cc-ci CI server. Surveys every enrolled recipe for available upstream upgrades, then runs /recipe-upgrade on each upgradeable one via a subagent — plan, implement, verify green on cc-ci, open a recipe PR (and, only if a cc-ci test went stale, a verified cc-ci test PR). Collects results into one summary listing every PR to review. Concurrency-bounded by default — runs up to DRONE_RUNNER_CAPACITY (the drone runner's slots, currently 2) recipe subagents at a time; --sequential for one-at-a-time, --capacity N to override, --parallel to fan out all, --dry-run to preview. NEVER merges. Built to run once weekly on a cron. Invoke as /upgrade-all.
 ---

 # upgrade-all
@ -23,7 +23,17 @@ session, but the agent is the intended path so the weekly run isn't buried in he
 ## Arguments (optional `$ARGUMENTS`)
 - A space-separated list of recipe names → only those (else all enrolled recipes).
 - `--dry-run` → survey + print what WOULD upgrade; spawn nothing.
- `--parallel` → fan out all per-recipe subagents at once (faster, more host load — see safety below).
+- **Default = concurrency-bounded:** run up to **`DRONE_RUNNER_CAPACITY`** recipe subagents at a
+  time (the drone runner's slots — currently `2`). The 2026-06-10 concurrency restructure
+  (`docs/concurrency.md`) makes concurrent recipe runs SAFE (per-run recipe trees + app-domain
+  locks + isolation), and the capacity knob is the operator's resource-tuned ceiling — so matching
+  the subagent pool to it uses all available concurrency without oversubscribing.
+- `--capacity N` → override the pool size (else the live `DRONE_RUNNER_CAPACITY`).
+- `--sequential` → force one subagent at a time (the old cautious default; use if the host is also
+  busy with the build loops).
+- `--parallel` → unbounded: fan out ALL subagents at once, ignoring capacity (heaviest host load;
+  drone still queues `!testme` beyond capacity, but every subagent's step-2b chaos deploy runs
+  concurrently — only when the box is otherwise idle).

 ## 0. Sweep orphans from previous runs (FIRST, before anything else)
 A prior run's teardown can crash, an agent can be killed mid-deploy, or a manual debug probe can be
@ -108,9 +118,16 @@ ssh cc-ci "GITEA_USERNAME='$GITEA_USERNAME' GITEA_PASSWORD='$GITEA_PASSWORD' GIT
 (The per-recipe `/recipe-upgrade` also reconciles, so this is belt-and-suspenders for skipped recipes —
 count closed-merged PRs in the summary.)

-## 2. Print the plan
-A table — Recipe | Status (will upgrade / skipped:reason) | Available upgrade(s) — plus the mode
-(`Sequential` default / `Parallel`). If `--dry-run`, **stop here**.
+## 2. Determine concurrency, then print the plan
+Read the live capacity (unless `--capacity N` / `--sequential` / `--parallel` was passed):
+```
+CAP=$(ssh cc-ci 'systemctl show drone-runner-exec -p Environment 2>/dev/null' | grep -oE 'DRONE_RUNNER_CAPACITY=[0-9]+' | cut -d= -f2)
+CAP=${CAP:-2}   # fallback to the documented default if the query fails
+```
+`--sequential` → `CAP=1`; `--capacity N` → `CAP=N`; `--parallel` → `CAP=∞` (all at once).
+Then print a table — Recipe | Status (will upgrade / skipped:reason) | Available upgrade(s) — plus
+the mode (`Concurrency-bounded (N=<CAP>)` default / `Sequential` / `Parallel`). If `--dry-run`,
+**stop here**.

 ## 3. Upgrade each recipe via a subagent
 For each recipe in `RECIPES_TO_UPGRADE`, spawn an Agent (`subagent_type: "general-purpose"`,
@ -130,11 +147,19 @@ test change deserves a human decision. So the cron opens recipe PRs only; where
 test looks stale, the operator sees the explanation in the PR comment and can re-run that one recipe
 with `/recipe-upgrade <recipe> --with-tests` to also get a verified test-update PR.

- **Sequential (default):** one Agent at a time; wait, collect its `RESULT:`, then the next. If the
-  Agent tool call itself errors, record `FAILED — agent tool error` and continue. One failure never
-  aborts the run.
- **Parallel (`--parallel`):** emit all Agent calls in one message (no `run_in_background` — you need
-  the results). Failures are isolated per recipe.
+- **Concurrency-bounded (default), `CAP` at a time:** process `RECIPES_TO_UPGRADE` in waves of
+  `CAP`. For each wave, emit `CAP` Agent calls **in one message** (they run concurrently — no
+  `run_in_background`, you need their `RESULT:` lines); wait for all `CAP` to return; collect their
+  results; then start the next wave. This keeps the drone runner's `CAP` slots busy without
+  oversubscribing. `CAP=1` degenerates to one-at-a-time (sequential). An Agent tool-call error →
+  record `FAILED — agent tool error` and continue; one failure never aborts the run or its wave.
+- **Parallel (`--parallel`):** `CAP=∞` — emit ALL Agent calls in one message. Heaviest load;
+  failures still isolated per recipe.
+
+Note (wave barrier): a wave waits for its slowest recipe before the next starts, so a slot can idle
+if one recipe (e.g. discourse's multi-image upgrade) runs much longer than its wave-mates. That's
+acceptable for the weekly run; do NOT try to maintain a rolling pool with `run_in_background` (you'd
+lose the `RESULT:` lines). Order `RECIPES_TO_UPGRADE` heaviest-first so long upgrades start early.

 ## 4. Collect results
 Parse each final `RESULT:` line into SUCCESS / SUCCESS-PENDING-TESTS / FAILED / SKIPPED (default mode
@ -175,10 +200,17 @@ still around). It runs `/recipe-report`, reviews this run + the live recipe/PR s
 https://report.ci.commoninternet.net. Fire-and-forget — it runs independently; you can then go idle.

 ## Safety / coordination (this matters — shared host with the build loops)
- **Sequential is the default for a reason.** Recipe deploys are **stateful on the shared Swarm** and
-  parallel deploys can OOM/collide. Between sequential recipes, the per-recipe `recipe-upgrade` tears
-  down what it deployed; verify a recipe is undeployed before the next starts
-  (`ssh cc-ci 'script -qec "abra app ls -n" /dev/null'` — pseudo-TTY wrapped, per the box above).
+- **Concurrency is bounded to the drone capacity, and that's deliberate.** The 2026-06-10
+  concurrency restructure (`docs/concurrency.md`) made concurrent recipe runs *correct* — per-run
+  recipe trees, app-domain locks, isolation — so two recipes no longer collide/corrupt. The
+  remaining limit is host RESOURCES (memory), which is exactly what `DRONE_RUNNER_CAPACITY` (=2) is
+  tuned to. Running `CAP` subagents matches the runner's slots; do NOT exceed it on the shared box
+  (that's what `--parallel` is for, and only when the box is idle). Each `recipe-upgrade` still
+  tears down its own step-2b deploy; the end-of-run reap (§4b) clears any that leaked.
+- **If the host is ALSO running the build loops, prefer `--sequential`** (CAP=1) — the loops and a
+  capacity-2 upgrade together can oversubscribe the 7.6 GB box (a Swarm task can wedge in `New`
+  state under load). The orchestrator should serialize: don't run a capacity>1 upgrade concurrently
+  with active phase loops.
 - **Single-writer:** every PR (recipe or cc-ci test) is on a dedicated branch; **never push `main`**,
  never touch the build loops' `/cc-ci` `/cc-ci-adv` working clones or their in-flight state.
 - **Contention with active loop development:** while the loops are still building cc-ci, this run