upgrade-all: default to concurrency-bounded (DRONE_RUNNER_CAPACITY) subagents

Now that the 2026-06-10 concurrency restructure makes concurrent recipe runs
safe (per-run trees, app-domain locks, isolation), default /upgrade-all to run
up to DRONE_RUNNER_CAPACITY (the drone runner's slots, currently 2) recipe
subagents at a time instead of strictly sequential — using all available
concurrency without oversubscribing. Query the live capacity from
'systemctl show drone-runner-exec' (fallback 2); process recipes in waves of
CAP (emit CAP Agent calls per message, await, next wave). Flags: --capacity N,
--sequential (CAP=1, old default — use when the build loops share the box),
--parallel (unbounded). Applies to the NEXT run; the in-flight run is unaffected.
This commit is contained in:
autonomic-bot
2026-06-12 01:39:44 +00:00
parent 1eb720e95a
commit a45517b432

View File

@ -1,6 +1,6 @@
---
name: upgrade-all
description: Weekly autonomous upgrade run for the cc-ci CI server. Surveys every enrolled recipe for available upstream upgrades, then runs /recipe-upgrade on each upgradeable one via a subagent — plan, implement, verify green on cc-ci, open a recipe PR (and, only if a cc-ci test went stale, a verified cc-ci test PR). Collects results into one summary listing every PR to review. Sequential by default (shared Swarm); --parallel to fan out; --dry-run to preview. NEVER merges. Built to run once weekly on a cron. Invoke as /upgrade-all.
description: Weekly autonomous upgrade run for the cc-ci CI server. Surveys every enrolled recipe for available upstream upgrades, then runs /recipe-upgrade on each upgradeable one via a subagent — plan, implement, verify green on cc-ci, open a recipe PR (and, only if a cc-ci test went stale, a verified cc-ci test PR). Collects results into one summary listing every PR to review. Concurrency-bounded by default — runs up to DRONE_RUNNER_CAPACITY (the drone runner's slots, currently 2) recipe subagents at a time; --sequential for one-at-a-time, --capacity N to override, --parallel to fan out all, --dry-run to preview. NEVER merges. Built to run once weekly on a cron. Invoke as /upgrade-all.
---
# upgrade-all
@ -23,7 +23,17 @@ session, but the agent is the intended path so the weekly run isn't buried in he
## Arguments (optional `$ARGUMENTS`)
- A space-separated list of recipe names → only those (else all enrolled recipes).
- `--dry-run` → survey + print what WOULD upgrade; spawn nothing.
- `--parallel` → fan out all per-recipe subagents at once (faster, more host load — see safety below).
- **Default = concurrency-bounded:** run up to **`DRONE_RUNNER_CAPACITY`** recipe subagents at a
time (the drone runner's slots — currently `2`). The 2026-06-10 concurrency restructure
(`docs/concurrency.md`) makes concurrent recipe runs SAFE (per-run recipe trees + app-domain
locks + isolation), and the capacity knob is the operator's resource-tuned ceiling — so matching
the subagent pool to it uses all available concurrency without oversubscribing.
- `--capacity N` → override the pool size (else the live `DRONE_RUNNER_CAPACITY`).
- `--sequential` → force one subagent at a time (the old cautious default; use if the host is also
busy with the build loops).
- `--parallel` → unbounded: fan out ALL subagents at once, ignoring capacity (heaviest host load;
drone still queues `!testme` beyond capacity, but every subagent's step-2b chaos deploy runs
concurrently — only when the box is otherwise idle).
## 0. Sweep orphans from previous runs (FIRST, before anything else)
A prior run's teardown can crash, an agent can be killed mid-deploy, or a manual debug probe can be
@ -108,9 +118,16 @@ ssh cc-ci "GITEA_USERNAME='$GITEA_USERNAME' GITEA_PASSWORD='$GITEA_PASSWORD' GIT
(The per-recipe `/recipe-upgrade` also reconciles, so this is belt-and-suspenders for skipped recipes —
count closed-merged PRs in the summary.)
## 2. Print the plan
A table — Recipe | Status (will upgrade / skipped:reason) | Available upgrade(s) — plus the mode
(`Sequential` default / `Parallel`). If `--dry-run`, **stop here**.
## 2. Determine concurrency, then print the plan
Read the live capacity (unless `--capacity N` / `--sequential` / `--parallel` was passed):
```
CAP=$(ssh cc-ci 'systemctl show drone-runner-exec -p Environment 2>/dev/null' | grep -oE 'DRONE_RUNNER_CAPACITY=[0-9]+' | cut -d= -f2)
CAP=${CAP:-2} # fallback to the documented default if the query fails
```
`--sequential``CAP=1`; `--capacity N``CAP=N`; `--parallel``CAP=∞` (all at once).
Then print a table — Recipe | Status (will upgrade / skipped:reason) | Available upgrade(s) — plus
the mode (`Concurrency-bounded (N=<CAP>)` default / `Sequential` / `Parallel`). If `--dry-run`,
**stop here**.
## 3. Upgrade each recipe via a subagent
For each recipe in `RECIPES_TO_UPGRADE`, spawn an Agent (`subagent_type: "general-purpose"`,
@ -130,11 +147,19 @@ test change deserves a human decision. So the cron opens recipe PRs only; where
test looks stale, the operator sees the explanation in the PR comment and can re-run that one recipe
with `/recipe-upgrade <recipe> --with-tests` to also get a verified test-update PR.
- **Sequential (default):** one Agent at a time; wait, collect its `RESULT:`, then the next. If the
Agent tool call itself errors, record `FAILED — agent tool error` and continue. One failure never
aborts the run.
- **Parallel (`--parallel`):** emit all Agent calls in one message (no `run_in_background` — you need
the results). Failures are isolated per recipe.
- **Concurrency-bounded (default), `CAP` at a time:** process `RECIPES_TO_UPGRADE` in waves of
`CAP`. For each wave, emit `CAP` Agent calls **in one message** (they run concurrently — no
`run_in_background`, you need their `RESULT:` lines); wait for all `CAP` to return; collect their
results; then start the next wave. This keeps the drone runner's `CAP` slots busy without
oversubscribing. `CAP=1` degenerates to one-at-a-time (sequential). An Agent tool-call error →
record `FAILED — agent tool error` and continue; one failure never aborts the run or its wave.
- **Parallel (`--parallel`):** `CAP=∞` — emit ALL Agent calls in one message. Heaviest load;
failures still isolated per recipe.
Note (wave barrier): a wave waits for its slowest recipe before the next starts, so a slot can idle
if one recipe (e.g. discourse's multi-image upgrade) runs much longer than its wave-mates. That's
acceptable for the weekly run; do NOT try to maintain a rolling pool with `run_in_background` (you'd
lose the `RESULT:` lines). Order `RECIPES_TO_UPGRADE` heaviest-first so long upgrades start early.
## 4. Collect results
Parse each final `RESULT:` line into SUCCESS / SUCCESS-PENDING-TESTS / FAILED / SKIPPED (default mode
@ -175,10 +200,17 @@ still around). It runs `/recipe-report`, reviews this run + the live recipe/PR s
https://report.ci.commoninternet.net. Fire-and-forget — it runs independently; you can then go idle.
## Safety / coordination (this matters — shared host with the build loops)
- **Sequential is the default for a reason.** Recipe deploys are **stateful on the shared Swarm** and
parallel deploys can OOM/collide. Between sequential recipes, the per-recipe `recipe-upgrade` tears
down what it deployed; verify a recipe is undeployed before the next starts
(`ssh cc-ci 'script -qec "abra app ls -n" /dev/null'` — pseudo-TTY wrapped, per the box above).
- **Concurrency is bounded to the drone capacity, and that's deliberate.** The 2026-06-10
concurrency restructure (`docs/concurrency.md`) made concurrent recipe runs *correct* — per-run
recipe trees, app-domain locks, isolation — so two recipes no longer collide/corrupt. The
remaining limit is host RESOURCES (memory), which is exactly what `DRONE_RUNNER_CAPACITY` (=2) is
tuned to. Running `CAP` subagents matches the runner's slots; do NOT exceed it on the shared box
(that's what `--parallel` is for, and only when the box is idle). Each `recipe-upgrade` still
tears down its own step-2b deploy; the end-of-run reap (§4b) clears any that leaked.
- **If the host is ALSO running the build loops, prefer `--sequential`** (CAP=1) — the loops and a
capacity-2 upgrade together can oversubscribe the 7.6 GB box (a Swarm task can wedge in `New`
state under load). The orchestrator should serialize: don't run a capacity>1 upgrade concurrently
with active phase loops.
- **Single-writer:** every PR (recipe or cc-ci test) is on a dedicated branch; **never push `main`**,
never touch the build loops' `/cc-ci` `/cc-ci-adv` working clones or their in-flight state.
- **Contention with active loop development:** while the loops are still building cc-ci, this run