ideas: fail-fast on crash-looping deploy + don't let one wedged run starve the CI queue

After a live incident: plausible build 220 (ClickHouse exit-1 crash-loop) held the
single serial runner for its full 1200s DEPLOY_TIMEOUT, starving immich PR-2's
queued builds for ~12min until manually torn down. Logs the two fixes (fail-fast
on crash-loop; head-of-line blocking on the serial runner) + the interim
mitigations (step-2b dev loop for debugging; SIGINT to free a wedged run).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
autonomic-bot
2026-06-09 16:29:30 +00:00
parent a2c1cb550a
commit 330378d30d

View File

@ -164,3 +164,28 @@ item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant.
fix.
- *When to revisit:* before running the upgrader fully unattended/untrusted at scale, or alongside the
"package cc-ci as a recipe" spike above (both hinge on a separate disposable test Swarm). *Added:* 2026-06-09.
- [ ] **Fail fast on a crash-looping deploy + don't let one wedged run starve the CI queue.**
*(operator-flagged 2026-06-09, after a live incident)*
Observed live: plausible build 220 — a recipe whose ClickHouse service exits 1 every ~6s — held the
**single serial CI runner for its full `DEPLOY_TIMEOUT` (1200s / 20 min)** while the deploy never
converged. Because the runner is serial, that **starved every other recipe's CI behind it**: immich
PR-2 builds 221/222 sat "pending" for ~12 min without ever starting, and only ran once the wedged
plausible run was manually torn down (SIGINT the harness — its `finally` tore down the stack — which
freed the runner). Two compounding problems:
- **No fail-fast on a crash-loop.** The deploy/health wait polls until `DEPLOY_TIMEOUT`; a service
that is plainly crash-looping (a task repeatedly `Failed "task: non-zero exit (1)"` / restarting
every few seconds with no healthy replica) could be detected and the run **failed early** (e.g.
after N restarts within M seconds) instead of burning the whole 1200s. Faster RED for the broken
recipe AND frees the runner sooner. This is the higher-value, lower-risk half — do it first.
- **Head-of-line blocking on a single serial runner.** One wedged/slow run blocks ALL other recipes'
CI. Options: bump runner concurrency (CAREFULLY — the single-node Swarm is why it's serial today,
to avoid parallel-deploy OOM/collision, per /upgrade-all §safety); a priority/queue policy; or at
minimum surface "queued behind build N" so a pending run isn't mistaken for a stuck/own failure.
- *Interim mitigations (in use):* (a) debug a known-crash-looping recipe via the `/recipe-upgrade`
**step-2b direct dev deploy** (`dev-<recipe>` + `docker service logs`) instead of repeated
`!testme` — diagnoses with full log visibility WITHOUT consuming a 20-min runner slot or starving
other recipes; (b) a wedged run can be manually freed by SIGINT-ing its `run_recipe_ci.py` (the
harness `finally` tears down its stack).
- *When to revisit:* if CI-queue starvation recurs (several recipes in flight, or a legitimately
long deploy wedging others). *Added:* 2026-06-09.