From 330378d30d13e7f7f58b3e7ddc3301fe484cd142 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Tue, 9 Jun 2026 16:29:30 +0000 Subject: [PATCH] ideas: fail-fast on crash-looping deploy + don't let one wedged run starve the CI queue After a live incident: plausible build 220 (ClickHouse exit-1 crash-loop) held the single serial runner for its full 1200s DEPLOY_TIMEOUT, starving immich PR-2's queued builds for ~12min until manually torn down. Logs the two fixes (fail-fast on crash-loop; head-of-line blocking on the serial runner) + the interim mitigations (step-2b dev loop for debugging; SIGINT to free a wedged run). Co-Authored-By: Claude Opus 4.8 --- cc-ci-plan/IDEAS.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/cc-ci-plan/IDEAS.md b/cc-ci-plan/IDEAS.md index 87d5023..7b78228 100644 --- a/cc-ci-plan/IDEAS.md +++ b/cc-ci-plan/IDEAS.md @@ -164,3 +164,28 @@ item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant. fix. - *When to revisit:* before running the upgrader fully unattended/untrusted at scale, or alongside the "package cc-ci as a recipe" spike above (both hinge on a separate disposable test Swarm). *Added:* 2026-06-09. + +- [ ] **Fail fast on a crash-looping deploy + don't let one wedged run starve the CI queue.** + *(operator-flagged 2026-06-09, after a live incident)* + Observed live: plausible build 220 — a recipe whose ClickHouse service exits 1 every ~6s — held the + **single serial CI runner for its full `DEPLOY_TIMEOUT` (1200s / 20 min)** while the deploy never + converged. Because the runner is serial, that **starved every other recipe's CI behind it**: immich + PR-2 builds 221/222 sat "pending" for ~12 min without ever starting, and only ran once the wedged + plausible run was manually torn down (SIGINT the harness — its `finally` tore down the stack — which + freed the runner). Two compounding problems: + - **No fail-fast on a crash-loop.** The deploy/health wait polls until `DEPLOY_TIMEOUT`; a service + that is plainly crash-looping (a task repeatedly `Failed "task: non-zero exit (1)"` / restarting + every few seconds with no healthy replica) could be detected and the run **failed early** (e.g. + after N restarts within M seconds) instead of burning the whole 1200s. Faster RED for the broken + recipe AND frees the runner sooner. This is the higher-value, lower-risk half — do it first. + - **Head-of-line blocking on a single serial runner.** One wedged/slow run blocks ALL other recipes' + CI. Options: bump runner concurrency (CAREFULLY — the single-node Swarm is why it's serial today, + to avoid parallel-deploy OOM/collision, per /upgrade-all §safety); a priority/queue policy; or at + minimum surface "queued behind build N" so a pending run isn't mistaken for a stuck/own failure. + - *Interim mitigations (in use):* (a) debug a known-crash-looping recipe via the `/recipe-upgrade` + **step-2b direct dev deploy** (`dev-` + `docker service logs`) instead of repeated + `!testme` — diagnoses with full log visibility WITHOUT consuming a 20-min runner slot or starving + other recipes; (b) a wedged run can be manually freed by SIGINT-ing its `run_recipe_ci.py` (the + harness `finally` tears down its stack). + - *When to revisit:* if CI-queue starvation recurs (several recipes in flight, or a legitimately + long deploy wedging others). *Added:* 2026-06-09.