ideas: fail-fast on crash-looping deploy + don't let one wedged run starve the CI queue

After a live incident: plausible build 220 (ClickHouse exit-1 crash-loop) held the single serial runner for its full 1200s DEPLOY_TIMEOUT, starving immich PR-2's queued builds for ~12min until manually torn down. Logs the two fixes (fail-fast on crash-loop; head-of-line blocking on the serial runner) + the interim mitigations (step-2b dev loop for debugging; SIGINT to free a wedged run). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 16:29:30 +00:00
parent a2c1cb550a
commit 330378d30d
1 changed files with 25 additions and 0 deletions
--- a/cc-ci-plan/IDEAS.md
+++ b/cc-ci-plan/IDEAS.md
@ -164,3 +164,28 @@ item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant.
    fix.
  - *When to revisit:* before running the upgrader fully unattended/untrusted at scale, or alongside the
    "package cc-ci as a recipe" spike above (both hinge on a separate disposable test Swarm). *Added:* 2026-06-09.
+
+- [ ] **Fail fast on a crash-looping deploy + don't let one wedged run starve the CI queue.**
+  *(operator-flagged 2026-06-09, after a live incident)*
+  Observed live: plausible build 220 — a recipe whose ClickHouse service exits 1 every ~6s — held the
+  **single serial CI runner for its full `DEPLOY_TIMEOUT` (1200s / 20 min)** while the deploy never
+  converged. Because the runner is serial, that **starved every other recipe's CI behind it**: immich
+  PR-2 builds 221/222 sat "pending" for ~12 min without ever starting, and only ran once the wedged
+  plausible run was manually torn down (SIGINT the harness — its `finally` tore down the stack — which
+  freed the runner). Two compounding problems:
+  - **No fail-fast on a crash-loop.** The deploy/health wait polls until `DEPLOY_TIMEOUT`; a service
+    that is plainly crash-looping (a task repeatedly `Failed "task: non-zero exit (1)"` / restarting
+    every few seconds with no healthy replica) could be detected and the run **failed early** (e.g.
+    after N restarts within M seconds) instead of burning the whole 1200s. Faster RED for the broken
+    recipe AND frees the runner sooner. This is the higher-value, lower-risk half — do it first.
+  - **Head-of-line blocking on a single serial runner.** One wedged/slow run blocks ALL other recipes'
+    CI. Options: bump runner concurrency (CAREFULLY — the single-node Swarm is why it's serial today,
+    to avoid parallel-deploy OOM/collision, per /upgrade-all §safety); a priority/queue policy; or at
+    minimum surface "queued behind build N" so a pending run isn't mistaken for a stuck/own failure.
+  - *Interim mitigations (in use):* (a) debug a known-crash-looping recipe via the `/recipe-upgrade`
+    **step-2b direct dev deploy** (`dev-<recipe>` + `docker service logs`) instead of repeated
+    `!testme` — diagnoses with full log visibility WITHOUT consuming a 20-min runner slot or starving
+    other recipes; (b) a wedged run can be manually freed by SIGINT-ing its `run_recipe_ci.py` (the
+    harness `finally` tears down its stack).
+  - *When to revisit:* if CI-queue starvation recurs (several recipes in flight, or a legitimately
+    long deploy wedging others). *Added:* 2026-06-09.