diff --git a/cc-ci-plan/IDEAS.md b/cc-ci-plan/IDEAS.md index 87d5023..7b78228 100644 --- a/cc-ci-plan/IDEAS.md +++ b/cc-ci-plan/IDEAS.md @@ -164,3 +164,28 @@ item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant. fix. - *When to revisit:* before running the upgrader fully unattended/untrusted at scale, or alongside the "package cc-ci as a recipe" spike above (both hinge on a separate disposable test Swarm). *Added:* 2026-06-09. + +- [ ] **Fail fast on a crash-looping deploy + don't let one wedged run starve the CI queue.** + *(operator-flagged 2026-06-09, after a live incident)* + Observed live: plausible build 220 — a recipe whose ClickHouse service exits 1 every ~6s — held the + **single serial CI runner for its full `DEPLOY_TIMEOUT` (1200s / 20 min)** while the deploy never + converged. Because the runner is serial, that **starved every other recipe's CI behind it**: immich + PR-2 builds 221/222 sat "pending" for ~12 min without ever starting, and only ran once the wedged + plausible run was manually torn down (SIGINT the harness — its `finally` tore down the stack — which + freed the runner). Two compounding problems: + - **No fail-fast on a crash-loop.** The deploy/health wait polls until `DEPLOY_TIMEOUT`; a service + that is plainly crash-looping (a task repeatedly `Failed "task: non-zero exit (1)"` / restarting + every few seconds with no healthy replica) could be detected and the run **failed early** (e.g. + after N restarts within M seconds) instead of burning the whole 1200s. Faster RED for the broken + recipe AND frees the runner sooner. This is the higher-value, lower-risk half — do it first. + - **Head-of-line blocking on a single serial runner.** One wedged/slow run blocks ALL other recipes' + CI. Options: bump runner concurrency (CAREFULLY — the single-node Swarm is why it's serial today, + to avoid parallel-deploy OOM/collision, per /upgrade-all §safety); a priority/queue policy; or at + minimum surface "queued behind build N" so a pending run isn't mistaken for a stuck/own failure. + - *Interim mitigations (in use):* (a) debug a known-crash-looping recipe via the `/recipe-upgrade` + **step-2b direct dev deploy** (`dev-` + `docker service logs`) instead of repeated + `!testme` — diagnoses with full log visibility WITHOUT consuming a 20-min runner slot or starving + other recipes; (b) a wedged run can be manually freed by SIGINT-ing its `run_recipe_ci.py` (the + harness `finally` tears down its stack). + - *When to revisit:* if CI-queue starvation recurs (several recipes in flight, or a legitimately + long deploy wedging others). *Added:* 2026-06-09.