diff --git a/cc-ci-plan/plan.md b/cc-ci-plan/plan.md index 10ca841..faff1c0 100644 --- a/cc-ci-plan/plan.md +++ b/cc-ci-plan/plan.md @@ -388,7 +388,18 @@ Bridge posts/updates a Gitea PR comment with the run URL and (on completion) pas cc-ci**. Make the `abra app new/deploy traefik` steps reproducible (scripted/Nix-invoked) for D8. - Each CI run gets an isolated app domain `-pr-.ci.commoninternet.net` (§4.0) so concurrent runs don't collide. Teardown removes app, secrets, and volumes. -- Consider a concurrency cap (1–2 deploys at a time) to avoid resource thrash; document it. +- **Concurrency cap + queue — use Drone natively (SETTLED).** Don't let the server fill with + simultaneously-deployed apps. Expose a configurable **`MAX_TESTS`** mapped to the exec runner's + **`DRONE_RUNNER_CAPACITY`** (Nix-set on the runner; default low — **1–2** given a single 28 GiB + node and heavy recipes like matrix/immich). Drone runs at most `MAX_TESTS` builds at once and + **automatically queues** excess builds (its native pending-build queue), starting them as slots + free. **Per-build timeout** (repo/runner timeout) guarantees a hung test is killed and frees its + slot — so "continue once a current test finishes *or times out*" is built in. No custom queue + needed. Optionally also set `concurrency: { limit: }` in `.drone.yml` as a per-pipeline cap. +- **One app at a time per run, torn down at run end.** A build deploys its recipe, runs the three + stages, then **undeploys** — the server should not accumulate live test apps. Guaranteed teardown + + the run-start janitor (§4.3) enforce this even when a build is timed-out/killed (in-process + cleanup can't run, so the janitor reaps it). ### 4.3 The test harness & recipe test contract @@ -415,6 +426,12 @@ in from day one** (carried over from prior work, re-verify on the installed abra The teardown guarantee is sacred: a failed test must never leak a deployed app or volume into the next run. Implement teardown as a pytest fixture finalizer / `try/finally` in the orchestrator and add a janitor pass at run start that nukes any orphaned `*-pr*` apps older than N hours. +**Crucially, the janitor is the backstop for timed-out/killed builds:** when Drone hits the +per-build timeout (or a build is cancelled) it may SIGKILL the runner process, so the `try/finally` +teardown can't run — those orphaned apps/volumes are reaped by the next build's run-start janitor +(and the janitor should run regardless of how the previous build ended). Net effect with the +`MAX_TESTS`/`DRONE_RUNNER_CAPACITY` cap (§4.2): at most `MAX_TESTS` apps are ever live at once, and +each is torn down (or janitor-reaped) so the single node never accumulates deployments. ### 4.4 Secrets (D6)