From 667c7cd5a0fdc2ef090cf889c10db210f3ab3621 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Wed, 27 May 2026 02:45:26 +0100 Subject: [PATCH] =?UTF-8?q?plan=20=C2=A74.2/=C2=A74.3:=20MAX=5FTESTS=20via?= =?UTF-8?q?=20DRONE=5FRUNNER=5FCAPACITY=20+=20native=20queue/timeout;=20te?= =?UTF-8?q?ardown=20after=20each=20run?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Don't overload the single node: cap concurrent test builds at a configurable MAX_TESTS (= DRONE_RUNNER_CAPACITY); Drone natively queues excess builds and times out hung ones, freeing slots — no custom queue. Each run deploys one app then undeploys; the run-start janitor is the backstop for timed-out/killed builds. At most MAX_TESTS apps live at once. Co-Authored-By: Claude Opus 4.7 (1M context) --- cc-ci-plan/plan.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/cc-ci-plan/plan.md b/cc-ci-plan/plan.md index 10ca841..faff1c0 100644 --- a/cc-ci-plan/plan.md +++ b/cc-ci-plan/plan.md @@ -388,7 +388,18 @@ Bridge posts/updates a Gitea PR comment with the run URL and (on completion) pas cc-ci**. Make the `abra app new/deploy traefik` steps reproducible (scripted/Nix-invoked) for D8. - Each CI run gets an isolated app domain `-pr-.ci.commoninternet.net` (§4.0) so concurrent runs don't collide. Teardown removes app, secrets, and volumes. -- Consider a concurrency cap (1–2 deploys at a time) to avoid resource thrash; document it. +- **Concurrency cap + queue — use Drone natively (SETTLED).** Don't let the server fill with + simultaneously-deployed apps. Expose a configurable **`MAX_TESTS`** mapped to the exec runner's + **`DRONE_RUNNER_CAPACITY`** (Nix-set on the runner; default low — **1–2** given a single 28 GiB + node and heavy recipes like matrix/immich). Drone runs at most `MAX_TESTS` builds at once and + **automatically queues** excess builds (its native pending-build queue), starting them as slots + free. **Per-build timeout** (repo/runner timeout) guarantees a hung test is killed and frees its + slot — so "continue once a current test finishes *or times out*" is built in. No custom queue + needed. Optionally also set `concurrency: { limit: }` in `.drone.yml` as a per-pipeline cap. +- **One app at a time per run, torn down at run end.** A build deploys its recipe, runs the three + stages, then **undeploys** — the server should not accumulate live test apps. Guaranteed teardown + + the run-start janitor (§4.3) enforce this even when a build is timed-out/killed (in-process + cleanup can't run, so the janitor reaps it). ### 4.3 The test harness & recipe test contract @@ -415,6 +426,12 @@ in from day one** (carried over from prior work, re-verify on the installed abra The teardown guarantee is sacred: a failed test must never leak a deployed app or volume into the next run. Implement teardown as a pytest fixture finalizer / `try/finally` in the orchestrator and add a janitor pass at run start that nukes any orphaned `*-pr*` apps older than N hours. +**Crucially, the janitor is the backstop for timed-out/killed builds:** when Drone hits the +per-build timeout (or a build is cancelled) it may SIGKILL the runner process, so the `try/finally` +teardown can't run — those orphaned apps/volumes are reaped by the next build's run-start janitor +(and the janitor should run regardless of how the previous build ended). Net effect with the +`MAX_TESTS`/`DRONE_RUNNER_CAPACITY` cap (§4.2): at most `MAX_TESTS` apps are ever live at once, and +each is torn down (or janitor-reaped) so the single node never accumulates deployments. ### 4.4 Secrets (D6)