plan §4.2/§4.3: MAX_TESTS via DRONE_RUNNER_CAPACITY + native queue/timeout; teardown after each run

Don't overload the single node: cap concurrent test builds at a configurable MAX_TESTS
(= DRONE_RUNNER_CAPACITY); Drone natively queues excess builds and times out hung ones,
freeing slots — no custom queue. Each run deploys one app then undeploys; the run-start
janitor is the backstop for timed-out/killed builds. At most MAX_TESTS apps live at once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-27 02:45:26 +01:00
parent 8c4efe3c88
commit 667c7cd5a0

View File

@ -388,7 +388,18 @@ Bridge posts/updates a Gitea PR comment with the run URL and (on completion) pas
cc-ci**. Make the `abra app new/deploy traefik` steps reproducible (scripted/Nix-invoked) for D8.
- Each CI run gets an isolated app domain `<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`
(§4.0) so concurrent runs don't collide. Teardown removes app, secrets, and volumes.
- Consider a concurrency cap (12 deploys at a time) to avoid resource thrash; document it.
- **Concurrency cap + queue — use Drone natively (SETTLED).** Don't let the server fill with
simultaneously-deployed apps. Expose a configurable **`MAX_TESTS`** mapped to the exec runner's
**`DRONE_RUNNER_CAPACITY`** (Nix-set on the runner; default low — **12** given a single 28 GiB
node and heavy recipes like matrix/immich). Drone runs at most `MAX_TESTS` builds at once and
**automatically queues** excess builds (its native pending-build queue), starting them as slots
free. **Per-build timeout** (repo/runner timeout) guarantees a hung test is killed and frees its
slot — so "continue once a current test finishes *or times out*" is built in. No custom queue
needed. Optionally also set `concurrency: { limit: <N> }` in `.drone.yml` as a per-pipeline cap.
- **One app at a time per run, torn down at run end.** A build deploys its recipe, runs the three
stages, then **undeploys** — the server should not accumulate live test apps. Guaranteed teardown
+ the run-start janitor (§4.3) enforce this even when a build is timed-out/killed (in-process
cleanup can't run, so the janitor reaps it).
### 4.3 The test harness & recipe test contract
@ -415,6 +426,12 @@ in from day one** (carried over from prior work, re-verify on the installed abra
The teardown guarantee is sacred: a failed test must never leak a deployed app or volume into the
next run. Implement teardown as a pytest fixture finalizer / `try/finally` in the orchestrator and
add a janitor pass at run start that nukes any orphaned `*-pr*` apps older than N hours.
**Crucially, the janitor is the backstop for timed-out/killed builds:** when Drone hits the
per-build timeout (or a build is cancelled) it may SIGKILL the runner process, so the `try/finally`
teardown can't run — those orphaned apps/volumes are reaped by the next build's run-start janitor
(and the janitor should run regardless of how the previous build ended). Net effect with the
`MAX_TESTS`/`DRONE_RUNNER_CAPACITY` cap (§4.2): at most `MAX_TESTS` apps are ever live at once, and
each is torn down (or janitor-reaped) so the single node never accumulates deployments.
### 4.4 Secrets (D6)