plan §4.2/§4.3: MAX_TESTS via DRONE_RUNNER_CAPACITY + native queue/timeout; teardown after each run
Don't overload the single node: cap concurrent test builds at a configurable MAX_TESTS (= DRONE_RUNNER_CAPACITY); Drone natively queues excess builds and times out hung ones, freeing slots — no custom queue. Each run deploys one app then undeploys; the run-start janitor is the backstop for timed-out/killed builds. At most MAX_TESTS apps live at once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -388,7 +388,18 @@ Bridge posts/updates a Gitea PR comment with the run URL and (on completion) pas
|
||||
cc-ci**. Make the `abra app new/deploy traefik` steps reproducible (scripted/Nix-invoked) for D8.
|
||||
- Each CI run gets an isolated app domain `<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`
|
||||
(§4.0) so concurrent runs don't collide. Teardown removes app, secrets, and volumes.
|
||||
- Consider a concurrency cap (1–2 deploys at a time) to avoid resource thrash; document it.
|
||||
- **Concurrency cap + queue — use Drone natively (SETTLED).** Don't let the server fill with
|
||||
simultaneously-deployed apps. Expose a configurable **`MAX_TESTS`** mapped to the exec runner's
|
||||
**`DRONE_RUNNER_CAPACITY`** (Nix-set on the runner; default low — **1–2** given a single 28 GiB
|
||||
node and heavy recipes like matrix/immich). Drone runs at most `MAX_TESTS` builds at once and
|
||||
**automatically queues** excess builds (its native pending-build queue), starting them as slots
|
||||
free. **Per-build timeout** (repo/runner timeout) guarantees a hung test is killed and frees its
|
||||
slot — so "continue once a current test finishes *or times out*" is built in. No custom queue
|
||||
needed. Optionally also set `concurrency: { limit: <N> }` in `.drone.yml` as a per-pipeline cap.
|
||||
- **One app at a time per run, torn down at run end.** A build deploys its recipe, runs the three
|
||||
stages, then **undeploys** — the server should not accumulate live test apps. Guaranteed teardown
|
||||
+ the run-start janitor (§4.3) enforce this even when a build is timed-out/killed (in-process
|
||||
cleanup can't run, so the janitor reaps it).
|
||||
|
||||
### 4.3 The test harness & recipe test contract
|
||||
|
||||
@ -415,6 +426,12 @@ in from day one** (carried over from prior work, re-verify on the installed abra
|
||||
The teardown guarantee is sacred: a failed test must never leak a deployed app or volume into the
|
||||
next run. Implement teardown as a pytest fixture finalizer / `try/finally` in the orchestrator and
|
||||
add a janitor pass at run start that nukes any orphaned `*-pr*` apps older than N hours.
|
||||
**Crucially, the janitor is the backstop for timed-out/killed builds:** when Drone hits the
|
||||
per-build timeout (or a build is cancelled) it may SIGKILL the runner process, so the `try/finally`
|
||||
teardown can't run — those orphaned apps/volumes are reaped by the next build's run-start janitor
|
||||
(and the janitor should run regardless of how the previous build ended). Net effect with the
|
||||
`MAX_TESTS`/`DRONE_RUNNER_CAPACITY` cap (§4.2): at most `MAX_TESTS` apps are ever live at once, and
|
||||
each is torn down (or janitor-reaped) so the single node never accumulates deployments.
|
||||
|
||||
### 4.4 Secrets (D6)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user