resource safety: MAX_TESTS=capacity=1 + per-build 60m timeout (orchestrator design change)
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
Bound live test apps on the single 28GiB node. DRONE_RUNNER_CAPACITY=1 (MAX_TESTS) caps concurrent builds; Drone auto-queues the rest natively. deploy-drone reconcile sets the cc-ci repo build timeout to 60m (best-effort PATCH, non-fatal) so a hung build is killed and frees its slot. Janitor remains the backstop for SIGKILL'd builds. Verified on host: DRONE_RUNNER_CAPACITY=1; repo timeout=60 via Drone API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
21
DECISIONS.md
21
DECISIONS.md
@ -69,6 +69,27 @@ Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
|
||||
matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in
|
||||
Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.)
|
||||
|
||||
- **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27,
|
||||
plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
|
||||
- **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding).
|
||||
Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending
|
||||
queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is
|
||||
never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
|
||||
- **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled
|
||||
best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's
|
||||
Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone →
|
||||
the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue
|
||||
once a test finishes OR times out".
|
||||
- **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys
|
||||
(guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its
|
||||
own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in
|
||||
both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI
|
||||
path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent
|
||||
runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default
|
||||
2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
|
||||
- Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary
|
||||
mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.)
|
||||
|
||||
## Open (defaults from §8, to confirm as reality lands)
|
||||
|
||||
- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
|
||||
|
||||
Reference in New Issue
Block a user