After a live incident: plausible build 220 (ClickHouse exit-1 crash-loop) held the single serial runner for its full 1200s DEPLOY_TIMEOUT, starving immich PR-2's queued builds for ~12min until manually torn down. Logs the two fixes (fail-fast on crash-loop; head-of-line blocking on the serial runner) + the interim mitigations (step-2b dev loop for debugging; SIGINT to free a wedged run). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
16 KiB
Deferred ideas / future enhancements (orchestrator-tracked)
Post-DONE or "revisit later" ideas that are intentionally out of scope for the current build
(§2 Definition of Done). Not active work — parked here so they aren't lost. The loops may pull an
item into the project BACKLOG.md as [idea] if/when it becomes relevant.
-
ALT infra-app model: deploy traefik / warm-keycloak / drone the normal Co-op Cloud way, updated outside Nix via abra by maintainers. (operator-flagged 2026-05-29, alternative to the current Phase-2w design — return to later) The current design Nix-reconciles the infra/warm apps: NixOS systemd oneshots run
abrato deploy traefik/keycloak/drone, and a nightlynixos-rebuildauto-updates them to latest with a pre-deploy major/manual-migration gate (WC1.2) + post-deploy health-gated rollback (WC1.1). The alternative: treat the CI server like a normal Co-op Cloud host — traefik, the warm keycloak, and drone are plain abra deployments managed by maintainers, deployed once and upgraded by a human viaabra app upgrade(using abra's own release-notes / major-bump caution), with no Nix reconciler and no nightly auto-update for them. Nix would provide only the host substrate (OS, docker/swarm, the harness, secrets), not the infra-app lifecycle.- Pros: simpler — no reconciler/rollback/nightly machinery; idiomatic Co-op Cloud ops (maintainers manage these exactly like any other coop-cloud app); updates are normal human abra actions; lower cognitive load (no "infra is special / Nix-driven" layer).
- Cons: the infra apps are no longer reproducible-from-git — a VM recreate (the D8 throwaway rebuild) would NOT re-establish traefik/keycloak/drone; they'd need manual redeploy after a rebuild (D8 then covers only the host + harness, not the infra apps). Loses the automated nightly-latest + health-gated auto-rollback; infra updates + rollback become manual/operator discipline. Drifts from the project's "declarative, rebuildable from scratch" ethos.
- Note: it's essentially one-or-the-other for the update path — a hybrid where Nix bootstraps
them but maintainers also
abra app upgradethem creates the dual-ownership conflict (the Nix reconciler would fight/redeploy over the maintainer's version on the next activation). - When to revisit: if the reconciler + rollback + nightly machinery proves high-maintenance or brittle, or if maintainers strongly prefer normal coop-cloud workflow over the Nix layer — weigh that against how much we value full reproducibility (D8) + hands-off auto-updates. Added: 2026-05-29.
-
Docker Hub pull-through registry cache (
registry:2proxy) — deferred; single-host makes it marginal. Considered as a Phase-2pc perf win, then dropped (operator, 2026-05-29): on a single host, Docker's own local image store already caches pulled images (re-deploys reuse local layers), so the prune-policy fix (Phase 2pc PC1) recovers ~all the benefit. A separate pull-through cache's distinctive wins don't apply here — multi-node fan-out (one node), surviving prune/VM-rebuild on separate storage (ours would be co-located, lost on recreate), cache-miss auth (daemon already PAT-authenticated). Revisit ONLY if: (a) cc-ci goes multi-node, OR (b) Phase-2b measurement shows cold-cache / fresh-deploy pull time (D8 throwaway-rebuild, fresh-canonical seeding) is a real bottleneck AND the cache lives on recreate-surviving storage (Incus volume / host-b1 path, not the VM's ephemeral disk). Otherwise it's complexity without payoff. Added: 2026-05-29. -
Optional
--extraflag for heavy / operational tests (opt-in heavy suite). Some recipe tests are "more than needed" for the default CI signal — state-management / long-running-instance / load / helper-script operational tests that don't fit the ephemeral per-run-deploy model cheaply but are useful occasionally. Today they're deferred tocc-ci/machine-docs/DEFERRED.md(e.g. matrix-synapsecompress_state.sh,test_complexity_limit.sh,test_purge.sh) and don't run. Idea: add an opt-in--extraflag (e.g.!testme --extraon a PR comment, or aSTAGES=extra/EXTRA_TESTS=1Drone build parameter) that the orchestrator passes through; recipes declare anextra/test dir or mark tests with@pytest.mark.extra; on opt-in the orchestrator runs them alongside the default tiers (still one deploy, still teardown). Default off so default CI stays fast; the operator can ask for the heavy suite when reviewing a PR that touches an extra-covered area (e.g. matrix-synapse's abra helpers). When implemented, each matching DEFERRED entry can be CLOSED by porting its test into the recipe'sextra/and noting the commit in DEFERRED.md. Why deferred for now: default coverage is sufficient; this is a later breadth/depth knob, not a critical-path feature. Added: 2026-05-28. -
Optional webhook self-registration (admin-access environments). We deliberately made polling the primary trigger and require the CI server/bot to run on read-level access only — so the server does not auto-register Gitea webhooks (that needs repo-admin), and webhook setup is a documented manual admin task (§4.1,
docs/enroll-recipe.md). Later, for environments where the CI server does hold admin on the recipe repos (or an org-level admin token is available), consider adding an opt-in, off-by-default feature (e.g.WEBHOOK_AUTOREGISTER=1) that auto-registers and idempotently reconciles theissue_commentwebhook (URL, events, HMAC secret) on enrolled repos — matching our declarative-reconcile pattern (§9) — giving low-latency push triggering with zero manual setup. Must stay off by default and fall back to manual-doc + polling when admin isn't available, so the least-privilege (read-only) default is preserved. Why deferred: polling already satisfies D1 and the read-only posture is the goal; this is a convenience optimization for a different deployment profile. Added: 2026-05-27. -
Docker Hub
registry:2pull-through cache (deferred from Phase 2pc). A local registry in proxy/pull-through mode, daemonregistry-mirrors-wired, so alldocker.iopulls are cached locally across recipes/runs/prunes. Deferred (operator, 2026-05-29): on the current single, PAT-authenticated, non-pruning host, Docker's own local image store already IS the cache (redeploys reuse local layers — proven in Phase 2pc), so a separate registry adds a service + mirror config + cache GC for marginal gain; its distinctive wins (multi-node fan-out, surviving prune/VM-rebuild on separate storage, cache-miss auth) don't apply here. Revisit ONLY if (a) cc-ci goes multi-node, OR (b) Phase-2b measurement shows cold-deploy pull time is a real bottleneck (e.g. D8 throwaway-rebuild / fresh-canonical seeding) AND the cache lives on recreate-surviving storage (an Incus volume / a path on host b1, not the VM's ephemeral disk). Otherwise it's complexity without payoff. See DECISIONS.md "Phase 2pc". Added: 2026-05-29. -
Phase-2b empirical performance work (moved out of the 2b phase). The original Phase 2b was a full empirical perf program: per-phase timing instrumentation in
results.json, a cold/warm baseline across representative recipes, a Pareto attribution, and a menu of software optimizations. Deferred (operator, 2026-05-30): the real deploy-speed bottleneck turned out to be hardware, not software — the cc-ci VM was 2 vCPU on a 4-core host and disk-I/O-bound (load ~8, io pressure ~65%) while running warm-keycloak (JVM) + all infra; RAM was never the constraint. Fixed directly: bumped to 4 vCPU and made cc-nix-test the only running VM on b1. The software micro-opts below are judged unlikely to move the needle enough to justify the work; revisit ONLY if measurement later shows a specific software bottleneck. (Phase 2b is narrowed to just confirming the test sequence already minimizes deploys — see plan-phase2b.) Parked ideas:- Per-phase timing instrumentation + cold/warm baseline + attribution — do this first if perf is ever revisited; numbers should drive any change.
- Image pulls: local registry pull-through cache (see the item above) and/or pre-pull/warm the enrolled recipes' image set so the first run doesn't pay the cold pull.
- Readiness/convergence: replace fixed sleeps with tight health-endpoint polling; per-recipe readiness probes; parallelize independent readiness checks within a run.
- Warm shared SSO provider (already partly live as warm-keycloak): saves per-run SSO deploy time but is a steady JVM CPU tax that slows non-SSO recipes — only worth it with proven per-run isolation; consider start-when-needed / stop-when-idle rather than always-on.
- Runner/build caching: persistent nix store + warm flake eval; cache pip/uv wheels + Playwright browsers in a persistent volume.
- Concurrency sizing: tune
MAX_TESTS/runner capacity + per-recipe weights so light recipes run concurrently while heavy ones serialize, without overcommitting the node. - Resources: further vCPU/RAM/disk-I/O sizing (the 4-vCPU bump is done; storage I/O on b1 is the harder co-bottleneck — a faster storage pool if it ever matters).
- abra/secret overhead: avoid regenerating/re-inserting secrets redundantly across stages. Why deferred: hardware was the real lever and is fixed; these are speculative software gains best validated by measurement, not assumed. Added: 2026-05-30.
-
Package cc-ci itself as a Co-op Cloud recipe (deploy it with
abralike any other app). (operator idea 2026-06-02 — not implementing now) Today cc-ci is a NixOS flake that declares the whole server. Idea: split it so the app layer (the!testmebridge, dashboard, and runner) ships as a Co-op Cloud recipe, on top of a host/ Swarm provided by NixOS (or any existing Co-op Cloud host). It would then be deployed/upgraded/ backed-up viaabralike any recipe — and could test itself (deploycc-ci-the-recipe, run its own lifecycle = ultimate self-regression; pairs with the regression-canary work).- Pros: dogfooding (cc-ci managed by the same tooling it tests); portability/accessibility (any
Co-op Cloud operator could
abra app deploy cc-ciinstead of needing the flake + a dedicated host); cleaner host-vs-CI-app separation. - Cons / crux: the runner must orchestrate Swarm deploys of recipes-under-test, which a normal (unprivileged) recipe can't do — needs either a Docker-socket mount (privileged recipe; test stacks share the host → blast-radius/isolation risk), DinD (heavy/fragile), or a control-plane recipe driving a separate disposable test Swarm over SSH (cleanest, biggest change). Also loses the "entire server reproducible from one flake" property (host + CI-app become two artifacts).
- When to revisit: after the server regression canaries land (want self-tests before re-architecting the thing that runs them). Treat as a design spike — the runner's Swarm-access model (socket / DinD / separate Swarm) is the decision that drives everything else.
- Pros: dogfooding (cc-ci managed by the same tooling it tests); portability/accessibility (any
Co-op Cloud operator could
-
Move the upgrader's direct deploy-and-inspect onto SEPARATE infra (isolate it from the cc-ci server, so it can't tamper with the tests). (operator-flagged 2026-06-09) The upgrader now has a direct deploy-and-inspect dev loop (
/recipe-upgradestep 2b): it deploys the WIP recipe withabra app deploy --chaosand readsdocker service logsto debug an upgrade before the!testmeCI gate. Today that runs on the cc-ci server's own local swarm — i.e. the autonomous upgrader (and its subagents) holdssh cc-ci+ root + docker + abra on the same host that owns the harness, the recipe tests, and the CI verdict. That's a trust-isolation hole: an agent with write access to the box that gates it could — by accident, prompt-injection, or reward-hacking — edit the tests/harness to make a failing upgrade "pass" ("hacking the tests"). We can't be fully sure the gate is honest while the thing being gated can modify the gate.- Idea: give the upgrader a dedicated, throwaway test server (a real
cctest-style host, like recipe-maintainer'scctest.autonomic.zone) for the step-2b deploy/inspect loop, reached with scoped, test-only credentials (its own SSH key/abra context — blast radius limited to that host, per recipe-maintainer'stest-ssh/model). The upgrader gets all the live-log feedback there and has no write access to the cc-ci server. The cc-ci!testmeCI stays the independent, tamper-proof gate — run by infra the upgrader cannot touch — so a GREEN verdict means the tests it can't edit passed. - Pros: the CI gate becomes trustworthy under an autonomous/untrusted upgrader (least-privilege:
the agent can deploy + inspect but not alter the harness/tests/verdict); cleaner separation of
"develop the upgrade" (test server) from "verify the upgrade" (cc-ci CI); matches recipe-maintainer's
limited-blast-radius
test-ssh/posture. - Cons / crux: new infra to provision + keep current (a second Co-op Cloud host or VM, abra
server add, scoped SSH creds, its own teardown/orphan-sweep); cost + maintenance; and the harness deploy path may assume--server default/local, so the per-run CI deploys vs the upgrader's test-server deploys need a clear split. Until then, the local-swarm dev loop + the orphan-sweep are the interim posture (functional, but NOT isolation-safe against a malicious/compromised upgrader). - Interim safeguard (in force now, until this lands): the upgrader must NEVER modify the cc-ci
tests or harness (
tests/**,runner/**) unless explicitly invoked with--with-tests— encoded as an absolute guardrail in/recipe-upgrade(and its step-2b direct-deploy loop). It is a written rule, not an enforced boundary — which is exactly why the separate-infra isolation above is the real fix. - When to revisit: before running the upgrader fully unattended/untrusted at scale, or alongside the "package cc-ci as a recipe" spike above (both hinge on a separate disposable test Swarm). Added: 2026-06-09.
- Idea: give the upgrader a dedicated, throwaway test server (a real
-
Fail fast on a crash-looping deploy + don't let one wedged run starve the CI queue. (operator-flagged 2026-06-09, after a live incident) Observed live: plausible build 220 — a recipe whose ClickHouse service exits 1 every ~6s — held the single serial CI runner for its full
DEPLOY_TIMEOUT(1200s / 20 min) while the deploy never converged. Because the runner is serial, that starved every other recipe's CI behind it: immich PR-2 builds 221/222 sat "pending" for ~12 min without ever starting, and only ran once the wedged plausible run was manually torn down (SIGINT the harness — itsfinallytore down the stack — which freed the runner). Two compounding problems:- No fail-fast on a crash-loop. The deploy/health wait polls until
DEPLOY_TIMEOUT; a service that is plainly crash-looping (a task repeatedlyFailed "task: non-zero exit (1)"/ restarting every few seconds with no healthy replica) could be detected and the run failed early (e.g. after N restarts within M seconds) instead of burning the whole 1200s. Faster RED for the broken recipe AND frees the runner sooner. This is the higher-value, lower-risk half — do it first. - Head-of-line blocking on a single serial runner. One wedged/slow run blocks ALL other recipes' CI. Options: bump runner concurrency (CAREFULLY — the single-node Swarm is why it's serial today, to avoid parallel-deploy OOM/collision, per /upgrade-all §safety); a priority/queue policy; or at minimum surface "queued behind build N" so a pending run isn't mistaken for a stuck/own failure.
- Interim mitigations (in use): (a) debug a known-crash-looping recipe via the
/recipe-upgradestep-2b direct dev deploy (dev-<recipe>+docker service logs) instead of repeated!testme— diagnoses with full log visibility WITHOUT consuming a 20-min runner slot or starving other recipes; (b) a wedged run can be manually freed by SIGINT-ing itsrun_recipe_ci.py(the harnessfinallytears down its stack). - When to revisit: if CI-queue starvation recurs (several recipes in flight, or a legitimately long deploy wedging others). Added: 2026-06-09.
- No fail-fast on a crash-loop. The deploy/health wait polls until