cc-ci-orchestrator/cc-ci-plan/IDEAS.md

# Deferred ideas / future enhancements (orchestrator-tracked)

Post-DONE or "revisit later" ideas that are intentionally **out of scope** for the current build
(§2 Definition of Done). Not active work — parked here so they aren't lost. The loops may pull an
item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant.

- [ ] **ALT infra-app model: deploy traefik / warm-keycloak / drone the normal Co-op Cloud way,
  updated outside Nix via abra by maintainers.** *(operator-flagged 2026-05-29, alternative to the
  current Phase-2w design — return to later)*
  The **current** design Nix-*reconciles* the infra/warm apps: NixOS systemd oneshots run `abra` to
  deploy traefik/keycloak/drone, and a nightly `nixos-rebuild` auto-updates them to latest with a
  pre-deploy major/manual-migration gate (WC1.2) + post-deploy health-gated rollback (WC1.1).
  **The alternative:** treat the CI server like a normal Co-op Cloud host — traefik, the warm
  keycloak, and drone are **plain abra deployments managed by maintainers**, deployed once and
  **upgraded by a human via `abra app upgrade`** (using abra's own release-notes / major-bump
  caution), with **no Nix reconciler and no nightly auto-update** for them. Nix would provide only
  the host substrate (OS, docker/swarm, the harness, secrets), not the infra-app lifecycle.
  - *Pros:* simpler — no reconciler/rollback/nightly machinery; idiomatic Co-op Cloud ops (maintainers
    manage these exactly like any other coop-cloud app); updates are normal human abra actions; lower
    cognitive load (no "infra is special / Nix-driven" layer).
  - *Cons:* the infra apps are **no longer reproducible-from-git** — a VM recreate (the D8 throwaway
    rebuild) would NOT re-establish traefik/keycloak/drone; they'd need manual redeploy after a
    rebuild (D8 then covers only the host + harness, not the infra apps). Loses the automated
    nightly-latest + health-gated auto-rollback; infra updates + rollback become manual/operator
    discipline. Drifts from the project's "declarative, rebuildable from scratch" ethos.
  - *Note:* it's essentially one-or-the-other for the **update path** — a hybrid where Nix bootstraps
    them but maintainers also `abra app upgrade` them creates the dual-ownership conflict (the Nix
    reconciler would fight/redeploy over the maintainer's version on the next activation).
  - *When to revisit:* if the reconciler + rollback + nightly machinery proves high-maintenance or
    brittle, or if maintainers strongly prefer normal coop-cloud workflow over the Nix layer — weigh
    that against how much we value full reproducibility (D8) + hands-off auto-updates. *Added:* 2026-05-29.

- [ ] **Docker Hub pull-through registry cache (`registry:2` proxy) — deferred; single-host makes it marginal.**
  Considered as a Phase-2pc perf win, then **dropped (operator, 2026-05-29):** on a **single host**,
  Docker's own local image store already caches pulled images (re-deploys reuse local layers), so the
  prune-policy fix (Phase 2pc PC1) recovers ~all the benefit. A separate pull-through cache's
  distinctive wins don't apply here — multi-node fan-out (one node), surviving prune/VM-rebuild on
  *separate* storage (ours would be co-located, lost on recreate), cache-miss auth (daemon already
  PAT-authenticated). **Revisit ONLY if:** (a) cc-ci goes **multi-node**, OR (b) Phase-2b measurement
  shows **cold-cache / fresh-deploy pull time** (D8 throwaway-rebuild, fresh-canonical seeding) is a
  real bottleneck **AND** the cache lives on **recreate-surviving storage** (Incus volume / host-b1
  path, not the VM's ephemeral disk). Otherwise it's complexity without payoff. *Added:* 2026-05-29.

- [ ] **Optional `--extra` flag for heavy / operational tests (opt-in heavy suite).**
  Some recipe tests are "more than needed" for the default CI signal — state-management /
  long-running-instance / load / helper-script operational tests that don't fit the ephemeral
  per-run-deploy model cheaply but are useful occasionally. Today they're deferred to
  `cc-ci/machine-docs/DEFERRED.md` (e.g. matrix-synapse `compress_state.sh`,
  `test_complexity_limit.sh`, `test_purge.sh`) and don't run.
  *Idea:* add an **opt-in `--extra` flag** (e.g. `!testme --extra` on a PR comment, or
  a `STAGES=extra` / `EXTRA_TESTS=1` Drone build parameter) that the orchestrator passes through;
  recipes declare an `extra/` test dir or mark tests with `@pytest.mark.extra`; on opt-in the
  orchestrator runs them **alongside** the default tiers (still one deploy, still teardown). Default
  off so default CI stays fast; the operator can ask for the heavy suite when reviewing a PR that
  touches an extra-covered area (e.g. matrix-synapse's abra helpers). When implemented, each
  matching DEFERRED entry can be CLOSED by porting its test into the recipe's `extra/` and noting
  the commit in DEFERRED.md. *Why deferred for now:* default coverage is sufficient; this is a
  later breadth/depth knob, not a critical-path feature. *Added:* 2026-05-28.

- [ ] **Optional webhook self-registration (admin-access environments).**
  We deliberately made **polling the primary trigger** and require the CI server/bot to run on
  **read-level** access only — so the server does **not** auto-register Gitea webhooks (that needs
  repo-admin), and webhook setup is a documented manual admin task (§4.1, `docs/enroll-recipe.md`).
  *Later*, for environments where the CI server **does** hold admin on the recipe repos (or an
  org-level admin token is available), consider adding an **opt-in, off-by-default** feature
  (e.g. `WEBHOOK_AUTOREGISTER=1`) that auto-registers and **idempotently reconciles** the
  `issue_comment` webhook (URL, events, HMAC secret) on enrolled repos — matching our
  declarative-reconcile pattern (§9) — giving low-latency push triggering with zero manual setup.
  Must stay off by default and fall back to manual-doc + polling when admin isn't available, so the
  least-privilege (read-only) default is preserved. *Why deferred:* polling already satisfies D1 and
  the read-only posture is the goal; this is a convenience optimization for a different deployment
  profile. *Added:* 2026-05-27.

- **Docker Hub `registry:2` pull-through cache (deferred from Phase 2pc).** A local registry in
  proxy/pull-through mode, daemon `registry-mirrors`-wired, so all `docker.io` pulls are cached
  locally across recipes/runs/prunes. **Deferred (operator, 2026-05-29):** on the current
  **single, PAT-authenticated, non-pruning** host, Docker's own local image store already IS the
  cache (redeploys reuse local layers — proven in Phase 2pc), so a separate registry adds a service +
  mirror config + cache GC for marginal gain; its distinctive wins (multi-node fan-out, surviving
  prune/VM-rebuild on *separate* storage, cache-miss auth) don't apply here. **Revisit ONLY if** (a)
  cc-ci goes **multi-node**, OR (b) Phase-2b measurement shows **cold-deploy pull time is a real
  bottleneck** (e.g. D8 throwaway-rebuild / fresh-canonical seeding) **AND** the cache lives on
  **recreate-surviving storage** (an Incus volume / a path on host b1, not the VM's ephemeral disk).
  Otherwise it's complexity without payoff. See DECISIONS.md "Phase 2pc". *Added:* 2026-05-29.

- **Phase-2b empirical performance work (moved out of the 2b phase).** The original Phase 2b was a full
  empirical perf program: per-phase timing instrumentation in `results.json`, a cold/warm baseline
  across representative recipes, a Pareto attribution, and a menu of software optimizations. **Deferred
  (operator, 2026-05-30):** the real deploy-speed bottleneck turned out to be **hardware**, not
  software — the cc-ci VM was **2 vCPU on a 4-core host** and **disk-I/O-bound** (load ~8, io pressure
  ~65%) while running warm-keycloak (JVM) + all infra; RAM was never the constraint. Fixed **directly**:
  bumped to **4 vCPU** and made cc-nix-test the **only running VM** on b1. The software micro-opts below
  are judged unlikely to move the needle enough to justify the work; revisit ONLY if measurement later
  shows a specific software bottleneck. (Phase 2b is narrowed to just confirming the test sequence
  already minimizes deploys — see plan-phase2b.) Parked ideas:
  - **Per-phase timing instrumentation** + cold/warm **baseline** + **attribution** — do this first if
    perf is ever revisited; numbers should drive any change.
  - **Image pulls:** local registry pull-through cache (see the item above) and/or pre-pull/warm the
    enrolled recipes' image set so the first run doesn't pay the cold pull.
  - **Readiness/convergence:** replace fixed sleeps with tight health-endpoint polling; per-recipe
    readiness probes; parallelize independent readiness checks within a run.
  - **Warm shared SSO provider** (already partly live as warm-keycloak): saves per-run SSO deploy time
    but is a steady JVM CPU tax that slows non-SSO recipes — only worth it with proven per-run
    isolation; consider start-when-needed / stop-when-idle rather than always-on.
  - **Runner/build caching:** persistent nix store + warm flake eval; cache pip/uv wheels + Playwright
    browsers in a persistent volume.
  - **Concurrency sizing:** tune `MAX_TESTS`/runner capacity + per-recipe weights so light recipes run
    concurrently while heavy ones serialize, without overcommitting the node.
  - **Resources:** further vCPU/RAM/disk-I/O sizing (the 4-vCPU bump is done; storage I/O on b1 is the
    harder co-bottleneck — a faster storage pool if it ever matters).
  - **abra/secret overhead:** avoid regenerating/re-inserting secrets redundantly across stages.
  *Why deferred:* hardware was the real lever and is fixed; these are speculative software gains best
  validated by measurement, not assumed. *Added:* 2026-05-30.

- [ ] **Package cc-ci itself as a Co-op Cloud recipe (deploy it with `abra` like any other app).**
  *(operator idea 2026-06-02 — not implementing now)*
  Today cc-ci is a NixOS flake that declares the whole server. Idea: split it so the **app layer**
  (the `!testme` bridge, dashboard, and runner) ships as a Co-op Cloud **recipe**, on top of a host/
  Swarm provided by NixOS (or any existing Co-op Cloud host). It would then be deployed/upgraded/
  backed-up via `abra` like any recipe — and could **test itself** (deploy `cc-ci`-the-recipe, run its
  own lifecycle = ultimate self-regression; pairs with the regression-canary work).
  - *Pros:* dogfooding (cc-ci managed by the same tooling it tests); portability/accessibility (any
    Co-op Cloud operator could `abra app deploy cc-ci` instead of needing the flake + a dedicated host);
    cleaner host-vs-CI-app separation.
  - *Cons / crux:* the runner must orchestrate Swarm deploys of recipes-under-test, which a normal
    (unprivileged) recipe can't do — needs either a Docker-socket mount (privileged recipe; test stacks
    share the host → blast-radius/isolation risk), DinD (heavy/fragile), or **a control-plane recipe
    driving a separate disposable test Swarm over SSH** (cleanest, biggest change). Also loses the
    "entire server reproducible from one flake" property (host + CI-app become two artifacts).
  - *When to revisit:* after the server regression canaries land (want self-tests before re-architecting
    the thing that runs them). Treat as a design spike — the runner's Swarm-access model (socket / DinD /
    separate Swarm) is the decision that drives everything else.

- [ ] **Move the upgrader's direct deploy-and-inspect onto SEPARATE infra (isolate it from the cc-ci
  server, so it can't tamper with the tests).** *(operator-flagged 2026-06-09)*
  The upgrader now has a direct deploy-and-inspect dev loop (`/recipe-upgrade` step 2b): it deploys the
  WIP recipe with `abra app deploy --chaos` and reads `docker service logs` to debug an upgrade before
  the `!testme` CI gate. **Today that runs on the cc-ci server's own local swarm** — i.e. the autonomous
  upgrader (and its subagents) hold `ssh cc-ci` + root + docker + abra on the **same host that owns the
  harness, the recipe tests, and the CI verdict**. That's a trust-isolation hole: an agent with write
  access to the box that gates it could — by accident, prompt-injection, or reward-hacking — **edit the
  tests/harness to make a failing upgrade "pass"** ("hacking the tests"). We can't be fully sure the gate
  is honest while the thing being gated can modify the gate.
  - *Idea:* give the upgrader a **dedicated, throwaway test server** (a real `cctest`-style host, like
    recipe-maintainer's `cctest.autonomic.zone`) for the step-2b deploy/inspect loop, reached with
    **scoped, test-only credentials** (its own SSH key/abra context — blast radius limited to that host,
    per recipe-maintainer's `test-ssh/` model). The upgrader gets all the live-log feedback there and has
    **no write access to the cc-ci server**. The cc-ci `!testme` CI stays the independent, tamper-proof
    gate — run by infra the upgrader cannot touch — so a GREEN verdict means the tests it can't edit
    passed.
  - *Pros:* the CI gate becomes trustworthy under an autonomous/untrusted upgrader (least-privilege:
    the agent can deploy + inspect but not alter the harness/tests/verdict); cleaner separation of
    "develop the upgrade" (test server) from "verify the upgrade" (cc-ci CI); matches recipe-maintainer's
    limited-blast-radius `test-ssh/` posture.
  - *Cons / crux:* new infra to provision + keep current (a second Co-op Cloud host or VM, abra
    `server add`, scoped SSH creds, its own teardown/orphan-sweep); cost + maintenance; and the harness
    deploy path may assume `--server default`/local, so the per-run CI deploys vs the upgrader's
    test-server deploys need a clear split. Until then, the local-swarm dev loop + the orphan-sweep are
    the interim posture (functional, but NOT isolation-safe against a malicious/compromised upgrader).
  - *Interim safeguard (in force now, until this lands):* the upgrader **must NEVER modify the cc-ci
    tests or harness (`tests/**`, `runner/**`) unless explicitly invoked with `--with-tests`** — encoded
    as an absolute guardrail in `/recipe-upgrade` (and its step-2b direct-deploy loop). It is a written
    rule, not an enforced boundary — which is exactly why the separate-infra isolation above is the real
    fix.
  - *When to revisit:* before running the upgrader fully unattended/untrusted at scale, or alongside the
    "package cc-ci as a recipe" spike above (both hinge on a separate disposable test Swarm). *Added:* 2026-06-09.

- [ ] **Fail fast on a crash-looping deploy + don't let one wedged run starve the CI queue.**
  *(operator-flagged 2026-06-09, after a live incident)*
  Observed live: plausible build 220 — a recipe whose ClickHouse service exits 1 every ~6s — held the
  **single serial CI runner for its full `DEPLOY_TIMEOUT` (1200s / 20 min)** while the deploy never
  converged. Because the runner is serial, that **starved every other recipe's CI behind it**: immich
  PR-2 builds 221/222 sat "pending" for ~12 min without ever starting, and only ran once the wedged
  plausible run was manually torn down (SIGINT the harness — its `finally` tore down the stack — which
  freed the runner). Two compounding problems:
  - **No fail-fast on a crash-loop.** The deploy/health wait polls until `DEPLOY_TIMEOUT`; a service
    that is plainly crash-looping (a task repeatedly `Failed "task: non-zero exit (1)"` / restarting
    every few seconds with no healthy replica) could be detected and the run **failed early** (e.g.
    after N restarts within M seconds) instead of burning the whole 1200s. Faster RED for the broken
    recipe AND frees the runner sooner. This is the higher-value, lower-risk half — do it first.
  - **Head-of-line blocking on a single serial runner.** One wedged/slow run blocks ALL other recipes'
    CI. Options: bump runner concurrency (CAREFULLY — the single-node Swarm is why it's serial today,
    to avoid parallel-deploy OOM/collision, per /upgrade-all §safety); a priority/queue policy; or at
    minimum surface "queued behind build N" so a pending run isn't mistaken for a stuck/own failure.
  - *Interim mitigations (in use):* (a) debug a known-crash-looping recipe via the `/recipe-upgrade`
    **step-2b direct dev deploy** (`dev-<recipe>` + `docker service logs`) instead of repeated
    `!testme` — diagnoses with full log visibility WITHOUT consuming a 20-min runner slot or starving
    other recipes; (b) a wedged run can be manually freed by SIGINT-ing its `run_recipe_ci.py` (the
    harness `finally` tears down its stack).
  - *When to revisit:* if CI-queue starvation recurs (several recipes in flight, or a legitimately
    long deploy wedging others). *Added:* 2026-06-09.

- [ ] **Canonical *history* for a green-verified older upgrade base (design B).** *(operator-flagged
  2026-06-17; deferred from the `samever` fix — design A shipped instead)*
  Context: the dynamic upgrade-base resolver (phase `prevb`) uses the last-green warm-canonical as the
  base. When that canonical version **equals the PR head version** (a non-version-bump PR, or a re-test
  after the canonical advanced), phase `samever` (design A) steps back to the **newest published version
  strictly older than head** (from `recipe_tags`). That older *published* version isn't guaranteed to
  have passed green on cc-ci.
  **The improvement (B):** keep a short **history** in the warm-canonical registry — the last N green
  `{version, commit}` records, not just the single current one (`canonical.read_registry`/`write_registry`
  are single-record today). When stepping back, prefer the most-recent **prior canonical** (green-verified)
  over a raw published tag; fall back to design A's published-tag only when no prior canonical exists.
  - *Why deferred:* history only accrues going forward (existing recipes have none until they've gone
    green ≥2× on cold-on-latest), so design A (published-tag) is the always-present floor and must exist
    anyway. B is a quality refinement on top. Requires a registry schema change (single record → bounded
    history) + `promote_canonical`/`write_registry` appending instead of overwriting.
  - *Re-entry trigger:* when the published-tag fallback proves insufficient — e.g. a recipe's
    newest-older-published version is itself undeployable, so we'd rather upgrade from a known-green prior
    canonical — or operator request. *Added:* 2026-06-17.