Files
cc-ci-orchestrator/cc-ci-plan/IDEAS.md
autonomic-bot e85e16318c Phase 2b narrowed to "confirm minimal deploys"; perf ideas moved to IDEAS
Operator (2026-05-30): the real deploy-speed bottleneck was hardware (cc-ci VM
was 2 vCPU on a 4-core host + disk-I/O-bound; RAM fine), now fixed directly
(bumped to 4 vCPU, made cc-nix-test the only running VM on b1). The 2b software
micro-optimizations are judged unlikely to help, so:

- IDEAS.md: parked the whole empirical-perf program (instrumentation, baseline,
  attribution) + the optimization menu (image cache/prepull, readiness tuning,
  warm-SSO start/stop, runner caching, concurrency sizing, resources, secret
  overhead) under "Phase-2b empirical performance work", revisit only if
  measurement later proves a specific software bottleneck.
- plan-phase2b: reduced to ONE goal — confirm (and fix if needed) that the
  per-recipe test sequence already uses the minimum deploys (1 base shared by
  install+functional+backup/restore, +1 for the upgrade tier, +1 per dep),
  enforced by the existing DG4.1 deploy-count check, WITHOUT weakening any test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:07:49 +01:00

114 lines
9.6 KiB
Markdown

# Deferred ideas / future enhancements (orchestrator-tracked)
Post-DONE or "revisit later" ideas that are intentionally **out of scope** for the current build
(§2 Definition of Done). Not active work — parked here so they aren't lost. The loops may pull an
item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant.
- [ ] **ALT infra-app model: deploy traefik / warm-keycloak / drone the normal Co-op Cloud way,
updated outside Nix via abra by maintainers.** *(operator-flagged 2026-05-29, alternative to the
current Phase-2w design — return to later)*
The **current** design Nix-*reconciles* the infra/warm apps: NixOS systemd oneshots run `abra` to
deploy traefik/keycloak/drone, and a nightly `nixos-rebuild` auto-updates them to latest with a
pre-deploy major/manual-migration gate (WC1.2) + post-deploy health-gated rollback (WC1.1).
**The alternative:** treat the CI server like a normal Co-op Cloud host — traefik, the warm
keycloak, and drone are **plain abra deployments managed by maintainers**, deployed once and
**upgraded by a human via `abra app upgrade`** (using abra's own release-notes / major-bump
caution), with **no Nix reconciler and no nightly auto-update** for them. Nix would provide only
the host substrate (OS, docker/swarm, the harness, secrets), not the infra-app lifecycle.
- *Pros:* simpler — no reconciler/rollback/nightly machinery; idiomatic Co-op Cloud ops (maintainers
manage these exactly like any other coop-cloud app); updates are normal human abra actions; lower
cognitive load (no "infra is special / Nix-driven" layer).
- *Cons:* the infra apps are **no longer reproducible-from-git** — a VM recreate (the D8 throwaway
rebuild) would NOT re-establish traefik/keycloak/drone; they'd need manual redeploy after a
rebuild (D8 then covers only the host + harness, not the infra apps). Loses the automated
nightly-latest + health-gated auto-rollback; infra updates + rollback become manual/operator
discipline. Drifts from the project's "declarative, rebuildable from scratch" ethos.
- *Note:* it's essentially one-or-the-other for the **update path** — a hybrid where Nix bootstraps
them but maintainers also `abra app upgrade` them creates the dual-ownership conflict (the Nix
reconciler would fight/redeploy over the maintainer's version on the next activation).
- *When to revisit:* if the reconciler + rollback + nightly machinery proves high-maintenance or
brittle, or if maintainers strongly prefer normal coop-cloud workflow over the Nix layer — weigh
that against how much we value full reproducibility (D8) + hands-off auto-updates. *Added:* 2026-05-29.
- [ ] **Docker Hub pull-through registry cache (`registry:2` proxy) — deferred; single-host makes it marginal.**
Considered as a Phase-2pc perf win, then **dropped (operator, 2026-05-29):** on a **single host**,
Docker's own local image store already caches pulled images (re-deploys reuse local layers), so the
prune-policy fix (Phase 2pc PC1) recovers ~all the benefit. A separate pull-through cache's
distinctive wins don't apply here — multi-node fan-out (one node), surviving prune/VM-rebuild on
*separate* storage (ours would be co-located, lost on recreate), cache-miss auth (daemon already
PAT-authenticated). **Revisit ONLY if:** (a) cc-ci goes **multi-node**, OR (b) Phase-2b measurement
shows **cold-cache / fresh-deploy pull time** (D8 throwaway-rebuild, fresh-canonical seeding) is a
real bottleneck **AND** the cache lives on **recreate-surviving storage** (Incus volume / host-b1
path, not the VM's ephemeral disk). Otherwise it's complexity without payoff. *Added:* 2026-05-29.
- [ ] **Optional `--extra` flag for heavy / operational tests (opt-in heavy suite).**
Some recipe tests are "more than needed" for the default CI signal — state-management /
long-running-instance / load / helper-script operational tests that don't fit the ephemeral
per-run-deploy model cheaply but are useful occasionally. Today they're deferred to
`cc-ci/machine-docs/DEFERRED.md` (e.g. matrix-synapse `compress_state.sh`,
`test_complexity_limit.sh`, `test_purge.sh`) and don't run.
*Idea:* add an **opt-in `--extra` flag** (e.g. `!testme --extra` on a PR comment, or
a `STAGES=extra` / `EXTRA_TESTS=1` Drone build parameter) that the orchestrator passes through;
recipes declare an `extra/` test dir or mark tests with `@pytest.mark.extra`; on opt-in the
orchestrator runs them **alongside** the default tiers (still one deploy, still teardown). Default
off so default CI stays fast; the operator can ask for the heavy suite when reviewing a PR that
touches an extra-covered area (e.g. matrix-synapse's abra helpers). When implemented, each
matching DEFERRED entry can be CLOSED by porting its test into the recipe's `extra/` and noting
the commit in DEFERRED.md. *Why deferred for now:* default coverage is sufficient; this is a
later breadth/depth knob, not a critical-path feature. *Added:* 2026-05-28.
- [ ] **Optional webhook self-registration (admin-access environments).**
We deliberately made **polling the primary trigger** and require the CI server/bot to run on
**read-level** access only — so the server does **not** auto-register Gitea webhooks (that needs
repo-admin), and webhook setup is a documented manual admin task (§4.1, `docs/enroll-recipe.md`).
*Later*, for environments where the CI server **does** hold admin on the recipe repos (or an
org-level admin token is available), consider adding an **opt-in, off-by-default** feature
(e.g. `WEBHOOK_AUTOREGISTER=1`) that auto-registers and **idempotently reconciles** the
`issue_comment` webhook (URL, events, HMAC secret) on enrolled repos — matching our
declarative-reconcile pattern (§9) — giving low-latency push triggering with zero manual setup.
Must stay off by default and fall back to manual-doc + polling when admin isn't available, so the
least-privilege (read-only) default is preserved. *Why deferred:* polling already satisfies D1 and
the read-only posture is the goal; this is a convenience optimization for a different deployment
profile. *Added:* 2026-05-27.
- **Docker Hub `registry:2` pull-through cache (deferred from Phase 2pc).** A local registry in
proxy/pull-through mode, daemon `registry-mirrors`-wired, so all `docker.io` pulls are cached
locally across recipes/runs/prunes. **Deferred (operator, 2026-05-29):** on the current
**single, PAT-authenticated, non-pruning** host, Docker's own local image store already IS the
cache (redeploys reuse local layers — proven in Phase 2pc), so a separate registry adds a service +
mirror config + cache GC for marginal gain; its distinctive wins (multi-node fan-out, surviving
prune/VM-rebuild on *separate* storage, cache-miss auth) don't apply here. **Revisit ONLY if** (a)
cc-ci goes **multi-node**, OR (b) Phase-2b measurement shows **cold-deploy pull time is a real
bottleneck** (e.g. D8 throwaway-rebuild / fresh-canonical seeding) **AND** the cache lives on
**recreate-surviving storage** (an Incus volume / a path on host b1, not the VM's ephemeral disk).
Otherwise it's complexity without payoff. See DECISIONS.md "Phase 2pc". *Added:* 2026-05-29.
- **Phase-2b empirical performance work (moved out of the 2b phase).** The original Phase 2b was a full
empirical perf program: per-phase timing instrumentation in `results.json`, a cold/warm baseline
across representative recipes, a Pareto attribution, and a menu of software optimizations. **Deferred
(operator, 2026-05-30):** the real deploy-speed bottleneck turned out to be **hardware**, not
software — the cc-ci VM was **2 vCPU on a 4-core host** and **disk-I/O-bound** (load ~8, io pressure
~65%) while running warm-keycloak (JVM) + all infra; RAM was never the constraint. Fixed **directly**:
bumped to **4 vCPU** and made cc-nix-test the **only running VM** on b1. The software micro-opts below
are judged unlikely to move the needle enough to justify the work; revisit ONLY if measurement later
shows a specific software bottleneck. (Phase 2b is narrowed to just confirming the test sequence
already minimizes deploys — see plan-phase2b.) Parked ideas:
- **Per-phase timing instrumentation** + cold/warm **baseline** + **attribution** — do this first if
perf is ever revisited; numbers should drive any change.
- **Image pulls:** local registry pull-through cache (see the item above) and/or pre-pull/warm the
enrolled recipes' image set so the first run doesn't pay the cold pull.
- **Readiness/convergence:** replace fixed sleeps with tight health-endpoint polling; per-recipe
readiness probes; parallelize independent readiness checks within a run.
- **Warm shared SSO provider** (already partly live as warm-keycloak): saves per-run SSO deploy time
but is a steady JVM CPU tax that slows non-SSO recipes — only worth it with proven per-run
isolation; consider start-when-needed / stop-when-idle rather than always-on.
- **Runner/build caching:** persistent nix store + warm flake eval; cache pip/uv wheels + Playwright
browsers in a persistent volume.
- **Concurrency sizing:** tune `MAX_TESTS`/runner capacity + per-recipe weights so light recipes run
concurrently while heavy ones serialize, without overcommitting the node.
- **Resources:** further vCPU/RAM/disk-I/O sizing (the 4-vCPU bump is done; storage I/O on b1 is the
harder co-bottleneck — a faster storage pool if it ever matters).
- **abra/secret overhead:** avoid regenerating/re-inserting secrets redundantly across stages.
*Why deferred:* hardware was the real lever and is fixed; these are speculative software gains best
validated by measurement, not assumed. *Added:* 2026-05-30.