Files
cc-ci-orchestrator/cc-ci-plan/IDEAS.md
autonomic-bot b40dcb504c plan: queue samever (older-base fallback when last-green==head, opus); IDEAS: defer canonical-history (B)
Operator 2026-06-17. Closes the prevb resolver gap: when the last-green
warm-canonical base version equals the PR head version, step back to the
newest published version strictly older than head (design A) instead of a
same-version no-op or a skip. Design B (canonical history for a green-verified
older base) saved to IDEAS. Auto-runs after regall (watchdog advances + switches
to opus).
2026-06-17 03:50:49 +00:00

211 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Deferred ideas / future enhancements (orchestrator-tracked)
Post-DONE or "revisit later" ideas that are intentionally **out of scope** for the current build
(§2 Definition of Done). Not active work — parked here so they aren't lost. The loops may pull an
item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant.
- [ ] **ALT infra-app model: deploy traefik / warm-keycloak / drone the normal Co-op Cloud way,
updated outside Nix via abra by maintainers.** *(operator-flagged 2026-05-29, alternative to the
current Phase-2w design — return to later)*
The **current** design Nix-*reconciles* the infra/warm apps: NixOS systemd oneshots run `abra` to
deploy traefik/keycloak/drone, and a nightly `nixos-rebuild` auto-updates them to latest with a
pre-deploy major/manual-migration gate (WC1.2) + post-deploy health-gated rollback (WC1.1).
**The alternative:** treat the CI server like a normal Co-op Cloud host — traefik, the warm
keycloak, and drone are **plain abra deployments managed by maintainers**, deployed once and
**upgraded by a human via `abra app upgrade`** (using abra's own release-notes / major-bump
caution), with **no Nix reconciler and no nightly auto-update** for them. Nix would provide only
the host substrate (OS, docker/swarm, the harness, secrets), not the infra-app lifecycle.
- *Pros:* simpler — no reconciler/rollback/nightly machinery; idiomatic Co-op Cloud ops (maintainers
manage these exactly like any other coop-cloud app); updates are normal human abra actions; lower
cognitive load (no "infra is special / Nix-driven" layer).
- *Cons:* the infra apps are **no longer reproducible-from-git** — a VM recreate (the D8 throwaway
rebuild) would NOT re-establish traefik/keycloak/drone; they'd need manual redeploy after a
rebuild (D8 then covers only the host + harness, not the infra apps). Loses the automated
nightly-latest + health-gated auto-rollback; infra updates + rollback become manual/operator
discipline. Drifts from the project's "declarative, rebuildable from scratch" ethos.
- *Note:* it's essentially one-or-the-other for the **update path** — a hybrid where Nix bootstraps
them but maintainers also `abra app upgrade` them creates the dual-ownership conflict (the Nix
reconciler would fight/redeploy over the maintainer's version on the next activation).
- *When to revisit:* if the reconciler + rollback + nightly machinery proves high-maintenance or
brittle, or if maintainers strongly prefer normal coop-cloud workflow over the Nix layer — weigh
that against how much we value full reproducibility (D8) + hands-off auto-updates. *Added:* 2026-05-29.
- [ ] **Docker Hub pull-through registry cache (`registry:2` proxy) — deferred; single-host makes it marginal.**
Considered as a Phase-2pc perf win, then **dropped (operator, 2026-05-29):** on a **single host**,
Docker's own local image store already caches pulled images (re-deploys reuse local layers), so the
prune-policy fix (Phase 2pc PC1) recovers ~all the benefit. A separate pull-through cache's
distinctive wins don't apply here — multi-node fan-out (one node), surviving prune/VM-rebuild on
*separate* storage (ours would be co-located, lost on recreate), cache-miss auth (daemon already
PAT-authenticated). **Revisit ONLY if:** (a) cc-ci goes **multi-node**, OR (b) Phase-2b measurement
shows **cold-cache / fresh-deploy pull time** (D8 throwaway-rebuild, fresh-canonical seeding) is a
real bottleneck **AND** the cache lives on **recreate-surviving storage** (Incus volume / host-b1
path, not the VM's ephemeral disk). Otherwise it's complexity without payoff. *Added:* 2026-05-29.
- [ ] **Optional `--extra` flag for heavy / operational tests (opt-in heavy suite).**
Some recipe tests are "more than needed" for the default CI signal — state-management /
long-running-instance / load / helper-script operational tests that don't fit the ephemeral
per-run-deploy model cheaply but are useful occasionally. Today they're deferred to
`cc-ci/machine-docs/DEFERRED.md` (e.g. matrix-synapse `compress_state.sh`,
`test_complexity_limit.sh`, `test_purge.sh`) and don't run.
*Idea:* add an **opt-in `--extra` flag** (e.g. `!testme --extra` on a PR comment, or
a `STAGES=extra` / `EXTRA_TESTS=1` Drone build parameter) that the orchestrator passes through;
recipes declare an `extra/` test dir or mark tests with `@pytest.mark.extra`; on opt-in the
orchestrator runs them **alongside** the default tiers (still one deploy, still teardown). Default
off so default CI stays fast; the operator can ask for the heavy suite when reviewing a PR that
touches an extra-covered area (e.g. matrix-synapse's abra helpers). When implemented, each
matching DEFERRED entry can be CLOSED by porting its test into the recipe's `extra/` and noting
the commit in DEFERRED.md. *Why deferred for now:* default coverage is sufficient; this is a
later breadth/depth knob, not a critical-path feature. *Added:* 2026-05-28.
- [ ] **Optional webhook self-registration (admin-access environments).**
We deliberately made **polling the primary trigger** and require the CI server/bot to run on
**read-level** access only — so the server does **not** auto-register Gitea webhooks (that needs
repo-admin), and webhook setup is a documented manual admin task (§4.1, `docs/enroll-recipe.md`).
*Later*, for environments where the CI server **does** hold admin on the recipe repos (or an
org-level admin token is available), consider adding an **opt-in, off-by-default** feature
(e.g. `WEBHOOK_AUTOREGISTER=1`) that auto-registers and **idempotently reconciles** the
`issue_comment` webhook (URL, events, HMAC secret) on enrolled repos — matching our
declarative-reconcile pattern (§9) — giving low-latency push triggering with zero manual setup.
Must stay off by default and fall back to manual-doc + polling when admin isn't available, so the
least-privilege (read-only) default is preserved. *Why deferred:* polling already satisfies D1 and
the read-only posture is the goal; this is a convenience optimization for a different deployment
profile. *Added:* 2026-05-27.
- **Docker Hub `registry:2` pull-through cache (deferred from Phase 2pc).** A local registry in
proxy/pull-through mode, daemon `registry-mirrors`-wired, so all `docker.io` pulls are cached
locally across recipes/runs/prunes. **Deferred (operator, 2026-05-29):** on the current
**single, PAT-authenticated, non-pruning** host, Docker's own local image store already IS the
cache (redeploys reuse local layers — proven in Phase 2pc), so a separate registry adds a service +
mirror config + cache GC for marginal gain; its distinctive wins (multi-node fan-out, surviving
prune/VM-rebuild on *separate* storage, cache-miss auth) don't apply here. **Revisit ONLY if** (a)
cc-ci goes **multi-node**, OR (b) Phase-2b measurement shows **cold-deploy pull time is a real
bottleneck** (e.g. D8 throwaway-rebuild / fresh-canonical seeding) **AND** the cache lives on
**recreate-surviving storage** (an Incus volume / a path on host b1, not the VM's ephemeral disk).
Otherwise it's complexity without payoff. See DECISIONS.md "Phase 2pc". *Added:* 2026-05-29.
- **Phase-2b empirical performance work (moved out of the 2b phase).** The original Phase 2b was a full
empirical perf program: per-phase timing instrumentation in `results.json`, a cold/warm baseline
across representative recipes, a Pareto attribution, and a menu of software optimizations. **Deferred
(operator, 2026-05-30):** the real deploy-speed bottleneck turned out to be **hardware**, not
software — the cc-ci VM was **2 vCPU on a 4-core host** and **disk-I/O-bound** (load ~8, io pressure
~65%) while running warm-keycloak (JVM) + all infra; RAM was never the constraint. Fixed **directly**:
bumped to **4 vCPU** and made cc-nix-test the **only running VM** on b1. The software micro-opts below
are judged unlikely to move the needle enough to justify the work; revisit ONLY if measurement later
shows a specific software bottleneck. (Phase 2b is narrowed to just confirming the test sequence
already minimizes deploys — see plan-phase2b.) Parked ideas:
- **Per-phase timing instrumentation** + cold/warm **baseline** + **attribution** — do this first if
perf is ever revisited; numbers should drive any change.
- **Image pulls:** local registry pull-through cache (see the item above) and/or pre-pull/warm the
enrolled recipes' image set so the first run doesn't pay the cold pull.
- **Readiness/convergence:** replace fixed sleeps with tight health-endpoint polling; per-recipe
readiness probes; parallelize independent readiness checks within a run.
- **Warm shared SSO provider** (already partly live as warm-keycloak): saves per-run SSO deploy time
but is a steady JVM CPU tax that slows non-SSO recipes — only worth it with proven per-run
isolation; consider start-when-needed / stop-when-idle rather than always-on.
- **Runner/build caching:** persistent nix store + warm flake eval; cache pip/uv wheels + Playwright
browsers in a persistent volume.
- **Concurrency sizing:** tune `MAX_TESTS`/runner capacity + per-recipe weights so light recipes run
concurrently while heavy ones serialize, without overcommitting the node.
- **Resources:** further vCPU/RAM/disk-I/O sizing (the 4-vCPU bump is done; storage I/O on b1 is the
harder co-bottleneck — a faster storage pool if it ever matters).
- **abra/secret overhead:** avoid regenerating/re-inserting secrets redundantly across stages.
*Why deferred:* hardware was the real lever and is fixed; these are speculative software gains best
validated by measurement, not assumed. *Added:* 2026-05-30.
- [ ] **Package cc-ci itself as a Co-op Cloud recipe (deploy it with `abra` like any other app).**
*(operator idea 2026-06-02 — not implementing now)*
Today cc-ci is a NixOS flake that declares the whole server. Idea: split it so the **app layer**
(the `!testme` bridge, dashboard, and runner) ships as a Co-op Cloud **recipe**, on top of a host/
Swarm provided by NixOS (or any existing Co-op Cloud host). It would then be deployed/upgraded/
backed-up via `abra` like any recipe — and could **test itself** (deploy `cc-ci`-the-recipe, run its
own lifecycle = ultimate self-regression; pairs with the regression-canary work).
- *Pros:* dogfooding (cc-ci managed by the same tooling it tests); portability/accessibility (any
Co-op Cloud operator could `abra app deploy cc-ci` instead of needing the flake + a dedicated host);
cleaner host-vs-CI-app separation.
- *Cons / crux:* the runner must orchestrate Swarm deploys of recipes-under-test, which a normal
(unprivileged) recipe can't do — needs either a Docker-socket mount (privileged recipe; test stacks
share the host → blast-radius/isolation risk), DinD (heavy/fragile), or **a control-plane recipe
driving a separate disposable test Swarm over SSH** (cleanest, biggest change). Also loses the
"entire server reproducible from one flake" property (host + CI-app become two artifacts).
- *When to revisit:* after the server regression canaries land (want self-tests before re-architecting
the thing that runs them). Treat as a design spike — the runner's Swarm-access model (socket / DinD /
separate Swarm) is the decision that drives everything else.
- [ ] **Move the upgrader's direct deploy-and-inspect onto SEPARATE infra (isolate it from the cc-ci
server, so it can't tamper with the tests).** *(operator-flagged 2026-06-09)*
The upgrader now has a direct deploy-and-inspect dev loop (`/recipe-upgrade` step 2b): it deploys the
WIP recipe with `abra app deploy --chaos` and reads `docker service logs` to debug an upgrade before
the `!testme` CI gate. **Today that runs on the cc-ci server's own local swarm** — i.e. the autonomous
upgrader (and its subagents) hold `ssh cc-ci` + root + docker + abra on the **same host that owns the
harness, the recipe tests, and the CI verdict**. That's a trust-isolation hole: an agent with write
access to the box that gates it could — by accident, prompt-injection, or reward-hacking — **edit the
tests/harness to make a failing upgrade "pass"** ("hacking the tests"). We can't be fully sure the gate
is honest while the thing being gated can modify the gate.
- *Idea:* give the upgrader a **dedicated, throwaway test server** (a real `cctest`-style host, like
recipe-maintainer's `cctest.autonomic.zone`) for the step-2b deploy/inspect loop, reached with
**scoped, test-only credentials** (its own SSH key/abra context — blast radius limited to that host,
per recipe-maintainer's `test-ssh/` model). The upgrader gets all the live-log feedback there and has
**no write access to the cc-ci server**. The cc-ci `!testme` CI stays the independent, tamper-proof
gate — run by infra the upgrader cannot touch — so a GREEN verdict means the tests it can't edit
passed.
- *Pros:* the CI gate becomes trustworthy under an autonomous/untrusted upgrader (least-privilege:
the agent can deploy + inspect but not alter the harness/tests/verdict); cleaner separation of
"develop the upgrade" (test server) from "verify the upgrade" (cc-ci CI); matches recipe-maintainer's
limited-blast-radius `test-ssh/` posture.
- *Cons / crux:* new infra to provision + keep current (a second Co-op Cloud host or VM, abra
`server add`, scoped SSH creds, its own teardown/orphan-sweep); cost + maintenance; and the harness
deploy path may assume `--server default`/local, so the per-run CI deploys vs the upgrader's
test-server deploys need a clear split. Until then, the local-swarm dev loop + the orphan-sweep are
the interim posture (functional, but NOT isolation-safe against a malicious/compromised upgrader).
- *Interim safeguard (in force now, until this lands):* the upgrader **must NEVER modify the cc-ci
tests or harness (`tests/**`, `runner/**`) unless explicitly invoked with `--with-tests`** — encoded
as an absolute guardrail in `/recipe-upgrade` (and its step-2b direct-deploy loop). It is a written
rule, not an enforced boundary — which is exactly why the separate-infra isolation above is the real
fix.
- *When to revisit:* before running the upgrader fully unattended/untrusted at scale, or alongside the
"package cc-ci as a recipe" spike above (both hinge on a separate disposable test Swarm). *Added:* 2026-06-09.
- [ ] **Fail fast on a crash-looping deploy + don't let one wedged run starve the CI queue.**
*(operator-flagged 2026-06-09, after a live incident)*
Observed live: plausible build 220 — a recipe whose ClickHouse service exits 1 every ~6s — held the
**single serial CI runner for its full `DEPLOY_TIMEOUT` (1200s / 20 min)** while the deploy never
converged. Because the runner is serial, that **starved every other recipe's CI behind it**: immich
PR-2 builds 221/222 sat "pending" for ~12 min without ever starting, and only ran once the wedged
plausible run was manually torn down (SIGINT the harness — its `finally` tore down the stack — which
freed the runner). Two compounding problems:
- **No fail-fast on a crash-loop.** The deploy/health wait polls until `DEPLOY_TIMEOUT`; a service
that is plainly crash-looping (a task repeatedly `Failed "task: non-zero exit (1)"` / restarting
every few seconds with no healthy replica) could be detected and the run **failed early** (e.g.
after N restarts within M seconds) instead of burning the whole 1200s. Faster RED for the broken
recipe AND frees the runner sooner. This is the higher-value, lower-risk half — do it first.
- **Head-of-line blocking on a single serial runner.** One wedged/slow run blocks ALL other recipes'
CI. Options: bump runner concurrency (CAREFULLY — the single-node Swarm is why it's serial today,
to avoid parallel-deploy OOM/collision, per /upgrade-all §safety); a priority/queue policy; or at
minimum surface "queued behind build N" so a pending run isn't mistaken for a stuck/own failure.
- *Interim mitigations (in use):* (a) debug a known-crash-looping recipe via the `/recipe-upgrade`
**step-2b direct dev deploy** (`dev-<recipe>` + `docker service logs`) instead of repeated
`!testme` — diagnoses with full log visibility WITHOUT consuming a 20-min runner slot or starving
other recipes; (b) a wedged run can be manually freed by SIGINT-ing its `run_recipe_ci.py` (the
harness `finally` tears down its stack).
- *When to revisit:* if CI-queue starvation recurs (several recipes in flight, or a legitimately
long deploy wedging others). *Added:* 2026-06-09.
- [ ] **Canonical *history* for a green-verified older upgrade base (design B).** *(operator-flagged
2026-06-17; deferred from the `samever` fix — design A shipped instead)*
Context: the dynamic upgrade-base resolver (phase `prevb`) uses the last-green warm-canonical as the
base. When that canonical version **equals the PR head version** (a non-version-bump PR, or a re-test
after the canonical advanced), phase `samever` (design A) steps back to the **newest published version
strictly older than head** (from `recipe_tags`). That older *published* version isn't guaranteed to
have passed green on cc-ci.
**The improvement (B):** keep a short **history** in the warm-canonical registry — the last N green
`{version, commit}` records, not just the single current one (`canonical.read_registry`/`write_registry`
are single-record today). When stepping back, prefer the most-recent **prior canonical** (green-verified)
over a raw published tag; fall back to design A's published-tag only when no prior canonical exists.
- *Why deferred:* history only accrues going forward (existing recipes have none until they've gone
green ≥2× on cold-on-latest), so design A (published-tag) is the always-present floor and must exist
anyway. B is a quality refinement on top. Requires a registry schema change (single record → bounded
history) + `promote_canonical`/`write_registry` appending instead of overwriting.
- *Re-entry trigger:* when the published-tag fallback proves insufficient — e.g. a recipe's
newest-older-published version is itself undeployable, so we'd rather upgrade from a known-green prior
canonical — or operator request. *Added:* 2026-06-17.