Files
cc-ci-orchestrator/cc-ci-plan/plan-phase2w-warm-canonical-quick.md
autonomic-bot 00e90bb597 plan(2w): WC1.2 — pre-deploy auto-upgrade safety gate (major/manual-migration -> alert, hold)
Operator (2026-05-29): a passing health check does NOT prove a required manual migration ran, so
auto-update needs a PRE-deploy gate in addition to the post-deploy health rollback. Reconciler
auto-applies only non-major (patch/minor) upgrades with no manual-migration release notes; a MAJOR
recipe-version bump (or release notes flagging a manual migration) is held on the current version
with a PushNotification carrying the release notes (operator upgrades manually). Leans on abra's
own major-bump caution + recipe releaseNotes/. Updated WC1.2/WC6/principles/decisions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:02:28 +01:00

211 lines
16 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 2w — Warm canonical deployments + `--quick` CI mode (Autonomous Build Plan)
**Status:** ACTIVE — **interjected into Phase 2** by operator decision (2026-05-28). Phase 2
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (its STATUS-2/BACKLOG-2 state is
preserved); the loops do this phase now, then **Phase 2 resumes automatically** where it left off.
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2w.md` the watchdog returns to Phase 2.
**Builds on:** the Phase-1d/1e harness (generic suite, deploy-once, override overlays, HC1 upgrade
to PR-head, the sso-dep pattern `plan-sso-dep-testing.md`) and the now-wired Docker Hub auth.
**Owner agents:** Builder + Adversary loops (`plan.md` §6/§7); Adversary cold-verifies.
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
**Phase order now:** 1c → 1b → 1d → 1e → 2(paused) → **2w** → 2(resume) → 2b → 3 → 4.
---
## 0. Why this phase
Cold-start CI (fresh `abra app new` → deploy → DB-init/first-boot → … → teardown) is slow, and it
re-pays that cost on every run and for every SSO dependency. This phase adds a **warm-data** layer:
keep each app's **data volume** around between runs (Co-op Cloud's `undeploy` frees RAM but keeps
volumes), so a fast `--quick` run can reattach it, upgrade to the PR code, and assert — without the
cold-provisioning cost. A persistent **keycloak** serves SSO-dependent recipes without a fresh
co-deploy each run. A **last-known-good snapshot per app** means a bad PR tested under `--quick` can
never destroy the working state+data — we roll back.
**Terminology (use these terms throughout code/docs/decisions):**
- **live-warm** — actually deployed and running (e.g. keycloak): instant to use, costs RAM.
- **data-warm** — *undeployed* (RAM freed) but its **data volume is retained**, so a later
`abra app deploy` reattaches it and boots warm (skips fresh DB-init/first-boot), costs only disk.
- **cold** — no retained data: fresh `abra app new` + new volume + full lifecycle + teardown that
deletes the volume. The authoritative default.
**Design principles settled with the operator (do not relitigate):**
- **Keep keycloak live-warm; keep everything else data-warm.** Only keycloak (shared dep) + the one
app under test run at a time. RAM stops being the limiter; **disk is the budget** (monitor; bump
only if needed — test fixtures are small).
- **Default `!testme` = full cold** (authoritative; its upgrade tier already exercises PR-upgrade per
1e). **`--quick` is an opt-in flag**, a lower-confidence fast lane.
- **The canonical known-good advances ONLY via cold runs** (esp. the nightly sweep). `--quick` NEVER
promotes the canonical — it consumes it read-mostly and rolls back on failure.
- **Snapshots: raw volume copy taken while UNDEPLOYED** (fast + consistent because nothing is
writing). **One last-known-good per app.**
- Warm volumes + snapshots are **cache, not source** — not in the git/D8 closure; re-seeded by cold
runs, not restored on a VM rebuild.
- **Warm/infra apps (traefik + keycloak) auto-update to LATEST, nightly, with health-gated
rollback** (operator, 2026-05-28). Both are **unpinned** — their reconcilers `abra recipe fetch`
the latest published recipe + chaos-deploy it. A **nightly `nixos-rebuild switch`** runs the
reconcilers so the warm/infra apps roll to latest each night. Because the recipe is fetched at
*activation* (runtime), the **nix closure stays byte-identical** (only the deployed versions float)
— D8 is preserved; the version pin is gone, so the closure is *more* stable, not less.
- **Auto-update is gated TWICE — a pre-deploy safety gate AND a post-deploy health gate:**
- **Pre-deploy (don't even try unsafe upgrades):** only **auto-apply upgrades that don't require
manual intervention** — i.e. non-major (patch/minor) recipe-version bumps with no
manual-migration in their release notes. If current→latest is a **MAJOR bump** or the target's
**release notes flag a manual migration**, **DO NOT auto-upgrade**: stay on the current version
and **PUSH ALERT** with the release notes (operator upgrades manually). A passing health check
does NOT prove a required migration was done, so this gate is independent of health. Lean on
abra's own major-bump caution + the recipe `releaseNotes/`.
- **Post-deploy (for upgrades we DO apply):** record running version as last-good → deploy latest
→ health-check → **healthy: commit (last-good := latest); unhealthy: roll back to last-good +
PUSH ALERT.** For **stateful apps (keycloak / any app with a DB/volume): snapshot the data volume
BEFORE the upgrade and restore it on rollback** (a forward DB migration can make a version-only
rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik (stateless) = version
rollback only. (Rollback is NOT nix-generation rollback — the swarm app isn't in the generation;
it's built into the reconciler.)
---
## 1. Definition of Done (Phase 2w exit condition)
Terminates when every item holds **and the Adversary has independently cold-verified** (logged in
`machine-docs/REVIEW-2w.md`):
- [ ] **WC1 — Live-warm keycloak (SSO dep), unpinned + self-healing.** A persistent (live-warm)
keycloak runs at a stable domain, **unpinned** (reconciler `abra recipe fetch` latest +
chaos-deploy, matching traefik — drop the `kcVersion` pin; keep the *secret-generate-only-if-
missing* guard + the health-wait). SSO-dependent recipes (per `plan-sso-dep-testing.md`) point
their `setup_custom_tests` at it, create a **per-run namespaced realm+client**, then **delete
that realm** after the run, instead of co-deploying a fresh keycloak. Proven: a dependent
recipe's SSO custom tests pass against the warm keycloak; concurrent dependents don't collide
(distinct realms); leftover realms are reaped.
- [ ] **WC1.2 — Pre-deploy auto-upgrade safety gate (manual-migration → alert, don't auto-apply).**
Before the reconciler auto-applies "latest", it checks the current→latest delta: **auto-apply
only non-major (patch/minor) bumps with no manual-migration release notes.** A **MAJOR
recipe-version bump**, or a target whose **`releaseNotes/` flag a manual migration**, is NOT
auto-applied — the reconciler **stays on the current version and pushes an alert with the
release notes** for the operator to upgrade manually. (Health-pass ≠ migration-done, so this is
independent of WC1.1.) Detection leans on abra's major-bump handling + the recipe release notes.
**Adversary proof:** simulate a major/manual-migration latest → confirm the warm app stays on
current + an alert with the notes fired (no silent auto-upgrade).
- [ ] **WC1.1 — Health-gated deploy-with-rollback in the warm/infra reconcilers (traefik + keycloak).**
Each reconciler: record the running version as **last-good** → fetch+deploy latest →
health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good +
`PushNotification` alert.** For **stateful apps (keycloak): snapshot the data volume BEFORE the
upgrade; on rollback restore that snapshot + redeploy the prior version** (forward DB migrations
make version-only rollback unsafe) — reuse the WC3 snapshot helper. traefik (stateless) = version
rollback only. **Adversary proof:** force a broken "latest" (simulate) → confirm the warm app
self-reverts to the prior healthy version (keycloak with its pre-upgrade data intact) and an
alert fired; a healthy update commits the new version as last-good.
- [ ] **WC2 — Data-warm canonical model.** A canonical per warmed recipe at a **stable domain**
(distinct from cold per-run `<recipe>-<6hex>` domains), kept **data-warm** (undeployed-when-idle,
volume retained). A small declarative registry/reconciler tracks which recipes are
canonical and **at which commit** their known-good is. Re-warmable from scratch (cache).
- [ ] **WC3 — Known-good snapshots.** For each canonical app, a **raw copy of its data volume(s)
taken while undeployed**, stored under a stable path (e.g. `/var/lib/ci-warm/<recipe>/`),
tagged with the commit it passed on. **One last-known-good retained per app** (prior is
replaced atomically on update). Restore is proven to bring the app back healthy with its data.
- [ ] **WC4 — `--quick` mode.** `runner/run_recipe_ci.py` gains a `--quick` path (flag/env): reattach
the canonical warm volume (`abra app deploy` of the canonical) → **upgrade to PR head** (chaos
redeploy) → run generic UPGRADE + serving + custom assertions (generic-first invariant holds) →
**on PASS:** `abra app undeploy` (keep volume), do NOT alter the known-good; **on FAIL:**
restore the last-known-good snapshot, then undeploy. `--quick` **never promotes** the canonical.
- [ ] **WC5 — Canonical advancement via cold only.** A **green full-cold run on latest** re-snapshots
+ re-tags the canonical known-good (promote-on-green instead of deleting at teardown). A cold
run is the ONLY thing that advances a canonical. Seeding: the first green cold run on latest
makes an app canonical.
- [ ] **WC6 — Nightly: rebuild (warm/infra → latest) THEN full-cold sweep.** A scheduled nightly job,
in order: (1) **`nixos-rebuild switch`** → the warm/infra reconcilers roll traefik + keycloak to
latest, subject to the **WC1.2 pre-deploy gate** (major/manual-migration → hold on current +
alert) and the **WC1.1 health-gated rollback**; (2) the **full cold** suite across enrolled recipes
— refreshing every canonical's known-good (WC5) AND serving as a daily authoritative regression
run, now against the freshly-updated infra. Don't run while a test run is in flight. Mechanism
settled in DECISIONS (systemd timer on cc-ci / Drone cron / bridge), declarative + reproducible.
Bounded by MAX_TESTS (serial is fine — nightly). If the rebuild's health-gate rolled an infra
app back, the alert fires and the sweep still runs against the (healthy) prior version.
- [ ] **WC7 — Trigger + authority + labeling.** Default `!testme` = full cold (unchanged). `--quick`
is opt-in (`!testme --quick`, or a build param) and **never gates merge**. Run results carry
the **mode** (cold vs quick) so a `--quick` pass is distinctly labeled lower-confidence (feeds
Phase 3). Quick requires an existing canonical; if none, it cleanly falls back to (or reports
"no canonical — run cold first").
- [ ] **WC8 — Resource safety + isolation.** Warm-base runs serialize per app (MAX_TESTS honored);
warm keycloak shared safely via per-run realms; **disk monitored** (warm volumes + one snapshot
each) with a documented budget + prune of stale/orphaned warm data; cold-run teardown stays
sacred (deletes its own per-run volumes); warm data is excluded from the D8 reproducibility
closure (documented as cache).
- [ ] **WC9 — Documented + cold-verified, incl. the rollback proof.** `docs/` explains warm/quick;
the Adversary cold-verifies, **including deliberately failing a PR under `--quick` and
confirming the canonical's last-known-good is restored intact (data preserved)**, and that a
`--quick` pass did not move the known-good. No softened tests.
When WC1WC9 hold and are confirmed, write `## DONE` to `machine-docs/STATUS-2w.md` → the watchdog
auto-returns to **Phase 2** (resume recipe authoring).
---
## 2. The `--quick` flow (reference)
```
PRECOND: a canonical for <recipe> exists (seeded by a prior green cold run); else fall back/report.
1. abra app deploy <canonical-domain> # reattach warm volume -> fast warm boot at known-good commit
2. wait_healthy
3. (deps) point at the warm keycloak; create a per-run realm+client (namespaced)
4. UPGRADE to PR head (abra app deploy --chaos to the PR checkout) # the op, once
5. assert: generic upgrade (reconverge + moved + serving) + recipe overlay + custom (requires_deps)
6a. PASS -> abra app undeploy <canonical-domain> # keep volume; known-good UNCHANGED
6b. FAIL -> restore last-known-good snapshot to the volume; abra app undeploy # roll back, data safe
7. (deps) delete the per-run realm from the warm keycloak
```
Cold run (default, unchanged) seeds/advances the canonical: on a green cold run on latest, snapshot
the (undeployed) volume → replace the last-known-good + tag the commit, and keep the volume as the
new canonical instead of deleting it.
## 3. Milestones (bounded)
- **W0 — Warm keycloak, unpinned + self-healing (WC1, WC1.1).** Highest ROI; unblocks faster SSO
recipe tests for the resumed Phase 2. Includes the health-gated deploy-with-rollback (snapshot
keycloak before upgrade, restore on health-fail + alert); apply the same to traefik (version-only).
- **W1 — Canonical registry + snapshot/restore (WC2, WC3).** Stable-domain warm apps; raw-while-
stopped snapshot + restore; prove restore round-trips data. (Shares the snapshot helper with WC1.1.)
- **W2 — `--quick` mode (WC4, WC7).** Orchestrator path + labeling + fallback.
- **W3 — Nightly rebuild→sweep + cold-advances-canonical (WC5, WC6).** Nightly `nixos-rebuild`
(warm/infra → latest, health-gated) then full-cold sweep; promote-on-green-cold; scheduled job.
- **W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9).** Then
`## DONE`.
## 4. Guardrails
- **`--quick` never advances the canonical; only cold does.** Anchors the baseline to verified states.
- **Never lose the known-good** — snapshot before mutate (or rely on the standing known-good); restore
on any quick failure. The rollback proof (WC9) is mandatory.
- **Default stays cold; quick is opt-in + clearly lower-confidence.** Don't let a quick pass read as
full confidence.
- **Snapshot only while undeployed** (consistency). **One last-known-good per app** (disk).
- **Cold teardown stays sacred** (deletes per-run volumes); warm volumes are a managed cache, never
confused with per-run state; warm data excluded from D8.
- **Warm/infra auto-update is health-gated** — a failed "latest" self-reverts to the last-good
version (+ data, for stateful apps) and alerts; never leave the proxy/SSO dep broken silently.
- **Never weaken a test** (cardinal rule). Generic-first invariant holds in `--quick` too.
- **Bounded** — build the mechanism + prove on keycloak + a couple of recipes; do NOT re-warm all
recipes here (the nightly sweep populates canonicals over time).
## 5. Open decisions (log in machine-docs/DECISIONS.md)
- Canonical **stable-domain scheme** (distinct from cold per-run domains) + how the registry/reconciler
is declared.
- **Snapshot storage + format** (raw tar vs reflink/CoW copy) under `/var/lib/ci-warm/`; atomic replace.
- **Nightly sweep mechanism** (systemd timer / Drone cron / bridge) + ordering + disk-prune policy.
- `--quick` **trigger surface** (`!testme --quick` comment vs Drone build param) + the "no canonical
yet" fallback (run cold vs report-and-skip).
- **Disk budget**: measure warm volume + snapshot sizes across recipes; decide if a 30→larger bump is
needed or the warm set stays bounded.
- **Stateful pre-upgrade snapshot consistency (keycloak).** keycloak is *live-warm* (running) at
nightly-upgrade time, but the snapshot rule is "raw copy while UNDEPLOYED." Cleanest: the nightly
keycloak update = **undeploy → raw snapshot → deploy latest → health-check → on fail restore
snapshot + redeploy prior** (the brief nightly downtime makes the snapshot consistent and honors
the WC3 invariant). Confirm this vs an app-consistent backup alternative.
- **last-good version state** for warm/infra apps (where the reconciler records the prior healthy
version to roll back to) — a small state file alongside the snapshot, re-derivable from the running
swarm version label.
- **Manual-migration / major-bump detection (WC1.2).** How to decide "auto-apply vs alert-and-hold":
primary signal = **major recipe-version bump** (coop-cloud `<upstream>+<recipe-semver>`; major
recipe-semver = breaking, matches abra's own major-upgrade caution); secondary = scan the target's
`releaseNotes/<version>.md` for manual-migration markers. Decide the exact rule + whether to parse
`abra app upgrade` output vs compute the delta directly.