# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode) **Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md` **Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared). Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state. Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`. ## Phase Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified. ## Definition of Done (Phase 2w) — WC1–WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w - [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain, **UNPINNED** (fetch latest + chaos deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped. - [ ] **WC1.1** — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak): record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on rollback (reuse WC3 helper). traefik = version rollback only. - [ ] **WC1.2** — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade). - [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative registry tracking recipe→known-good commit; re-warmable from scratch. - [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one last-known-good per app, atomic replace; restore proven to round-trip data. - [ ] **WC4** — `--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts; PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes. - [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold). - [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded). - [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge; results carry mode; clean no-canonical fallback. - [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure. - [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`, confirm last-known-good restored intact; a `--quick` pass did not move the known-good). ## Milestones (plan §3) - **W0** — Warm keycloak (WC1). ← IN FLIGHT - **W1** — Canonical registry + snapshot/restore (WC2, WC3). - **W2** — `--quick` mode (WC4, WC7). - **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6). - **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE. ## In flight **W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114): - W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass). - W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm). - **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS. - W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch` → warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned + skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.) **Re-sequenced after the 2026-05-28/29 design update (unpin + WC1.1 rollback + WC1.2 safety gate):** WC1.1's keycloak rollback needs the **WC3 snapshot/restore helper**, so build that FIRST, then rewrite the reconciler ONCE into the unpinned + safety-gated + health-gated-with-rollback form. Next: 1. **WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`): raw copy of an app's data volume(s) while undeployed, under `/var/lib/ci-warm//`, atomic replace, one last-good; restore round-trips data. + unit tests + live round-trip proof. 2. Rewrite reconciler: unpin keycloak (fetch latest + chaos); WC1.2 safety gate (major / manual- migration → hold + alert); WC1.1 record last-good → (keycloak: undeploy→snapshot→deploy latest) → health-gate → commit-or-rollback+restore+alert. 3. Settle the **alert mechanism** (bash reconciler can't call agent PushNotification — sentinel file the Builder loop relays, see DECISIONS). 4. Resolve the lasuite-docs in-place-redeploy race (BUILD finding below) OR pick a more-robust dependent, then the headline WC1 e2e (dependent SSO green vs warm keycloak) + concurrency proof. **Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy --force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream ...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around. ## Gate (none claimed yet) ## Blocked (none) ## Notes - **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2 cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower). - Stable-domain scheme (proposed, see DECISIONS): `warm-.ci.commoninternet.net`, distinct from cold `-<6hex>`.