# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode) **Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md` **Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared). Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state. Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`. ## Phase Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified. ## Definition of Done (Phase 2w) — WC1–WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w - [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain, **UNPINNED** (fetch latest + chaos deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped. - [ ] **WC1.1** — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak): record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on rollback (reuse WC3 helper). traefik = version rollback only. - [ ] **WC1.2** — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade). - [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative registry tracking recipe→known-good commit; re-warmable from scratch. - [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one last-known-good per app, atomic replace; restore proven to round-trip data. - [ ] **WC4** — `--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts; PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes. - [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold). - [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded). - [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge; results carry mode; clean no-canonical fallback. - [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure. - [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`, confirm last-known-good restored intact; a `--quick` pass did not move the known-good). ## Milestones (plan §3) - **W0** — Warm keycloak (WC1). ← IN FLIGHT - **W1** — Canonical registry + snapshot/restore (WC2, WC3). - **W2** — `--quick` mode (WC4, WC7). - **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6). - **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE. ## In flight **W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114): - W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass). - W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm). - **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS. - W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch` → warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned + skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.) - **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests (48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm → undeploy → snapshot (mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm//`, atomic, one last-good. - **W0.6 reconciler rewrite** DONE (a044abb). `runner/warm_reconcile.py` (python, packaged into the nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate → commit-or-restore+redeploy-prior+alert). Alerts = `/var/lib/ci-warm/alerts/*.json`. +8 unit tests (56 unit pass). PROVEN live: `nixos-rebuild switch` → warm-keycloak.service runs the python reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major, keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes). - **W0.9 WC1.1 live proofs** DONE (32f0071). PROVEN on warm keycloak (annotated fake tags + CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good committed, marker preserved; (b) **marquee rollback** — broken latest 10.7.10 → deploy fails → rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical 10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed swarm-settle, abra-stdout capture). 57 unit pass. **Reconciler-side WC1/WC1.1/WC1.2 proven.** **Adversary reproduce (W0.9):** on cc-ci, with the keycloak recipe clone, create annotated fake tags (peel `^{}`, set git identity) `10.7.9+26.6.2`(=good commit) and `10.7.10+26.6.2`(broken KC_HOSTNAME), then `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice; observe `upgraded:` then `rolled-back:`, marker realm survives, `/var/lib/ci-warm/keycloak/last_good` unchanged at the prior version, a `*rollback*.json` alert under `/var/lib/ci-warm/alerts/`. **Next (remaining for WC1 gate):** 1. **W0.7** — fix the lasuite-docs in-place chaos-redeploy nginx-upstream race (`host not found in upstream ...backend:8000`) OR pick a more-robust SSO dependent for the headline proof. 2. **W0.8** — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent distinct realms (no collision) + reaping. → claim WC1/WC1.1/WC1.2. 3. **Builder-loop alert relay** (deferred wiring) — on each wake, scan `/var/lib/ci-warm/alerts/*.json`, PushNotification + record + archive to `alerts/seen/`; wire when nightly WC6 lands (first real alert). **Build finding (RESOLVED):** the W0.4 lasuite-docs `setup_custom_tests` redeploy failure (nginx web `host not found in upstream ...backend:8000`) was **transient resource contention** from the since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the headline e2e is green (below). No recipe/harness change needed. ## Gate ### Gate: WC1 + WC1.1 + WC1.2 — CLAIMED, awaiting Adversary (@2026-05-29, HEAD = see `git log -1`) **WHAT.** The live-warm keycloak layer (W0): a persistent **unpinned** keycloak at the stable domain `warm-keycloak.ci.commoninternet.net`, declaratively reconciled, that SSO-dependent runs use via a **per-run namespaced realm** (created + deleted) instead of co-deploying; concurrent dependents get distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps (WC1.2). **WHERE (code).** `runner/warm_reconcile.py` (reconcile logic), `runner/harness/warm.py` (stable domain, per-run realm naming, reaping), `runner/harness/sso.py` (realm lifecycle), `runner/harness/ warmsnap.py` (snapshot/restore), `runner/run_recipe_ci.py` (warm/cold dep split), `nix/modules/ warm-keycloak.nix` (systemd reconcile unit). Warm state on cc-ci under `/var/lib/ci-warm/`. **HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/):** 1. **Declarative + unpinned + healthy:** `grep -n kcVersion nix/modules/warm-keycloak.nix` → *no match* (pin removed; the unit runs `runner/warm_reconcile.py keycloak`). `ssh cc-ci 'systemctl is-active warm-keycloak.service'` → `active`; `systemctl is-system-running` → `running`. Health: `curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1 https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'` → `200`. D8: a `nixos-rebuild build` closure hash is unaffected by which keycloak version is live (recipe fetched at runtime). 2. **Units:** `cc-ci-run -m pytest tests/unit -q` → **57 passed** (incl. test_warm_realm, test_warmsnap, test_warm_reconcile). 3. **WC1 headline e2e:** `RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py` → `install: pass`, `custom: pass`, **`deploy-count = 1 (expect 1)`** (keycloak NOT co-deployed), log shows `dep: using live-warm keycloak @ warm-keycloak...` and `dep: deleted per-run realm lasuite-docs- on warm keycloak`. The 3 custom SSO tests pass (test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak). After the run, warm keycloak realms = `['master']` only (no leftover); no `lasu*` docker stack. 4. **WC1 concurrency + reaping (deploy-free):** `realm_for("lasuite-docs","lasu-aaa111...")` = `lasuite-docs-aaa111` and `...bbb222` → distinct (two concurrent same-recipe runs never collide); create realms aaa111/bbb222/ccc333 on the warm kc, each `oidc_password_grant` returns a JWT; `sso.reap_orphaned_realms(D, live_hexes={"aaa111"})` deletes exactly bbb222+ccc333 and KEEPS aaa111. (Builder ran this live: PASS.) 5. **WC1.1 health-gated rollback (live):** with `CCCI_SKIP_FETCH=1` stage two **annotated** fake tags on `~/.abra/recipes/keycloak` — `10.7.9+26.6.2` at the good commit (`git tag -a -m x 10.7.9+26.6.2 10.7.1+26.6.2^{}`) and `10.7.10+26.6.2` at a commit whose compose.yml has a broken `KC_HOSTNAME=:::bad-host:::`. Create a marker realm, set last_good, then run `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice → first `RECONCILE RESULT: upgraded:...->10.7.9` (snapshot taken, last_good=10.7.9, marker preserved); second `rolled-back:10.7.10->10.7.9` — keycloak HEALTHY on 10.7.9, **marker realm INTACT** (data preserved), `/var/lib/ci-warm/keycloak/ last_good` still `10.7.9` (NOT advanced), a `*-rollback.json` alert under `/var/lib/ci-warm/alerts/` with `attempted=10.7.10 last_good=10.7.9 recovered=true`. (Builder ran this live: ALL PASS; keycloak restored to canonical 10.7.1+26.6.2.) 6. **WC1.2 pre-deploy safety gate (live):** stage an annotated fake tag with a MAJOR bump (`11.0.0+27.0.0`) → `CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloak` → `RECONCILE RESULT: held-major:...`, a `*-held-major.json` alert written, **keycloak untouched** (TYPE unchanged, 200, no snapshot/deploy churn). Stage a minor tag (`10.7.2+26.6.3`) with `releaseNotes/ 10.7.2+26.6.3.md` containing "manual migration" → `held-manual-migration`, alert carries the notes. (Builder ran both live: held + untouched.) **SCOPE (honest).** WC1 and WC1.2 are complete. **WC1.1 is proven for keycloak** — the *stateful* case (snapshot-backed data-integrity rollback), which is the hard part and the Adversary's marquee proof. **traefik's WC1.1** (stateless = version-rollback-only) is **NOT yet migrated** onto the shared health-gated reconciler — it still uses the existing `proxy.nix` chaos-deploy reconciler. That migration is **W0.10** (tracked in BACKLOG-2w), to land before the Phase-2w DONE. If the Adversary wants WC1.1 fully closed (both reconcilers) before PASS, treat this gate as WC1 + WC1.2 + WC1.1(keycloak). **Alert delivery note (not blocking):** the reconciler WRITES alert sentinels to `/var/lib/ci-warm/alerts/*.json` (proven above). The operator-facing relay (Builder loop scans → PushNotification → archive to `alerts/seen/`) is loop behavior, run each wake when an alert exists; none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable. **Builder will NOT advance past this gate** (to W1/WC2 canonical registry) until REVIEW-2w shows PASS. ## (prior) Gate (none before this) ## Blocked (none) ## Notes - **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2 cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower). - Stable-domain scheme (proposed, see DECISIONS): `warm-.ci.commoninternet.net`, distinct from cold `-<6hex>`.