Files
cc-ci/machine-docs/STATUS-2w.md

368 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
## Phase
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.
## Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
- [x] **WC1** — Live-warm UNPINNED keycloak; per-run namespaced realms (create+delete); concurrent
distinct realms; orphan realms reaped. **Adversary PASS @2026-05-29** (REVIEW-2w, gate 985686f).
- [~] **WC1.1** — Health-gated deploy-with-rollback. **keycloak (stateful) — Adversary PASS
@2026-05-29** (marquee). **traefik (stateless, version-rollback-only) — reconciler MIGRATED
(W0.10a): proxy.nix now drives `warm_reconcile.py traefik` (shared health-gated path, no
snapshot; cert/file-provider setup preserved); no-op converge proven live (traefik 200,
keycloak-through-traefik 200, 0 failed). **Adversary PASS @2026-05-29** (REVIEW-2w e3b08a9):
destructive rollback proven (lint-breaking tag → rollback to 5.1.1, NO TLS outage). **WC1.1
FULLY CLOSED (keycloak stateful + traefik stateless).**
- [x] **WC1.2** — Pre-deploy safety gate (major / manual-migration → hold + alert with notes, no
churn, short-circuits before WC1.1). **Adversary PASS @2026-05-29**.
- [x] **WC2** — Data-warm canonical model: per-recipe canonical at stable domain `warm-<recipe>`,
declarative registry (canonical.json + recipe_meta.WARM_CANONICAL) tracking recipe→known-good
version/commit; data-warm (undeployed-when-idle, volume retained); re-warmable via seed_canonical.
Proven on custom-html (W1.2). **Adversary PASS @2026-05-29** (REVIEW-2w 0246296, gate 4ce80f8).
- [x] **WC3** — Known-good snapshots: raw per-volume tar taken while undeployed under
`/var/lib/ci-warm/<recipe>/snapshot/`; one last-good per app, atomic subdir swap; restore
round-trips data (W0.5 + W1.2 + Adversary's own mutate→restore). **Adversary PASS @2026-05-29**.
- [x] **WC4**`--quick` mode (`run_quick` in run_recipe_ci.py): reattach canonical → upgrade to PR
head (chaos) → generic UPGRADE+serving+overlay+custom; PASS→undeploy-keep-volume (known-good
UNCHANGED, never promote); FAIL→restore last-known-good snapshot then undeploy. Proven live on
custom-html (PASS + FAIL). **Adversary PASS @2026-05-29** (REVIEW-2w 31f0e42, gate 3ff2bf6).
- [x] **WC5** — Canonical advancement via cold only (promote-on-green-cold). `should_promote_canonical`
(enrolled+green+cold+latest) + `promote_canonical` (re-seed canonical at green-verified latest →
snapshot+registry; never lose known-good). Proven live: green cold custom-html run advanced the
canonical 1.10.0+1.28.0 → 1.11.0+1.29.0 (snapshot refreshed, idle, per-run app torn down).
`--quick` never promotes (W2). **Adversary PASS @2026-05-29** (REVIEW-2w 5bbc47c, gate 125453d).
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
- [x] **WC7** — Trigger/authority/labeling: default `!testme`=cold (unchanged); `--quick` opt-in via
bridge `parse_trigger` (`!testme --quick` → CCCI_QUICK=1 Drone param, deployed+live-verified);
never gates merge; runs carry mode=quick (lower-confidence label); clean no-canonical fallback
to cold. **Adversary PASS @2026-05-29** (REVIEW-2w 31f0e42, gate 3ff2bf6).
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
## Milestones (plan §3)
- **W0** — Warm keycloak (WC1/WC1.1-keycloak/WC1.2). ✅ Adversary PASS @2026-05-29.
- **W1** — Canonical registry + snapshot/restore (WC2, WC3). ✅ Adversary PASS @2026-05-29.
- **W2** — `--quick` mode (WC4, WC7). ✅ Adversary PASS @2026-05-29.
- **W3** — Cold-advances-canonical (WC5 ✅ PASS) + nightly sweep (WC6 ← building).
- **W4** — Resource/isolation hardening + docs + cold verify (WC8, WC9).
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
- **W2** — `--quick` mode (WC4, WC7).
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
## In flight
**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant
JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch`
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
- **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests
(48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm → undeploy → snapshot
(mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker
realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm/<recipe>/`, atomic, one last-good.
- **W0.6 reconciler rewrite** DONE (a044abb). `runner/warm_reconcile.py` (python, packaged into the
nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched
at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR
releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback
scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate →
commit-or-restore+redeploy-prior+alert). Alerts = `/var/lib/ci-warm/alerts/*.json`. +8 unit tests
(56 unit pass). PROVEN live: `nixos-rebuild switch` → warm-keycloak.service runs the python
reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major,
keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).
- **W0.9 WC1.1 live proofs** DONE (32f0071). PROVEN on warm keycloak (annotated fake tags +
CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good
committed, marker preserved; (b) **marquee rollback** — broken latest 10.7.10 → deploy fails →
rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback
alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical
10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed
swarm-settle, abra-stdout capture). 57 unit pass. **Reconciler-side WC1/WC1.1/WC1.2 proven.**
**Adversary reproduce (W0.9):** on cc-ci, with the keycloak recipe clone, create annotated fake
tags (peel `^{}`, set git identity) `10.7.9+26.6.2`(=good commit) and `10.7.10+26.6.2`(broken
KC_HOSTNAME), then `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice; observe
`upgraded:` then `rolled-back:`, marker realm survives, `/var/lib/ci-warm/keycloak/last_good`
unchanged at the prior version, a `*rollback*.json` alert under `/var/lib/ci-warm/alerts/`.
**W0 COMPLETE — Adversary PASS @2026-05-29.** Now in **W1 (canonical registry, WC2/WC3)**.
**W0 ✅ + W1 ✅ + W2 ✅ Adversary PASS. Now in W3 (cold-advances-canonical WC5 + nightly sweep WC6).**
**W3 plan:**
- **WC5 — promote-on-green-cold.** A GREEN full-cold run on the LATEST (not a `--quick` run) of an
enrolled (WARM_CANONICAL) recipe re-snapshots + re-tags the canonical known-good instead of
deleting the volume at teardown: at the end of a green cold run, undeploy → `canonical.seed_canonical`
(snapshot while undeployed + write registry version=the green commit/version) → keep the volume as
the new canonical. The FIRST green cold run on latest SEEDS the canonical. ONLY cold advances it
(`--quick` never promotes — proven W2). Wire into run_recipe_ci.py cold teardown, gated on:
recipe is WARM_CANONICAL + run was green + deployed LATEST (not a pinned/prev base). Add unit
tests + a live proof (green cold custom-html run → canonical re-seeded at the new known-good).
- **WC6 — nightly full-cold sweep.** Declarative scheduler (systemd timer on cc-ci): nightly does
`nixos-rebuild switch` FIRST (rolls warm/infra to latest, health-gated per WC1.1) THEN a full-cold
sweep across enrolled recipes (serial, MAX_TESTS-bounded), refreshing each canonical's known-good
(WC5) + serving as the daily authoritative regression. MUST NOT run while a test is in flight.
- **Quiet-window opportunity (now): W0.10a traefik WC1.1** — Adversary idle post-W2 PASS, so this is
the window to migrate traefik onto the health-gated reconciler (tracked-before-DONE; below).
**Tracked before Phase-2w DONE:**
- **W0.10a — traefik WC1.1** (Adversary requires a cold proof): migrate `proxy.nix` onto the shared
health-gated reconciler (stateless = version-rollback-only; preserve cert-secret/WILDCARDS_ENABLED/
COMPOSE_FILE setup). CAREFUL — traefik serves all TLS; deploy/test only in a quiet window.
- **W0.10b — Builder-loop alert relay**: each wake, scan `/var/lib/ci-warm/alerts/*.json`
PushNotification → archive to `alerts/seen/`.
**Build finding (RESOLVED):** the W0.4 lasuite-docs `setup_custom_tests` redeploy failure (nginx web
`host not found in upstream ...backend:8000`) was **transient resource contention** from the
since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the
headline e2e is green (below). No recipe/harness change needed.
## Gate
### Gate: WC5 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 5bbc47c, gate 125453d)
Anti-poison gate predicate + live advancement 1.10.0→1.11.0 (cold-only) cold-verified. Builder may
proceed to WC6. (claim detail retained below.)
### (claimed, now PASS) Gate: WC5 — CLAIMED detail
**WHAT.** Promote-on-green-cold: a GREEN full-cold run on LATEST (no PR head) of an enrolled
(WARM_CANONICAL) recipe advances/seeds the canonical known-good; `--quick` never promotes; only cold
advances. **WHERE:** `runner/run_recipe_ci.py` (`should_promote_canonical` gate + `promote_canonical`
+ the post-green-cold hook in main()), `runner/harness/canonical.py` (seed_canonical).
**HOW + EXPECTED (cold):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q`**70 passed** (incl. test_promote: the gate fires
only for enrolled+green+cold+latest; not on red / quick / PR-head / unenrolled).
2. **Live advancement (custom-html canonical):** set its registry version to an OLDER value
(`canonical.write_registry("custom-html", version="1.10.0+1.28.0", …)`), then a full COLD run
`RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py` (no REF = latest) → install/upgrade/backup/
restore/custom all pass, deploy-count=1, then `WC5 promote-on-green-cold: (re)seed canonical
custom-html @ 1.11.0+1.29.0` → afterwards `canonical.json` version **ADVANCED to 1.11.0+1.29.0**
(commit=head 8a02606…), snapshot refreshed (`warmsnap.read_meta` version=1.11.0+1.29.0), canonical
idle + volume retained, NO `cust-*` per-run service left (cold teardown sacred). Builder ran this
live: **advanced 1.10.0→1.11.0**. (A PR `!testme` REF=PR-head does NOT promote; `--quick` never
promotes — both gate-checked.)
---
### Gate: W0.10a traefik WC1.1 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w e3b08a9, gate e678d2e)
Migration + no-op converge + destructive rollback (lint-breaking tag → rollback to last-good, NO TLS
outage — broken deploy rejected at lint before touching the running proxy) all cold-verified.
**WC1.1 now FULLY closed (keycloak + traefik).** (claim detail retained below.)
### (claimed, now PASS) Gate: W0.10a traefik WC1.1 — CLAIMED detail
**WHAT.** traefik migrated onto the shared health-gated reconciler (WC1.1, stateless =
version-rollback-only, NO snapshot): record last-good → deploy latest tag → health-gate (routed host
ci.commoninternet.net = 200) → healthy commit / unhealthy roll back to last-good + alert. Closes the
W0.10a tracked-open item from the W0 gate. traefik's wildcard-cert/file-provider config preserved.
**WHERE.** `runner/warm_reconcile.py` (SPECS["traefik"] stateful=False + `_traefik_setup` + health_domain;
reconcile() per-app setup hook; the stateless path skips snapshot/restore — version rollback only),
`nix/modules/proxy.nix` (deploy-proxy.service now execs `python3 …/warm_reconcile.py traefik`).
**HOW + EXPECTED (cold):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q`**65 passed** (incl. test_warm_reconcile traefik
spec: stateful=False, callable setup, health_domain=ci.commoninternet.net; keycloak unchanged).
2. **No-op converge (delivered, proven live):** `systemctl is-active deploy-proxy.service` → active;
`journalctl -u deploy-proxy.service``[traefik] already on latest 5.1.1+v3.6.15 and healthy —
no-op`; traefik serving (ci.commoninternet.net=200) + keycloak-through-traefik=200 + system
`running` (0 failed). The migration was zero-disruption (traefik was already at the latest tag; I
pre-seeded TYPE+last_good to 5.1.1+v3.6.15 so the reconcile is a clean no-op).
3. **Destructive rollback (the Adversary's required cold proof):** stage a fake newer traefik tag with
a broken config → `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py traefik` → broken deploy
fails health → reconciler rolls back to last-good 5.1.1+v3.6.15 (version-only, no snapshot — traefik
is stateless) → traefik healthy again + a `*-rollback.json` alert. NOTE: a destructive traefik test
briefly drops TLS for ALL routes during the broken-deploy window until rollback — run it knowing
that + with manual recovery ready (`abra app deploy traefik.ci.commoninternet.net 5.1.1+v3.6.15
-o -n -f`). The rollback logic is the SAME proven keycloak pattern, stateless variant (no snapshot).
Per operator guidance, I delivered the code + the safe no-op converge this iteration and left the
destructive rollback as the Adversary's cold proof (a live destructive traefik test risks all TLS).
---
### Gate: WC4 + WC7 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 31f0e42, gate 3ff2bf6)
Cold-verified from the Adversary's own clone: 64 units; WC7 adversarial trigger battery (all negatives
rejected, live bridge); WC4 never-promote (snapshot byte-identical, registry unchanged); WC4
FAIL→rollback restored EXACT known-good (marker back, 200, broken image gone, exit 1); no-canonical
fallback to a cold per-run domain. Builder may proceed to W3. (claim detail retained below.)
### (claimed, now PASS) Gate: WC4 + WC7 — CLAIMED detail
**WHAT.** The `--quick` opt-in fast lane (W2): reattach the data-warm canonical → upgrade in place to
the PR head → assert (generic upgrade reconverge+moved+serving + overlay + custom); PASS →
undeploy-keep-volume with the **known-good UNCHANGED (never promote)**; FAIL → restore the
last-known-good snapshot + undeploy (roll back, data safe). Opt-in via `!testme --quick`, mode-tagged
lower-confidence, never gates merge; clean no-canonical fallback to COLD.
**WHERE (code).** `runner/run_recipe_ci.py` (`run_quick`, dispatched from `main()` on CCCI_QUICK=1 /
MODE=quick; `_wait_undeployed`; no-canonical fallback), `runner/harness/canonical.py`
(deploy_canonical resets TYPE; undeploy_keep_volume), `runner/harness/warmsnap.py` (restore),
`bridge/bridge.py` (`parse_trigger` + CCCI_QUICK param), `.drone.yml` (quick echo). 64 unit pass.
**HOW + EXPECTED (cold, from your own clone on cc-ci):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q`**64 passed** (incl. test_bridge_trigger:
`!testme`→cold, `!testme --quick`→quick, `!testmexyz`→reject).
2. **WC7 trigger (live in the running bridge):** `cid=$(docker ps -q -f name=ccci-bridge);
docker exec $cid python3 -c 'import sys;sys.path.insert(0,"/app");import bridge;
print(bridge.parse_trigger("!testme --quick"), bridge.parse_trigger("!testmexyz"))'` →
`(True, True) (False, False)`. `trigger_build` adds `CCCI_QUICK=1` (auto-exposed to run_recipe_ci);
a `!testme --quick` PR comment is labelled lower-confidence; plain `!testme` stays full cold.
3. **WC4 `--quick` flow (custom-html canonical, currently idle at 1.11.0+1.29.0):**
- **PASS run:** `RECIPE=custom-html CCCI_QUICK=1 REF=87a62a5 cc-ci-run runner/run_recipe_ci.py`
(REF=87a62a5 is the 1.10.0+1.28.0 commit — a different healthy head) → exit 0; SUMMARY shows
`mode=quick`, `upgrade: pass`, `custom: pass`, "canonical undeployed, volume retained, known-good
UNCHANGED"; afterwards `canonical.json` version STILL 1.11.0+1.29.0 (NOT promoted), canonical
idle, content volume retained, known-good marker intact.
- **FAIL run (rollback):** stage a broken custom-html commit (`image: nginx:99.99.99-doesnotexist`),
`RECIPE=custom-html CCCI_QUICK=1 CCCI_SKIP_FETCH=1 REF=<broken sha> cc-ci-run
runner/run_recipe_ci.py` → exit 1; SUMMARY shows "rolling back … restored known-good data;
canonical idle (NOT promoted)"; afterwards known-good version UNCHANGED, canonical idle, data
(marker) intact. Builder ran both live: **ALL PASS** (canonical left clean idle@1.11.0+1.29.0).
- **no-canonical fallback:** MODE=quick for a recipe with no canonical → logs "falling back to COLD"
and runs the full cold flow (so the PR is still tested; default `!testme` unaffected).
**Builder will NOT advance into W3 (cold-advances-canonical / nightly) past this gate** until
REVIEW-2w shows PASS — but will do the tracked W0.10a (traefik) in a quiet window meanwhile.
---
### Gate: WC2 + WC3 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 0246296, gate 4ce80f8)
Cold-verified from the Adversary's own clone (its own data-warm round-trip + restore round-trip).
Builder may proceed to W2 (`--quick`). custom-html canonical left clean (idle, volume retained,
known-good content, snapshot intact, v1.11.0+1.29.0). (claim detail retained below.)
### (claimed, now PASS) Gate: WC2 + WC3 — CLAIMED detail
**WHAT.** The data-warm canonical model (W1): a declarative per-recipe canonical at the stable domain
`warm-<recipe>.ci.commoninternet.net`, kept **data-warm** (undeployed-when-idle, data volume
retained), tracked by a registry; **known-good snapshots** (raw per-volume tar while undeployed, one
last-good per app, restore round-trips data).
**WHERE (code).** `runner/harness/canonical.py` (registry + data-warm lifecycle), `runner/harness/
warmsnap.py` (snapshot/restore), enrollment `tests/custom-html/recipe_meta.py: WARM_CANONICAL=True`.
State on cc-ci under `/var/lib/ci-warm/<recipe>/` (`canonical.json`, `snapshot/`, retained volume).
**HOW + EXPECTED (cold, from your own clone on cc-ci):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **61 passed** (incl. test_canonical, test_warmsnap).
2. **WC2/WC3 data-warm round-trip** (custom-html canonical exists idle now): reproduce with a driver
that uses `runner/harness/canonical.py` — deploy `warm-custom-html.ci.commoninternet.net` @
`1.11.0+1.29.0`, write a marker file into `/usr/share/nginx/html/`, undeploy, `seed_canonical`
(writes `/var/lib/ci-warm/custom-html/canonical.json` + a `snapshot/` while undeployed); confirm
**app UNDEPLOYED but the `content` volume RETAINED** (`docker volume ls | grep warm-custom-html`);
then `deploy_canonical('custom-html')` → the marker **survives** (data-warm reattach). Builder ran
this live: **ALL PASS** (marker `WC2-DATA-MARKER-7f3a9c` survived; registry version=1.11.0+1.29.0;
snapshot present). Current live state: `cat /var/lib/ci-warm/custom-html/canonical.json` →
status=idle, version=1.11.0+1.29.0; `docker volume ls` shows
`warm-custom-html_ci_commoninternet_net_content` retained with NO custom-html service running.
3. **WC3 restore round-trip** already cold-verified in the W0.9/W0.5 keycloak proof (snapshot →
mutate DB → restore → data back); same `warmsnap` helper.
4. **D8/WC8:** `/var/lib/ci-warm/` is cache, NOT in the nix closure (no module references it as a
source); re-seeded by cold runs, not restored on rebuild.
**Builder will NOT advance into W2 (`--quick`, which consumes the canonical) past this gate** until
REVIEW-2w shows PASS — but will do non-disruptive W0.10 follow-ups (alert relay) meanwhile.
---
### Gate: WC1 + WC1.2 + WC1.1(keycloak) — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 31ac86d, gate 985686f)
All 6 checks cold-verified from the Adversary's own clone. Builder may proceed to W1. **Tracked open
(must close before Phase-2w DONE, not a blocker now): traefik WC1.1 (W0.10)** — stateless
version-rollback not yet on the shared health-gated reconciler; Adversary will require a cold proof.
(claim detail retained below for the record)
**WHAT.** The live-warm keycloak layer (W0): a persistent **unpinned** keycloak at the stable domain
`warm-keycloak.ci.commoninternet.net`, declaratively reconciled, that SSO-dependent runs use via a
**per-run namespaced realm** (created + deleted) instead of co-deploying; concurrent dependents get
distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with
snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps
(WC1.2).
**WHERE (code).** `runner/warm_reconcile.py` (reconcile logic), `runner/harness/warm.py` (stable
domain, per-run realm naming, reaping), `runner/harness/sso.py` (realm lifecycle), `runner/harness/
warmsnap.py` (snapshot/restore), `runner/run_recipe_ci.py` (warm/cold dep split), `nix/modules/
warm-keycloak.nix` (systemd reconcile unit). Warm state on cc-ci under `/var/lib/ci-warm/`.
**HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/<clone>):**
1. **Declarative + unpinned + healthy:** `grep -n kcVersion nix/modules/warm-keycloak.nix` → *no
match* (pin removed; the unit runs `runner/warm_reconcile.py keycloak`). `ssh cc-ci 'systemctl
is-active warm-keycloak.service'` → `active`; `systemctl is-system-running` → `running`. Health:
`curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1
https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'` → `200`.
D8: a `nixos-rebuild build` closure hash is unaffected by which keycloak version is live (recipe
fetched at runtime).
2. **Units:** `cc-ci-run -m pytest tests/unit -q` → **57 passed** (incl. test_warm_realm,
test_warmsnap, test_warm_reconcile).
3. **WC1 headline e2e:** `RECIPE=lasuite-docs STAGES=install,custom cc-ci-run
runner/run_recipe_ci.py` → `install: pass`, `custom: pass`, **`deploy-count = 1 (expect 1)`**
(keycloak NOT co-deployed), log shows `dep: using live-warm keycloak @ warm-keycloak...` and
`dep: deleted per-run realm lasuite-docs-<hex> on warm keycloak`. The 3 custom SSO tests pass
(test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak).
After the run, warm keycloak realms = `['master']` only (no leftover); no `lasu*` docker stack.
4. **WC1 concurrency + reaping (deploy-free):** `realm_for("lasuite-docs","lasu-aaa111...")` =
`lasuite-docs-aaa111` and `...bbb222` → distinct (two concurrent same-recipe runs never collide);
create realms aaa111/bbb222/ccc333 on the warm kc, each `oidc_password_grant` returns a JWT;
`sso.reap_orphaned_realms(D, live_hexes={"aaa111"})` deletes exactly bbb222+ccc333 and KEEPS
aaa111. (Builder ran this live: PASS.)
5. **WC1.1 health-gated rollback (live):** with `CCCI_SKIP_FETCH=1` stage two **annotated** fake tags
on `~/.abra/recipes/keycloak` — `10.7.9+26.6.2` at the good commit (`git tag -a -m x 10.7.9+26.6.2
10.7.1+26.6.2^{}`) and `10.7.10+26.6.2` at a commit whose compose.yml has a broken
`KC_HOSTNAME=:::bad-host:::`. Create a marker realm, set last_good, then run `CCCI_SKIP_FETCH=1
cc-ci-run runner/warm_reconcile.py keycloak` twice → first `RECONCILE RESULT: upgraded:...->10.7.9`
(snapshot taken, last_good=10.7.9, marker preserved); second `rolled-back:10.7.10->10.7.9` —
keycloak HEALTHY on 10.7.9, **marker realm INTACT** (data preserved), `/var/lib/ci-warm/keycloak/
last_good` still `10.7.9` (NOT advanced), a `*-rollback.json` alert under `/var/lib/ci-warm/alerts/`
with `attempted=10.7.10 last_good=10.7.9 recovered=true`. (Builder ran this live: ALL PASS; keycloak
restored to canonical 10.7.1+26.6.2.)
6. **WC1.2 pre-deploy safety gate (live):** stage an annotated fake tag with a MAJOR bump
(`11.0.0+27.0.0`) → `CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloak` → `RECONCILE RESULT:
held-major:...`, a `*-held-major.json` alert written, **keycloak untouched** (TYPE unchanged,
200, no snapshot/deploy churn). Stage a minor tag (`10.7.2+26.6.3`) with `releaseNotes/
10.7.2+26.6.3.md` containing "manual migration" → `held-manual-migration`, alert carries the notes.
(Builder ran both live: held + untouched.)
**SCOPE (honest).** WC1 and WC1.2 are complete. **WC1.1 is proven for keycloak** — the *stateful*
case (snapshot-backed data-integrity rollback), which is the hard part and the Adversary's marquee
proof. **traefik's WC1.1** (stateless = version-rollback-only) is **NOT yet migrated** onto the shared
health-gated reconciler — it still uses the existing `proxy.nix` chaos-deploy reconciler. That
migration is **W0.10** (tracked in BACKLOG-2w), to land before the Phase-2w DONE. If the Adversary
wants WC1.1 fully closed (both reconcilers) before PASS, treat this gate as WC1 + WC1.2 + WC1.1(keycloak).
**Alert delivery note (not blocking):** the reconciler WRITES alert sentinels to
`/var/lib/ci-warm/alerts/*.json` (proven above). The operator-facing relay (Builder loop scans →
PushNotification → archive to `alerts/seen/`) is loop behavior, run each wake when an alert exists;
none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable.
**Builder will NOT advance past this gate** (to W1/WC2 canonical registry) until REVIEW-2w shows PASS.
## (prior) Gate
(none before this)
## Blocked
(none)
## Notes
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
from cold `<recipe[:4]>-<6hex>`.
</content>