W1.2: enrolled custom-html (recipe_meta.WARM_CANONICAL); live proof ALL PASS (seed canonical → idle-with-volume-retained → re-warm → marker survived). WC2 (registry+data-warm model) + WC3 (snapshot+restore) proven. 61 unit pass. custom-html now the first real data-warm canonical (idle). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
18 KiB
STATUS — Phase 2w (warm canonical deployments + --quick CI mode)
Phase plan (SSOT): /srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md
Loop state for THIS phase: STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
Phase 2 is PAUSED (STATUS-2/BACKLOG-2 intact) and resumes after 2w ## DONE.
Phase
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
canonicals at stable domains, known-good snapshots, an opt-in --quick fast lane that reattaches the
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified.
Definition of Done (Phase 2w) — WC1–WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
- WC1 — Live-warm UNPINNED keycloak; per-run namespaced realms (create+delete); concurrent
distinct realms; orphan realms reaped. Adversary PASS @2026-05-29 (REVIEW-2w, gate
985686f). - [~] WC1.1 — Health-gated deploy-with-rollback. keycloak (stateful) — Adversary PASS @2026-05-29 (marquee: broken latest → snapshot→restore→prior, data intact, last_good held, alert). traefik (stateless, version-rollback-only) — NOT yet migrated = W0.10, MUST close before Phase-2w DONE (Adversary will require a cold proof).
- WC1.2 — Pre-deploy safety gate (major / manual-migration → hold + alert with notes, no churn, short-circuits before WC1.1). Adversary PASS @2026-05-29.
- WC2 — Data-warm canonical model: per-recipe canonical at stable domain
warm-<recipe>, declarative registry (canonical.json + recipe_meta.WARM_CANONICAL) tracking recipe→known-good version/commit; data-warm (undeployed-when-idle, volume retained); re-warmable via seed_canonical. Proven on custom-html (W1.2). CLAIMED — see Gate below. - WC3 — Known-good snapshots: raw per-volume tar taken while undeployed under
/var/lib/ci-warm/<recipe>/snapshot/; one last-good per app, atomic subdir swap; restore round-trips data (W0.5 mutate→restore proof + W1.2 data-warm reattach). CLAIMED — see Gate. - WC4 —
--quickmode: reattach canonical → upgrade to PR head → generic+custom asserts; PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes. - WC5 — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
- WC6 — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
- WC7 — Trigger/authority/labeling: default
!testme=cold;--quickopt-in, never gates merge; results carry mode; clean no-canonical fallback. - WC8 — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
- WC9 — Docs + cold verify incl. the rollback proof (deliberately fail a PR under
--quick, confirm last-known-good restored intact; a--quickpass did not move the known-good).
Milestones (plan §3)
- W0 — Warm keycloak (WC1/WC1.1-keycloak/WC1.2). ✅ Adversary PASS @2026-05-29.
- W1 — Canonical registry + snapshot/restore (WC2, WC3). ← IN FLIGHT
- W1 — Canonical registry + snapshot/restore (WC2, WC3).
- W2 —
--quickmode (WC4, WC7). - W3 — Cold-advances-canonical + nightly sweep (WC5, WC6).
- W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
In flight
W0 — live-warm keycloak (WC1). Done so far (commits up to 88c1114):
-
W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
-
W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
-
WC1 core mechanism PROVEN deploy-free on the live warm keycloak: realm create → password-grant JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
-
W0.3 declarative reconciler
nix/modules/warm-keycloak.nixup;nixos-rebuild switch→ warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned + skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.) -
W0.5 WC3 snapshot/restore helper (
runner/harness/warmsnap.py) DONE (4cc1e15). +5 unit tests (48 unit pass). LIVE round-trip PROVEN on warm keycloak: marker realm → undeploy → snapshot (mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker realm BACK; keycloak healthy. Snapshots under/var/lib/ci-warm/<recipe>/, atomic, one last-good. -
W0.6 reconciler rewrite DONE (
a044abb).runner/warm_reconcile.py(python, packaged into the nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate → commit-or-restore+redeploy-prior+alert). Alerts =/var/lib/ci-warm/alerts/*.json. +8 unit tests (56 unit pass). PROVEN live:nixos-rebuild switch→ warm-keycloak.service runs the python reconciler → noop-healthy (system 0-failed, 200); WC1.2 holds proven (MAJOR → held-major, keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes). -
W0.9 WC1.1 live proofs DONE (
32f0071). PROVEN on warm keycloak (annotated fake tags + CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good committed, marker preserved; (b) marquee rollback — broken latest 10.7.10 → deploy fails → rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical 10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed swarm-settle, abra-stdout capture). 57 unit pass. Reconciler-side WC1/WC1.1/WC1.2 proven.Adversary reproduce (W0.9): on cc-ci, with the keycloak recipe clone, create annotated fake tags (peel
^{}, set git identity)10.7.9+26.6.2(=good commit) and10.7.10+26.6.2(broken KC_HOSTNAME), thenCCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloaktwice; observeupgraded:thenrolled-back:, marker realm survives,/var/lib/ci-warm/keycloak/last_goodunchanged at the prior version, a*rollback*.jsonalert under/var/lib/ci-warm/alerts/.
W0 COMPLETE — Adversary PASS @2026-05-29. Now in W1 (canonical registry, WC2/WC3).
W1 progress: W1.1 canonical registry module DONE (b6ef83a) — runner/harness/canonical.py
(enrollment via recipe_meta.WARM_CANONICAL, registry canonical.json, deploy/undeploy-keep-volume/
seed lifecycle) + 4 unit tests (61 unit pass). Next: W1.2 — enroll custom-html
(tests/custom-html/recipe_meta.py: WARM_CANONICAL=True) + LIVE data-warm proof: seed a
warm-custom-html canonical with content → undeploy-keep-volume (verify volume retained, app down) →
deploy_canonical (reattach) → assert the written content survives; re-warmable from scratch. Then
close WC2/WC3.
W1 plan (WC2 data-warm canonical model + WC3 closure):
- WC2: a declarative canonical registry — which recipes are canonical + at which known-good
commit/version — with each canonical app at a stable domain
warm-<recipe>, kept data-warm (undeployed-when-idle, data volume retained). Re-warmable from scratch (cache). Reconciler/registry declared in-repo. - WC3: snapshots (warmsnap, W0.5 — done) tied to canonicals: one last-good per canonical under
/var/lib/ci-warm/<recipe>/, restore proven (done). Close WC3 with the canonical model. - Distinguish from W0's live-warm keycloak: canonicals are DATA-warm (undeployed when idle), keycloak
is LIVE-warm (always up). Both use the
warm-<recipe>stable scheme.
Tracked before Phase-2w DONE (not blocking W1):
- W0.10a — traefik WC1.1 (Adversary requires a cold proof): migrate
proxy.nixonto the shared health-gated reconciler (stateless = version-rollback-only; preserve cert-secret/WILDCARDS_ENABLED/ COMPOSE_FILE setup). CAREFUL — traefik serves all TLS; deploy/test only in a quiet window. - W0.10b — Builder-loop alert relay: each wake, scan
/var/lib/ci-warm/alerts/*.json→ PushNotification → archive toalerts/seen/.
Build finding (RESOLVED): the W0.4 lasuite-docs setup_custom_tests redeploy failure (nginx web
host not found in upstream ...backend:8000) was transient resource contention from the
since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the
headline e2e is green (below). No recipe/harness change needed.
Gate
Gate: WC2 + WC3 — CLAIMED, awaiting Adversary (@2026-05-29, HEAD = see git log -1)
WHAT. The data-warm canonical model (W1): a declarative per-recipe canonical at the stable domain
warm-<recipe>.ci.commoninternet.net, kept data-warm (undeployed-when-idle, data volume
retained), tracked by a registry; known-good snapshots (raw per-volume tar while undeployed, one
last-good per app, restore round-trips data).
WHERE (code). runner/harness/canonical.py (registry + data-warm lifecycle), runner/harness/ warmsnap.py (snapshot/restore), enrollment tests/custom-html/recipe_meta.py: WARM_CANONICAL=True.
State on cc-ci under /var/lib/ci-warm/<recipe>/ (canonical.json, snapshot/, retained volume).
HOW + EXPECTED (cold, from your own clone on cc-ci):
- Units:
cc-ci-run -m pytest tests/unit -q→ 61 passed (incl. test_canonical, test_warmsnap). - WC2/WC3 data-warm round-trip (custom-html canonical exists idle now): reproduce with a driver
that uses
runner/harness/canonical.py— deploywarm-custom-html.ci.commoninternet.net@1.11.0+1.29.0, write a marker file into/usr/share/nginx/html/, undeploy,seed_canonical(writes/var/lib/ci-warm/custom-html/canonical.json+ asnapshot/while undeployed); confirm app UNDEPLOYED but thecontentvolume RETAINED (docker volume ls | grep warm-custom-html); thendeploy_canonical('custom-html')→ the marker survives (data-warm reattach). Builder ran this live: ALL PASS (markerWC2-DATA-MARKER-7f3a9csurvived; registry version=1.11.0+1.29.0; snapshot present). Current live state:cat /var/lib/ci-warm/custom-html/canonical.json→ status=idle, version=1.11.0+1.29.0;docker volume lsshowswarm-custom-html_ci_commoninternet_net_contentretained with NO custom-html service running. - WC3 restore round-trip already cold-verified in the W0.9/W0.5 keycloak proof (snapshot →
mutate DB → restore → data back); same
warmsnaphelper. - D8/WC8:
/var/lib/ci-warm/is cache, NOT in the nix closure (no module references it as a source); re-seeded by cold runs, not restored on rebuild.
Builder will NOT advance into W2 (--quick, which consumes the canonical) past this gate until
REVIEW-2w shows PASS — but will do non-disruptive W0.10 follow-ups (alert relay) meanwhile.
Gate: WC1 + WC1.2 + WC1.1(keycloak) — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 31ac86d, gate 985686f)
All 6 checks cold-verified from the Adversary's own clone. Builder may proceed to W1. Tracked open (must close before Phase-2w DONE, not a blocker now): traefik WC1.1 (W0.10) — stateless version-rollback not yet on the shared health-gated reconciler; Adversary will require a cold proof.
(claim detail retained below for the record)
WHAT. The live-warm keycloak layer (W0): a persistent unpinned keycloak at the stable domain
warm-keycloak.ci.commoninternet.net, declaratively reconciled, that SSO-dependent runs use via a
per-run namespaced realm (created + deleted) instead of co-deploying; concurrent dependents get
distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with
snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps
(WC1.2).
WHERE (code). runner/warm_reconcile.py (reconcile logic), runner/harness/warm.py (stable
domain, per-run realm naming, reaping), runner/harness/sso.py (realm lifecycle), runner/harness/ warmsnap.py (snapshot/restore), runner/run_recipe_ci.py (warm/cold dep split), nix/modules/ warm-keycloak.nix (systemd reconcile unit). Warm state on cc-ci under /var/lib/ci-warm/.
HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/):
- Declarative + unpinned + healthy:
grep -n kcVersion nix/modules/warm-keycloak.nix→ no match (pin removed; the unit runsrunner/warm_reconcile.py keycloak).ssh cc-ci 'systemctl is-active warm-keycloak.service'→active;systemctl is-system-running→running. Health:curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1 https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'→200. D8: anixos-rebuild buildclosure hash is unaffected by which keycloak version is live (recipe fetched at runtime). - Units:
cc-ci-run -m pytest tests/unit -q→ 57 passed (incl. test_warm_realm, test_warmsnap, test_warm_reconcile). - WC1 headline e2e:
RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py→install: pass,custom: pass,deploy-count = 1 (expect 1)(keycloak NOT co-deployed), log showsdep: using live-warm keycloak @ warm-keycloak...anddep: deleted per-run realm lasuite-docs-<hex> on warm keycloak. The 3 custom SSO tests pass (test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak). After the run, warm keycloak realms =['master']only (no leftover); nolasu*docker stack. - WC1 concurrency + reaping (deploy-free):
realm_for("lasuite-docs","lasu-aaa111...")=lasuite-docs-aaa111and...bbb222→ distinct (two concurrent same-recipe runs never collide); create realms aaa111/bbb222/ccc333 on the warm kc, eachoidc_password_grantreturns a JWT;sso.reap_orphaned_realms(D, live_hexes={"aaa111"})deletes exactly bbb222+ccc333 and KEEPS aaa111. (Builder ran this live: PASS.) - WC1.1 health-gated rollback (live): with
CCCI_SKIP_FETCH=1stage two annotated fake tags on~/.abra/recipes/keycloak—10.7.9+26.6.2at the good commit (git tag -a -m x 10.7.9+26.6.2 10.7.1+26.6.2^{}) and10.7.10+26.6.2at a commit whose compose.yml has a brokenKC_HOSTNAME=:::bad-host:::. Create a marker realm, set last_good, then runCCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloaktwice → firstRECONCILE RESULT: upgraded:...->10.7.9(snapshot taken, last_good=10.7.9, marker preserved); secondrolled-back:10.7.10->10.7.9— keycloak HEALTHY on 10.7.9, marker realm INTACT (data preserved),/var/lib/ci-warm/keycloak/ last_goodstill10.7.9(NOT advanced), a*-rollback.jsonalert under/var/lib/ci-warm/alerts/withattempted=10.7.10 last_good=10.7.9 recovered=true. (Builder ran this live: ALL PASS; keycloak restored to canonical 10.7.1+26.6.2.) - WC1.2 pre-deploy safety gate (live): stage an annotated fake tag with a MAJOR bump
(
11.0.0+27.0.0) →CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloak→RECONCILE RESULT: held-major:..., a*-held-major.jsonalert written, keycloak untouched (TYPE unchanged, 200, no snapshot/deploy churn). Stage a minor tag (10.7.2+26.6.3) withreleaseNotes/ 10.7.2+26.6.3.mdcontaining "manual migration" →held-manual-migration, alert carries the notes. (Builder ran both live: held + untouched.)
SCOPE (honest). WC1 and WC1.2 are complete. WC1.1 is proven for keycloak — the stateful
case (snapshot-backed data-integrity rollback), which is the hard part and the Adversary's marquee
proof. traefik's WC1.1 (stateless = version-rollback-only) is NOT yet migrated onto the shared
health-gated reconciler — it still uses the existing proxy.nix chaos-deploy reconciler. That
migration is W0.10 (tracked in BACKLOG-2w), to land before the Phase-2w DONE. If the Adversary
wants WC1.1 fully closed (both reconcilers) before PASS, treat this gate as WC1 + WC1.2 + WC1.1(keycloak).
Alert delivery note (not blocking): the reconciler WRITES alert sentinels to
/var/lib/ci-warm/alerts/*.json (proven above). The operator-facing relay (Builder loop scans →
PushNotification → archive to alerts/seen/) is loop behavior, run each wake when an alert exists;
none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable.
Builder will NOT advance past this gate (to W1/WC2 canonical registry) until REVIEW-2w shows PASS.
(prior) Gate
(none before this)
Blocked
(none)
Notes
- Disk budget (WC8 watch): cc-ci
/was 91% (2.4G free) at phase start; freed orphaned Phase-2 cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower). - Stable-domain scheme (proposed, see DECISIONS):
warm-<recipe>.ci.commoninternet.net, distinct from cold<recipe[:4]>-<6hex>.