Files
cc-ci/machine-docs/STATUS-2w.md
autonomic-bot 465e1059b0 claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live
canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:33:08 +01:00

29 KiB
Raw Blame History

STATUS — Phase 2w (warm canonical deployments + --quick CI mode)

Phase plan (SSOT): /srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md Loop state for THIS phase: STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared). Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state. Phase 2 is PAUSED (STATUS-2/BACKLOG-2 intact) and resumes after 2w ## DONE.

Phase

Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe canonicals at stable domains, known-good snapshots, an opt-in --quick fast lane that reattaches the canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.

Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w

  • WC1 — Live-warm UNPINNED keycloak; per-run namespaced realms (create+delete); concurrent distinct realms; orphan realms reaped. Adversary PASS @2026-05-29 (REVIEW-2w, gate 985686f).
  • [~] WC1.1 — Health-gated deploy-with-rollback. keycloak (stateful) — Adversary PASS @2026-05-29 (marquee). **traefik (stateless, version-rollback-only) — reconciler MIGRATED (W0.10a): proxy.nix now drives warm_reconcile.py traefik (shared health-gated path, no snapshot; cert/file-provider setup preserved); no-op converge proven live (traefik 200, keycloak-through-traefik 200, 0 failed). Adversary PASS @2026-05-29 (REVIEW-2w e3b08a9): destructive rollback proven (lint-breaking tag → rollback to 5.1.1, NO TLS outage). WC1.1 FULLY CLOSED (keycloak stateful + traefik stateless).
  • WC1.2 — Pre-deploy safety gate (major / manual-migration → hold + alert with notes, no churn, short-circuits before WC1.1). Adversary PASS @2026-05-29.
  • WC2 — Data-warm canonical model: per-recipe canonical at stable domain warm-<recipe>, declarative registry (canonical.json + recipe_meta.WARM_CANONICAL) tracking recipe→known-good version/commit; data-warm (undeployed-when-idle, volume retained); re-warmable via seed_canonical. Proven on custom-html (W1.2). Adversary PASS @2026-05-29 (REVIEW-2w 0246296, gate 4ce80f8).
  • WC3 — Known-good snapshots: raw per-volume tar taken while undeployed under /var/lib/ci-warm/<recipe>/snapshot/; one last-good per app, atomic subdir swap; restore round-trips data (W0.5 + W1.2 + Adversary's own mutate→restore). Adversary PASS @2026-05-29.
  • WC4--quick mode (run_quick in run_recipe_ci.py): reattach canonical → upgrade to PR head (chaos) → generic UPGRADE+serving+overlay+custom; PASS→undeploy-keep-volume (known-good UNCHANGED, never promote); FAIL→restore last-known-good snapshot then undeploy. Proven live on custom-html (PASS + FAIL). Adversary PASS @2026-05-29 (REVIEW-2w 31f0e42, gate 3ff2bf6).
  • WC5 — Canonical advancement via cold only (promote-on-green-cold). should_promote_canonical (enrolled+green+cold+latest) + promote_canonical (re-seed canonical at green-verified latest → snapshot+registry; never lose known-good). Proven live: green cold custom-html run advanced the canonical 1.10.0+1.28.0 → 1.11.0+1.29.0 (snapshot refreshed, idle, per-run app torn down). --quick never promotes (W2). Adversary PASS @2026-05-29 (REVIEW-2w 5bbc47c, gate 125453d).
  • WC6 — Nightly full-cold sweep. nix/modules/nightly-sweep.nix (systemd TIMER OnCalendar 03:00 Persistent + oneshot service) → runner/nightly_sweep.py: roll warm/infra (keycloak+traefik health-gated, WC1.1) → SERIAL full-cold run over enrolled (canonical.enrolled_recipes) recipes on latest → each green run promotes its canonical (WC5); skips if a test is in flight. Proven via the live service: enrolled=['custom-html'] → all tiers green → canonical advanced 1.10.0→1.11.0. CLAIMED — see Gate.
  • WC7 — Trigger/authority/labeling: default !testme=cold (unchanged); --quick opt-in via bridge parse_trigger (!testme --quick → CCCI_QUICK=1 Drone param, deployed+live-verified); never gates merge; runs carry mode=quick (lower-confidence label); clean no-canonical fallback to cold. Adversary PASS @2026-05-29 (REVIEW-2w 31f0e42, gate 3ff2bf6).
  • WC8 — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
  • WC9 — Docs + cold verify incl. the rollback proof (deliberately fail a PR under --quick, confirm last-known-good restored intact; a --quick pass did not move the known-good).

Milestones (plan §3)

  • W0 — Warm keycloak (WC1/WC1.1-keycloak/WC1.2). Adversary PASS @2026-05-29.
  • W1 — Canonical registry + snapshot/restore (WC2, WC3). Adversary PASS @2026-05-29.
  • W2--quick mode (WC4, WC7). Adversary PASS @2026-05-29.
  • W3 — Cold-advances-canonical (WC5 PASS) + nightly sweep (WC6 ← building).
  • W4 — Resource/isolation hardening + docs + cold verify (WC8, WC9).
  • W1 — Canonical registry + snapshot/restore (WC2, WC3).
  • W2--quick mode (WC4, WC7).
  • W3 — Cold-advances-canonical + nightly sweep (WC5, WC6).
  • W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.

In flight

W0 — live-warm keycloak (WC1). Done so far (commits up to 88c1114):

  • W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).

  • W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).

  • WC1 core mechanism PROVEN deploy-free on the live warm keycloak: realm create → password-grant JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.

  • W0.3 declarative reconciler nix/modules/warm-keycloak.nix up; nixos-rebuild switch → warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned + skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)

  • W0.5 WC3 snapshot/restore helper (runner/harness/warmsnap.py) DONE (4cc1e15). +5 unit tests (48 unit pass). LIVE round-trip PROVEN on warm keycloak: marker realm → undeploy → snapshot (mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker realm BACK; keycloak healthy. Snapshots under /var/lib/ci-warm/<recipe>/, atomic, one last-good.

  • W0.6 reconciler rewrite DONE (a044abb). runner/warm_reconcile.py (python, packaged into the nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate → commit-or-restore+redeploy-prior+alert). Alerts = /var/lib/ci-warm/alerts/*.json. +8 unit tests (56 unit pass). PROVEN live: nixos-rebuild switch → warm-keycloak.service runs the python reconciler → noop-healthy (system 0-failed, 200); WC1.2 holds proven (MAJOR → held-major, keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).

  • W0.9 WC1.1 live proofs DONE (32f0071). PROVEN on warm keycloak (annotated fake tags + CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good committed, marker preserved; (b) marquee rollback — broken latest 10.7.10 → deploy fails → rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical 10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed swarm-settle, abra-stdout capture). 57 unit pass. Reconciler-side WC1/WC1.1/WC1.2 proven.

    Adversary reproduce (W0.9): on cc-ci, with the keycloak recipe clone, create annotated fake tags (peel ^{}, set git identity) 10.7.9+26.6.2(=good commit) and 10.7.10+26.6.2(broken KC_HOSTNAME), then CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak twice; observe upgraded: then rolled-back:, marker realm survives, /var/lib/ci-warm/keycloak/last_good unchanged at the prior version, a *rollback*.json alert under /var/lib/ci-warm/alerts/.

W0 COMPLETE — Adversary PASS @2026-05-29. Now in W1 (canonical registry, WC2/WC3).

W0 + W1 + W2 Adversary PASS. Now in W3 (cold-advances-canonical WC5 + nightly sweep WC6).

W3 plan:

  • WC5 — promote-on-green-cold. A GREEN full-cold run on the LATEST (not a --quick run) of an enrolled (WARM_CANONICAL) recipe re-snapshots + re-tags the canonical known-good instead of deleting the volume at teardown: at the end of a green cold run, undeploy → canonical.seed_canonical (snapshot while undeployed + write registry version=the green commit/version) → keep the volume as the new canonical. The FIRST green cold run on latest SEEDS the canonical. ONLY cold advances it (--quick never promotes — proven W2). Wire into run_recipe_ci.py cold teardown, gated on: recipe is WARM_CANONICAL + run was green + deployed LATEST (not a pinned/prev base). Add unit tests + a live proof (green cold custom-html run → canonical re-seeded at the new known-good).
  • WC6 — nightly full-cold sweep. Declarative scheduler (systemd timer on cc-ci): nightly does nixos-rebuild switch FIRST (rolls warm/infra to latest, health-gated per WC1.1) THEN a full-cold sweep across enrolled recipes (serial, MAX_TESTS-bounded), refreshing each canonical's known-good (WC5) + serving as the daily authoritative regression. MUST NOT run while a test is in flight.
  • Quiet-window opportunity (now): W0.10a traefik WC1.1 — Adversary idle post-W2 PASS, so this is the window to migrate traefik onto the health-gated reconciler (tracked-before-DONE; below).

Tracked before Phase-2w DONE:

  • W0.10a — traefik WC1.1 (Adversary requires a cold proof): migrate proxy.nix onto the shared health-gated reconciler (stateless = version-rollback-only; preserve cert-secret/WILDCARDS_ENABLED/ COMPOSE_FILE setup). CAREFUL — traefik serves all TLS; deploy/test only in a quiet window.
  • W0.10b — Builder-loop alert relay: each wake, scan /var/lib/ci-warm/alerts/*.json → PushNotification → archive to alerts/seen/.

Build finding (RESOLVED): the W0.4 lasuite-docs setup_custom_tests redeploy failure (nginx web host not found in upstream ...backend:8000) was transient resource contention from the since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the headline e2e is green (below). No recipe/harness change needed.

Gate

Gate: WC6 — CLAIMED, awaiting Adversary (@2026-05-29)

WHAT. Nightly full-cold sweep: a scheduled job rolls warm/infra to latest (health-gated, WC1.1) then runs the full COLD suite serially across enrolled canonical recipes on latest — refreshing each canonical's known-good (WC5) + a daily authoritative regression. Declarative, MAX_TESTS-bounded (serial), skips if a test is in flight. WHERE: nix/modules/nightly-sweep.nix (timer+service), runner/nightly_sweep.py, runner/harness/canonical.py (enrolled_recipes). Wired into hosts/cc-ci/configuration.nix.

HOW + EXPECTED (cold):

  1. Units: cc-ci-run -m pytest tests/unit -q71 passed (incl. test_canonical enrolled_recipes).
  2. Timer present: systemctl is-active nightly-sweep.timer → active; systemctl list-timers nightly-sweep.timer → next ~03:00 (Persistent).
  3. Live sweep (via the systemd SERVICE, store copy): set the custom-html canonical to an OLDER version, then systemctl start nightly-sweep.service → journal shows: roll keycloak rc=0 + traefik rc=0 (health-gated, noop at latest); enrolled canonicals = ['custom-html']; full-cold custom-html install/upgrade/backup/restore/custom all pass; WC5 promote: canonical custom-html advanced to known-good 1.11.0+1.29.0; custom-html: PASS; afterwards canonical.json version ADVANCED to 1.11.0+1.29.0, canonical idle, traefik+keycloak 200, system running. Builder ran this live: PASS. (A red recipe in the sweep is reported FAIL + does NOT promote — known-good safe; verified when a missing-util-linux backup flake red'd a run and the canonical stayed put, then fixed.)

Gate: WC5 — Adversary PASS @2026-05-29 (REVIEW-2w 5bbc47c, gate 125453d)

Anti-poison gate predicate + live advancement 1.10.0→1.11.0 (cold-only) cold-verified. Builder may proceed to WC6. (claim detail retained below.)

(claimed, now PASS) Gate: WC5 — CLAIMED detail

WHAT. Promote-on-green-cold: a GREEN full-cold run on LATEST (no PR head) of an enrolled (WARM_CANONICAL) recipe advances/seeds the canonical known-good; --quick never promotes; only cold advances. WHERE: runner/run_recipe_ci.py (should_promote_canonical gate + promote_canonical

  • the post-green-cold hook in main()), runner/harness/canonical.py (seed_canonical).

HOW + EXPECTED (cold):

  1. Units: cc-ci-run -m pytest tests/unit -q70 passed (incl. test_promote: the gate fires only for enrolled+green+cold+latest; not on red / quick / PR-head / unenrolled).
  2. Live advancement (custom-html canonical): set its registry version to an OLDER value (canonical.write_registry("custom-html", version="1.10.0+1.28.0", …)), then a full COLD run RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py (no REF = latest) → install/upgrade/backup/ restore/custom all pass, deploy-count=1, then WC5 promote-on-green-cold: (re)seed canonical custom-html @ 1.11.0+1.29.0 → afterwards canonical.json version ADVANCED to 1.11.0+1.29.0 (commit=head 8a02606…), snapshot refreshed (warmsnap.read_meta version=1.11.0+1.29.0), canonical idle + volume retained, NO cust-* per-run service left (cold teardown sacred). Builder ran this live: advanced 1.10.0→1.11.0. (A PR !testme REF=PR-head does NOT promote; --quick never promotes — both gate-checked.)

Gate: W0.10a traefik WC1.1 — Adversary PASS @2026-05-29 (REVIEW-2w e3b08a9, gate e678d2e)

Migration + no-op converge + destructive rollback (lint-breaking tag → rollback to last-good, NO TLS outage — broken deploy rejected at lint before touching the running proxy) all cold-verified. WC1.1 now FULLY closed (keycloak + traefik). (claim detail retained below.)

(claimed, now PASS) Gate: W0.10a traefik WC1.1 — CLAIMED detail

WHAT. traefik migrated onto the shared health-gated reconciler (WC1.1, stateless = version-rollback-only, NO snapshot): record last-good → deploy latest tag → health-gate (routed host ci.commoninternet.net = 200) → healthy commit / unhealthy roll back to last-good + alert. Closes the W0.10a tracked-open item from the W0 gate. traefik's wildcard-cert/file-provider config preserved.

WHERE. runner/warm_reconcile.py (SPECS["traefik"] stateful=False + _traefik_setup + health_domain; reconcile() per-app setup hook; the stateless path skips snapshot/restore — version rollback only), nix/modules/proxy.nix (deploy-proxy.service now execs python3 …/warm_reconcile.py traefik).

HOW + EXPECTED (cold):

  1. Units: cc-ci-run -m pytest tests/unit -q65 passed (incl. test_warm_reconcile traefik spec: stateful=False, callable setup, health_domain=ci.commoninternet.net; keycloak unchanged).
  2. No-op converge (delivered, proven live): systemctl is-active deploy-proxy.service → active; journalctl -u deploy-proxy.service[traefik] already on latest 5.1.1+v3.6.15 and healthy — no-op; traefik serving (ci.commoninternet.net=200) + keycloak-through-traefik=200 + system running (0 failed). The migration was zero-disruption (traefik was already at the latest tag; I pre-seeded TYPE+last_good to 5.1.1+v3.6.15 so the reconcile is a clean no-op).
  3. Destructive rollback (the Adversary's required cold proof): stage a fake newer traefik tag with a broken config → CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py traefik → broken deploy fails health → reconciler rolls back to last-good 5.1.1+v3.6.15 (version-only, no snapshot — traefik is stateless) → traefik healthy again + a *-rollback.json alert. NOTE: a destructive traefik test briefly drops TLS for ALL routes during the broken-deploy window until rollback — run it knowing that + with manual recovery ready (abra app deploy traefik.ci.commoninternet.net 5.1.1+v3.6.15 -o -n -f). The rollback logic is the SAME proven keycloak pattern, stateless variant (no snapshot).

Per operator guidance, I delivered the code + the safe no-op converge this iteration and left the destructive rollback as the Adversary's cold proof (a live destructive traefik test risks all TLS).


Gate: WC4 + WC7 — Adversary PASS @2026-05-29 (REVIEW-2w 31f0e42, gate 3ff2bf6)

Cold-verified from the Adversary's own clone: 64 units; WC7 adversarial trigger battery (all negatives rejected, live bridge); WC4 never-promote (snapshot byte-identical, registry unchanged); WC4 FAIL→rollback restored EXACT known-good (marker back, 200, broken image gone, exit 1); no-canonical fallback to a cold per-run domain. Builder may proceed to W3. (claim detail retained below.)

(claimed, now PASS) Gate: WC4 + WC7 — CLAIMED detail

WHAT. The --quick opt-in fast lane (W2): reattach the data-warm canonical → upgrade in place to the PR head → assert (generic upgrade reconverge+moved+serving + overlay + custom); PASS → undeploy-keep-volume with the known-good UNCHANGED (never promote); FAIL → restore the last-known-good snapshot + undeploy (roll back, data safe). Opt-in via !testme --quick, mode-tagged lower-confidence, never gates merge; clean no-canonical fallback to COLD.

WHERE (code). runner/run_recipe_ci.py (run_quick, dispatched from main() on CCCI_QUICK=1 / MODE=quick; _wait_undeployed; no-canonical fallback), runner/harness/canonical.py (deploy_canonical resets TYPE; undeploy_keep_volume), runner/harness/warmsnap.py (restore), bridge/bridge.py (parse_trigger + CCCI_QUICK param), .drone.yml (quick echo). 64 unit pass.

HOW + EXPECTED (cold, from your own clone on cc-ci):

  1. Units: cc-ci-run -m pytest tests/unit -q64 passed (incl. test_bridge_trigger: !testme→cold, !testme --quick→quick, !testmexyz→reject).
  2. WC7 trigger (live in the running bridge): cid=$(docker ps -q -f name=ccci-bridge); docker exec $cid python3 -c 'import sys;sys.path.insert(0,"/app");import bridge; print(bridge.parse_trigger("!testme --quick"), bridge.parse_trigger("!testmexyz"))'(True, True) (False, False). trigger_build adds CCCI_QUICK=1 (auto-exposed to run_recipe_ci); a !testme --quick PR comment is labelled lower-confidence; plain !testme stays full cold.
  3. WC4 --quick flow (custom-html canonical, currently idle at 1.11.0+1.29.0):
    • PASS run: RECIPE=custom-html CCCI_QUICK=1 REF=87a62a5 cc-ci-run runner/run_recipe_ci.py (REF=87a62a5 is the 1.10.0+1.28.0 commit — a different healthy head) → exit 0; SUMMARY shows mode=quick, upgrade: pass, custom: pass, "canonical undeployed, volume retained, known-good UNCHANGED"; afterwards canonical.json version STILL 1.11.0+1.29.0 (NOT promoted), canonical idle, content volume retained, known-good marker intact.
    • FAIL run (rollback): stage a broken custom-html commit (image: nginx:99.99.99-doesnotexist), RECIPE=custom-html CCCI_QUICK=1 CCCI_SKIP_FETCH=1 REF=<broken sha> cc-ci-run runner/run_recipe_ci.py → exit 1; SUMMARY shows "rolling back … restored known-good data; canonical idle (NOT promoted)"; afterwards known-good version UNCHANGED, canonical idle, data (marker) intact. Builder ran both live: ALL PASS (canonical left clean idle@1.11.0+1.29.0).
    • no-canonical fallback: MODE=quick for a recipe with no canonical → logs "falling back to COLD" and runs the full cold flow (so the PR is still tested; default !testme unaffected).

Builder will NOT advance into W3 (cold-advances-canonical / nightly) past this gate until REVIEW-2w shows PASS — but will do the tracked W0.10a (traefik) in a quiet window meanwhile.


Gate: WC2 + WC3 — Adversary PASS @2026-05-29 (REVIEW-2w 0246296, gate 4ce80f8)

Cold-verified from the Adversary's own clone (its own data-warm round-trip + restore round-trip). Builder may proceed to W2 (--quick). custom-html canonical left clean (idle, volume retained, known-good content, snapshot intact, v1.11.0+1.29.0). (claim detail retained below.)

(claimed, now PASS) Gate: WC2 + WC3 — CLAIMED detail

WHAT. The data-warm canonical model (W1): a declarative per-recipe canonical at the stable domain warm-<recipe>.ci.commoninternet.net, kept data-warm (undeployed-when-idle, data volume retained), tracked by a registry; known-good snapshots (raw per-volume tar while undeployed, one last-good per app, restore round-trips data).

WHERE (code). runner/harness/canonical.py (registry + data-warm lifecycle), runner/harness/ warmsnap.py (snapshot/restore), enrollment tests/custom-html/recipe_meta.py: WARM_CANONICAL=True. State on cc-ci under /var/lib/ci-warm/<recipe>/ (canonical.json, snapshot/, retained volume).

HOW + EXPECTED (cold, from your own clone on cc-ci):

  1. Units: cc-ci-run -m pytest tests/unit -q61 passed (incl. test_canonical, test_warmsnap).
  2. WC2/WC3 data-warm round-trip (custom-html canonical exists idle now): reproduce with a driver that uses runner/harness/canonical.py — deploy warm-custom-html.ci.commoninternet.net @ 1.11.0+1.29.0, write a marker file into /usr/share/nginx/html/, undeploy, seed_canonical (writes /var/lib/ci-warm/custom-html/canonical.json + a snapshot/ while undeployed); confirm app UNDEPLOYED but the content volume RETAINED (docker volume ls | grep warm-custom-html); then deploy_canonical('custom-html') → the marker survives (data-warm reattach). Builder ran this live: ALL PASS (marker WC2-DATA-MARKER-7f3a9c survived; registry version=1.11.0+1.29.0; snapshot present). Current live state: cat /var/lib/ci-warm/custom-html/canonical.json → status=idle, version=1.11.0+1.29.0; docker volume ls shows warm-custom-html_ci_commoninternet_net_content retained with NO custom-html service running.
  3. WC3 restore round-trip already cold-verified in the W0.9/W0.5 keycloak proof (snapshot → mutate DB → restore → data back); same warmsnap helper.
  4. D8/WC8: /var/lib/ci-warm/ is cache, NOT in the nix closure (no module references it as a source); re-seeded by cold runs, not restored on rebuild.

Builder will NOT advance into W2 (--quick, which consumes the canonical) past this gate until REVIEW-2w shows PASS — but will do non-disruptive W0.10 follow-ups (alert relay) meanwhile.


Gate: WC1 + WC1.2 + WC1.1(keycloak) — Adversary PASS @2026-05-29 (REVIEW-2w 31ac86d, gate 985686f)

All 6 checks cold-verified from the Adversary's own clone. Builder may proceed to W1. Tracked open (must close before Phase-2w DONE, not a blocker now): traefik WC1.1 (W0.10) — stateless version-rollback not yet on the shared health-gated reconciler; Adversary will require a cold proof.

(claim detail retained below for the record)

WHAT. The live-warm keycloak layer (W0): a persistent unpinned keycloak at the stable domain warm-keycloak.ci.commoninternet.net, declaratively reconciled, that SSO-dependent runs use via a per-run namespaced realm (created + deleted) instead of co-deploying; concurrent dependents get distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps (WC1.2).

WHERE (code). runner/warm_reconcile.py (reconcile logic), runner/harness/warm.py (stable domain, per-run realm naming, reaping), runner/harness/sso.py (realm lifecycle), runner/harness/ warmsnap.py (snapshot/restore), runner/run_recipe_ci.py (warm/cold dep split), nix/modules/ warm-keycloak.nix (systemd reconcile unit). Warm state on cc-ci under /var/lib/ci-warm/.

HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/):

  1. Declarative + unpinned + healthy: grep -n kcVersion nix/modules/warm-keycloak.nixno match (pin removed; the unit runs runner/warm_reconcile.py keycloak). ssh cc-ci 'systemctl is-active warm-keycloak.service'active; systemctl is-system-runningrunning. Health: curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1 https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'200. D8: a nixos-rebuild build closure hash is unaffected by which keycloak version is live (recipe fetched at runtime).
  2. Units: cc-ci-run -m pytest tests/unit -q57 passed (incl. test_warm_realm, test_warmsnap, test_warm_reconcile).
  3. WC1 headline e2e: RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.pyinstall: pass, custom: pass, deploy-count = 1 (expect 1) (keycloak NOT co-deployed), log shows dep: using live-warm keycloak @ warm-keycloak... and dep: deleted per-run realm lasuite-docs-<hex> on warm keycloak. The 3 custom SSO tests pass (test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak). After the run, warm keycloak realms = ['master'] only (no leftover); no lasu* docker stack.
  4. WC1 concurrency + reaping (deploy-free): realm_for("lasuite-docs","lasu-aaa111...") = lasuite-docs-aaa111 and ...bbb222 → distinct (two concurrent same-recipe runs never collide); create realms aaa111/bbb222/ccc333 on the warm kc, each oidc_password_grant returns a JWT; sso.reap_orphaned_realms(D, live_hexes={"aaa111"}) deletes exactly bbb222+ccc333 and KEEPS aaa111. (Builder ran this live: PASS.)
  5. WC1.1 health-gated rollback (live): with CCCI_SKIP_FETCH=1 stage two annotated fake tags on ~/.abra/recipes/keycloak10.7.9+26.6.2 at the good commit (git tag -a -m x 10.7.9+26.6.2 10.7.1+26.6.2^{}) and 10.7.10+26.6.2 at a commit whose compose.yml has a broken KC_HOSTNAME=:::bad-host:::. Create a marker realm, set last_good, then run CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak twice → first RECONCILE RESULT: upgraded:...->10.7.9 (snapshot taken, last_good=10.7.9, marker preserved); second rolled-back:10.7.10->10.7.9 — keycloak HEALTHY on 10.7.9, marker realm INTACT (data preserved), /var/lib/ci-warm/keycloak/ last_good still 10.7.9 (NOT advanced), a *-rollback.json alert under /var/lib/ci-warm/alerts/ with attempted=10.7.10 last_good=10.7.9 recovered=true. (Builder ran this live: ALL PASS; keycloak restored to canonical 10.7.1+26.6.2.)
  6. WC1.2 pre-deploy safety gate (live): stage an annotated fake tag with a MAJOR bump (11.0.0+27.0.0) → CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloakRECONCILE RESULT: held-major:..., a *-held-major.json alert written, keycloak untouched (TYPE unchanged, 200, no snapshot/deploy churn). Stage a minor tag (10.7.2+26.6.3) with releaseNotes/ 10.7.2+26.6.3.md containing "manual migration" → held-manual-migration, alert carries the notes. (Builder ran both live: held + untouched.)

SCOPE (honest). WC1 and WC1.2 are complete. WC1.1 is proven for keycloak — the stateful case (snapshot-backed data-integrity rollback), which is the hard part and the Adversary's marquee proof. traefik's WC1.1 (stateless = version-rollback-only) is NOT yet migrated onto the shared health-gated reconciler — it still uses the existing proxy.nix chaos-deploy reconciler. That migration is W0.10 (tracked in BACKLOG-2w), to land before the Phase-2w DONE. If the Adversary wants WC1.1 fully closed (both reconcilers) before PASS, treat this gate as WC1 + WC1.2 + WC1.1(keycloak).

Alert delivery note (not blocking): the reconciler WRITES alert sentinels to /var/lib/ci-warm/alerts/*.json (proven above). The operator-facing relay (Builder loop scans → PushNotification → archive to alerts/seen/) is loop behavior, run each wake when an alert exists; none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable.

Builder will NOT advance past this gate (to W1/WC2 canonical registry) until REVIEW-2w shows PASS.

(prior) Gate

(none before this)

Blocked

(none)

Notes

  • Disk budget (WC8 watch): cc-ci / was 91% (2.4G free) at phase start; freed orphaned Phase-2 cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
  • Stable-domain scheme (proposed, see DECISIONS): warm-<recipe>.ci.commoninternet.net, distinct from cold <recipe[:4]>-<6hex>.