Files
cc-ci/machine-docs/STATUS-2w.md

13 KiB
Raw Blame History

STATUS — Phase 2w (warm canonical deployments + --quick CI mode)

Phase plan (SSOT): /srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md Loop state for THIS phase: STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared). Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state. Phase 2 is PAUSED (STATUS-2/BACKLOG-2 intact) and resumes after 2w ## DONE.

Phase

Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe canonicals at stable domains, known-good snapshots, an opt-in --quick fast lane that reattaches the canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.

Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w

  • WC1 — Live-warm keycloak (SSO dep) at a stable domain, UNPINNED (fetch latest + chaos deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped.
  • WC1.1 — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak): record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on rollback (reuse WC3 helper). traefik = version rollback only.
  • WC1.2 — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade).
  • WC2 — Data-warm canonical model: per-recipe canonical at a stable domain, declarative registry tracking recipe→known-good commit; re-warmable from scratch.
  • WC3 — Known-good snapshots: raw volume copy taken while undeployed under stable path; one last-known-good per app, atomic replace; restore proven to round-trip data.
  • WC4--quick mode: reattach canonical → upgrade to PR head → generic+custom asserts; PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
  • WC5 — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
  • WC6 — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
  • WC7 — Trigger/authority/labeling: default !testme=cold; --quick opt-in, never gates merge; results carry mode; clean no-canonical fallback.
  • WC8 — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
  • WC9 — Docs + cold verify incl. the rollback proof (deliberately fail a PR under --quick, confirm last-known-good restored intact; a --quick pass did not move the known-good).

Milestones (plan §3)

  • W0 — Warm keycloak (WC1). ← IN FLIGHT
  • W1 — Canonical registry + snapshot/restore (WC2, WC3).
  • W2--quick mode (WC4, WC7).
  • W3 — Cold-advances-canonical + nightly sweep (WC5, WC6).
  • W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.

In flight

W0 — live-warm keycloak (WC1). Done so far (commits up to 88c1114):

  • W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).

  • W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).

  • WC1 core mechanism PROVEN deploy-free on the live warm keycloak: realm create → password-grant JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.

  • W0.3 declarative reconciler nix/modules/warm-keycloak.nix up; nixos-rebuild switch → warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned + skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)

  • W0.5 WC3 snapshot/restore helper (runner/harness/warmsnap.py) DONE (4cc1e15). +5 unit tests (48 unit pass). LIVE round-trip PROVEN on warm keycloak: marker realm → undeploy → snapshot (mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker realm BACK; keycloak healthy. Snapshots under /var/lib/ci-warm/<recipe>/, atomic, one last-good.

  • W0.6 reconciler rewrite DONE (a044abb). runner/warm_reconcile.py (python, packaged into the nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate → commit-or-restore+redeploy-prior+alert). Alerts = /var/lib/ci-warm/alerts/*.json. +8 unit tests (56 unit pass). PROVEN live: nixos-rebuild switch → warm-keycloak.service runs the python reconciler → noop-healthy (system 0-failed, 200); WC1.2 holds proven (MAJOR → held-major, keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).

  • W0.9 WC1.1 live proofs DONE (32f0071). PROVEN on warm keycloak (annotated fake tags + CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good committed, marker preserved; (b) marquee rollback — broken latest 10.7.10 → deploy fails → rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical 10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed swarm-settle, abra-stdout capture). 57 unit pass. Reconciler-side WC1/WC1.1/WC1.2 proven.

    Adversary reproduce (W0.9): on cc-ci, with the keycloak recipe clone, create annotated fake tags (peel ^{}, set git identity) 10.7.9+26.6.2(=good commit) and 10.7.10+26.6.2(broken KC_HOSTNAME), then CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak twice; observe upgraded: then rolled-back:, marker realm survives, /var/lib/ci-warm/keycloak/last_good unchanged at the prior version, a *rollback*.json alert under /var/lib/ci-warm/alerts/.

Next (remaining for WC1 gate):

  1. W0.7 — fix the lasuite-docs in-place chaos-redeploy nginx-upstream race (host not found in upstream ...backend:8000) OR pick a more-robust SSO dependent for the headline proof.
  2. W0.8 — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent distinct realms (no collision) + reaping. → claim WC1/WC1.1/WC1.2.
  3. Builder-loop alert relay (deferred wiring) — on each wake, scan /var/lib/ci-warm/alerts/*.json, PushNotification + record + archive to alerts/seen/; wire when nightly WC6 lands (first real alert).

Build finding (RESOLVED): the W0.4 lasuite-docs setup_custom_tests redeploy failure (nginx web host not found in upstream ...backend:8000) was transient resource contention from the since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the headline e2e is green (below). No recipe/harness change needed.

Gate

Gate: WC1 + WC1.1 + WC1.2 — CLAIMED, awaiting Adversary (@2026-05-29, HEAD = see git log -1)

WHAT. The live-warm keycloak layer (W0): a persistent unpinned keycloak at the stable domain warm-keycloak.ci.commoninternet.net, declaratively reconciled, that SSO-dependent runs use via a per-run namespaced realm (created + deleted) instead of co-deploying; concurrent dependents get distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps (WC1.2).

WHERE (code). runner/warm_reconcile.py (reconcile logic), runner/harness/warm.py (stable domain, per-run realm naming, reaping), runner/harness/sso.py (realm lifecycle), runner/harness/ warmsnap.py (snapshot/restore), runner/run_recipe_ci.py (warm/cold dep split), nix/modules/ warm-keycloak.nix (systemd reconcile unit). Warm state on cc-ci under /var/lib/ci-warm/.

HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/):

  1. Declarative + unpinned + healthy: grep -n kcVersion nix/modules/warm-keycloak.nixno match (pin removed; the unit runs runner/warm_reconcile.py keycloak). ssh cc-ci 'systemctl is-active warm-keycloak.service'active; systemctl is-system-runningrunning. Health: curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1 https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'200. D8: a nixos-rebuild build closure hash is unaffected by which keycloak version is live (recipe fetched at runtime).
  2. Units: cc-ci-run -m pytest tests/unit -q57 passed (incl. test_warm_realm, test_warmsnap, test_warm_reconcile).
  3. WC1 headline e2e: RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.pyinstall: pass, custom: pass, deploy-count = 1 (expect 1) (keycloak NOT co-deployed), log shows dep: using live-warm keycloak @ warm-keycloak... and dep: deleted per-run realm lasuite-docs-<hex> on warm keycloak. The 3 custom SSO tests pass (test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak). After the run, warm keycloak realms = ['master'] only (no leftover); no lasu* docker stack.
  4. WC1 concurrency + reaping (deploy-free): realm_for("lasuite-docs","lasu-aaa111...") = lasuite-docs-aaa111 and ...bbb222 → distinct (two concurrent same-recipe runs never collide); create realms aaa111/bbb222/ccc333 on the warm kc, each oidc_password_grant returns a JWT; sso.reap_orphaned_realms(D, live_hexes={"aaa111"}) deletes exactly bbb222+ccc333 and KEEPS aaa111. (Builder ran this live: PASS.)
  5. WC1.1 health-gated rollback (live): with CCCI_SKIP_FETCH=1 stage two annotated fake tags on ~/.abra/recipes/keycloak10.7.9+26.6.2 at the good commit (git tag -a -m x 10.7.9+26.6.2 10.7.1+26.6.2^{}) and 10.7.10+26.6.2 at a commit whose compose.yml has a broken KC_HOSTNAME=:::bad-host:::. Create a marker realm, set last_good, then run CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak twice → first RECONCILE RESULT: upgraded:...->10.7.9 (snapshot taken, last_good=10.7.9, marker preserved); second rolled-back:10.7.10->10.7.9 — keycloak HEALTHY on 10.7.9, marker realm INTACT (data preserved), /var/lib/ci-warm/keycloak/ last_good still 10.7.9 (NOT advanced), a *-rollback.json alert under /var/lib/ci-warm/alerts/ with attempted=10.7.10 last_good=10.7.9 recovered=true. (Builder ran this live: ALL PASS; keycloak restored to canonical 10.7.1+26.6.2.)
  6. WC1.2 pre-deploy safety gate (live): stage an annotated fake tag with a MAJOR bump (11.0.0+27.0.0) → CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloakRECONCILE RESULT: held-major:..., a *-held-major.json alert written, keycloak untouched (TYPE unchanged, 200, no snapshot/deploy churn). Stage a minor tag (10.7.2+26.6.3) with releaseNotes/ 10.7.2+26.6.3.md containing "manual migration" → held-manual-migration, alert carries the notes. (Builder ran both live: held + untouched.)

SCOPE (honest). WC1 and WC1.2 are complete. WC1.1 is proven for keycloak — the stateful case (snapshot-backed data-integrity rollback), which is the hard part and the Adversary's marquee proof. traefik's WC1.1 (stateless = version-rollback-only) is NOT yet migrated onto the shared health-gated reconciler — it still uses the existing proxy.nix chaos-deploy reconciler. That migration is W0.10 (tracked in BACKLOG-2w), to land before the Phase-2w DONE. If the Adversary wants WC1.1 fully closed (both reconcilers) before PASS, treat this gate as WC1 + WC1.2 + WC1.1(keycloak).

Alert delivery note (not blocking): the reconciler WRITES alert sentinels to /var/lib/ci-warm/alerts/*.json (proven above). The operator-facing relay (Builder loop scans → PushNotification → archive to alerts/seen/) is loop behavior, run each wake when an alert exists; none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable.

Builder will NOT advance past this gate (to W1/WC2 canonical registry) until REVIEW-2w shows PASS.

(prior) Gate

(none before this)

Blocked

(none)

Notes

  • Disk budget (WC8 watch): cc-ci / was 91% (2.4G free) at phase start; freed orphaned Phase-2 cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
  • Stable-domain scheme (proposed, see DECISIONS): warm-<recipe>.ci.commoninternet.net, distinct from cold <recipe[:4]>-<6hex>.