Files
cc-ci/machine-docs/STATUS-2w.md

6.6 KiB
Raw Blame History

STATUS — Phase 2w (warm canonical deployments + --quick CI mode)

Phase plan (SSOT): /srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md Loop state for THIS phase: STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared). Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state. Phase 2 is PAUSED (STATUS-2/BACKLOG-2 intact) and resumes after 2w ## DONE.

Phase

Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe canonicals at stable domains, known-good snapshots, an opt-in --quick fast lane that reattaches the canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.

Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w

  • WC1 — Live-warm keycloak (SSO dep) at a stable domain, UNPINNED (fetch latest + chaos deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped.
  • WC1.1 — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak): record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on rollback (reuse WC3 helper). traefik = version rollback only.
  • WC1.2 — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade).
  • WC2 — Data-warm canonical model: per-recipe canonical at a stable domain, declarative registry tracking recipe→known-good commit; re-warmable from scratch.
  • WC3 — Known-good snapshots: raw volume copy taken while undeployed under stable path; one last-known-good per app, atomic replace; restore proven to round-trip data.
  • WC4--quick mode: reattach canonical → upgrade to PR head → generic+custom asserts; PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
  • WC5 — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
  • WC6 — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
  • WC7 — Trigger/authority/labeling: default !testme=cold; --quick opt-in, never gates merge; results carry mode; clean no-canonical fallback.
  • WC8 — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
  • WC9 — Docs + cold verify incl. the rollback proof (deliberately fail a PR under --quick, confirm last-known-good restored intact; a --quick pass did not move the known-good).

Milestones (plan §3)

  • W0 — Warm keycloak (WC1). ← IN FLIGHT
  • W1 — Canonical registry + snapshot/restore (WC2, WC3).
  • W2--quick mode (WC4, WC7).
  • W3 — Cold-advances-canonical + nightly sweep (WC5, WC6).
  • W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.

In flight

W0 — live-warm keycloak (WC1). Done so far (commits up to 88c1114):

  • W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).

  • W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).

  • WC1 core mechanism PROVEN deploy-free on the live warm keycloak: realm create → password-grant JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.

  • W0.3 declarative reconciler nix/modules/warm-keycloak.nix up; nixos-rebuild switch → warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned + skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)

  • W0.5 WC3 snapshot/restore helper (runner/harness/warmsnap.py) DONE (4cc1e15). +5 unit tests (48 unit pass). LIVE round-trip PROVEN on warm keycloak: marker realm → undeploy → snapshot (mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker realm BACK; keycloak healthy. Snapshots under /var/lib/ci-warm/<recipe>/, atomic, one last-good.

Next (W0.6 reconciler rewrite — split):

  1. W0.6a — Python reconcile entrypoint runner/warm_reconcile.py, packaged into the nix store (systemd unit invokes the store copy of runner/ — D8-clean, reuses warmsnap/sso/abra; replaces the bash reconciler). UNPIN keycloak (fetch latest + chaos deploy; drop kcVersion); keep secret-guard
    • health-wait.
  2. W0.6b — WC1.2 pre-deploy safety gate: major recipe-semver bump OR releaseNotes manual-migration marker → hold-on-current + alert-with-notes (no deploy churn).
  3. W0.6c — WC1.1 health-gated rollback: record last-good → (keycloak: undeploy→snapshot→deploy latest) → health-gate → commit-or-(restore+redeploy-prior+alert). Same for traefik (version rollback only). Alert = sentinel file in /var/lib/ci-warm/alerts/ relayed by the Builder loop.
  4. W0.7 — resolve the lasuite-docs in-place-redeploy race (finding below) OR pick a more-robust dependent; then W0.8 headline WC1 e2e (dependent SSO green vs warm keycloak) + concurrency.
  5. W0.9 — WC1.1/WC1.2 Adversary-facing proofs (broken latest → self-revert + data intact + alert; healthy → commit last-good; major/manual-migration → hold + alert).

Build finding (mine, to fix): lasuite-docs setup_custom_tests in-place abra app deploy --force --chaos (OIDC wiring) fails: nginx web fatally exits [emerg] host not found in upstream ...backend:8000 during the rolling restart → abra converge times out. Independent of warm/cold keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around.

Gate

(none claimed yet)

Blocked

(none)

Notes

  • Disk budget (WC8 watch): cc-ci / was 91% (2.4G free) at phase start; freed orphaned Phase-2 cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
  • Stable-domain scheme (proposed, see DECISIONS): warm-<recipe>.ci.commoninternet.net, distinct from cold <recipe[:4]>-<6hex>.