diff --git a/machine-docs/BACKLOG-2w.md b/machine-docs/BACKLOG-2w.md index 3d7895a..748bf2a 100644 --- a/machine-docs/BACKLOG-2w.md +++ b/machine-docs/BACKLOG-2w.md @@ -5,23 +5,34 @@ Single-writer rule (plan §6.1): Builder edits `## Build backlog` only; Adversar ## Build backlog -### W0 — Live-warm keycloak (WC1) -- [ ] W0.1 — sso.py: realm lifecycle primitives (`delete_keycloak_realm`, `list_realms`, - `reap_stale_realms`) + unit tests. -- [ ] W0.2 — Orchestrator/deps: live-warm keycloak dep mode — stable warm domain + per-run - namespaced realm; delete realm on teardown (don't undeploy); cold-codeploy fallback if no warm - keycloak. Per-run realm name unique per (parent, pr, ref) for concurrency isolation. -- [ ] W0.3 — Declarative Nix reconciler `nix/modules/warm-keycloak.nix` (systemd oneshot converges - warm keycloak deployed+healthy at stable domain); wired into the host config. -- [ ] W0.4 — e2e proof: a dependent recipe (lasuite-docs) SSO custom test passes against warm - keycloak; concurrent dependents use distinct realms (no collision); leftover realms reaped. - → claim WC1 gate. +### W0 — Live-warm keycloak (WC1, WC1.1, WC1.2) +- [x] W0.1 — sso.py realm lifecycle (`list_realms`/`delete_keycloak_realm`/`realms_to_reap`/ + `reap_orphaned_realms`) + 8 unit tests. DONE (74bf8c1). +- [x] W0.2 — Orchestrator live-warm dep mode (warm.py + run_recipe_ci warm/cold split, per-run + namespaced realm, realm-delete teardown, cold fallback, deploy-count). DONE (1b8d26b). + Core mechanism proven deploy-free on the live warm keycloak. +- [x] W0.3a — Declarative reconciler `nix/modules/warm-keycloak.nix` up + verified via rebuild. + DONE (88c1114) but INTERIM (pinned + skip-if-healthy) — superseded by W0.6 below. +- [ ] **W0.5 — WC3 snapshot/restore helper FIRST** (prereq for WC1.1): `runner/harness/warmsnap.py` + — raw copy of an app's data volume(s) while undeployed, under `/var/lib/ci-warm//`, + atomic replace, one last-good, restore round-trips data. + unit tests + live round-trip proof. +- [ ] **W0.6 — Rewrite reconciler: unpin + WC1.2 safety gate + WC1.1 health-gated rollback.** + UNPIN keycloak (fetch latest + chaos; drop kcVersion); keep secret-guard + health-wait. WC1.2 + gate: hold-on-current + alert on major/manual-migration bump (no deploy churn). WC1.1: record + last-good → keycloak undeploy→snapshot→deploy latest → health-gate → commit-or-(restore+ + redeploy-prior+alert). Apply the same health-gated+safety-gate pattern to traefik (version + rollback only, stateless). Settle the alert mechanism (see DECISIONS). +- [ ] **W0.7 — Fix lasuite-docs in-place-redeploy race** (nginx web `host not found in upstream + backend` during chaos redeploy) OR pick a more-robust SSO dependent for the headline proof. +- [ ] W0.8 — Headline WC1 e2e: dependent SSO custom test green vs warm keycloak; concurrent + dependents distinct realms (no collision); leftover realms reaped. → claim WC1. +- [ ] W0.9 — WC1.1/WC1.2 Adversary proofs: simulate broken latest → self-revert + data intact + + alert; healthy update commits last-good; major/manual-migration → hold + alert-with-notes. + → claim WC1.1/WC1.2. -### W1 — Canonical registry + snapshot/restore (WC2, WC3) +### W1 — Canonical registry (WC2) - [ ] W1.1 — Canonical registry/reconciler (declarative; tracks recipe→known-good commit; stable - domain `warm-`). -- [ ] W1.2 — Snapshot/restore: raw volume copy while undeployed under `/var/lib/ci-warm//`; - one last-known-good, atomic replace; prove restore round-trips data. + domain `warm-`). (Snapshot/restore done in W0.5; WC3 closes with W1's canonicals.) ### W2 — `--quick` mode (WC4, WC7) - [ ] W2.1 — `run_recipe_ci.py --quick` path (reattach → upgrade-to-PR-head → assert → PASS undeploy / diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index 3ec1318..1d57d61 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -585,3 +585,33 @@ from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent t from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred when available. + +## Phase 2w — design update: unpinned warm/infra + health-gated rollback (2026-05-28/29) + +**Warm/infra apps (traefik + keycloak) auto-update to LATEST nightly, health-gated (operator).** +Supersedes the W0.3 pinned `kcVersion`. Keycloak is now unpinned like traefik: reconciler `abra +recipe fetch` latest + chaos deploy; keep secret-generate-only-if-missing + health-wait. D8 holds +because the recipe is fetched at *activation* (runtime), so the nix store closure is byte-identical +regardless of which keycloak version is live. + +**Snapshot helper (WC3) — format + path.** `runner/harness/warmsnap.py`. A snapshot is a **raw tar +of each docker volume belonging to the app's stack**, taken **while the app is undeployed** (nothing +writing → consistent). Stored under `/var/lib/ci-warm//` as `.snapshot.tar` + a +`.meta.json` (commit/version/timestamp/volume list). **One last-good per app**, replaced +**atomically** (write to `.tmp` then `rename`). Restore: for each volume, clear `_data` and untar +back. Docker volumes are stack-scoped (`_`); the helper enumerates them via +`docker volume ls` filtered to the stack. Reused by WC1.1 (pre-upgrade snapshot of keycloak) and WC5 +(promote-on-green-cold). Warm snapshots are **cache, excluded from the D8 closure** (WC8). + +**Alert mechanism — sentinel files relayed by the Builder loop.** The warm/infra reconciler is an +autonomous bash systemd unit on cc-ci; it cannot call the agent's `PushNotification` tool. So a +reconciler that rolls back (WC1.1) or holds a major/manual-migration upgrade (WC1.2) writes a JSON +**alert sentinel** to `/var/lib/ci-warm/alerts/--.json` (fields: app, reason +[rollback|held-major|held-manual-migration], from_version, to_version, release_notes, ts). The +Builder loop, each wake, scans that dir; for each new alert it (a) issues `PushNotification` to the +operator, (b) records it in STATUS-2w/JOURNAL-2w, (c) archives it to `alerts/seen/`. This bridges the +autonomous reconciler to operator visibility (latency = next Builder wake; acceptable for an alert). + +**Re-sequence:** WC1.1's keycloak rollback needs the WC3 snapshot helper, so build that FIRST, then +rewrite the reconciler ONCE into the unpinned + WC1.2-safety-gated + WC1.1-health-gated-rollback form +(avoids reworking the reconciler twice). The W0.3 reconciler is INTERIM until then.