diff --git a/machine-docs/JOURNAL-2w.md b/machine-docs/JOURNAL-2w.md index 03e07c5..d64663c 100644 --- a/machine-docs/JOURNAL-2w.md +++ b/machine-docs/JOURNAL-2w.md @@ -52,3 +52,68 @@ Building W0 in increments (each verified): (1) sso realm lifecycle prims + units keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof. + +## 2026-05-29 — W0 core mechanism PROVEN; declarative reconciler up; design update absorbed + +**Stale Phase-2 run killed.** Found an orphaned `run_recipe_ci.py` (RECIPE=lasuite-drive, the Q3.2 +`ccci-q32-drive-sso2.log` run) still alive from before the phase switch (PPID 1, nohup). It had +deployed lasu-0a6fb2 + tried a cold keyc-07d81e dep — both of which I'd already torn down, so it was +failing. Killed its process tree + janitored. Only infra + warm-keycloak remain. + +**W0.1 realm lifecycle (sso.py)** — list_realms / delete_keycloak_realm (idempotent, refuses master) +/ realms_to_reap (pure predicate) / reap_orphaned_realms. +8 unit tests. The per-run realm is the +isolation unit on a shared keycloak; orphans reaped by hex-not-in-live-stacks (concurrency-safe). + +**W0.2 orchestrator live-warm mode** — warm.py (stable-domain scheme, is_warm_up probe, +live_app_hexes, realm_for=-<6hex>, reap_orphan_realms). run_recipe_ci splits declared deps +into warm (shared provider + per-run realm, no deploy, realm deleted at teardown) vs cold +(co-deploy), warm only if provider up else cold fallback; deploy-count excludes warm deps; reaps +orphans at run start. Dependent tests now assert the namespaced realm pattern (stronger than ==parent). + +**WC1 CORE MECHANISM PROVEN** (deploy-free, live warm keycloak): realm create → password-grant JWT +→ discovery issuer → delete(idempotent) → reap(keeps live hex, deletes orphan): ALL PASS. + +**W0.3 declarative reconciler** (nix/modules/warm-keycloak.nix) — systemd oneshot, converges warm +keycloak. Two bugs found+fixed against the real system: +1. `abra app deploy` non-chaos FATALs "already deployed" → need `-f` (tested: redeploys at ENV + VERSION, exit 0). +2. **Newline bite** (the backupbot.nix bite): keycloak's .env.sample ends with a newline-less + `#COMPOSE_FILE=` comment, so bash `set_env`'s printf glued `DOMAIN=` onto that comment → + DOMAIN unset → `KC_HOSTNAME=https://` (empty host) → keycloak crash-loop ("Expected authority at + index 8: https://"). Fixed set_env to ensure a trailing newline before append (same as backupbot). +Also made converge **skip the redeploy when already 200** (no JVM-restart blip on every rebuild; +only (re)deploys when down/crash-looping). Verified: nixos-rebuild switch → warm-keycloak.service +active "no-op converge", system running (0 failed), /realms/master=200. + +**W0.4 e2e (lasuite-docs vs warm keycloak)** — the WARM MECHANISM worked: deploy-count=1 (keycloak +NOT co-deployed), per-run realm `lasuite-docs-9c1995` created + **deleted on the warm keycloak** at +teardown, install pass. BUT `setup_custom_tests.sh exited 1` → 3 requires_deps SSO tests SKIPPED → +F2-11 correctly FAILED the run (not green). Root cause = a **lasuite-docs recipe race**, NOT warm +keycloak: the in-place `abra app deploy --force --chaos` (OIDC wiring) rolls all services; nginx +`web` fatally exits on `[emerg] host not found in upstream ...backend:8000` while backend is +mid-restart, and abra's converge check times out → "deploy failed 🛑". This is independent of +warm/cold keycloak (Q2.4 cold-keycloak lasuite-docs passed before; warm should REDUCE contention). +Filed as a finding to investigate (flaky/timing/resource vs deterministic regression); the headline +WC1 "dependent SSO tests green against warm keycloak" needs this resolved or a more-robust dependent. + +**DESIGN UPDATE absorbed (orchestrator + Adversary REVIEW-2w, 2026-05-28→29).** Warm/infra apps +(traefik + keycloak) now AUTO-UPDATE to LATEST nightly with HEALTH-GATED ROLLBACK: +- **WC1 revised:** UNPIN keycloak (match traefik: `abra recipe fetch` latest + chaos deploy; DROP + kcVersion). Keep secret-generate-only-if-missing + health-wait. D8 preserved (recipe fetched at + runtime → nix closure byte-identical). +- **WC1.1 NEW:** health-gated deploy-with-rollback IN the reconcilers. record last-good → deploy + latest → health-check → healthy: commit last-good:=latest; unhealthy: rollback + PushNotification. + Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → on fail restore snapshot + + redeploy prior version (forward DB migrations make version-only rollback unsafe). traefik + (stateless) = version rollback only. Reuse WC3 snapshot helper. +- **WC1.2 NEW:** pre-deploy safety gate — auto-apply only non-major/no-manual-migration bumps; a + MAJOR bump or manual-migration release notes → stay on current + alert (don't auto-apply). +- **WC6 reordered:** nightly = nixos-rebuild switch FIRST (warm/infra→latest, health-gated) THEN + full-cold sweep; never while a test is in flight. + +**Re-sequencing consequence:** WC1.1 depends on the **WC3 snapshot/restore helper**, so I build that +FIRST (foundational), then rewrite the reconciler ONCE into the full unpinned + health-gated + +safety-gated + rollback form (avoids reworking the reconciler twice). Current reconciler (pinned, +skip-if-healthy) is INTERIM — keeps keycloak live-warm/healthy meanwhile; will be replaced. Also need +to settle the **alert mechanism**: a bash systemd reconciler can't call the agent's PushNotification +tool directly — decision needed (alert sentinel file the Builder loop reads + relays, or a webhook). diff --git a/machine-docs/STATUS-2w.md b/machine-docs/STATUS-2w.md index 7805144..7f176d3 100644 --- a/machine-docs/STATUS-2w.md +++ b/machine-docs/STATUS-2w.md @@ -11,9 +11,16 @@ canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lan canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified. -## Definition of Done (Phase 2w) — WC1–WC9, each Adversary cold-verified in REVIEW-2w -- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain; dependents create+delete per-run - namespaced realms; concurrent dependents don't collide; leftover realms reaped. +## Definition of Done (Phase 2w) — WC1–WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w +- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain, **UNPINNED** (fetch latest + chaos + deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents + create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped. +- [ ] **WC1.1** — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak): + record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy + rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on + rollback (reuse WC3 helper). traefik = version rollback only. +- [ ] **WC1.2** — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a + MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade). - [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative registry tracking recipe→known-good commit; re-warmable from scratch. - [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one @@ -37,15 +44,33 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa - **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE. ## In flight -**W0 — live-warm keycloak (WC1).** Building incrementally: -1. sso.py realm lifecycle: add `delete_keycloak_realm` + `list_realms` + `reap_stale_realms` (realm - is the per-run isolation unit on a shared keycloak). -2. Orchestrator dep path: live-warm mode for the keycloak dep — use the stable warm domain + a - per-run **namespaced** realm (not realm=parent_recipe), delete the realm on teardown instead of - undeploying keycloak. Fall back to cold co-deploy if no warm keycloak present. -3. Declarative Nix reconciler (`nix/modules/warm-keycloak.nix`) — systemd oneshot converges the - warm keycloak to deployed+healthy at the stable domain. -4. e2e proof + concurrency (distinct realms) + reaping → claim WC1. +**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114): +- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass). +- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm). +- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant + JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS. +- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch` → + warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned + + skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.) + +**Re-sequenced after the 2026-05-28/29 design update (unpin + WC1.1 rollback + WC1.2 safety gate):** +WC1.1's keycloak rollback needs the **WC3 snapshot/restore helper**, so build that FIRST, then +rewrite the reconciler ONCE into the unpinned + safety-gated + health-gated-with-rollback form. Next: +1. **WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`): raw copy of an app's data + volume(s) while undeployed, under `/var/lib/ci-warm//`, atomic replace, one last-good; + restore round-trips data. + unit tests + live round-trip proof. +2. Rewrite reconciler: unpin keycloak (fetch latest + chaos); WC1.2 safety gate (major / manual- + migration → hold + alert); WC1.1 record last-good → (keycloak: undeploy→snapshot→deploy latest) → + health-gate → commit-or-rollback+restore+alert. +3. Settle the **alert mechanism** (bash reconciler can't call agent PushNotification — sentinel file + the Builder loop relays, see DECISIONS). +4. Resolve the lasuite-docs in-place-redeploy race (BUILD finding below) OR pick a more-robust + dependent, then the headline WC1 e2e (dependent SSO green vs warm keycloak) + concurrency proof. + +**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy +--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream +...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold +keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around. ## Gate (none claimed yet)