status(2w): W0 core mechanism proven + reconciler up; absorb design update (unpin+WC1.1+WC1.2); re-sequence to WC3 snapshot helper first

2026-05-29 00:04:10 +01:00
parent b127078516
commit 740d7bac4c
2 changed files with 102 additions and 12 deletions
--- a/machine-docs/JOURNAL-2w.md
+++ b/machine-docs/JOURNAL-2w.md
@ -52,3 +52,68 @@ Building W0 in increments (each verified): (1) sso realm lifecycle prims + units
 keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the
 orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof.
 </content>
+
+## 2026-05-29 — W0 core mechanism PROVEN; declarative reconciler up; design update absorbed
+
+**Stale Phase-2 run killed.** Found an orphaned `run_recipe_ci.py` (RECIPE=lasuite-drive, the Q3.2
+`ccci-q32-drive-sso2.log` run) still alive from before the phase switch (PPID 1, nohup). It had
+deployed lasu-0a6fb2 + tried a cold keyc-07d81e dep — both of which I'd already torn down, so it was
+failing. Killed its process tree + janitored. Only infra + warm-keycloak remain.
+
+**W0.1 realm lifecycle (sso.py)** — list_realms / delete_keycloak_realm (idempotent, refuses master)
+/ realms_to_reap (pure predicate) / reap_orphaned_realms. +8 unit tests. The per-run realm is the
+isolation unit on a shared keycloak; orphans reaped by hex-not-in-live-stacks (concurrency-safe).
+
+**W0.2 orchestrator live-warm mode** — warm.py (stable-domain scheme, is_warm_up probe,
+live_app_hexes, realm_for=<parent>-<6hex>, reap_orphan_realms). run_recipe_ci splits declared deps
+into warm (shared provider + per-run realm, no deploy, realm deleted at teardown) vs cold
+(co-deploy), warm only if provider up else cold fallback; deploy-count excludes warm deps; reaps
+orphans at run start. Dependent tests now assert the namespaced realm pattern (stronger than ==parent).
+
+**WC1 CORE MECHANISM PROVEN** (deploy-free, live warm keycloak): realm create → password-grant JWT
+→ discovery issuer → delete(idempotent) → reap(keeps live hex, deletes orphan): ALL PASS.
+
+**W0.3 declarative reconciler** (nix/modules/warm-keycloak.nix) — systemd oneshot, converges warm
+keycloak. Two bugs found+fixed against the real system:
+1. `abra app deploy` non-chaos FATALs "already deployed" → need `-f` (tested: redeploys at ENV
+   VERSION, exit 0).
+2. **Newline bite** (the backupbot.nix bite): keycloak's .env.sample ends with a newline-less
+   `#COMPOSE_FILE=` comment, so bash `set_env`'s printf glued `DOMAIN=` onto that comment →
+   DOMAIN unset → `KC_HOSTNAME=https://` (empty host) → keycloak crash-loop ("Expected authority at
+   index 8: https://"). Fixed set_env to ensure a trailing newline before append (same as backupbot).
+Also made converge **skip the redeploy when already 200** (no JVM-restart blip on every rebuild;
+only (re)deploys when down/crash-looping). Verified: nixos-rebuild switch → warm-keycloak.service
+active "no-op converge", system running (0 failed), /realms/master=200.
+
+**W0.4 e2e (lasuite-docs vs warm keycloak)** — the WARM MECHANISM worked: deploy-count=1 (keycloak
+NOT co-deployed), per-run realm `lasuite-docs-9c1995` created + **deleted on the warm keycloak** at
+teardown, install pass. BUT `setup_custom_tests.sh exited 1` → 3 requires_deps SSO tests SKIPPED →
+F2-11 correctly FAILED the run (not green). Root cause = a **lasuite-docs recipe race**, NOT warm
+keycloak: the in-place `abra app deploy --force --chaos` (OIDC wiring) rolls all services; nginx
+`web` fatally exits on `[emerg] host not found in upstream ...backend:8000` while backend is
+mid-restart, and abra's converge check times out → "deploy failed 🛑". This is independent of
+warm/cold keycloak (Q2.4 cold-keycloak lasuite-docs passed before; warm should REDUCE contention).
+Filed as a finding to investigate (flaky/timing/resource vs deterministic regression); the headline
+WC1 "dependent SSO tests green against warm keycloak" needs this resolved or a more-robust dependent.
+
+**DESIGN UPDATE absorbed (orchestrator + Adversary REVIEW-2w, 2026-05-28→29).** Warm/infra apps
+(traefik + keycloak) now AUTO-UPDATE to LATEST nightly with HEALTH-GATED ROLLBACK:
+- **WC1 revised:** UNPIN keycloak (match traefik: `abra recipe fetch` latest + chaos deploy; DROP
+  kcVersion). Keep secret-generate-only-if-missing + health-wait. D8 preserved (recipe fetched at
+  runtime → nix closure byte-identical).
+- **WC1.1 NEW:** health-gated deploy-with-rollback IN the reconcilers. record last-good → deploy
+  latest → health-check → healthy: commit last-good:=latest; unhealthy: rollback + PushNotification.
+  Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → on fail restore snapshot
+  + redeploy prior version (forward DB migrations make version-only rollback unsafe). traefik
+  (stateless) = version rollback only. Reuse WC3 snapshot helper.
+- **WC1.2 NEW:** pre-deploy safety gate — auto-apply only non-major/no-manual-migration bumps; a
+  MAJOR bump or manual-migration release notes → stay on current + alert (don't auto-apply).
+- **WC6 reordered:** nightly = nixos-rebuild switch FIRST (warm/infra→latest, health-gated) THEN
+  full-cold sweep; never while a test is in flight.
+
+**Re-sequencing consequence:** WC1.1 depends on the **WC3 snapshot/restore helper**, so I build that
+FIRST (foundational), then rewrite the reconciler ONCE into the full unpinned + health-gated +
+safety-gated + rollback form (avoids reworking the reconciler twice). Current reconciler (pinned,
+skip-if-healthy) is INTERIM — keeps keycloak live-warm/healthy meanwhile; will be replaced. Also need
+to settle the **alert mechanism**: a bash systemd reconciler can't call the agent's PushNotification
+tool directly — decision needed (alert sentinel file the Builder loop reads + relays, or a webhook).