status(2w): W0 core mechanism proven + reconciler up; absorb design update (unpin+WC1.1+WC1.2); re-sequence to WC3 snapshot helper first
This commit is contained in:
@ -11,9 +11,16 @@ canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lan
|
||||
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
|
||||
nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified.
|
||||
|
||||
## Definition of Done (Phase 2w) — WC1–WC9, each Adversary cold-verified in REVIEW-2w
|
||||
- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain; dependents create+delete per-run
|
||||
namespaced realms; concurrent dependents don't collide; leftover realms reaped.
|
||||
## Definition of Done (Phase 2w) — WC1–WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
|
||||
- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain, **UNPINNED** (fetch latest + chaos
|
||||
deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents
|
||||
create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped.
|
||||
- [ ] **WC1.1** — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak):
|
||||
record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy
|
||||
rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on
|
||||
rollback (reuse WC3 helper). traefik = version rollback only.
|
||||
- [ ] **WC1.2** — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a
|
||||
MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade).
|
||||
- [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative
|
||||
registry tracking recipe→known-good commit; re-warmable from scratch.
|
||||
- [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one
|
||||
@ -37,15 +44,33 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa
|
||||
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
|
||||
|
||||
## In flight
|
||||
**W0 — live-warm keycloak (WC1).** Building incrementally:
|
||||
1. sso.py realm lifecycle: add `delete_keycloak_realm` + `list_realms` + `reap_stale_realms` (realm
|
||||
is the per-run isolation unit on a shared keycloak).
|
||||
2. Orchestrator dep path: live-warm mode for the keycloak dep — use the stable warm domain + a
|
||||
per-run **namespaced** realm (not realm=parent_recipe), delete the realm on teardown instead of
|
||||
undeploying keycloak. Fall back to cold co-deploy if no warm keycloak present.
|
||||
3. Declarative Nix reconciler (`nix/modules/warm-keycloak.nix`) — systemd oneshot converges the
|
||||
warm keycloak to deployed+healthy at the stable domain.
|
||||
4. e2e proof + concurrency (distinct realms) + reaping → claim WC1.
|
||||
**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
|
||||
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
|
||||
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
|
||||
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant
|
||||
JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
|
||||
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch` →
|
||||
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
|
||||
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
|
||||
|
||||
**Re-sequenced after the 2026-05-28/29 design update (unpin + WC1.1 rollback + WC1.2 safety gate):**
|
||||
WC1.1's keycloak rollback needs the **WC3 snapshot/restore helper**, so build that FIRST, then
|
||||
rewrite the reconciler ONCE into the unpinned + safety-gated + health-gated-with-rollback form. Next:
|
||||
1. **WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`): raw copy of an app's data
|
||||
volume(s) while undeployed, under `/var/lib/ci-warm/<recipe>/`, atomic replace, one last-good;
|
||||
restore round-trips data. + unit tests + live round-trip proof.
|
||||
2. Rewrite reconciler: unpin keycloak (fetch latest + chaos); WC1.2 safety gate (major / manual-
|
||||
migration → hold + alert); WC1.1 record last-good → (keycloak: undeploy→snapshot→deploy latest) →
|
||||
health-gate → commit-or-rollback+restore+alert.
|
||||
3. Settle the **alert mechanism** (bash reconciler can't call agent PushNotification — sentinel file
|
||||
the Builder loop relays, see DECISIONS).
|
||||
4. Resolve the lasuite-docs in-place-redeploy race (BUILD finding below) OR pick a more-robust
|
||||
dependent, then the headline WC1 e2e (dependent SSO green vs warm keycloak) + concurrency proof.
|
||||
|
||||
**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy
|
||||
--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream
|
||||
...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold
|
||||
keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around.
|
||||
|
||||
## Gate
|
||||
(none claimed yet)
|
||||
|
||||
Reference in New Issue
Block a user