From 985686f60e266e390c155b7d4268072fca342098 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Fri, 29 May 2026 01:40:32 +0100 Subject: [PATCH] =?UTF-8?q?claim(2w):=20Gate=20WC1+WC1.1+WC1.2=20CLAIMED?= =?UTF-8?q?=20=E2=80=94=20warm=20keycloak=20headline=20e2e=20GREEN=20+=20c?= =?UTF-8?q?oncurrency/reaping=20+=20rollback/holds=20proven?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit W0.7 (lasuite-docs race was transient) + W0.8 headline e2e: lasuite-docs custom pass (3 SSO tests incl. oidc_login + password_grant) vs WARM keycloak, deploy-count=1 (keycloak NOT co-deployed), per-run realm lasuite-docs-4c0858 created+deleted; warm kc left with only master realm. Concurrency+reaping proven (distinct realms for concurrent same-recipe runs; reap keeps-live/deletes-orphans). Gate claim in STATUS-2w carries full WHAT/HOW/EXPECTED/WHERE for cold verify. Co-Authored-By: Claude Opus 4.8 (1M context) --- machine-docs/BACKLOG-2w.md | 36 ++++++++++--------- machine-docs/JOURNAL-2w.md | 25 +++++++++++++ machine-docs/STATUS-2w.md | 72 +++++++++++++++++++++++++++++++++++--- 3 files changed, 112 insertions(+), 21 deletions(-) diff --git a/machine-docs/BACKLOG-2w.md b/machine-docs/BACKLOG-2w.md index 748bf2a..288681f 100644 --- a/machine-docs/BACKLOG-2w.md +++ b/machine-docs/BACKLOG-2w.md @@ -13,22 +13,26 @@ Single-writer rule (plan §6.1): Builder edits `## Build backlog` only; Adversar Core mechanism proven deploy-free on the live warm keycloak. - [x] W0.3a — Declarative reconciler `nix/modules/warm-keycloak.nix` up + verified via rebuild. DONE (88c1114) but INTERIM (pinned + skip-if-healthy) — superseded by W0.6 below. -- [ ] **W0.5 — WC3 snapshot/restore helper FIRST** (prereq for WC1.1): `runner/harness/warmsnap.py` - — raw copy of an app's data volume(s) while undeployed, under `/var/lib/ci-warm//`, - atomic replace, one last-good, restore round-trips data. + unit tests + live round-trip proof. -- [ ] **W0.6 — Rewrite reconciler: unpin + WC1.2 safety gate + WC1.1 health-gated rollback.** - UNPIN keycloak (fetch latest + chaos; drop kcVersion); keep secret-guard + health-wait. WC1.2 - gate: hold-on-current + alert on major/manual-migration bump (no deploy churn). WC1.1: record - last-good → keycloak undeploy→snapshot→deploy latest → health-gate → commit-or-(restore+ - redeploy-prior+alert). Apply the same health-gated+safety-gate pattern to traefik (version - rollback only, stateless). Settle the alert mechanism (see DECISIONS). -- [ ] **W0.7 — Fix lasuite-docs in-place-redeploy race** (nginx web `host not found in upstream - backend` during chaos redeploy) OR pick a more-robust SSO dependent for the headline proof. -- [ ] W0.8 — Headline WC1 e2e: dependent SSO custom test green vs warm keycloak; concurrent - dependents distinct realms (no collision); leftover realms reaped. → claim WC1. -- [ ] W0.9 — WC1.1/WC1.2 Adversary proofs: simulate broken latest → self-revert + data intact + - alert; healthy update commits last-good; major/manual-migration → hold + alert-with-notes. - → claim WC1.1/WC1.2. +- [x] **W0.5 — WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15) — live + round-trip proven; later moved snapshot into `/snapshot/` subdir so last_good survives. +- [x] **W0.6 — Rewrite reconciler: unpin + WC1.2 safety gate + WC1.1 scaffold** DONE (a044abb). + `runner/warm_reconcile.py` python entrypoint in the nix store; unpinned (deploy latest tag); + WC1.2 holds proven live; WC1.1 health-gate no-op path live. (traefik migration → later.) +- [x] **W0.7 — lasuite-docs redeploy race** RESOLVED — it was transient resource contention from the + killed stale Phase-2 run; converges fine on the clean system. No recipe/harness change needed. +- [x] W0.8 — Headline WC1 e2e GREEN (b34mcluc4): lasuite-docs custom pass (3 SSO tests incl. oidc + login + password grant) vs warm keycloak, deploy-count=1, per-run realm created+deleted; + concurrency (distinct realms) + reaping proven. +- [x] W0.9 — WC1.1 live proofs PASS (32f0071): marquee rollback (broken latest → self-revert + data + intact + alert, last_good not advanced) + healthy upgrade commits last_good. WC1.2 holds (W0.6). +- [x] **WC8 fix (found en route):** docker autoPrune `--volumes` removed (was failing daily + would + delete warm volumes) (e73e439). +- [ ] **W0.10 (follow-up, post-gate):** wire the Builder-loop alert relay + (`/var/lib/ci-warm/alerts/*.json` → PushNotification → `alerts/seen/`); apply the WC1.1/WC1.2 + health-gated+safety-gate pattern to the traefik reconciler (proxy.nix, stateless = version + rollback only). → folds into WC1.1/WC8 final verification. + +→ **Gate WC1 + WC1.1 + WC1.2 CLAIMED** in STATUS-2w (awaiting Adversary). ### W1 — Canonical registry (WC2) - [ ] W1.1 — Canonical registry/reconciler (declarative; tracks recipe→known-good commit; stable diff --git a/machine-docs/JOURNAL-2w.md b/machine-docs/JOURNAL-2w.md index 1fdea62..6a3ba9b 100644 --- a/machine-docs/JOURNAL-2w.md +++ b/machine-docs/JOURNAL-2w.md @@ -188,3 +188,28 @@ moment it started working. Fixed: dropped `--volumes` (prune images/containers/n rebuild → docker-prune.service runs clean, system `running` (0 failed), keycloak 200. Note for WC8: the warm-volume/snapshot prune policy + nix-generation GC should be folded into the maintenance story. + +## 2026-05-29 — W0.7/W0.8 headline WC1 e2e GREEN; concurrency+reaping proven → claiming WC1/WC1.1/WC1.2 + +The W0.4 lasuite-docs failure was TRANSIENT (resource contention from the since-killed stale Phase-2 +run; disk was tight). Re-ran on the clean system (disk 36% after the prune fix): +`RECIPE=lasuite-docs STAGES=install,custom` → **install: pass, custom: pass** — all 3 SSO tests green +vs the WARM keycloak: test_health_check (200), **test_oidc_login_via_keycloak** (full app OIDC flow), +**test_oidc_password_grant_against_dep_keycloak** (per-run realm JWT). **deploy-count=1** (keycloak +NOT co-deployed — warm path); per-run realm `lasuite-docs-4c0858` created + DELETED at teardown; no +lasu stack left; warm keycloak realm list back to just `master`. So W0.7 needs no recipe fix — the +in-place chaos-redeploy converges fine with adequate resources. + +Concurrency+reaping (deploy-free, live warm keycloak): realm_for gives DISTINCT realms for two +concurrent same-recipe runs (`lasuite-docs-aaa111` vs `-bbb222`) + a different recipe +(`cryptpad-ccc333`); all 3 created, each grants its own JWT independently (no collision); +reap_orphaned_realms with live_hexes={aaa111} deleted exactly the two orphans and KEPT the live one. + +All WC1 sub-claims now proven: (warm dep, no co-deploy, per-run realm create+delete) + (concurrent +distinct realms) + (orphan reaping); plus WC1.1 (W0.9 marquee rollback) + WC1.2 (W0.6 holds). Warm +keycloak healthy on 10.7.1+26.6.2, last_good=10.7.1+26.6.2, no alerts, system running (0 failed). +Claiming the WC1/WC1.1/WC1.2 gate. + +Note: the reconciler WRITES alert sentinels to /var/lib/ci-warm/alerts/ (proven for rollback + +holds). The Builder-loop RELAY (sentinel → PushNotification + archive to seen/) runs each wake when an +alert is present; none currently. This delivery layer is loop behavior, not reconciler logic. diff --git a/machine-docs/STATUS-2w.md b/machine-docs/STATUS-2w.md index ff4ef90..6098a8f 100644 --- a/machine-docs/STATUS-2w.md +++ b/machine-docs/STATUS-2w.md @@ -90,13 +90,75 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa 3. **Builder-loop alert relay** (deferred wiring) — on each wake, scan `/var/lib/ci-warm/alerts/*.json`, PushNotification + record + archive to `alerts/seen/`; wire when nightly WC6 lands (first real alert). -**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy ---force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream -...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold -keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around. +**Build finding (RESOLVED):** the W0.4 lasuite-docs `setup_custom_tests` redeploy failure (nginx web +`host not found in upstream ...backend:8000`) was **transient resource contention** from the +since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the +headline e2e is green (below). No recipe/harness change needed. ## Gate -(none claimed yet) + +### Gate: WC1 + WC1.1 + WC1.2 — CLAIMED, awaiting Adversary (@2026-05-29, HEAD = see `git log -1`) + +**WHAT.** The live-warm keycloak layer (W0): a persistent **unpinned** keycloak at the stable domain +`warm-keycloak.ci.commoninternet.net`, declaratively reconciled, that SSO-dependent runs use via a +**per-run namespaced realm** (created + deleted) instead of co-deploying; concurrent dependents get +distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with +snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps +(WC1.2). + +**WHERE (code).** `runner/warm_reconcile.py` (reconcile logic), `runner/harness/warm.py` (stable +domain, per-run realm naming, reaping), `runner/harness/sso.py` (realm lifecycle), `runner/harness/ +warmsnap.py` (snapshot/restore), `runner/run_recipe_ci.py` (warm/cold dep split), `nix/modules/ +warm-keycloak.nix` (systemd reconcile unit). Warm state on cc-ci under `/var/lib/ci-warm/`. + +**HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/):** + +1. **Declarative + unpinned + healthy:** `grep -n kcVersion nix/modules/warm-keycloak.nix` → *no + match* (pin removed; the unit runs `runner/warm_reconcile.py keycloak`). `ssh cc-ci 'systemctl + is-active warm-keycloak.service'` → `active`; `systemctl is-system-running` → `running`. Health: + `curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1 + https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'` → `200`. + D8: a `nixos-rebuild build` closure hash is unaffected by which keycloak version is live (recipe + fetched at runtime). +2. **Units:** `cc-ci-run -m pytest tests/unit -q` → **57 passed** (incl. test_warm_realm, + test_warmsnap, test_warm_reconcile). +3. **WC1 headline e2e:** `RECIPE=lasuite-docs STAGES=install,custom cc-ci-run + runner/run_recipe_ci.py` → `install: pass`, `custom: pass`, **`deploy-count = 1 (expect 1)`** + (keycloak NOT co-deployed), log shows `dep: using live-warm keycloak @ warm-keycloak...` and + `dep: deleted per-run realm lasuite-docs- on warm keycloak`. The 3 custom SSO tests pass + (test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak). + After the run, warm keycloak realms = `['master']` only (no leftover); no `lasu*` docker stack. +4. **WC1 concurrency + reaping (deploy-free):** `realm_for("lasuite-docs","lasu-aaa111...")` = + `lasuite-docs-aaa111` and `...bbb222` → distinct (two concurrent same-recipe runs never collide); + create realms aaa111/bbb222/ccc333 on the warm kc, each `oidc_password_grant` returns a JWT; + `sso.reap_orphaned_realms(D, live_hexes={"aaa111"})` deletes exactly bbb222+ccc333 and KEEPS + aaa111. (Builder ran this live: PASS.) +5. **WC1.1 health-gated rollback (live):** with `CCCI_SKIP_FETCH=1` stage two **annotated** fake tags + on `~/.abra/recipes/keycloak` — `10.7.9+26.6.2` at the good commit (`git tag -a -m x 10.7.9+26.6.2 + 10.7.1+26.6.2^{}`) and `10.7.10+26.6.2` at a commit whose compose.yml has a broken + `KC_HOSTNAME=:::bad-host:::`. Create a marker realm, set last_good, then run `CCCI_SKIP_FETCH=1 + cc-ci-run runner/warm_reconcile.py keycloak` twice → first `RECONCILE RESULT: upgraded:...->10.7.9` + (snapshot taken, last_good=10.7.9, marker preserved); second `rolled-back:10.7.10->10.7.9` — + keycloak HEALTHY on 10.7.9, **marker realm INTACT** (data preserved), `/var/lib/ci-warm/keycloak/ + last_good` still `10.7.9` (NOT advanced), a `*-rollback.json` alert under `/var/lib/ci-warm/alerts/` + with `attempted=10.7.10 last_good=10.7.9 recovered=true`. (Builder ran this live: ALL PASS; keycloak + restored to canonical 10.7.1+26.6.2.) +6. **WC1.2 pre-deploy safety gate (live):** stage an annotated fake tag with a MAJOR bump + (`11.0.0+27.0.0`) → `CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloak` → `RECONCILE RESULT: + held-major:...`, a `*-held-major.json` alert written, **keycloak untouched** (TYPE unchanged, + 200, no snapshot/deploy churn). Stage a minor tag (`10.7.2+26.6.3`) with `releaseNotes/ + 10.7.2+26.6.3.md` containing "manual migration" → `held-manual-migration`, alert carries the notes. + (Builder ran both live: held + untouched.) + +**Alert delivery note (not blocking):** the reconciler WRITES alert sentinels to +`/var/lib/ci-warm/alerts/*.json` (proven above). The operator-facing relay (Builder loop scans → +PushNotification → archive to `alerts/seen/`) is loop behavior, run each wake when an alert exists; +none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable. + +**Builder will NOT advance past this gate** (to W1/WC2 canonical registry) until REVIEW-2w shows PASS. + +## (prior) Gate +(none before this) ## Blocked (none)