claim(2w): Gate WC1+WC1.1+WC1.2 CLAIMED — warm keycloak headline e2e GREEN + concurrency/reaping + rollback/holds proven
W0.7 (lasuite-docs race was transient) + W0.8 headline e2e: lasuite-docs custom pass (3 SSO tests incl. oidc_login + password_grant) vs WARM keycloak, deploy-count=1 (keycloak NOT co-deployed), per-run realm lasuite-docs-4c0858 created+deleted; warm kc left with only master realm. Concurrency+reaping proven (distinct realms for concurrent same-recipe runs; reap keeps-live/deletes-orphans). Gate claim in STATUS-2w carries full WHAT/HOW/EXPECTED/WHERE for cold verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -13,22 +13,26 @@ Single-writer rule (plan §6.1): Builder edits `## Build backlog` only; Adversar
|
|||||||
Core mechanism proven deploy-free on the live warm keycloak.
|
Core mechanism proven deploy-free on the live warm keycloak.
|
||||||
- [x] W0.3a — Declarative reconciler `nix/modules/warm-keycloak.nix` up + verified via rebuild.
|
- [x] W0.3a — Declarative reconciler `nix/modules/warm-keycloak.nix` up + verified via rebuild.
|
||||||
DONE (88c1114) but INTERIM (pinned + skip-if-healthy) — superseded by W0.6 below.
|
DONE (88c1114) but INTERIM (pinned + skip-if-healthy) — superseded by W0.6 below.
|
||||||
- [ ] **W0.5 — WC3 snapshot/restore helper FIRST** (prereq for WC1.1): `runner/harness/warmsnap.py`
|
- [x] **W0.5 — WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15) — live
|
||||||
— raw copy of an app's data volume(s) while undeployed, under `/var/lib/ci-warm/<recipe>/`,
|
round-trip proven; later moved snapshot into `<recipe>/snapshot/` subdir so last_good survives.
|
||||||
atomic replace, one last-good, restore round-trips data. + unit tests + live round-trip proof.
|
- [x] **W0.6 — Rewrite reconciler: unpin + WC1.2 safety gate + WC1.1 scaffold** DONE (a044abb).
|
||||||
- [ ] **W0.6 — Rewrite reconciler: unpin + WC1.2 safety gate + WC1.1 health-gated rollback.**
|
`runner/warm_reconcile.py` python entrypoint in the nix store; unpinned (deploy latest tag);
|
||||||
UNPIN keycloak (fetch latest + chaos; drop kcVersion); keep secret-guard + health-wait. WC1.2
|
WC1.2 holds proven live; WC1.1 health-gate no-op path live. (traefik migration → later.)
|
||||||
gate: hold-on-current + alert on major/manual-migration bump (no deploy churn). WC1.1: record
|
- [x] **W0.7 — lasuite-docs redeploy race** RESOLVED — it was transient resource contention from the
|
||||||
last-good → keycloak undeploy→snapshot→deploy latest → health-gate → commit-or-(restore+
|
killed stale Phase-2 run; converges fine on the clean system. No recipe/harness change needed.
|
||||||
redeploy-prior+alert). Apply the same health-gated+safety-gate pattern to traefik (version
|
- [x] W0.8 — Headline WC1 e2e GREEN (b34mcluc4): lasuite-docs custom pass (3 SSO tests incl. oidc
|
||||||
rollback only, stateless). Settle the alert mechanism (see DECISIONS).
|
login + password grant) vs warm keycloak, deploy-count=1, per-run realm created+deleted;
|
||||||
- [ ] **W0.7 — Fix lasuite-docs in-place-redeploy race** (nginx web `host not found in upstream
|
concurrency (distinct realms) + reaping proven.
|
||||||
backend` during chaos redeploy) OR pick a more-robust SSO dependent for the headline proof.
|
- [x] W0.9 — WC1.1 live proofs PASS (32f0071): marquee rollback (broken latest → self-revert + data
|
||||||
- [ ] W0.8 — Headline WC1 e2e: dependent SSO custom test green vs warm keycloak; concurrent
|
intact + alert, last_good not advanced) + healthy upgrade commits last_good. WC1.2 holds (W0.6).
|
||||||
dependents distinct realms (no collision); leftover realms reaped. → claim WC1.
|
- [x] **WC8 fix (found en route):** docker autoPrune `--volumes` removed (was failing daily + would
|
||||||
- [ ] W0.9 — WC1.1/WC1.2 Adversary proofs: simulate broken latest → self-revert + data intact +
|
delete warm volumes) (e73e439).
|
||||||
alert; healthy update commits last-good; major/manual-migration → hold + alert-with-notes.
|
- [ ] **W0.10 (follow-up, post-gate):** wire the Builder-loop alert relay
|
||||||
→ claim WC1.1/WC1.2.
|
(`/var/lib/ci-warm/alerts/*.json` → PushNotification → `alerts/seen/`); apply the WC1.1/WC1.2
|
||||||
|
health-gated+safety-gate pattern to the traefik reconciler (proxy.nix, stateless = version
|
||||||
|
rollback only). → folds into WC1.1/WC8 final verification.
|
||||||
|
|
||||||
|
→ **Gate WC1 + WC1.1 + WC1.2 CLAIMED** in STATUS-2w (awaiting Adversary).
|
||||||
|
|
||||||
### W1 — Canonical registry (WC2)
|
### W1 — Canonical registry (WC2)
|
||||||
- [ ] W1.1 — Canonical registry/reconciler (declarative; tracks recipe→known-good commit; stable
|
- [ ] W1.1 — Canonical registry/reconciler (declarative; tracks recipe→known-good commit; stable
|
||||||
|
|||||||
@ -188,3 +188,28 @@ moment it started working. Fixed: dropped `--volumes` (prune images/containers/n
|
|||||||
rebuild → docker-prune.service runs clean, system `running` (0 failed), keycloak 200. Note for WC8:
|
rebuild → docker-prune.service runs clean, system `running` (0 failed), keycloak 200. Note for WC8:
|
||||||
the warm-volume/snapshot prune policy + nix-generation GC should be folded into the maintenance
|
the warm-volume/snapshot prune policy + nix-generation GC should be folded into the maintenance
|
||||||
story.
|
story.
|
||||||
|
|
||||||
|
## 2026-05-29 — W0.7/W0.8 headline WC1 e2e GREEN; concurrency+reaping proven → claiming WC1/WC1.1/WC1.2
|
||||||
|
|
||||||
|
The W0.4 lasuite-docs failure was TRANSIENT (resource contention from the since-killed stale Phase-2
|
||||||
|
run; disk was tight). Re-ran on the clean system (disk 36% after the prune fix):
|
||||||
|
`RECIPE=lasuite-docs STAGES=install,custom` → **install: pass, custom: pass** — all 3 SSO tests green
|
||||||
|
vs the WARM keycloak: test_health_check (200), **test_oidc_login_via_keycloak** (full app OIDC flow),
|
||||||
|
**test_oidc_password_grant_against_dep_keycloak** (per-run realm JWT). **deploy-count=1** (keycloak
|
||||||
|
NOT co-deployed — warm path); per-run realm `lasuite-docs-4c0858` created + DELETED at teardown; no
|
||||||
|
lasu stack left; warm keycloak realm list back to just `master`. So W0.7 needs no recipe fix — the
|
||||||
|
in-place chaos-redeploy converges fine with adequate resources.
|
||||||
|
|
||||||
|
Concurrency+reaping (deploy-free, live warm keycloak): realm_for gives DISTINCT realms for two
|
||||||
|
concurrent same-recipe runs (`lasuite-docs-aaa111` vs `-bbb222`) + a different recipe
|
||||||
|
(`cryptpad-ccc333`); all 3 created, each grants its own JWT independently (no collision);
|
||||||
|
reap_orphaned_realms with live_hexes={aaa111} deleted exactly the two orphans and KEPT the live one.
|
||||||
|
|
||||||
|
All WC1 sub-claims now proven: (warm dep, no co-deploy, per-run realm create+delete) + (concurrent
|
||||||
|
distinct realms) + (orphan reaping); plus WC1.1 (W0.9 marquee rollback) + WC1.2 (W0.6 holds). Warm
|
||||||
|
keycloak healthy on 10.7.1+26.6.2, last_good=10.7.1+26.6.2, no alerts, system running (0 failed).
|
||||||
|
Claiming the WC1/WC1.1/WC1.2 gate.
|
||||||
|
|
||||||
|
Note: the reconciler WRITES alert sentinels to /var/lib/ci-warm/alerts/ (proven for rollback +
|
||||||
|
holds). The Builder-loop RELAY (sentinel → PushNotification + archive to seen/) runs each wake when an
|
||||||
|
alert is present; none currently. This delivery layer is loop behavior, not reconciler logic.
|
||||||
|
|||||||
@ -90,13 +90,75 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa
|
|||||||
3. **Builder-loop alert relay** (deferred wiring) — on each wake, scan `/var/lib/ci-warm/alerts/*.json`,
|
3. **Builder-loop alert relay** (deferred wiring) — on each wake, scan `/var/lib/ci-warm/alerts/*.json`,
|
||||||
PushNotification + record + archive to `alerts/seen/`; wire when nightly WC6 lands (first real alert).
|
PushNotification + record + archive to `alerts/seen/`; wire when nightly WC6 lands (first real alert).
|
||||||
|
|
||||||
**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy
|
**Build finding (RESOLVED):** the W0.4 lasuite-docs `setup_custom_tests` redeploy failure (nginx web
|
||||||
--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream
|
`host not found in upstream ...backend:8000`) was **transient resource contention** from the
|
||||||
...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold
|
since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the
|
||||||
keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around.
|
headline e2e is green (below). No recipe/harness change needed.
|
||||||
|
|
||||||
## Gate
|
## Gate
|
||||||
(none claimed yet)
|
|
||||||
|
### Gate: WC1 + WC1.1 + WC1.2 — CLAIMED, awaiting Adversary (@2026-05-29, HEAD = see `git log -1`)
|
||||||
|
|
||||||
|
**WHAT.** The live-warm keycloak layer (W0): a persistent **unpinned** keycloak at the stable domain
|
||||||
|
`warm-keycloak.ci.commoninternet.net`, declaratively reconciled, that SSO-dependent runs use via a
|
||||||
|
**per-run namespaced realm** (created + deleted) instead of co-deploying; concurrent dependents get
|
||||||
|
distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with
|
||||||
|
snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps
|
||||||
|
(WC1.2).
|
||||||
|
|
||||||
|
**WHERE (code).** `runner/warm_reconcile.py` (reconcile logic), `runner/harness/warm.py` (stable
|
||||||
|
domain, per-run realm naming, reaping), `runner/harness/sso.py` (realm lifecycle), `runner/harness/
|
||||||
|
warmsnap.py` (snapshot/restore), `runner/run_recipe_ci.py` (warm/cold dep split), `nix/modules/
|
||||||
|
warm-keycloak.nix` (systemd reconcile unit). Warm state on cc-ci under `/var/lib/ci-warm/`.
|
||||||
|
|
||||||
|
**HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/<clone>):**
|
||||||
|
|
||||||
|
1. **Declarative + unpinned + healthy:** `grep -n kcVersion nix/modules/warm-keycloak.nix` → *no
|
||||||
|
match* (pin removed; the unit runs `runner/warm_reconcile.py keycloak`). `ssh cc-ci 'systemctl
|
||||||
|
is-active warm-keycloak.service'` → `active`; `systemctl is-system-running` → `running`. Health:
|
||||||
|
`curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1
|
||||||
|
https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'` → `200`.
|
||||||
|
D8: a `nixos-rebuild build` closure hash is unaffected by which keycloak version is live (recipe
|
||||||
|
fetched at runtime).
|
||||||
|
2. **Units:** `cc-ci-run -m pytest tests/unit -q` → **57 passed** (incl. test_warm_realm,
|
||||||
|
test_warmsnap, test_warm_reconcile).
|
||||||
|
3. **WC1 headline e2e:** `RECIPE=lasuite-docs STAGES=install,custom cc-ci-run
|
||||||
|
runner/run_recipe_ci.py` → `install: pass`, `custom: pass`, **`deploy-count = 1 (expect 1)`**
|
||||||
|
(keycloak NOT co-deployed), log shows `dep: using live-warm keycloak @ warm-keycloak...` and
|
||||||
|
`dep: deleted per-run realm lasuite-docs-<hex> on warm keycloak`. The 3 custom SSO tests pass
|
||||||
|
(test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak).
|
||||||
|
After the run, warm keycloak realms = `['master']` only (no leftover); no `lasu*` docker stack.
|
||||||
|
4. **WC1 concurrency + reaping (deploy-free):** `realm_for("lasuite-docs","lasu-aaa111...")` =
|
||||||
|
`lasuite-docs-aaa111` and `...bbb222` → distinct (two concurrent same-recipe runs never collide);
|
||||||
|
create realms aaa111/bbb222/ccc333 on the warm kc, each `oidc_password_grant` returns a JWT;
|
||||||
|
`sso.reap_orphaned_realms(D, live_hexes={"aaa111"})` deletes exactly bbb222+ccc333 and KEEPS
|
||||||
|
aaa111. (Builder ran this live: PASS.)
|
||||||
|
5. **WC1.1 health-gated rollback (live):** with `CCCI_SKIP_FETCH=1` stage two **annotated** fake tags
|
||||||
|
on `~/.abra/recipes/keycloak` — `10.7.9+26.6.2` at the good commit (`git tag -a -m x 10.7.9+26.6.2
|
||||||
|
10.7.1+26.6.2^{}`) and `10.7.10+26.6.2` at a commit whose compose.yml has a broken
|
||||||
|
`KC_HOSTNAME=:::bad-host:::`. Create a marker realm, set last_good, then run `CCCI_SKIP_FETCH=1
|
||||||
|
cc-ci-run runner/warm_reconcile.py keycloak` twice → first `RECONCILE RESULT: upgraded:...->10.7.9`
|
||||||
|
(snapshot taken, last_good=10.7.9, marker preserved); second `rolled-back:10.7.10->10.7.9` —
|
||||||
|
keycloak HEALTHY on 10.7.9, **marker realm INTACT** (data preserved), `/var/lib/ci-warm/keycloak/
|
||||||
|
last_good` still `10.7.9` (NOT advanced), a `*-rollback.json` alert under `/var/lib/ci-warm/alerts/`
|
||||||
|
with `attempted=10.7.10 last_good=10.7.9 recovered=true`. (Builder ran this live: ALL PASS; keycloak
|
||||||
|
restored to canonical 10.7.1+26.6.2.)
|
||||||
|
6. **WC1.2 pre-deploy safety gate (live):** stage an annotated fake tag with a MAJOR bump
|
||||||
|
(`11.0.0+27.0.0`) → `CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloak` → `RECONCILE RESULT:
|
||||||
|
held-major:...`, a `*-held-major.json` alert written, **keycloak untouched** (TYPE unchanged,
|
||||||
|
200, no snapshot/deploy churn). Stage a minor tag (`10.7.2+26.6.3`) with `releaseNotes/
|
||||||
|
10.7.2+26.6.3.md` containing "manual migration" → `held-manual-migration`, alert carries the notes.
|
||||||
|
(Builder ran both live: held + untouched.)
|
||||||
|
|
||||||
|
**Alert delivery note (not blocking):** the reconciler WRITES alert sentinels to
|
||||||
|
`/var/lib/ci-warm/alerts/*.json` (proven above). The operator-facing relay (Builder loop scans →
|
||||||
|
PushNotification → archive to `alerts/seen/`) is loop behavior, run each wake when an alert exists;
|
||||||
|
none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable.
|
||||||
|
|
||||||
|
**Builder will NOT advance past this gate** (to W1/WC2 canonical registry) until REVIEW-2w shows PASS.
|
||||||
|
|
||||||
|
## (prior) Gate
|
||||||
|
(none before this)
|
||||||
|
|
||||||
## Blocked
|
## Blocked
|
||||||
(none)
|
(none)
|
||||||
|
|||||||
Reference in New Issue
Block a user