W1.2: enrolled custom-html (recipe_meta.WARM_CANONICAL); live proof ALL PASS (seed canonical → idle-with-volume-retained → re-warm → marker survived). WC2 (registry+data-warm model) + WC3 (snapshot+restore) proven. 61 unit pass. custom-html now the first real data-warm canonical (idle). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
240 lines
18 KiB
Markdown
240 lines
18 KiB
Markdown
# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
|
||
|
||
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
|
||
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
|
||
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
|
||
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
|
||
|
||
## Phase
|
||
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
|
||
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
|
||
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
|
||
nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified.
|
||
|
||
## Definition of Done (Phase 2w) — WC1–WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
|
||
- [x] **WC1** — Live-warm UNPINNED keycloak; per-run namespaced realms (create+delete); concurrent
|
||
distinct realms; orphan realms reaped. **Adversary PASS @2026-05-29** (REVIEW-2w, gate 985686f).
|
||
- [~] **WC1.1** — Health-gated deploy-with-rollback. **keycloak (stateful) — Adversary PASS
|
||
@2026-05-29** (marquee: broken latest → snapshot→restore→prior, data intact, last_good held,
|
||
alert). **traefik (stateless, version-rollback-only) — NOT yet migrated = W0.10**, MUST close
|
||
before Phase-2w DONE (Adversary will require a cold proof).
|
||
- [x] **WC1.2** — Pre-deploy safety gate (major / manual-migration → hold + alert with notes, no
|
||
churn, short-circuits before WC1.1). **Adversary PASS @2026-05-29**.
|
||
- [x] **WC2** — Data-warm canonical model: per-recipe canonical at stable domain `warm-<recipe>`,
|
||
declarative registry (canonical.json + recipe_meta.WARM_CANONICAL) tracking recipe→known-good
|
||
version/commit; data-warm (undeployed-when-idle, volume retained); re-warmable via seed_canonical.
|
||
Proven on custom-html (W1.2). **CLAIMED — see Gate below.**
|
||
- [x] **WC3** — Known-good snapshots: raw per-volume tar taken while undeployed under
|
||
`/var/lib/ci-warm/<recipe>/snapshot/`; one last-good per app, atomic subdir swap; restore
|
||
round-trips data (W0.5 mutate→restore proof + W1.2 data-warm reattach). **CLAIMED — see Gate.**
|
||
- [ ] **WC4** — `--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts;
|
||
PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
|
||
- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
|
||
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
|
||
- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
|
||
results carry mode; clean no-canonical fallback.
|
||
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
|
||
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
|
||
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
|
||
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
|
||
|
||
## Milestones (plan §3)
|
||
- **W0** — Warm keycloak (WC1/WC1.1-keycloak/WC1.2). ✅ Adversary PASS @2026-05-29.
|
||
- **W1** — Canonical registry + snapshot/restore (WC2, WC3). ← IN FLIGHT
|
||
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
|
||
- **W2** — `--quick` mode (WC4, WC7).
|
||
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
|
||
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
|
||
|
||
## In flight
|
||
**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
|
||
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
|
||
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
|
||
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant
|
||
JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
|
||
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch` →
|
||
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
|
||
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
|
||
|
||
- **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests
|
||
(48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm → undeploy → snapshot
|
||
(mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker
|
||
realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm/<recipe>/`, atomic, one last-good.
|
||
|
||
- **W0.6 reconciler rewrite** DONE (a044abb). `runner/warm_reconcile.py` (python, packaged into the
|
||
nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched
|
||
at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR
|
||
releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback
|
||
scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate →
|
||
commit-or-restore+redeploy-prior+alert). Alerts = `/var/lib/ci-warm/alerts/*.json`. +8 unit tests
|
||
(56 unit pass). PROVEN live: `nixos-rebuild switch` → warm-keycloak.service runs the python
|
||
reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major,
|
||
keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).
|
||
|
||
- **W0.9 WC1.1 live proofs** DONE (32f0071). PROVEN on warm keycloak (annotated fake tags +
|
||
CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good
|
||
committed, marker preserved; (b) **marquee rollback** — broken latest 10.7.10 → deploy fails →
|
||
rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback
|
||
alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical
|
||
10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed
|
||
swarm-settle, abra-stdout capture). 57 unit pass. **Reconciler-side WC1/WC1.1/WC1.2 proven.**
|
||
|
||
**Adversary reproduce (W0.9):** on cc-ci, with the keycloak recipe clone, create annotated fake
|
||
tags (peel `^{}`, set git identity) `10.7.9+26.6.2`(=good commit) and `10.7.10+26.6.2`(broken
|
||
KC_HOSTNAME), then `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice; observe
|
||
`upgraded:` then `rolled-back:`, marker realm survives, `/var/lib/ci-warm/keycloak/last_good`
|
||
unchanged at the prior version, a `*rollback*.json` alert under `/var/lib/ci-warm/alerts/`.
|
||
|
||
**W0 COMPLETE — Adversary PASS @2026-05-29.** Now in **W1 (canonical registry, WC2/WC3)**.
|
||
|
||
**W1 progress:** W1.1 canonical registry module DONE (b6ef83a) — `runner/harness/canonical.py`
|
||
(enrollment via recipe_meta.WARM_CANONICAL, registry canonical.json, deploy/undeploy-keep-volume/
|
||
seed lifecycle) + 4 unit tests (61 unit pass). **Next: W1.2** — enroll custom-html
|
||
(`tests/custom-html/recipe_meta.py: WARM_CANONICAL=True`) + LIVE data-warm proof: seed a
|
||
warm-custom-html canonical with content → undeploy-keep-volume (verify volume retained, app down) →
|
||
deploy_canonical (reattach) → assert the written content survives; re-warmable from scratch. Then
|
||
close WC2/WC3.
|
||
|
||
**W1 plan (WC2 data-warm canonical model + WC3 closure):**
|
||
- WC2: a declarative **canonical registry** — which recipes are canonical + at which known-good
|
||
commit/version — with each canonical app at a **stable domain `warm-<recipe>`**, kept **data-warm**
|
||
(undeployed-when-idle, data volume retained). Re-warmable from scratch (cache). Reconciler/registry
|
||
declared in-repo.
|
||
- WC3: snapshots (warmsnap, W0.5 — done) tied to canonicals: one last-good per canonical under
|
||
`/var/lib/ci-warm/<recipe>/`, restore proven (done). Close WC3 with the canonical model.
|
||
- Distinguish from W0's live-warm keycloak: canonicals are DATA-warm (undeployed when idle), keycloak
|
||
is LIVE-warm (always up). Both use the `warm-<recipe>` stable scheme.
|
||
|
||
**Tracked before Phase-2w DONE (not blocking W1):**
|
||
- **W0.10a — traefik WC1.1** (Adversary requires a cold proof): migrate `proxy.nix` onto the shared
|
||
health-gated reconciler (stateless = version-rollback-only; preserve cert-secret/WILDCARDS_ENABLED/
|
||
COMPOSE_FILE setup). CAREFUL — traefik serves all TLS; deploy/test only in a quiet window.
|
||
- **W0.10b — Builder-loop alert relay**: each wake, scan `/var/lib/ci-warm/alerts/*.json` →
|
||
PushNotification → archive to `alerts/seen/`.
|
||
|
||
**Build finding (RESOLVED):** the W0.4 lasuite-docs `setup_custom_tests` redeploy failure (nginx web
|
||
`host not found in upstream ...backend:8000`) was **transient resource contention** from the
|
||
since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the
|
||
headline e2e is green (below). No recipe/harness change needed.
|
||
|
||
## Gate
|
||
|
||
### Gate: WC2 + WC3 — CLAIMED, awaiting Adversary (@2026-05-29, HEAD = see `git log -1`)
|
||
|
||
**WHAT.** The data-warm canonical model (W1): a declarative per-recipe canonical at the stable domain
|
||
`warm-<recipe>.ci.commoninternet.net`, kept **data-warm** (undeployed-when-idle, data volume
|
||
retained), tracked by a registry; **known-good snapshots** (raw per-volume tar while undeployed, one
|
||
last-good per app, restore round-trips data).
|
||
|
||
**WHERE (code).** `runner/harness/canonical.py` (registry + data-warm lifecycle), `runner/harness/
|
||
warmsnap.py` (snapshot/restore), enrollment `tests/custom-html/recipe_meta.py: WARM_CANONICAL=True`.
|
||
State on cc-ci under `/var/lib/ci-warm/<recipe>/` (`canonical.json`, `snapshot/`, retained volume).
|
||
|
||
**HOW + EXPECTED (cold, from your own clone on cc-ci):**
|
||
1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **61 passed** (incl. test_canonical, test_warmsnap).
|
||
2. **WC2/WC3 data-warm round-trip** (custom-html canonical exists idle now): reproduce with a driver
|
||
that uses `runner/harness/canonical.py` — deploy `warm-custom-html.ci.commoninternet.net` @
|
||
`1.11.0+1.29.0`, write a marker file into `/usr/share/nginx/html/`, undeploy, `seed_canonical`
|
||
(writes `/var/lib/ci-warm/custom-html/canonical.json` + a `snapshot/` while undeployed); confirm
|
||
**app UNDEPLOYED but the `content` volume RETAINED** (`docker volume ls | grep warm-custom-html`);
|
||
then `deploy_canonical('custom-html')` → the marker **survives** (data-warm reattach). Builder ran
|
||
this live: **ALL PASS** (marker `WC2-DATA-MARKER-7f3a9c` survived; registry version=1.11.0+1.29.0;
|
||
snapshot present). Current live state: `cat /var/lib/ci-warm/custom-html/canonical.json` →
|
||
status=idle, version=1.11.0+1.29.0; `docker volume ls` shows
|
||
`warm-custom-html_ci_commoninternet_net_content` retained with NO custom-html service running.
|
||
3. **WC3 restore round-trip** already cold-verified in the W0.9/W0.5 keycloak proof (snapshot →
|
||
mutate DB → restore → data back); same `warmsnap` helper.
|
||
4. **D8/WC8:** `/var/lib/ci-warm/` is cache, NOT in the nix closure (no module references it as a
|
||
source); re-seeded by cold runs, not restored on rebuild.
|
||
|
||
**Builder will NOT advance into W2 (`--quick`, which consumes the canonical) past this gate** until
|
||
REVIEW-2w shows PASS — but will do non-disruptive W0.10 follow-ups (alert relay) meanwhile.
|
||
|
||
---
|
||
|
||
### Gate: WC1 + WC1.2 + WC1.1(keycloak) — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 31ac86d, gate 985686f)
|
||
All 6 checks cold-verified from the Adversary's own clone. Builder may proceed to W1. **Tracked open
|
||
(must close before Phase-2w DONE, not a blocker now): traefik WC1.1 (W0.10)** — stateless
|
||
version-rollback not yet on the shared health-gated reconciler; Adversary will require a cold proof.
|
||
|
||
(claim detail retained below for the record)
|
||
|
||
**WHAT.** The live-warm keycloak layer (W0): a persistent **unpinned** keycloak at the stable domain
|
||
`warm-keycloak.ci.commoninternet.net`, declaratively reconciled, that SSO-dependent runs use via a
|
||
**per-run namespaced realm** (created + deleted) instead of co-deploying; concurrent dependents get
|
||
distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with
|
||
snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps
|
||
(WC1.2).
|
||
|
||
**WHERE (code).** `runner/warm_reconcile.py` (reconcile logic), `runner/harness/warm.py` (stable
|
||
domain, per-run realm naming, reaping), `runner/harness/sso.py` (realm lifecycle), `runner/harness/
|
||
warmsnap.py` (snapshot/restore), `runner/run_recipe_ci.py` (warm/cold dep split), `nix/modules/
|
||
warm-keycloak.nix` (systemd reconcile unit). Warm state on cc-ci under `/var/lib/ci-warm/`.
|
||
|
||
**HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/<clone>):**
|
||
|
||
1. **Declarative + unpinned + healthy:** `grep -n kcVersion nix/modules/warm-keycloak.nix` → *no
|
||
match* (pin removed; the unit runs `runner/warm_reconcile.py keycloak`). `ssh cc-ci 'systemctl
|
||
is-active warm-keycloak.service'` → `active`; `systemctl is-system-running` → `running`. Health:
|
||
`curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1
|
||
https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'` → `200`.
|
||
D8: a `nixos-rebuild build` closure hash is unaffected by which keycloak version is live (recipe
|
||
fetched at runtime).
|
||
2. **Units:** `cc-ci-run -m pytest tests/unit -q` → **57 passed** (incl. test_warm_realm,
|
||
test_warmsnap, test_warm_reconcile).
|
||
3. **WC1 headline e2e:** `RECIPE=lasuite-docs STAGES=install,custom cc-ci-run
|
||
runner/run_recipe_ci.py` → `install: pass`, `custom: pass`, **`deploy-count = 1 (expect 1)`**
|
||
(keycloak NOT co-deployed), log shows `dep: using live-warm keycloak @ warm-keycloak...` and
|
||
`dep: deleted per-run realm lasuite-docs-<hex> on warm keycloak`. The 3 custom SSO tests pass
|
||
(test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak).
|
||
After the run, warm keycloak realms = `['master']` only (no leftover); no `lasu*` docker stack.
|
||
4. **WC1 concurrency + reaping (deploy-free):** `realm_for("lasuite-docs","lasu-aaa111...")` =
|
||
`lasuite-docs-aaa111` and `...bbb222` → distinct (two concurrent same-recipe runs never collide);
|
||
create realms aaa111/bbb222/ccc333 on the warm kc, each `oidc_password_grant` returns a JWT;
|
||
`sso.reap_orphaned_realms(D, live_hexes={"aaa111"})` deletes exactly bbb222+ccc333 and KEEPS
|
||
aaa111. (Builder ran this live: PASS.)
|
||
5. **WC1.1 health-gated rollback (live):** with `CCCI_SKIP_FETCH=1` stage two **annotated** fake tags
|
||
on `~/.abra/recipes/keycloak` — `10.7.9+26.6.2` at the good commit (`git tag -a -m x 10.7.9+26.6.2
|
||
10.7.1+26.6.2^{}`) and `10.7.10+26.6.2` at a commit whose compose.yml has a broken
|
||
`KC_HOSTNAME=:::bad-host:::`. Create a marker realm, set last_good, then run `CCCI_SKIP_FETCH=1
|
||
cc-ci-run runner/warm_reconcile.py keycloak` twice → first `RECONCILE RESULT: upgraded:...->10.7.9`
|
||
(snapshot taken, last_good=10.7.9, marker preserved); second `rolled-back:10.7.10->10.7.9` —
|
||
keycloak HEALTHY on 10.7.9, **marker realm INTACT** (data preserved), `/var/lib/ci-warm/keycloak/
|
||
last_good` still `10.7.9` (NOT advanced), a `*-rollback.json` alert under `/var/lib/ci-warm/alerts/`
|
||
with `attempted=10.7.10 last_good=10.7.9 recovered=true`. (Builder ran this live: ALL PASS; keycloak
|
||
restored to canonical 10.7.1+26.6.2.)
|
||
6. **WC1.2 pre-deploy safety gate (live):** stage an annotated fake tag with a MAJOR bump
|
||
(`11.0.0+27.0.0`) → `CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloak` → `RECONCILE RESULT:
|
||
held-major:...`, a `*-held-major.json` alert written, **keycloak untouched** (TYPE unchanged,
|
||
200, no snapshot/deploy churn). Stage a minor tag (`10.7.2+26.6.3`) with `releaseNotes/
|
||
10.7.2+26.6.3.md` containing "manual migration" → `held-manual-migration`, alert carries the notes.
|
||
(Builder ran both live: held + untouched.)
|
||
|
||
**SCOPE (honest).** WC1 and WC1.2 are complete. **WC1.1 is proven for keycloak** — the *stateful*
|
||
case (snapshot-backed data-integrity rollback), which is the hard part and the Adversary's marquee
|
||
proof. **traefik's WC1.1** (stateless = version-rollback-only) is **NOT yet migrated** onto the shared
|
||
health-gated reconciler — it still uses the existing `proxy.nix` chaos-deploy reconciler. That
|
||
migration is **W0.10** (tracked in BACKLOG-2w), to land before the Phase-2w DONE. If the Adversary
|
||
wants WC1.1 fully closed (both reconcilers) before PASS, treat this gate as WC1 + WC1.2 + WC1.1(keycloak).
|
||
|
||
**Alert delivery note (not blocking):** the reconciler WRITES alert sentinels to
|
||
`/var/lib/ci-warm/alerts/*.json` (proven above). The operator-facing relay (Builder loop scans →
|
||
PushNotification → archive to `alerts/seen/`) is loop behavior, run each wake when an alert exists;
|
||
none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable.
|
||
|
||
**Builder will NOT advance past this gate** (to W1/WC2 canonical registry) until REVIEW-2w shows PASS.
|
||
|
||
## (prior) Gate
|
||
(none before this)
|
||
|
||
## Blocked
|
||
(none)
|
||
|
||
## Notes
|
||
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
|
||
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
|
||
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
|
||
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
|
||
from cold `<recipe[:4]>-<6hex>`.
|
||
</content>
|