Files
cc-ci/machine-docs/STATUS-2w.md
autonomic-bot 4ce80f8751 claim(2w): W1 gate WC2+WC3 CLAIMED — data-warm canonical proven (custom-html round-trip: undeploy-keep-volume → reattach → data survives)
W1.2: enrolled custom-html (recipe_meta.WARM_CANONICAL); live proof ALL PASS
(seed canonical → idle-with-volume-retained → re-warm → marker survived).
WC2 (registry+data-warm model) + WC3 (snapshot+restore) proven. 61 unit pass.
custom-html now the first real data-warm canonical (idle).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 02:23:22 +01:00

240 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
## Phase
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.
## Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
- [x] **WC1** — Live-warm UNPINNED keycloak; per-run namespaced realms (create+delete); concurrent
distinct realms; orphan realms reaped. **Adversary PASS @2026-05-29** (REVIEW-2w, gate 985686f).
- [~] **WC1.1** — Health-gated deploy-with-rollback. **keycloak (stateful) — Adversary PASS
@2026-05-29** (marquee: broken latest → snapshot→restore→prior, data intact, last_good held,
alert). **traefik (stateless, version-rollback-only) — NOT yet migrated = W0.10**, MUST close
before Phase-2w DONE (Adversary will require a cold proof).
- [x] **WC1.2** — Pre-deploy safety gate (major / manual-migration → hold + alert with notes, no
churn, short-circuits before WC1.1). **Adversary PASS @2026-05-29**.
- [x] **WC2** — Data-warm canonical model: per-recipe canonical at stable domain `warm-<recipe>`,
declarative registry (canonical.json + recipe_meta.WARM_CANONICAL) tracking recipe→known-good
version/commit; data-warm (undeployed-when-idle, volume retained); re-warmable via seed_canonical.
Proven on custom-html (W1.2). **CLAIMED — see Gate below.**
- [x] **WC3** — Known-good snapshots: raw per-volume tar taken while undeployed under
`/var/lib/ci-warm/<recipe>/snapshot/`; one last-good per app, atomic subdir swap; restore
round-trips data (W0.5 mutate→restore proof + W1.2 data-warm reattach). **CLAIMED — see Gate.**
- [ ] **WC4**`--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts;
PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
results carry mode; clean no-canonical fallback.
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
## Milestones (plan §3)
- **W0** — Warm keycloak (WC1/WC1.1-keycloak/WC1.2). ✅ Adversary PASS @2026-05-29.
- **W1** — Canonical registry + snapshot/restore (WC2, WC3). ← IN FLIGHT
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
- **W2** — `--quick` mode (WC4, WC7).
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
## In flight
**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant
JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch`
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
- **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests
(48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm → undeploy → snapshot
(mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker
realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm/<recipe>/`, atomic, one last-good.
- **W0.6 reconciler rewrite** DONE (a044abb). `runner/warm_reconcile.py` (python, packaged into the
nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched
at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR
releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback
scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate →
commit-or-restore+redeploy-prior+alert). Alerts = `/var/lib/ci-warm/alerts/*.json`. +8 unit tests
(56 unit pass). PROVEN live: `nixos-rebuild switch` → warm-keycloak.service runs the python
reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major,
keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).
- **W0.9 WC1.1 live proofs** DONE (32f0071). PROVEN on warm keycloak (annotated fake tags +
CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good
committed, marker preserved; (b) **marquee rollback** — broken latest 10.7.10 → deploy fails →
rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback
alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical
10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed
swarm-settle, abra-stdout capture). 57 unit pass. **Reconciler-side WC1/WC1.1/WC1.2 proven.**
**Adversary reproduce (W0.9):** on cc-ci, with the keycloak recipe clone, create annotated fake
tags (peel `^{}`, set git identity) `10.7.9+26.6.2`(=good commit) and `10.7.10+26.6.2`(broken
KC_HOSTNAME), then `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice; observe
`upgraded:` then `rolled-back:`, marker realm survives, `/var/lib/ci-warm/keycloak/last_good`
unchanged at the prior version, a `*rollback*.json` alert under `/var/lib/ci-warm/alerts/`.
**W0 COMPLETE — Adversary PASS @2026-05-29.** Now in **W1 (canonical registry, WC2/WC3)**.
**W1 progress:** W1.1 canonical registry module DONE (b6ef83a) — `runner/harness/canonical.py`
(enrollment via recipe_meta.WARM_CANONICAL, registry canonical.json, deploy/undeploy-keep-volume/
seed lifecycle) + 4 unit tests (61 unit pass). **Next: W1.2** — enroll custom-html
(`tests/custom-html/recipe_meta.py: WARM_CANONICAL=True`) + LIVE data-warm proof: seed a
warm-custom-html canonical with content → undeploy-keep-volume (verify volume retained, app down) →
deploy_canonical (reattach) → assert the written content survives; re-warmable from scratch. Then
close WC2/WC3.
**W1 plan (WC2 data-warm canonical model + WC3 closure):**
- WC2: a declarative **canonical registry** — which recipes are canonical + at which known-good
commit/version — with each canonical app at a **stable domain `warm-<recipe>`**, kept **data-warm**
(undeployed-when-idle, data volume retained). Re-warmable from scratch (cache). Reconciler/registry
declared in-repo.
- WC3: snapshots (warmsnap, W0.5 — done) tied to canonicals: one last-good per canonical under
`/var/lib/ci-warm/<recipe>/`, restore proven (done). Close WC3 with the canonical model.
- Distinguish from W0's live-warm keycloak: canonicals are DATA-warm (undeployed when idle), keycloak
is LIVE-warm (always up). Both use the `warm-<recipe>` stable scheme.
**Tracked before Phase-2w DONE (not blocking W1):**
- **W0.10a — traefik WC1.1** (Adversary requires a cold proof): migrate `proxy.nix` onto the shared
health-gated reconciler (stateless = version-rollback-only; preserve cert-secret/WILDCARDS_ENABLED/
COMPOSE_FILE setup). CAREFUL — traefik serves all TLS; deploy/test only in a quiet window.
- **W0.10b — Builder-loop alert relay**: each wake, scan `/var/lib/ci-warm/alerts/*.json`
PushNotification → archive to `alerts/seen/`.
**Build finding (RESOLVED):** the W0.4 lasuite-docs `setup_custom_tests` redeploy failure (nginx web
`host not found in upstream ...backend:8000`) was **transient resource contention** from the
since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine — the
headline e2e is green (below). No recipe/harness change needed.
## Gate
### Gate: WC2 + WC3 — CLAIMED, awaiting Adversary (@2026-05-29, HEAD = see `git log -1`)
**WHAT.** The data-warm canonical model (W1): a declarative per-recipe canonical at the stable domain
`warm-<recipe>.ci.commoninternet.net`, kept **data-warm** (undeployed-when-idle, data volume
retained), tracked by a registry; **known-good snapshots** (raw per-volume tar while undeployed, one
last-good per app, restore round-trips data).
**WHERE (code).** `runner/harness/canonical.py` (registry + data-warm lifecycle), `runner/harness/
warmsnap.py` (snapshot/restore), enrollment `tests/custom-html/recipe_meta.py: WARM_CANONICAL=True`.
State on cc-ci under `/var/lib/ci-warm/<recipe>/` (`canonical.json`, `snapshot/`, retained volume).
**HOW + EXPECTED (cold, from your own clone on cc-ci):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q`**61 passed** (incl. test_canonical, test_warmsnap).
2. **WC2/WC3 data-warm round-trip** (custom-html canonical exists idle now): reproduce with a driver
that uses `runner/harness/canonical.py` — deploy `warm-custom-html.ci.commoninternet.net` @
`1.11.0+1.29.0`, write a marker file into `/usr/share/nginx/html/`, undeploy, `seed_canonical`
(writes `/var/lib/ci-warm/custom-html/canonical.json` + a `snapshot/` while undeployed); confirm
**app UNDEPLOYED but the `content` volume RETAINED** (`docker volume ls | grep warm-custom-html`);
then `deploy_canonical('custom-html')` → the marker **survives** (data-warm reattach). Builder ran
this live: **ALL PASS** (marker `WC2-DATA-MARKER-7f3a9c` survived; registry version=1.11.0+1.29.0;
snapshot present). Current live state: `cat /var/lib/ci-warm/custom-html/canonical.json`
status=idle, version=1.11.0+1.29.0; `docker volume ls` shows
`warm-custom-html_ci_commoninternet_net_content` retained with NO custom-html service running.
3. **WC3 restore round-trip** already cold-verified in the W0.9/W0.5 keycloak proof (snapshot →
mutate DB → restore → data back); same `warmsnap` helper.
4. **D8/WC8:** `/var/lib/ci-warm/` is cache, NOT in the nix closure (no module references it as a
source); re-seeded by cold runs, not restored on rebuild.
**Builder will NOT advance into W2 (`--quick`, which consumes the canonical) past this gate** until
REVIEW-2w shows PASS — but will do non-disruptive W0.10 follow-ups (alert relay) meanwhile.
---
### Gate: WC1 + WC1.2 + WC1.1(keycloak) — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 31ac86d, gate 985686f)
All 6 checks cold-verified from the Adversary's own clone. Builder may proceed to W1. **Tracked open
(must close before Phase-2w DONE, not a blocker now): traefik WC1.1 (W0.10)** — stateless
version-rollback not yet on the shared health-gated reconciler; Adversary will require a cold proof.
(claim detail retained below for the record)
**WHAT.** The live-warm keycloak layer (W0): a persistent **unpinned** keycloak at the stable domain
`warm-keycloak.ci.commoninternet.net`, declaratively reconciled, that SSO-dependent runs use via a
**per-run namespaced realm** (created + deleted) instead of co-deploying; concurrent dependents get
distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with
snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps
(WC1.2).
**WHERE (code).** `runner/warm_reconcile.py` (reconcile logic), `runner/harness/warm.py` (stable
domain, per-run realm naming, reaping), `runner/harness/sso.py` (realm lifecycle), `runner/harness/
warmsnap.py` (snapshot/restore), `runner/run_recipe_ci.py` (warm/cold dep split), `nix/modules/
warm-keycloak.nix` (systemd reconcile unit). Warm state on cc-ci under `/var/lib/ci-warm/`.
**HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/<clone>):**
1. **Declarative + unpinned + healthy:** `grep -n kcVersion nix/modules/warm-keycloak.nix` → *no
match* (pin removed; the unit runs `runner/warm_reconcile.py keycloak`). `ssh cc-ci 'systemctl
is-active warm-keycloak.service'``active`; `systemctl is-system-running``running`. Health:
`curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1
https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'``200`.
D8: a `nixos-rebuild build` closure hash is unaffected by which keycloak version is live (recipe
fetched at runtime).
2. **Units:** `cc-ci-run -m pytest tests/unit -q`**57 passed** (incl. test_warm_realm,
test_warmsnap, test_warm_reconcile).
3. **WC1 headline e2e:** `RECIPE=lasuite-docs STAGES=install,custom cc-ci-run
runner/run_recipe_ci.py` → `install: pass`, `custom: pass`, **`deploy-count = 1 (expect 1)`**
(keycloak NOT co-deployed), log shows `dep: using live-warm keycloak @ warm-keycloak...` and
`dep: deleted per-run realm lasuite-docs-<hex> on warm keycloak`. The 3 custom SSO tests pass
(test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak).
After the run, warm keycloak realms = `['master']` only (no leftover); no `lasu*` docker stack.
4. **WC1 concurrency + reaping (deploy-free):** `realm_for("lasuite-docs","lasu-aaa111...")` =
`lasuite-docs-aaa111` and `...bbb222` → distinct (two concurrent same-recipe runs never collide);
create realms aaa111/bbb222/ccc333 on the warm kc, each `oidc_password_grant` returns a JWT;
`sso.reap_orphaned_realms(D, live_hexes={"aaa111"})` deletes exactly bbb222+ccc333 and KEEPS
aaa111. (Builder ran this live: PASS.)
5. **WC1.1 health-gated rollback (live):** with `CCCI_SKIP_FETCH=1` stage two **annotated** fake tags
on `~/.abra/recipes/keycloak` — `10.7.9+26.6.2` at the good commit (`git tag -a -m x 10.7.9+26.6.2
10.7.1+26.6.2^{}`) and `10.7.10+26.6.2` at a commit whose compose.yml has a broken
`KC_HOSTNAME=:::bad-host:::`. Create a marker realm, set last_good, then run `CCCI_SKIP_FETCH=1
cc-ci-run runner/warm_reconcile.py keycloak` twice → first `RECONCILE RESULT: upgraded:...->10.7.9`
(snapshot taken, last_good=10.7.9, marker preserved); second `rolled-back:10.7.10->10.7.9` —
keycloak HEALTHY on 10.7.9, **marker realm INTACT** (data preserved), `/var/lib/ci-warm/keycloak/
last_good` still `10.7.9` (NOT advanced), a `*-rollback.json` alert under `/var/lib/ci-warm/alerts/`
with `attempted=10.7.10 last_good=10.7.9 recovered=true`. (Builder ran this live: ALL PASS; keycloak
restored to canonical 10.7.1+26.6.2.)
6. **WC1.2 pre-deploy safety gate (live):** stage an annotated fake tag with a MAJOR bump
(`11.0.0+27.0.0`) → `CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloak` → `RECONCILE RESULT:
held-major:...`, a `*-held-major.json` alert written, **keycloak untouched** (TYPE unchanged,
200, no snapshot/deploy churn). Stage a minor tag (`10.7.2+26.6.3`) with `releaseNotes/
10.7.2+26.6.3.md` containing "manual migration" → `held-manual-migration`, alert carries the notes.
(Builder ran both live: held + untouched.)
**SCOPE (honest).** WC1 and WC1.2 are complete. **WC1.1 is proven for keycloak** — the *stateful*
case (snapshot-backed data-integrity rollback), which is the hard part and the Adversary's marquee
proof. **traefik's WC1.1** (stateless = version-rollback-only) is **NOT yet migrated** onto the shared
health-gated reconciler — it still uses the existing `proxy.nix` chaos-deploy reconciler. That
migration is **W0.10** (tracked in BACKLOG-2w), to land before the Phase-2w DONE. If the Adversary
wants WC1.1 fully closed (both reconcilers) before PASS, treat this gate as WC1 + WC1.2 + WC1.1(keycloak).
**Alert delivery note (not blocking):** the reconciler WRITES alert sentinels to
`/var/lib/ci-warm/alerts/*.json` (proven above). The operator-facing relay (Builder loop scans →
PushNotification → archive to `alerts/seen/`) is loop behavior, run each wake when an alert exists;
none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable.
**Builder will NOT advance past this gate** (to W1/WC2 canonical registry) until REVIEW-2w shows PASS.
## (prior) Gate
(none before this)
## Blocked
(none)
## Notes
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
from cold `<recipe[:4]>-<6hex>`.
</content>