Files
cc-ci/machine-docs/STATUS-2w.md

102 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
## Phase
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.
## Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain, **UNPINNED** (fetch latest + chaos
deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents
create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped.
- [ ] **WC1.1** — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak):
record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy
rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on
rollback (reuse WC3 helper). traefik = version rollback only.
- [ ] **WC1.2** — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a
MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade).
- [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative
registry tracking recipe→known-good commit; re-warmable from scratch.
- [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one
last-known-good per app, atomic replace; restore proven to round-trip data.
- [ ] **WC4**`--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts;
PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
results carry mode; clean no-canonical fallback.
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
## Milestones (plan §3)
- **W0** — Warm keycloak (WC1). ← IN FLIGHT
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
- **W2** — `--quick` mode (WC4, WC7).
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
## In flight
**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant
JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch`
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
- **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests
(48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm → undeploy → snapshot
(mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker
realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm/<recipe>/`, atomic, one last-good.
- **W0.6 reconciler rewrite** DONE (a044abb). `runner/warm_reconcile.py` (python, packaged into the
nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched
at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR
releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback
scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate →
commit-or-restore+redeploy-prior+alert). Alerts = `/var/lib/ci-warm/alerts/*.json`. +8 unit tests
(56 unit pass). PROVEN live: `nixos-rebuild switch` → warm-keycloak.service runs the python
reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major,
keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).
**Next:**
1. **W0.9 — WC1.1 live proofs** (deploy cycles): (a) healthy upgrade — stage a fake newer tag
(re-tag of current → same healthy image) → reconcile snapshots + deploys + commits last-good;
(b) **rollback (marquee)** — stage a fake newer tag with a BROKEN compose (bad KC_HOSTNAME →
crash-loop) → reconcile snapshots → deploys broken → health-gate fails → restores snapshot +
redeploys prior → healthy + data intact (marker realm) + alert written + last_good NOT advanced.
2. **W0.7** — fix the lasuite-docs in-place-redeploy nginx-upstream race OR pick a more-robust SSO
dependent for the headline proof.
3. **W0.8** — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent
distinct realms (no collision) + reaping. → claim WC1/WC1.1/WC1.2.
4. **Builder-loop alert relay** — on each wake, scan `/var/lib/ci-warm/alerts/*.json`, PushNotification
+ record + archive to `alerts/seen/` (wire when first real alert can occur, i.e. with nightly WC6).
**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy
--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream
...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold
keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around.
## Gate
(none claimed yet)
## Blocked
(none)
## Notes
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
from cold `<recipe[:4]>-<6hex>`.
</content>