Files
cc-ci/machine-docs/STATUS-2w.md

88 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
## Phase
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.
## Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain, **UNPINNED** (fetch latest + chaos
deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents
create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped.
- [ ] **WC1.1** — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak):
record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy
rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on
rollback (reuse WC3 helper). traefik = version rollback only.
- [ ] **WC1.2** — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a
MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade).
- [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative
registry tracking recipe→known-good commit; re-warmable from scratch.
- [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one
last-known-good per app, atomic replace; restore proven to round-trip data.
- [ ] **WC4**`--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts;
PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
results carry mode; clean no-canonical fallback.
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
## Milestones (plan §3)
- **W0** — Warm keycloak (WC1). ← IN FLIGHT
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
- **W2** — `--quick` mode (WC4, WC7).
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
## In flight
**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant
JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch`
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
**Re-sequenced after the 2026-05-28/29 design update (unpin + WC1.1 rollback + WC1.2 safety gate):**
WC1.1's keycloak rollback needs the **WC3 snapshot/restore helper**, so build that FIRST, then
rewrite the reconciler ONCE into the unpinned + safety-gated + health-gated-with-rollback form. Next:
1. **WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`): raw copy of an app's data
volume(s) while undeployed, under `/var/lib/ci-warm/<recipe>/`, atomic replace, one last-good;
restore round-trips data. + unit tests + live round-trip proof.
2. Rewrite reconciler: unpin keycloak (fetch latest + chaos); WC1.2 safety gate (major / manual-
migration → hold + alert); WC1.1 record last-good → (keycloak: undeploy→snapshot→deploy latest) →
health-gate → commit-or-rollback+restore+alert.
3. Settle the **alert mechanism** (bash reconciler can't call agent PushNotification — sentinel file
the Builder loop relays, see DECISIONS).
4. Resolve the lasuite-docs in-place-redeploy race (BUILD finding below) OR pick a more-robust
dependent, then the headline WC1 e2e (dependent SSO green vs warm keycloak) + concurrency proof.
**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy
--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream
...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold
keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around.
## Gate
(none claimed yet)
## Blocked
(none)
## Notes
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
from cold `<recipe[:4]>-<6hex>`.
</content>