Files
cc-ci/machine-docs/STATUS-2w.md

111 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
## Phase
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.
## Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain, **UNPINNED** (fetch latest + chaos
deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents
create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped.
- [ ] **WC1.1** — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak):
record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy
rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on
rollback (reuse WC3 helper). traefik = version rollback only.
- [ ] **WC1.2** — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a
MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade).
- [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative
registry tracking recipe→known-good commit; re-warmable from scratch.
- [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one
last-known-good per app, atomic replace; restore proven to round-trip data.
- [ ] **WC4**`--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts;
PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
results carry mode; clean no-canonical fallback.
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
## Milestones (plan §3)
- **W0** — Warm keycloak (WC1). ← IN FLIGHT
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
- **W2** — `--quick` mode (WC4, WC7).
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
## In flight
**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant
JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch`
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
- **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests
(48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm → undeploy → snapshot
(mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker
realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm/<recipe>/`, atomic, one last-good.
- **W0.6 reconciler rewrite** DONE (a044abb). `runner/warm_reconcile.py` (python, packaged into the
nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched
at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR
releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback
scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate →
commit-or-restore+redeploy-prior+alert). Alerts = `/var/lib/ci-warm/alerts/*.json`. +8 unit tests
(56 unit pass). PROVEN live: `nixos-rebuild switch` → warm-keycloak.service runs the python
reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major,
keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).
- **W0.9 WC1.1 live proofs** DONE (32f0071). PROVEN on warm keycloak (annotated fake tags +
CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good
committed, marker preserved; (b) **marquee rollback** — broken latest 10.7.10 → deploy fails →
rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback
alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical
10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed
swarm-settle, abra-stdout capture). 57 unit pass. **Reconciler-side WC1/WC1.1/WC1.2 proven.**
**Adversary reproduce (W0.9):** on cc-ci, with the keycloak recipe clone, create annotated fake
tags (peel `^{}`, set git identity) `10.7.9+26.6.2`(=good commit) and `10.7.10+26.6.2`(broken
KC_HOSTNAME), then `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice; observe
`upgraded:` then `rolled-back:`, marker realm survives, `/var/lib/ci-warm/keycloak/last_good`
unchanged at the prior version, a `*rollback*.json` alert under `/var/lib/ci-warm/alerts/`.
**Next (remaining for WC1 gate):**
1. **W0.7** — fix the lasuite-docs in-place chaos-redeploy nginx-upstream race (`host not found in
upstream ...backend:8000`) OR pick a more-robust SSO dependent for the headline proof.
2. **W0.8** — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent
distinct realms (no collision) + reaping. → claim WC1/WC1.1/WC1.2.
3. **Builder-loop alert relay** (deferred wiring) — on each wake, scan `/var/lib/ci-warm/alerts/*.json`,
PushNotification + record + archive to `alerts/seen/`; wire when nightly WC6 lands (first real alert).
**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy
--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream
...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold
keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around.
## Gate
(none claimed yet)
## Blocked
(none)
## Notes
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
from cold `<recipe[:4]>-<6hex>`.
</content>