102 lines
7.3 KiB
Markdown
102 lines
7.3 KiB
Markdown
# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
|
||
|
||
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
|
||
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
|
||
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
|
||
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
|
||
|
||
## Phase
|
||
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
|
||
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
|
||
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
|
||
nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified.
|
||
|
||
## Definition of Done (Phase 2w) — WC1–WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
|
||
- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain, **UNPINNED** (fetch latest + chaos
|
||
deploy, like traefik; keep secret-generate-only-if-missing + health-wait); dependents
|
||
create+delete per-run namespaced realms; concurrent dependents don't collide; leftover realms reaped.
|
||
- [ ] **WC1.1** — Health-gated deploy-with-rollback in warm/infra reconcilers (traefik+keycloak):
|
||
record last-good → deploy latest → health-check → healthy commits last-good:=latest; unhealthy
|
||
rolls back + alerts. Stateful (keycloak): snapshot data volume before upgrade, restore on
|
||
rollback (reuse WC3 helper). traefik = version rollback only.
|
||
- [ ] **WC1.2** — Pre-deploy safety gate: auto-apply only non-major/no-manual-migration bumps; a
|
||
MAJOR bump or manual-migration release notes → stay on current + alert with notes (no silent auto-upgrade).
|
||
- [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative
|
||
registry tracking recipe→known-good commit; re-warmable from scratch.
|
||
- [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one
|
||
last-known-good per app, atomic replace; restore proven to round-trip data.
|
||
- [ ] **WC4** — `--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts;
|
||
PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
|
||
- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
|
||
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
|
||
- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
|
||
results carry mode; clean no-canonical fallback.
|
||
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
|
||
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
|
||
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
|
||
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
|
||
|
||
## Milestones (plan §3)
|
||
- **W0** — Warm keycloak (WC1). ← IN FLIGHT
|
||
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
|
||
- **W2** — `--quick` mode (WC4, WC7).
|
||
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
|
||
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
|
||
|
||
## In flight
|
||
**W0 — live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
|
||
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
|
||
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
|
||
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create → password-grant
|
||
JWT → discovery issuer → delete(idempotent) → reap(keeps live hex / deletes orphan). All PASS.
|
||
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch` →
|
||
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
|
||
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
|
||
|
||
- **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests
|
||
(48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm → undeploy → snapshot
|
||
(mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker
|
||
realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm/<recipe>/`, atomic, one last-good.
|
||
|
||
- **W0.6 reconciler rewrite** DONE (a044abb). `runner/warm_reconcile.py` (python, packaged into the
|
||
nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched
|
||
at runtime → D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR
|
||
releaseNotes manual-migration → hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback
|
||
scaffold (record last-good → keycloak undeploy→snapshot→deploy latest → health-gate →
|
||
commit-or-restore+redeploy-prior+alert). Alerts = `/var/lib/ci-warm/alerts/*.json`. +8 unit tests
|
||
(56 unit pass). PROVEN live: `nixos-rebuild switch` → warm-keycloak.service runs the python
|
||
reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major,
|
||
keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).
|
||
|
||
**Next:**
|
||
1. **W0.9 — WC1.1 live proofs** (deploy cycles): (a) healthy upgrade — stage a fake newer tag
|
||
(re-tag of current → same healthy image) → reconcile snapshots + deploys + commits last-good;
|
||
(b) **rollback (marquee)** — stage a fake newer tag with a BROKEN compose (bad KC_HOSTNAME →
|
||
crash-loop) → reconcile snapshots → deploys broken → health-gate fails → restores snapshot +
|
||
redeploys prior → healthy + data intact (marker realm) + alert written + last_good NOT advanced.
|
||
2. **W0.7** — fix the lasuite-docs in-place-redeploy nginx-upstream race OR pick a more-robust SSO
|
||
dependent for the headline proof.
|
||
3. **W0.8** — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent
|
||
distinct realms (no collision) + reaping. → claim WC1/WC1.1/WC1.2.
|
||
4. **Builder-loop alert relay** — on each wake, scan `/var/lib/ci-warm/alerts/*.json`, PushNotification
|
||
+ record + archive to `alerts/seen/` (wire when first real alert can occur, i.e. with nightly WC6).
|
||
|
||
**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy
|
||
--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream
|
||
...backend:8000` during the rolling restart → abra converge times out. Independent of warm/cold
|
||
keycloak. Blocks the WC1 dependent-green proof until fixed/worked-around.
|
||
|
||
## Gate
|
||
(none claimed yet)
|
||
|
||
## Blocked
|
||
(none)
|
||
|
||
## Notes
|
||
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
|
||
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
|
||
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
|
||
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
|
||
from cold `<recipe[:4]>-<6hex>`.
|
||
</content>
|