diff --git a/machine-docs/BACKLOG-2w.md b/machine-docs/BACKLOG-2w.md new file mode 100644 index 0000000..3d7895a --- /dev/null +++ b/machine-docs/BACKLOG-2w.md @@ -0,0 +1,41 @@ +# BACKLOG — Phase 2w (warm canonical + `--quick`) + +Single-writer rule (plan §6.1): Builder edits `## Build backlog` only; Adversary edits +`## Adversary findings` only. + +## Build backlog + +### W0 — Live-warm keycloak (WC1) +- [ ] W0.1 — sso.py: realm lifecycle primitives (`delete_keycloak_realm`, `list_realms`, + `reap_stale_realms`) + unit tests. +- [ ] W0.2 — Orchestrator/deps: live-warm keycloak dep mode — stable warm domain + per-run + namespaced realm; delete realm on teardown (don't undeploy); cold-codeploy fallback if no warm + keycloak. Per-run realm name unique per (parent, pr, ref) for concurrency isolation. +- [ ] W0.3 — Declarative Nix reconciler `nix/modules/warm-keycloak.nix` (systemd oneshot converges + warm keycloak deployed+healthy at stable domain); wired into the host config. +- [ ] W0.4 — e2e proof: a dependent recipe (lasuite-docs) SSO custom test passes against warm + keycloak; concurrent dependents use distinct realms (no collision); leftover realms reaped. + → claim WC1 gate. + +### W1 — Canonical registry + snapshot/restore (WC2, WC3) +- [ ] W1.1 — Canonical registry/reconciler (declarative; tracks recipe→known-good commit; stable + domain `warm-`). +- [ ] W1.2 — Snapshot/restore: raw volume copy while undeployed under `/var/lib/ci-warm//`; + one last-known-good, atomic replace; prove restore round-trips data. + +### W2 — `--quick` mode (WC4, WC7) +- [ ] W2.1 — `run_recipe_ci.py --quick` path (reattach → upgrade-to-PR-head → assert → PASS undeploy / + FAIL restore+undeploy; never promote). +- [ ] W2.2 — Trigger surface + labeling + no-canonical fallback (WC7). + +### W3 — Cold-advances-canonical + nightly sweep (WC5, WC6) +- [ ] W3.1 — Promote-on-green-cold (snapshot+tag canonical at teardown on green cold; seed on first green). +- [ ] W3.2 — Nightly full-cold sweep (declarative scheduler, MAX_TESTS-bounded). + +### W4 — Hardening + docs + cold verify (WC8, WC9) +- [ ] W4.1 — Resource/isolation hardening: disk monitor+prune, per-app serialize, warm excluded from D8. +- [ ] W4.2 — Docs (warm/quick) + the WC9 rollback proof. + +## Adversary findings +(none yet) + diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index 315aa89..3ec1318 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -557,3 +557,31 @@ pulls. Tested on cc-ci with the authenticated config.json in place: So: declarative root `config.json` is sufficient end-to-end here; `--with-registry-auth` is not required (abra/SDK attaches it). **Caveat (Phase 2b):** 200/6h may still be tight for a full ~18-recipe sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT. + +--- + +## Phase 2w — warm canonical + `--quick` (2026-05-28) + +**Stable-domain scheme for warm apps: `warm-.ci.commoninternet.net`.** Distinct from cold +per-run `-<6hex>` (naming.app_domain) so a warm app is never confused with a disposable +cold run. Live-warm keycloak = `warm-keycloak.ci.commoninternet.net`; data-warm canonicals (W1) = +`warm-...`. Risk to watch: longer stack name vs swarm's 64-char config/secret limit — +verified per-recipe on first deploy; shorten the scheme if any recipe's secret name overflows. + +**Realm is the per-run isolation unit on the shared live-warm keycloak (WC1).** Instead of +co-deploying a fresh keycloak per dependent run, dependents use the one live-warm keycloak and create +a **per-run namespaced realm+client+user**, deleted at run teardown. Realm name = +`-<6hex>` where 6hex is the parent's per-run domain label suffix — unique per +(parent, pr, ref) so concurrent dependents never collide, and traceable for debugging. (Was +`realm=parent_recipe`, which would collide across concurrent same-recipe runs.) + +**Warm keycloak is declarative INFRA, not warm DATA.** The live-warm keycloak service is brought up +by a Nix systemd-oneshot reconciler (converges to deployed+healthy at the stable domain), exactly +like the traefik recipe deploy — so it IS in the D8 reproducibility closure (re-warmable from +scratch) and self-heals on activation/boot. Only warm *volumes/snapshots* (W1+) are cache excluded +from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent to exclude. + +**Live-warm is an optimization layer with a cold fallback.** If no warm keycloak is present (e.g. a +from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path +falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred +when available. diff --git a/machine-docs/JOURNAL-2w.md b/machine-docs/JOURNAL-2w.md new file mode 100644 index 0000000..03e07c5 --- /dev/null +++ b/machine-docs/JOURNAL-2w.md @@ -0,0 +1,54 @@ +# JOURNAL — Phase 2w (warm canonical + `--quick`) — Builder + +Append-only reasoning log (WHY). Facts/verification go in STATUS-2w; verdicts in REVIEW-2w. + +## 2026-05-28 — Phase 2w bootstrap + cleanup + W0 design + +**Orientation.** Operator interjected Phase 2w into Phase 2 (Phase 2 paused, state preserved). +Read the 2w plan + plan.md §6.1/§7/§9. Adversary already online (REVIEW-2w `@2026-05-28 start`), +idle awaiting a WC gate claim. Seeded STATUS-2w/BACKLOG-2w/JOURNAL-2w. + +**In-flight Phase 2 work committed.** Working tree had an uncommitted edit to +`tests/lasuite-drive/setup_custom_tests.sh` (Q3.2 MinIO bucket creation via the createbuckets +one-shot) — the continuation of commit 6557197. Committed it (66e065d) with an honest message: not +yet live-verified (needs a lasuite-drive deploy once warm keycloak exists). This preserves Phase 2 +progress at the pause point; it resumes after 2w DONE. + +**Cleanup (orchestrator-requested).** cc-ci `/` was at 91% (only 2.4G free) — a real WC8 concern +before adding warm volumes/snapshots. Tore down the leftover COLD per-run apps from paused Phase 2 +via `lifecycle.teardown_app(..., verify=True)`: `lasu-0a6fb2` (12-service lasuite-drive, heaviest), +`keyc-07d81e` (cold keycloak), `lasu-dbg` (debug lasuite). All TEARDOWN OK, no residual. Disk → +86% (3.8G free). Only infra stacks remain (backups, bridge, dashboard, drone, traefik). Did NOT +`docker image prune` — 9.7GB reclaimable but the image cache is the warm pull-cache; with authed +Docker Hub pulls now wired, a re-pull is billed to the account (cheaper) but still slow, so keep the +cache. Disk is the Phase-2w budget (WC8) — monitor. + +**W0 design (WC1 — live-warm keycloak).** The existing SSO harness is already most of the way there: +- `sso.setup_keycloak_realm(provider_domain, realm, client_id, ...)` creates a realm+client+user + **idempotently via the admin API**, and `_kc_admin_password` reads the admin password from inside + the running container (`docker exec ... cat /run/secrets/admin_password`). So it works against ANY + running keycloak — cold or warm — with no external password handling. +- The orchestrator dep flow (`run_recipe_ci.py`): `declared_deps` → `deploy_deps` (fresh co-deploy + per run) → `_enrich_deps_with_sso` (creates realm, realm name currently = `parent_recipe`) → + `setup_custom_tests.sh` hook → teardown_deps (undeploy). + +What WC1 changes: +1. The **realm becomes the per-run isolation unit** on a shared live-warm keycloak. Realm name must + be unique per (parent, pr, ref) so concurrent dependents don't collide — change from + `realm=parent_recipe` to `realm=-<6hex>` (derive the hex from the parent's per-run domain + label so it's stable within a run and distinct across concurrent runs). +2. The keycloak dep is **not co-deployed**: point at the stable warm domain; on teardown **delete the + realm** (not undeploy keycloak). Fall back to cold co-deploy if no warm keycloak is present (so a + from-scratch / no-warm environment still works — the warm keycloak is an optimization layer). +3. The warm keycloak itself is **declarative infra** (Nix reconciler, like traefik) — NOT warm + *data* (so it IS in the D8 closure as a reconciler; its realm data is ephemeral per-run anyway). + Re-warmable from scratch. + +Stable-domain scheme decision: `warm-.ci.commoninternet.net` (here `warm-keycloak...`), +clearly distinct from cold `-<6hex>`. Risk: longer stack name → swarm 64-char +config/secret limit; will verify on first deploy and shorten if it overflows. + +Building W0 in increments (each verified): (1) sso realm lifecycle prims + units; (2) deploy warm +keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the +orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof. + diff --git a/machine-docs/STATUS-2w.md b/machine-docs/STATUS-2w.md new file mode 100644 index 0000000..7805144 --- /dev/null +++ b/machine-docs/STATUS-2w.md @@ -0,0 +1,62 @@ +# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode) + +**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md` +**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared). +Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state. +Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`. + +## Phase +Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe +canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the +canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a +nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified. + +## Definition of Done (Phase 2w) — WC1–WC9, each Adversary cold-verified in REVIEW-2w +- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain; dependents create+delete per-run + namespaced realms; concurrent dependents don't collide; leftover realms reaped. +- [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative + registry tracking recipe→known-good commit; re-warmable from scratch. +- [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one + last-known-good per app, atomic replace; restore proven to round-trip data. +- [ ] **WC4** — `--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts; + PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes. +- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold). +- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded). +- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge; + results carry mode; clean no-canonical fallback. +- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via + per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure. +- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`, + confirm last-known-good restored intact; a `--quick` pass did not move the known-good). + +## Milestones (plan §3) +- **W0** — Warm keycloak (WC1). ← IN FLIGHT +- **W1** — Canonical registry + snapshot/restore (WC2, WC3). +- **W2** — `--quick` mode (WC4, WC7). +- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6). +- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE. + +## In flight +**W0 — live-warm keycloak (WC1).** Building incrementally: +1. sso.py realm lifecycle: add `delete_keycloak_realm` + `list_realms` + `reap_stale_realms` (realm + is the per-run isolation unit on a shared keycloak). +2. Orchestrator dep path: live-warm mode for the keycloak dep — use the stable warm domain + a + per-run **namespaced** realm (not realm=parent_recipe), delete the realm on teardown instead of + undeploying keycloak. Fall back to cold co-deploy if no warm keycloak present. +3. Declarative Nix reconciler (`nix/modules/warm-keycloak.nix`) — systemd oneshot converges the + warm keycloak to deployed+healthy at the stable domain. +4. e2e proof + concurrency (distinct realms) + reaping → claim WC1. + +## Gate +(none claimed yet) + +## Blocked +(none) + +## Notes +- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2 + cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in + Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower). +- Stable-domain scheme (proposed, see DECISIONS): `warm-.ci.commoninternet.net`, distinct + from cold `-<6hex>`. +