chore(2w): bootstrap Phase 2w loop state + cleanup orphaned cold apps

- Seed STATUS-2w / BACKLOG-2w / JOURNAL-2w (WC1-WC9 DoD, W0-W4 milestones).
- Tore down leftover Phase-2 cold apps (lasu-0a6fb2/keyc-07d81e/lasu-dbg);
  disk 91%->86%.
- DECISIONS: warm-domain scheme, per-run realm isolation, warm keycloak as
  declarative infra, cold fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 23:14:41 +01:00
parent 66e065dff5
commit 5dd76d7c8c
4 changed files with 185 additions and 0 deletions

View File

@ -0,0 +1,41 @@
# BACKLOG — Phase 2w (warm canonical + `--quick`)
Single-writer rule (plan §6.1): Builder edits `## Build backlog` only; Adversary edits
`## Adversary findings` only.
## Build backlog
### W0 — Live-warm keycloak (WC1)
- [ ] W0.1 — sso.py: realm lifecycle primitives (`delete_keycloak_realm`, `list_realms`,
`reap_stale_realms`) + unit tests.
- [ ] W0.2 — Orchestrator/deps: live-warm keycloak dep mode — stable warm domain + per-run
namespaced realm; delete realm on teardown (don't undeploy); cold-codeploy fallback if no warm
keycloak. Per-run realm name unique per (parent, pr, ref) for concurrency isolation.
- [ ] W0.3 — Declarative Nix reconciler `nix/modules/warm-keycloak.nix` (systemd oneshot converges
warm keycloak deployed+healthy at stable domain); wired into the host config.
- [ ] W0.4 — e2e proof: a dependent recipe (lasuite-docs) SSO custom test passes against warm
keycloak; concurrent dependents use distinct realms (no collision); leftover realms reaped.
→ claim WC1 gate.
### W1 — Canonical registry + snapshot/restore (WC2, WC3)
- [ ] W1.1 — Canonical registry/reconciler (declarative; tracks recipe→known-good commit; stable
domain `warm-<recipe>`).
- [ ] W1.2 — Snapshot/restore: raw volume copy while undeployed under `/var/lib/ci-warm/<recipe>/`;
one last-known-good, atomic replace; prove restore round-trips data.
### W2 — `--quick` mode (WC4, WC7)
- [ ] W2.1 — `run_recipe_ci.py --quick` path (reattach → upgrade-to-PR-head → assert → PASS undeploy /
FAIL restore+undeploy; never promote).
- [ ] W2.2 — Trigger surface + labeling + no-canonical fallback (WC7).
### W3 — Cold-advances-canonical + nightly sweep (WC5, WC6)
- [ ] W3.1 — Promote-on-green-cold (snapshot+tag canonical at teardown on green cold; seed on first green).
- [ ] W3.2 — Nightly full-cold sweep (declarative scheduler, MAX_TESTS-bounded).
### W4 — Hardening + docs + cold verify (WC8, WC9)
- [ ] W4.1 — Resource/isolation hardening: disk monitor+prune, per-app serialize, warm excluded from D8.
- [ ] W4.2 — Docs (warm/quick) + the WC9 rollback proof.
## Adversary findings
(none yet)
</content>

View File

@ -557,3 +557,31 @@ pulls. Tested on cc-ci with the authenticated config.json in place:
So: declarative root `config.json` is sufficient end-to-end here; `--with-registry-auth` is not
required (abra/SDK attaches it). **Caveat (Phase 2b):** 200/6h may still be tight for a full ~18-recipe
sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT.
---
## Phase 2w — warm canonical + `--quick` (2026-05-28)
**Stable-domain scheme for warm apps: `warm-<recipe>.ci.commoninternet.net`.** Distinct from cold
per-run `<recipe[:4]>-<6hex>` (naming.app_domain) so a warm app is never confused with a disposable
cold run. Live-warm keycloak = `warm-keycloak.ci.commoninternet.net`; data-warm canonicals (W1) =
`warm-<recipe>...`. Risk to watch: longer stack name vs swarm's 64-char config/secret limit —
verified per-recipe on first deploy; shorten the scheme if any recipe's secret name overflows.
**Realm is the per-run isolation unit on the shared live-warm keycloak (WC1).** Instead of
co-deploying a fresh keycloak per dependent run, dependents use the one live-warm keycloak and create
a **per-run namespaced realm+client+user**, deleted at run teardown. Realm name =
`<parent_recipe>-<6hex>` where 6hex is the parent's per-run domain label suffix — unique per
(parent, pr, ref) so concurrent dependents never collide, and traceable for debugging. (Was
`realm=parent_recipe`, which would collide across concurrent same-recipe runs.)
**Warm keycloak is declarative INFRA, not warm DATA.** The live-warm keycloak service is brought up
by a Nix systemd-oneshot reconciler (converges to deployed+healthy at the stable domain), exactly
like the traefik recipe deploy — so it IS in the D8 reproducibility closure (re-warmable from
scratch) and self-heals on activation/boot. Only warm *volumes/snapshots* (W1+) are cache excluded
from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent to exclude.
**Live-warm is an optimization layer with a cold fallback.** If no warm keycloak is present (e.g. a
from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path
falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred
when available.

View File

@ -0,0 +1,54 @@
# JOURNAL — Phase 2w (warm canonical + `--quick`) — Builder
Append-only reasoning log (WHY). Facts/verification go in STATUS-2w; verdicts in REVIEW-2w.
## 2026-05-28 — Phase 2w bootstrap + cleanup + W0 design
**Orientation.** Operator interjected Phase 2w into Phase 2 (Phase 2 paused, state preserved).
Read the 2w plan + plan.md §6.1/§7/§9. Adversary already online (REVIEW-2w `@2026-05-28 start`),
idle awaiting a WC gate claim. Seeded STATUS-2w/BACKLOG-2w/JOURNAL-2w.
**In-flight Phase 2 work committed.** Working tree had an uncommitted edit to
`tests/lasuite-drive/setup_custom_tests.sh` (Q3.2 MinIO bucket creation via the createbuckets
one-shot) — the continuation of commit 6557197. Committed it (66e065d) with an honest message: not
yet live-verified (needs a lasuite-drive deploy once warm keycloak exists). This preserves Phase 2
progress at the pause point; it resumes after 2w DONE.
**Cleanup (orchestrator-requested).** cc-ci `/` was at 91% (only 2.4G free) — a real WC8 concern
before adding warm volumes/snapshots. Tore down the leftover COLD per-run apps from paused Phase 2
via `lifecycle.teardown_app(..., verify=True)`: `lasu-0a6fb2` (12-service lasuite-drive, heaviest),
`keyc-07d81e` (cold keycloak), `lasu-dbg` (debug lasuite). All TEARDOWN OK, no residual. Disk →
86% (3.8G free). Only infra stacks remain (backups, bridge, dashboard, drone, traefik). Did NOT
`docker image prune` — 9.7GB reclaimable but the image cache is the warm pull-cache; with authed
Docker Hub pulls now wired, a re-pull is billed to the account (cheaper) but still slow, so keep the
cache. Disk is the Phase-2w budget (WC8) — monitor.
**W0 design (WC1 — live-warm keycloak).** The existing SSO harness is already most of the way there:
- `sso.setup_keycloak_realm(provider_domain, realm, client_id, ...)` creates a realm+client+user
**idempotently via the admin API**, and `_kc_admin_password` reads the admin password from inside
the running container (`docker exec ... cat /run/secrets/admin_password`). So it works against ANY
running keycloak — cold or warm — with no external password handling.
- The orchestrator dep flow (`run_recipe_ci.py`): `declared_deps``deploy_deps` (fresh co-deploy
per run) → `_enrich_deps_with_sso` (creates realm, realm name currently = `parent_recipe`) →
`setup_custom_tests.sh` hook → teardown_deps (undeploy).
What WC1 changes:
1. The **realm becomes the per-run isolation unit** on a shared live-warm keycloak. Realm name must
be unique per (parent, pr, ref) so concurrent dependents don't collide — change from
`realm=parent_recipe` to `realm=<parent>-<6hex>` (derive the hex from the parent's per-run domain
label so it's stable within a run and distinct across concurrent runs).
2. The keycloak dep is **not co-deployed**: point at the stable warm domain; on teardown **delete the
realm** (not undeploy keycloak). Fall back to cold co-deploy if no warm keycloak is present (so a
from-scratch / no-warm environment still works — the warm keycloak is an optimization layer).
3. The warm keycloak itself is **declarative infra** (Nix reconciler, like traefik) — NOT warm
*data* (so it IS in the D8 closure as a reconciler; its realm data is ephemeral per-run anyway).
Re-warmable from scratch.
Stable-domain scheme decision: `warm-<recipe>.ci.commoninternet.net` (here `warm-keycloak...`),
clearly distinct from cold `<recipe[:4]>-<6hex>`. Risk: longer stack name → swarm 64-char
config/secret limit; will verify on first deploy and shorten if it overflows.
Building W0 in increments (each verified): (1) sso realm lifecycle prims + units; (2) deploy warm
keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the
orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof.
</content>

62
machine-docs/STATUS-2w.md Normal file
View File

@ -0,0 +1,62 @@
# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
## Phase
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.
## Definition of Done (Phase 2w) — WC1WC9, each Adversary cold-verified in REVIEW-2w
- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain; dependents create+delete per-run
namespaced realms; concurrent dependents don't collide; leftover realms reaped.
- [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative
registry tracking recipe→known-good commit; re-warmable from scratch.
- [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one
last-known-good per app, atomic replace; restore proven to round-trip data.
- [ ] **WC4**`--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts;
PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
results carry mode; clean no-canonical fallback.
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
## Milestones (plan §3)
- **W0** — Warm keycloak (WC1). ← IN FLIGHT
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
- **W2** — `--quick` mode (WC4, WC7).
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
## In flight
**W0 — live-warm keycloak (WC1).** Building incrementally:
1. sso.py realm lifecycle: add `delete_keycloak_realm` + `list_realms` + `reap_stale_realms` (realm
is the per-run isolation unit on a shared keycloak).
2. Orchestrator dep path: live-warm mode for the keycloak dep — use the stable warm domain + a
per-run **namespaced** realm (not realm=parent_recipe), delete the realm on teardown instead of
undeploying keycloak. Fall back to cold co-deploy if no warm keycloak present.
3. Declarative Nix reconciler (`nix/modules/warm-keycloak.nix`) — systemd oneshot converges the
warm keycloak to deployed+healthy at the stable domain.
4. e2e proof + concurrency (distinct realms) + reaping → claim WC1.
## Gate
(none claimed yet)
## Blocked
(none)
## Notes
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
from cold `<recipe[:4]>-<6hex>`.
</content>