chore(2w): bootstrap Phase 2w loop state + cleanup orphaned cold apps
- Seed STATUS-2w / BACKLOG-2w / JOURNAL-2w (WC1-WC9 DoD, W0-W4 milestones). - Tore down leftover Phase-2 cold apps (lasu-0a6fb2/keyc-07d81e/lasu-dbg); disk 91%->86%. - DECISIONS: warm-domain scheme, per-run realm isolation, warm keycloak as declarative infra, cold fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
41
machine-docs/BACKLOG-2w.md
Normal file
41
machine-docs/BACKLOG-2w.md
Normal file
@ -0,0 +1,41 @@
|
||||
# BACKLOG — Phase 2w (warm canonical + `--quick`)
|
||||
|
||||
Single-writer rule (plan §6.1): Builder edits `## Build backlog` only; Adversary edits
|
||||
`## Adversary findings` only.
|
||||
|
||||
## Build backlog
|
||||
|
||||
### W0 — Live-warm keycloak (WC1)
|
||||
- [ ] W0.1 — sso.py: realm lifecycle primitives (`delete_keycloak_realm`, `list_realms`,
|
||||
`reap_stale_realms`) + unit tests.
|
||||
- [ ] W0.2 — Orchestrator/deps: live-warm keycloak dep mode — stable warm domain + per-run
|
||||
namespaced realm; delete realm on teardown (don't undeploy); cold-codeploy fallback if no warm
|
||||
keycloak. Per-run realm name unique per (parent, pr, ref) for concurrency isolation.
|
||||
- [ ] W0.3 — Declarative Nix reconciler `nix/modules/warm-keycloak.nix` (systemd oneshot converges
|
||||
warm keycloak deployed+healthy at stable domain); wired into the host config.
|
||||
- [ ] W0.4 — e2e proof: a dependent recipe (lasuite-docs) SSO custom test passes against warm
|
||||
keycloak; concurrent dependents use distinct realms (no collision); leftover realms reaped.
|
||||
→ claim WC1 gate.
|
||||
|
||||
### W1 — Canonical registry + snapshot/restore (WC2, WC3)
|
||||
- [ ] W1.1 — Canonical registry/reconciler (declarative; tracks recipe→known-good commit; stable
|
||||
domain `warm-<recipe>`).
|
||||
- [ ] W1.2 — Snapshot/restore: raw volume copy while undeployed under `/var/lib/ci-warm/<recipe>/`;
|
||||
one last-known-good, atomic replace; prove restore round-trips data.
|
||||
|
||||
### W2 — `--quick` mode (WC4, WC7)
|
||||
- [ ] W2.1 — `run_recipe_ci.py --quick` path (reattach → upgrade-to-PR-head → assert → PASS undeploy /
|
||||
FAIL restore+undeploy; never promote).
|
||||
- [ ] W2.2 — Trigger surface + labeling + no-canonical fallback (WC7).
|
||||
|
||||
### W3 — Cold-advances-canonical + nightly sweep (WC5, WC6)
|
||||
- [ ] W3.1 — Promote-on-green-cold (snapshot+tag canonical at teardown on green cold; seed on first green).
|
||||
- [ ] W3.2 — Nightly full-cold sweep (declarative scheduler, MAX_TESTS-bounded).
|
||||
|
||||
### W4 — Hardening + docs + cold verify (WC8, WC9)
|
||||
- [ ] W4.1 — Resource/isolation hardening: disk monitor+prune, per-app serialize, warm excluded from D8.
|
||||
- [ ] W4.2 — Docs (warm/quick) + the WC9 rollback proof.
|
||||
|
||||
## Adversary findings
|
||||
(none yet)
|
||||
</content>
|
||||
@ -557,3 +557,31 @@ pulls. Tested on cc-ci with the authenticated config.json in place:
|
||||
So: declarative root `config.json` is sufficient end-to-end here; `--with-registry-auth` is not
|
||||
required (abra/SDK attaches it). **Caveat (Phase 2b):** 200/6h may still be tight for a full ~18-recipe
|
||||
sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2w — warm canonical + `--quick` (2026-05-28)
|
||||
|
||||
**Stable-domain scheme for warm apps: `warm-<recipe>.ci.commoninternet.net`.** Distinct from cold
|
||||
per-run `<recipe[:4]>-<6hex>` (naming.app_domain) so a warm app is never confused with a disposable
|
||||
cold run. Live-warm keycloak = `warm-keycloak.ci.commoninternet.net`; data-warm canonicals (W1) =
|
||||
`warm-<recipe>...`. Risk to watch: longer stack name vs swarm's 64-char config/secret limit —
|
||||
verified per-recipe on first deploy; shorten the scheme if any recipe's secret name overflows.
|
||||
|
||||
**Realm is the per-run isolation unit on the shared live-warm keycloak (WC1).** Instead of
|
||||
co-deploying a fresh keycloak per dependent run, dependents use the one live-warm keycloak and create
|
||||
a **per-run namespaced realm+client+user**, deleted at run teardown. Realm name =
|
||||
`<parent_recipe>-<6hex>` where 6hex is the parent's per-run domain label suffix — unique per
|
||||
(parent, pr, ref) so concurrent dependents never collide, and traceable for debugging. (Was
|
||||
`realm=parent_recipe`, which would collide across concurrent same-recipe runs.)
|
||||
|
||||
**Warm keycloak is declarative INFRA, not warm DATA.** The live-warm keycloak service is brought up
|
||||
by a Nix systemd-oneshot reconciler (converges to deployed+healthy at the stable domain), exactly
|
||||
like the traefik recipe deploy — so it IS in the D8 reproducibility closure (re-warmable from
|
||||
scratch) and self-heals on activation/boot. Only warm *volumes/snapshots* (W1+) are cache excluded
|
||||
from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent to exclude.
|
||||
|
||||
**Live-warm is an optimization layer with a cold fallback.** If no warm keycloak is present (e.g. a
|
||||
from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path
|
||||
falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred
|
||||
when available.
|
||||
|
||||
54
machine-docs/JOURNAL-2w.md
Normal file
54
machine-docs/JOURNAL-2w.md
Normal file
@ -0,0 +1,54 @@
|
||||
# JOURNAL — Phase 2w (warm canonical + `--quick`) — Builder
|
||||
|
||||
Append-only reasoning log (WHY). Facts/verification go in STATUS-2w; verdicts in REVIEW-2w.
|
||||
|
||||
## 2026-05-28 — Phase 2w bootstrap + cleanup + W0 design
|
||||
|
||||
**Orientation.** Operator interjected Phase 2w into Phase 2 (Phase 2 paused, state preserved).
|
||||
Read the 2w plan + plan.md §6.1/§7/§9. Adversary already online (REVIEW-2w `@2026-05-28 start`),
|
||||
idle awaiting a WC gate claim. Seeded STATUS-2w/BACKLOG-2w/JOURNAL-2w.
|
||||
|
||||
**In-flight Phase 2 work committed.** Working tree had an uncommitted edit to
|
||||
`tests/lasuite-drive/setup_custom_tests.sh` (Q3.2 MinIO bucket creation via the createbuckets
|
||||
one-shot) — the continuation of commit 6557197. Committed it (66e065d) with an honest message: not
|
||||
yet live-verified (needs a lasuite-drive deploy once warm keycloak exists). This preserves Phase 2
|
||||
progress at the pause point; it resumes after 2w DONE.
|
||||
|
||||
**Cleanup (orchestrator-requested).** cc-ci `/` was at 91% (only 2.4G free) — a real WC8 concern
|
||||
before adding warm volumes/snapshots. Tore down the leftover COLD per-run apps from paused Phase 2
|
||||
via `lifecycle.teardown_app(..., verify=True)`: `lasu-0a6fb2` (12-service lasuite-drive, heaviest),
|
||||
`keyc-07d81e` (cold keycloak), `lasu-dbg` (debug lasuite). All TEARDOWN OK, no residual. Disk →
|
||||
86% (3.8G free). Only infra stacks remain (backups, bridge, dashboard, drone, traefik). Did NOT
|
||||
`docker image prune` — 9.7GB reclaimable but the image cache is the warm pull-cache; with authed
|
||||
Docker Hub pulls now wired, a re-pull is billed to the account (cheaper) but still slow, so keep the
|
||||
cache. Disk is the Phase-2w budget (WC8) — monitor.
|
||||
|
||||
**W0 design (WC1 — live-warm keycloak).** The existing SSO harness is already most of the way there:
|
||||
- `sso.setup_keycloak_realm(provider_domain, realm, client_id, ...)` creates a realm+client+user
|
||||
**idempotently via the admin API**, and `_kc_admin_password` reads the admin password from inside
|
||||
the running container (`docker exec ... cat /run/secrets/admin_password`). So it works against ANY
|
||||
running keycloak — cold or warm — with no external password handling.
|
||||
- The orchestrator dep flow (`run_recipe_ci.py`): `declared_deps` → `deploy_deps` (fresh co-deploy
|
||||
per run) → `_enrich_deps_with_sso` (creates realm, realm name currently = `parent_recipe`) →
|
||||
`setup_custom_tests.sh` hook → teardown_deps (undeploy).
|
||||
|
||||
What WC1 changes:
|
||||
1. The **realm becomes the per-run isolation unit** on a shared live-warm keycloak. Realm name must
|
||||
be unique per (parent, pr, ref) so concurrent dependents don't collide — change from
|
||||
`realm=parent_recipe` to `realm=<parent>-<6hex>` (derive the hex from the parent's per-run domain
|
||||
label so it's stable within a run and distinct across concurrent runs).
|
||||
2. The keycloak dep is **not co-deployed**: point at the stable warm domain; on teardown **delete the
|
||||
realm** (not undeploy keycloak). Fall back to cold co-deploy if no warm keycloak is present (so a
|
||||
from-scratch / no-warm environment still works — the warm keycloak is an optimization layer).
|
||||
3. The warm keycloak itself is **declarative infra** (Nix reconciler, like traefik) — NOT warm
|
||||
*data* (so it IS in the D8 closure as a reconciler; its realm data is ephemeral per-run anyway).
|
||||
Re-warmable from scratch.
|
||||
|
||||
Stable-domain scheme decision: `warm-<recipe>.ci.commoninternet.net` (here `warm-keycloak...`),
|
||||
clearly distinct from cold `<recipe[:4]>-<6hex>`. Risk: longer stack name → swarm 64-char
|
||||
config/secret limit; will verify on first deploy and shorten if it overflows.
|
||||
|
||||
Building W0 in increments (each verified): (1) sso realm lifecycle prims + units; (2) deploy warm
|
||||
keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the
|
||||
orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof.
|
||||
</content>
|
||||
62
machine-docs/STATUS-2w.md
Normal file
62
machine-docs/STATUS-2w.md
Normal file
@ -0,0 +1,62 @@
|
||||
# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
|
||||
|
||||
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
|
||||
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
|
||||
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
|
||||
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
|
||||
|
||||
## Phase
|
||||
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
|
||||
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
|
||||
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
|
||||
nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversary cold-verified.
|
||||
|
||||
## Definition of Done (Phase 2w) — WC1–WC9, each Adversary cold-verified in REVIEW-2w
|
||||
- [ ] **WC1** — Live-warm keycloak (SSO dep) at a stable domain; dependents create+delete per-run
|
||||
namespaced realms; concurrent dependents don't collide; leftover realms reaped.
|
||||
- [ ] **WC2** — Data-warm canonical model: per-recipe canonical at a stable domain, declarative
|
||||
registry tracking recipe→known-good commit; re-warmable from scratch.
|
||||
- [ ] **WC3** — Known-good snapshots: raw volume copy taken while undeployed under stable path; one
|
||||
last-known-good per app, atomic replace; restore proven to round-trip data.
|
||||
- [ ] **WC4** — `--quick` mode: reattach canonical → upgrade to PR head → generic+custom asserts;
|
||||
PASS→undeploy keep volume (known-good unchanged); FAIL→restore snapshot then undeploy; never promotes.
|
||||
- [ ] **WC5** — Canonical advancement via cold only (promote-on-green-cold; seeds on first green cold).
|
||||
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
|
||||
- [ ] **WC7** — Trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
|
||||
results carry mode; clean no-canonical fallback.
|
||||
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
|
||||
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
|
||||
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
|
||||
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
|
||||
|
||||
## Milestones (plan §3)
|
||||
- **W0** — Warm keycloak (WC1). ← IN FLIGHT
|
||||
- **W1** — Canonical registry + snapshot/restore (WC2, WC3).
|
||||
- **W2** — `--quick` mode (WC4, WC7).
|
||||
- **W3** — Cold-advances-canonical + nightly sweep (WC5, WC6).
|
||||
- **W4** — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). → DONE.
|
||||
|
||||
## In flight
|
||||
**W0 — live-warm keycloak (WC1).** Building incrementally:
|
||||
1. sso.py realm lifecycle: add `delete_keycloak_realm` + `list_realms` + `reap_stale_realms` (realm
|
||||
is the per-run isolation unit on a shared keycloak).
|
||||
2. Orchestrator dep path: live-warm mode for the keycloak dep — use the stable warm domain + a
|
||||
per-run **namespaced** realm (not realm=parent_recipe), delete the realm on teardown instead of
|
||||
undeploying keycloak. Fall back to cold co-deploy if no warm keycloak present.
|
||||
3. Declarative Nix reconciler (`nix/modules/warm-keycloak.nix`) — systemd oneshot converges the
|
||||
warm keycloak to deployed+healthy at the stable domain.
|
||||
4. e2e proof + concurrency (distinct realms) + reaping → claim WC1.
|
||||
|
||||
## Gate
|
||||
(none claimed yet)
|
||||
|
||||
## Blocked
|
||||
(none)
|
||||
|
||||
## Notes
|
||||
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
|
||||
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
|
||||
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
|
||||
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
|
||||
from cold `<recipe[:4]>-<6hex>`.
|
||||
</content>
|
||||
Reference in New Issue
Block a user