120 lines
8.9 KiB
Markdown
120 lines
8.9 KiB
Markdown
# JOURNAL — Phase 2w (warm canonical + `--quick`) — Builder
|
|
|
|
Append-only reasoning log (WHY). Facts/verification go in STATUS-2w; verdicts in REVIEW-2w.
|
|
|
|
## 2026-05-28 — Phase 2w bootstrap + cleanup + W0 design
|
|
|
|
**Orientation.** Operator interjected Phase 2w into Phase 2 (Phase 2 paused, state preserved).
|
|
Read the 2w plan + plan.md §6.1/§7/§9. Adversary already online (REVIEW-2w `@2026-05-28 start`),
|
|
idle awaiting a WC gate claim. Seeded STATUS-2w/BACKLOG-2w/JOURNAL-2w.
|
|
|
|
**In-flight Phase 2 work committed.** Working tree had an uncommitted edit to
|
|
`tests/lasuite-drive/setup_custom_tests.sh` (Q3.2 MinIO bucket creation via the createbuckets
|
|
one-shot) — the continuation of commit 6557197. Committed it (66e065d) with an honest message: not
|
|
yet live-verified (needs a lasuite-drive deploy once warm keycloak exists). This preserves Phase 2
|
|
progress at the pause point; it resumes after 2w DONE.
|
|
|
|
**Cleanup (orchestrator-requested).** cc-ci `/` was at 91% (only 2.4G free) — a real WC8 concern
|
|
before adding warm volumes/snapshots. Tore down the leftover COLD per-run apps from paused Phase 2
|
|
via `lifecycle.teardown_app(..., verify=True)`: `lasu-0a6fb2` (12-service lasuite-drive, heaviest),
|
|
`keyc-07d81e` (cold keycloak), `lasu-dbg` (debug lasuite). All TEARDOWN OK, no residual. Disk →
|
|
86% (3.8G free). Only infra stacks remain (backups, bridge, dashboard, drone, traefik). Did NOT
|
|
`docker image prune` — 9.7GB reclaimable but the image cache is the warm pull-cache; with authed
|
|
Docker Hub pulls now wired, a re-pull is billed to the account (cheaper) but still slow, so keep the
|
|
cache. Disk is the Phase-2w budget (WC8) — monitor.
|
|
|
|
**W0 design (WC1 — live-warm keycloak).** The existing SSO harness is already most of the way there:
|
|
- `sso.setup_keycloak_realm(provider_domain, realm, client_id, ...)` creates a realm+client+user
|
|
**idempotently via the admin API**, and `_kc_admin_password` reads the admin password from inside
|
|
the running container (`docker exec ... cat /run/secrets/admin_password`). So it works against ANY
|
|
running keycloak — cold or warm — with no external password handling.
|
|
- The orchestrator dep flow (`run_recipe_ci.py`): `declared_deps` → `deploy_deps` (fresh co-deploy
|
|
per run) → `_enrich_deps_with_sso` (creates realm, realm name currently = `parent_recipe`) →
|
|
`setup_custom_tests.sh` hook → teardown_deps (undeploy).
|
|
|
|
What WC1 changes:
|
|
1. The **realm becomes the per-run isolation unit** on a shared live-warm keycloak. Realm name must
|
|
be unique per (parent, pr, ref) so concurrent dependents don't collide — change from
|
|
`realm=parent_recipe` to `realm=<parent>-<6hex>` (derive the hex from the parent's per-run domain
|
|
label so it's stable within a run and distinct across concurrent runs).
|
|
2. The keycloak dep is **not co-deployed**: point at the stable warm domain; on teardown **delete the
|
|
realm** (not undeploy keycloak). Fall back to cold co-deploy if no warm keycloak is present (so a
|
|
from-scratch / no-warm environment still works — the warm keycloak is an optimization layer).
|
|
3. The warm keycloak itself is **declarative infra** (Nix reconciler, like traefik) — NOT warm
|
|
*data* (so it IS in the D8 closure as a reconciler; its realm data is ephemeral per-run anyway).
|
|
Re-warmable from scratch.
|
|
|
|
Stable-domain scheme decision: `warm-<recipe>.ci.commoninternet.net` (here `warm-keycloak...`),
|
|
clearly distinct from cold `<recipe[:4]>-<6hex>`. Risk: longer stack name → swarm 64-char
|
|
config/secret limit; will verify on first deploy and shorten if it overflows.
|
|
|
|
Building W0 in increments (each verified): (1) sso realm lifecycle prims + units; (2) deploy warm
|
|
keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the
|
|
orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof.
|
|
</content>
|
|
|
|
## 2026-05-29 — W0 core mechanism PROVEN; declarative reconciler up; design update absorbed
|
|
|
|
**Stale Phase-2 run killed.** Found an orphaned `run_recipe_ci.py` (RECIPE=lasuite-drive, the Q3.2
|
|
`ccci-q32-drive-sso2.log` run) still alive from before the phase switch (PPID 1, nohup). It had
|
|
deployed lasu-0a6fb2 + tried a cold keyc-07d81e dep — both of which I'd already torn down, so it was
|
|
failing. Killed its process tree + janitored. Only infra + warm-keycloak remain.
|
|
|
|
**W0.1 realm lifecycle (sso.py)** — list_realms / delete_keycloak_realm (idempotent, refuses master)
|
|
/ realms_to_reap (pure predicate) / reap_orphaned_realms. +8 unit tests. The per-run realm is the
|
|
isolation unit on a shared keycloak; orphans reaped by hex-not-in-live-stacks (concurrency-safe).
|
|
|
|
**W0.2 orchestrator live-warm mode** — warm.py (stable-domain scheme, is_warm_up probe,
|
|
live_app_hexes, realm_for=<parent>-<6hex>, reap_orphan_realms). run_recipe_ci splits declared deps
|
|
into warm (shared provider + per-run realm, no deploy, realm deleted at teardown) vs cold
|
|
(co-deploy), warm only if provider up else cold fallback; deploy-count excludes warm deps; reaps
|
|
orphans at run start. Dependent tests now assert the namespaced realm pattern (stronger than ==parent).
|
|
|
|
**WC1 CORE MECHANISM PROVEN** (deploy-free, live warm keycloak): realm create → password-grant JWT
|
|
→ discovery issuer → delete(idempotent) → reap(keeps live hex, deletes orphan): ALL PASS.
|
|
|
|
**W0.3 declarative reconciler** (nix/modules/warm-keycloak.nix) — systemd oneshot, converges warm
|
|
keycloak. Two bugs found+fixed against the real system:
|
|
1. `abra app deploy` non-chaos FATALs "already deployed" → need `-f` (tested: redeploys at ENV
|
|
VERSION, exit 0).
|
|
2. **Newline bite** (the backupbot.nix bite): keycloak's .env.sample ends with a newline-less
|
|
`#COMPOSE_FILE=` comment, so bash `set_env`'s printf glued `DOMAIN=` onto that comment →
|
|
DOMAIN unset → `KC_HOSTNAME=https://` (empty host) → keycloak crash-loop ("Expected authority at
|
|
index 8: https://"). Fixed set_env to ensure a trailing newline before append (same as backupbot).
|
|
Also made converge **skip the redeploy when already 200** (no JVM-restart blip on every rebuild;
|
|
only (re)deploys when down/crash-looping). Verified: nixos-rebuild switch → warm-keycloak.service
|
|
active "no-op converge", system running (0 failed), /realms/master=200.
|
|
|
|
**W0.4 e2e (lasuite-docs vs warm keycloak)** — the WARM MECHANISM worked: deploy-count=1 (keycloak
|
|
NOT co-deployed), per-run realm `lasuite-docs-9c1995` created + **deleted on the warm keycloak** at
|
|
teardown, install pass. BUT `setup_custom_tests.sh exited 1` → 3 requires_deps SSO tests SKIPPED →
|
|
F2-11 correctly FAILED the run (not green). Root cause = a **lasuite-docs recipe race**, NOT warm
|
|
keycloak: the in-place `abra app deploy --force --chaos` (OIDC wiring) rolls all services; nginx
|
|
`web` fatally exits on `[emerg] host not found in upstream ...backend:8000` while backend is
|
|
mid-restart, and abra's converge check times out → "deploy failed 🛑". This is independent of
|
|
warm/cold keycloak (Q2.4 cold-keycloak lasuite-docs passed before; warm should REDUCE contention).
|
|
Filed as a finding to investigate (flaky/timing/resource vs deterministic regression); the headline
|
|
WC1 "dependent SSO tests green against warm keycloak" needs this resolved or a more-robust dependent.
|
|
|
|
**DESIGN UPDATE absorbed (orchestrator + Adversary REVIEW-2w, 2026-05-28→29).** Warm/infra apps
|
|
(traefik + keycloak) now AUTO-UPDATE to LATEST nightly with HEALTH-GATED ROLLBACK:
|
|
- **WC1 revised:** UNPIN keycloak (match traefik: `abra recipe fetch` latest + chaos deploy; DROP
|
|
kcVersion). Keep secret-generate-only-if-missing + health-wait. D8 preserved (recipe fetched at
|
|
runtime → nix closure byte-identical).
|
|
- **WC1.1 NEW:** health-gated deploy-with-rollback IN the reconcilers. record last-good → deploy
|
|
latest → health-check → healthy: commit last-good:=latest; unhealthy: rollback + PushNotification.
|
|
Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → on fail restore snapshot
|
|
+ redeploy prior version (forward DB migrations make version-only rollback unsafe). traefik
|
|
(stateless) = version rollback only. Reuse WC3 snapshot helper.
|
|
- **WC1.2 NEW:** pre-deploy safety gate — auto-apply only non-major/no-manual-migration bumps; a
|
|
MAJOR bump or manual-migration release notes → stay on current + alert (don't auto-apply).
|
|
- **WC6 reordered:** nightly = nixos-rebuild switch FIRST (warm/infra→latest, health-gated) THEN
|
|
full-cold sweep; never while a test is in flight.
|
|
|
|
**Re-sequencing consequence:** WC1.1 depends on the **WC3 snapshot/restore helper**, so I build that
|
|
FIRST (foundational), then rewrite the reconciler ONCE into the full unpinned + health-gated +
|
|
safety-gated + rollback form (avoids reworking the reconciler twice). Current reconciler (pinned,
|
|
skip-if-healthy) is INTERIM — keeps keycloak live-warm/healthy meanwhile; will be replaced. Also need
|
|
to settle the **alert mechanism**: a bash systemd reconciler can't call the agent's PushNotification
|
|
tool directly — decision needed (alert sentinel file the Builder loop reads + relays, or a webhook).
|