13 KiB
JOURNAL — Phase 2w (warm canonical + --quick) — Builder
Append-only reasoning log (WHY). Facts/verification go in STATUS-2w; verdicts in REVIEW-2w.
2026-05-28 — Phase 2w bootstrap + cleanup + W0 design
Orientation. Operator interjected Phase 2w into Phase 2 (Phase 2 paused, state preserved).
Read the 2w plan + plan.md §6.1/§7/§9. Adversary already online (REVIEW-2w @2026-05-28 start),
idle awaiting a WC gate claim. Seeded STATUS-2w/BACKLOG-2w/JOURNAL-2w.
In-flight Phase 2 work committed. Working tree had an uncommitted edit to
tests/lasuite-drive/setup_custom_tests.sh (Q3.2 MinIO bucket creation via the createbuckets
one-shot) — the continuation of commit 6557197. Committed it (66e065d) with an honest message: not
yet live-verified (needs a lasuite-drive deploy once warm keycloak exists). This preserves Phase 2
progress at the pause point; it resumes after 2w DONE.
Cleanup (orchestrator-requested). cc-ci / was at 91% (only 2.4G free) — a real WC8 concern
before adding warm volumes/snapshots. Tore down the leftover COLD per-run apps from paused Phase 2
via lifecycle.teardown_app(..., verify=True): lasu-0a6fb2 (12-service lasuite-drive, heaviest),
keyc-07d81e (cold keycloak), lasu-dbg (debug lasuite). All TEARDOWN OK, no residual. Disk →
86% (3.8G free). Only infra stacks remain (backups, bridge, dashboard, drone, traefik). Did NOT
docker image prune — 9.7GB reclaimable but the image cache is the warm pull-cache; with authed
Docker Hub pulls now wired, a re-pull is billed to the account (cheaper) but still slow, so keep the
cache. Disk is the Phase-2w budget (WC8) — monitor.
W0 design (WC1 — live-warm keycloak). The existing SSO harness is already most of the way there:
sso.setup_keycloak_realm(provider_domain, realm, client_id, ...)creates a realm+client+user idempotently via the admin API, and_kc_admin_passwordreads the admin password from inside the running container (docker exec ... cat /run/secrets/admin_password). So it works against ANY running keycloak — cold or warm — with no external password handling.- The orchestrator dep flow (
run_recipe_ci.py):declared_deps→deploy_deps(fresh co-deploy per run) →_enrich_deps_with_sso(creates realm, realm name currently =parent_recipe) →setup_custom_tests.shhook → teardown_deps (undeploy).
What WC1 changes:
- The realm becomes the per-run isolation unit on a shared live-warm keycloak. Realm name must
be unique per (parent, pr, ref) so concurrent dependents don't collide — change from
realm=parent_recipetorealm=<parent>-<6hex>(derive the hex from the parent's per-run domain label so it's stable within a run and distinct across concurrent runs). - The keycloak dep is not co-deployed: point at the stable warm domain; on teardown delete the realm (not undeploy keycloak). Fall back to cold co-deploy if no warm keycloak is present (so a from-scratch / no-warm environment still works — the warm keycloak is an optimization layer).
- The warm keycloak itself is declarative infra (Nix reconciler, like traefik) — NOT warm data (so it IS in the D8 closure as a reconciler; its realm data is ephemeral per-run anyway). Re-warmable from scratch.
Stable-domain scheme decision: warm-<recipe>.ci.commoninternet.net (here warm-keycloak...),
clearly distinct from cold <recipe[:4]>-<6hex>. Risk: longer stack name → swarm 64-char
config/secret limit; will verify on first deploy and shorten if it overflows.
Building W0 in increments (each verified): (1) sso realm lifecycle prims + units; (2) deploy warm keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof.
2026-05-29 — W0 core mechanism PROVEN; declarative reconciler up; design update absorbed
Stale Phase-2 run killed. Found an orphaned run_recipe_ci.py (RECIPE=lasuite-drive, the Q3.2
ccci-q32-drive-sso2.log run) still alive from before the phase switch (PPID 1, nohup). It had
deployed lasu-0a6fb2 + tried a cold keyc-07d81e dep — both of which I'd already torn down, so it was
failing. Killed its process tree + janitored. Only infra + warm-keycloak remain.
W0.1 realm lifecycle (sso.py) — list_realms / delete_keycloak_realm (idempotent, refuses master) / realms_to_reap (pure predicate) / reap_orphaned_realms. +8 unit tests. The per-run realm is the isolation unit on a shared keycloak; orphans reaped by hex-not-in-live-stacks (concurrency-safe).
W0.2 orchestrator live-warm mode — warm.py (stable-domain scheme, is_warm_up probe, live_app_hexes, realm_for=-<6hex>, reap_orphan_realms). run_recipe_ci splits declared deps into warm (shared provider + per-run realm, no deploy, realm deleted at teardown) vs cold (co-deploy), warm only if provider up else cold fallback; deploy-count excludes warm deps; reaps orphans at run start. Dependent tests now assert the namespaced realm pattern (stronger than ==parent).
WC1 CORE MECHANISM PROVEN (deploy-free, live warm keycloak): realm create → password-grant JWT → discovery issuer → delete(idempotent) → reap(keeps live hex, deletes orphan): ALL PASS.
W0.3 declarative reconciler (nix/modules/warm-keycloak.nix) — systemd oneshot, converges warm keycloak. Two bugs found+fixed against the real system:
abra app deploynon-chaos FATALs "already deployed" → need-f(tested: redeploys at ENV VERSION, exit 0).- Newline bite (the backupbot.nix bite): keycloak's .env.sample ends with a newline-less
#COMPOSE_FILE=comment, so bashset_env's printf gluedDOMAIN=onto that comment → DOMAIN unset →KC_HOSTNAME=https://(empty host) → keycloak crash-loop ("Expected authority at index 8: https://"). Fixed set_env to ensure a trailing newline before append (same as backupbot). Also made converge skip the redeploy when already 200 (no JVM-restart blip on every rebuild; only (re)deploys when down/crash-looping). Verified: nixos-rebuild switch → warm-keycloak.service active "no-op converge", system running (0 failed), /realms/master=200.
W0.4 e2e (lasuite-docs vs warm keycloak) — the WARM MECHANISM worked: deploy-count=1 (keycloak
NOT co-deployed), per-run realm lasuite-docs-9c1995 created + deleted on the warm keycloak at
teardown, install pass. BUT setup_custom_tests.sh exited 1 → 3 requires_deps SSO tests SKIPPED →
F2-11 correctly FAILED the run (not green). Root cause = a lasuite-docs recipe race, NOT warm
keycloak: the in-place abra app deploy --force --chaos (OIDC wiring) rolls all services; nginx
web fatally exits on [emerg] host not found in upstream ...backend:8000 while backend is
mid-restart, and abra's converge check times out → "deploy failed 🛑". This is independent of
warm/cold keycloak (Q2.4 cold-keycloak lasuite-docs passed before; warm should REDUCE contention).
Filed as a finding to investigate (flaky/timing/resource vs deterministic regression); the headline
WC1 "dependent SSO tests green against warm keycloak" needs this resolved or a more-robust dependent.
DESIGN UPDATE absorbed (orchestrator + Adversary REVIEW-2w, 2026-05-28→29). Warm/infra apps (traefik + keycloak) now AUTO-UPDATE to LATEST nightly with HEALTH-GATED ROLLBACK:
- WC1 revised: UNPIN keycloak (match traefik:
abra recipe fetchlatest + chaos deploy; DROP kcVersion). Keep secret-generate-only-if-missing + health-wait. D8 preserved (recipe fetched at runtime → nix closure byte-identical). - WC1.1 NEW: health-gated deploy-with-rollback IN the reconcilers. record last-good → deploy
latest → health-check → healthy: commit last-good:=latest; unhealthy: rollback + PushNotification.
Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → on fail restore snapshot
- redeploy prior version (forward DB migrations make version-only rollback unsafe). traefik (stateless) = version rollback only. Reuse WC3 snapshot helper.
- WC1.2 NEW: pre-deploy safety gate — auto-apply only non-major/no-manual-migration bumps; a MAJOR bump or manual-migration release notes → stay on current + alert (don't auto-apply).
- WC6 reordered: nightly = nixos-rebuild switch FIRST (warm/infra→latest, health-gated) THEN full-cold sweep; never while a test is in flight.
Re-sequencing consequence: WC1.1 depends on the WC3 snapshot/restore helper, so I build that FIRST (foundational), then rewrite the reconciler ONCE into the full unpinned + health-gated + safety-gated + rollback form (avoids reworking the reconciler twice). Current reconciler (pinned, skip-if-healthy) is INTERIM — keeps keycloak live-warm/healthy meanwhile; will be replaced. Also need to settle the alert mechanism: a bash systemd reconciler can't call the agent's PushNotification tool directly — decision needed (alert sentinel file the Builder loop reads + relays, or a webhook).
2026-05-29 — W0.5 WC3 snapshot helper proven; disk reclaim (WC8 hygiene)
W0.5 warmsnap.py landed + LIVE round-trip proven on warm keycloak (see STATUS-2w). Then settled the
W0.6 reconciler approach (python entrypoint in nix store; deploy-by-tag; recipe-semver = pre-+
component) in DECISIONS.
Disk reclaim. After 3 nixos-rebuild switches + 3 keycloak deploy cycles (WC3 proof) + a 159M
keycloak snapshot, / hit 96% (1.2G free) — a WC8 red flag before continuing. Reclaimed safely
(reversibility is via the git-declared config, not old generations): rm -rf /root/cc-ci.prev;
nix-collect-garbage -d (2553 paths, 3.38G); docker image prune -f dangling-only (3.32G, KEEPS the
tagged pull-cache); pruned old abra deploy logs (keep last 5). Result: 62% (10G free). This
GC+dangling-prune is the disk-management mechanism WC8 must formalize (run it in the nightly/W4, and
keep one last-good snapshot per app bounded). NOTE for WC8: the WC3 keycloak snapshot is 159M; a
warm-set of ~6 canonicals × (volume + 1 snapshot) is the disk budget to size.
State at checkpoint: warm keycloak healthy (200), only infra+warm stacks, system running (0
failed), disk 62%. W0.1-W0.5 done+proven+pushed (HEAD 67240dc). Next unit: W0.6 reconciler rewrite
(unpin + WC1.2 safety gate + WC1.1 health-gated rollback), then W0.7/W0.8 (lasuite-docs race +
headline WC1 e2e).
2026-05-29 — W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback)
Built runner/warm_reconcile.py's health-gated rollback and proved it live against the warm keycloak
using annotated fake tags + CCCI_SKIP_FETCH=1. The proof iterations surfaced 4 real issues, each
fixed against the real system (verify-don't-assume):
- deploy-failure must roll back too — a broken "latest" can fail abra's lint/converge (deploy_version raises) rather than deploy-then-be-unhealthy; wrapped the upgrade deploy so BOTH raise and unhealthy paths trigger the snapshot-restore rollback (else the unit just crashes).
- warmsnap clobbered last_good — snapshot's atomic swap renamed the whole
<recipe>/dir, wiping the siblinglast_goodfile. Fixed: snapshot lives in<recipe>/snapshot/; only that subdir is swapped;last_good(sibling) survives. - swarm settle race — abra undeploy returns before swarm finishes removing tasks, so an
immediate snapshot/restore/redeploy of the same stack raced a half-removed stack. Added
wait_undeployed()after every undeploy. - abra writes FATA to stdout — deploy_version only surfaced stderr (empty); now includes stdout.
This is how I diagnosed the two test-artifact failures: the broken deploy failed abra lint R009
(bad env not a string — a valid "broken latest"), and the first rollback attempts failed abra
lint R014 "only annotated tags used for recipe version" because my fake tags were lightweight
(production tags are annotated) — a TEST artifact, not a reconciler bug. Fixed the test to create
annotated tags (peel
^{}to avoid nested-tag; set git identity).
Final PROOF (ALL PASS):
- (a) healthy upgrade 10.7.1→10.7.9: snapshot taken (subdir), deploy, health-pass, last_good committed=10.7.9, marker realm preserved through the undeploy/snapshot/redeploy.
- (b) marquee rollback: broken latest 10.7.10 → deploy fails → rollback to 10.7.9 → HEALTHY; marker realm INTACT (data preserved through broken-upgrade + snapshot-restore); last_good NOT advanced; rollback alert sentinel written (attempted=10.7.10, last_good=10.7.9, recovered=True). keycloak recovered to canonical 10.7.1+26.6.2 healthy, no fake tags left.
This satisfies the WC1.1 Adversary mandate (broken latest → self-revert + data intact + alert; healthy update commits last-good). WC1.2 holds were proven in W0.6. The reconciler-side WC1/WC1.1/ WC1.2 are proven; the alert RELAY (Builder loop scans /var/lib/ci-warm/alerts/ → PushNotification + archive to seen/) is still to wire (flagged for when nightly WC6 lands / a real alert can occur).
Remaining for the WC1 gate: W0.7 (lasuite-docs in-place chaos-redeploy nginx race) + W0.8 (headline dependent-SSO-green e2e vs warm keycloak + concurrent distinct realms + reaping).