cc-ci/machine-docs/JOURNAL-2w.md

# JOURNAL — Phase 2w (warm canonical + `--quick`) — Builder

Append-only reasoning log (WHY). Facts/verification go in STATUS-2w; verdicts in REVIEW-2w.

## 2026-05-28 — Phase 2w bootstrap + cleanup + W0 design

**Orientation.** Operator interjected Phase 2w into Phase 2 (Phase 2 paused, state preserved).
Read the 2w plan + plan.md §6.1/§7/§9. Adversary already online (REVIEW-2w `@2026-05-28 start`),
idle awaiting a WC gate claim. Seeded STATUS-2w/BACKLOG-2w/JOURNAL-2w.

**In-flight Phase 2 work committed.** Working tree had an uncommitted edit to
`tests/lasuite-drive/setup_custom_tests.sh` (Q3.2 MinIO bucket creation via the createbuckets
one-shot) — the continuation of commit 6557197. Committed it (66e065d) with an honest message: not
yet live-verified (needs a lasuite-drive deploy once warm keycloak exists). This preserves Phase 2
progress at the pause point; it resumes after 2w DONE.

**Cleanup (orchestrator-requested).** cc-ci `/` was at 91% (only 2.4G free) — a real WC8 concern
before adding warm volumes/snapshots. Tore down the leftover COLD per-run apps from paused Phase 2
via `lifecycle.teardown_app(..., verify=True)`: `lasu-0a6fb2` (12-service lasuite-drive, heaviest),
`keyc-07d81e` (cold keycloak), `lasu-dbg` (debug lasuite). All TEARDOWN OK, no residual. Disk →
86% (3.8G free). Only infra stacks remain (backups, bridge, dashboard, drone, traefik). Did NOT
`docker image prune` — 9.7GB reclaimable but the image cache is the warm pull-cache; with authed
Docker Hub pulls now wired, a re-pull is billed to the account (cheaper) but still slow, so keep the
cache. Disk is the Phase-2w budget (WC8) — monitor.

**W0 design (WC1 — live-warm keycloak).** The existing SSO harness is already most of the way there:
- `sso.setup_keycloak_realm(provider_domain, realm, client_id, ...)` creates a realm+client+user
  **idempotently via the admin API**, and `_kc_admin_password` reads the admin password from inside
  the running container (`docker exec ... cat /run/secrets/admin_password`). So it works against ANY
  running keycloak — cold or warm — with no external password handling.
- The orchestrator dep flow (`run_recipe_ci.py`): `declared_deps` → `deploy_deps` (fresh co-deploy
  per run) → `_enrich_deps_with_sso` (creates realm, realm name currently = `parent_recipe`) →
  `setup_custom_tests.sh` hook → teardown_deps (undeploy).

What WC1 changes:
1. The **realm becomes the per-run isolation unit** on a shared live-warm keycloak. Realm name must
   be unique per (parent, pr, ref) so concurrent dependents don't collide — change from
   `realm=parent_recipe` to `realm=<parent>-<6hex>` (derive the hex from the parent's per-run domain
   label so it's stable within a run and distinct across concurrent runs).
2. The keycloak dep is **not co-deployed**: point at the stable warm domain; on teardown **delete the
   realm** (not undeploy keycloak). Fall back to cold co-deploy if no warm keycloak is present (so a
   from-scratch / no-warm environment still works — the warm keycloak is an optimization layer).
3. The warm keycloak itself is **declarative infra** (Nix reconciler, like traefik) — NOT warm
   *data* (so it IS in the D8 closure as a reconciler; its realm data is ephemeral per-run anyway).
   Re-warmable from scratch.

Stable-domain scheme decision: `warm-<recipe>.ci.commoninternet.net` (here `warm-keycloak...`),
clearly distinct from cold `<recipe[:4]>-<6hex>`. Risk: longer stack name → swarm 64-char
config/secret limit; will verify on first deploy and shorten if it overflows.

Building W0 in increments (each verified): (1) sso realm lifecycle prims + units; (2) deploy warm
keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the
orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof.
</content>

## 2026-05-29 — W0 core mechanism PROVEN; declarative reconciler up; design update absorbed

**Stale Phase-2 run killed.** Found an orphaned `run_recipe_ci.py` (RECIPE=lasuite-drive, the Q3.2
`ccci-q32-drive-sso2.log` run) still alive from before the phase switch (PPID 1, nohup). It had
deployed lasu-0a6fb2 + tried a cold keyc-07d81e dep — both of which I'd already torn down, so it was
failing. Killed its process tree + janitored. Only infra + warm-keycloak remain.

**W0.1 realm lifecycle (sso.py)** — list_realms / delete_keycloak_realm (idempotent, refuses master)
/ realms_to_reap (pure predicate) / reap_orphaned_realms. +8 unit tests. The per-run realm is the
isolation unit on a shared keycloak; orphans reaped by hex-not-in-live-stacks (concurrency-safe).

**W0.2 orchestrator live-warm mode** — warm.py (stable-domain scheme, is_warm_up probe,
live_app_hexes, realm_for=<parent>-<6hex>, reap_orphan_realms). run_recipe_ci splits declared deps
into warm (shared provider + per-run realm, no deploy, realm deleted at teardown) vs cold
(co-deploy), warm only if provider up else cold fallback; deploy-count excludes warm deps; reaps
orphans at run start. Dependent tests now assert the namespaced realm pattern (stronger than ==parent).

**WC1 CORE MECHANISM PROVEN** (deploy-free, live warm keycloak): realm create → password-grant JWT
→ discovery issuer → delete(idempotent) → reap(keeps live hex, deletes orphan): ALL PASS.

**W0.3 declarative reconciler** (nix/modules/warm-keycloak.nix) — systemd oneshot, converges warm
keycloak. Two bugs found+fixed against the real system:
1. `abra app deploy` non-chaos FATALs "already deployed" → need `-f` (tested: redeploys at ENV
   VERSION, exit 0).
2. **Newline bite** (the backupbot.nix bite): keycloak's .env.sample ends with a newline-less
   `#COMPOSE_FILE=` comment, so bash `set_env`'s printf glued `DOMAIN=` onto that comment →
   DOMAIN unset → `KC_HOSTNAME=https://` (empty host) → keycloak crash-loop ("Expected authority at
   index 8: https://"). Fixed set_env to ensure a trailing newline before append (same as backupbot).
Also made converge **skip the redeploy when already 200** (no JVM-restart blip on every rebuild;
only (re)deploys when down/crash-looping). Verified: nixos-rebuild switch → warm-keycloak.service
active "no-op converge", system running (0 failed), /realms/master=200.

**W0.4 e2e (lasuite-docs vs warm keycloak)** — the WARM MECHANISM worked: deploy-count=1 (keycloak
NOT co-deployed), per-run realm `lasuite-docs-9c1995` created + **deleted on the warm keycloak** at
teardown, install pass. BUT `setup_custom_tests.sh exited 1` → 3 requires_deps SSO tests SKIPPED →
F2-11 correctly FAILED the run (not green). Root cause = a **lasuite-docs recipe race**, NOT warm
keycloak: the in-place `abra app deploy --force --chaos` (OIDC wiring) rolls all services; nginx
`web` fatally exits on `[emerg] host not found in upstream ...backend:8000` while backend is
mid-restart, and abra's converge check times out → "deploy failed 🛑". This is independent of
warm/cold keycloak (Q2.4 cold-keycloak lasuite-docs passed before; warm should REDUCE contention).
Filed as a finding to investigate (flaky/timing/resource vs deterministic regression); the headline
WC1 "dependent SSO tests green against warm keycloak" needs this resolved or a more-robust dependent.

**DESIGN UPDATE absorbed (orchestrator + Adversary REVIEW-2w, 2026-05-28→29).** Warm/infra apps
(traefik + keycloak) now AUTO-UPDATE to LATEST nightly with HEALTH-GATED ROLLBACK:
- **WC1 revised:** UNPIN keycloak (match traefik: `abra recipe fetch` latest + chaos deploy; DROP
  kcVersion). Keep secret-generate-only-if-missing + health-wait. D8 preserved (recipe fetched at
  runtime → nix closure byte-identical).
- **WC1.1 NEW:** health-gated deploy-with-rollback IN the reconcilers. record last-good → deploy
  latest → health-check → healthy: commit last-good:=latest; unhealthy: rollback + PushNotification.
  Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → on fail restore snapshot
  + redeploy prior version (forward DB migrations make version-only rollback unsafe). traefik
  (stateless) = version rollback only. Reuse WC3 snapshot helper.
- **WC1.2 NEW:** pre-deploy safety gate — auto-apply only non-major/no-manual-migration bumps; a
  MAJOR bump or manual-migration release notes → stay on current + alert (don't auto-apply).
- **WC6 reordered:** nightly = nixos-rebuild switch FIRST (warm/infra→latest, health-gated) THEN
  full-cold sweep; never while a test is in flight.

**Re-sequencing consequence:** WC1.1 depends on the **WC3 snapshot/restore helper**, so I build that
FIRST (foundational), then rewrite the reconciler ONCE into the full unpinned + health-gated +
safety-gated + rollback form (avoids reworking the reconciler twice). Current reconciler (pinned,
skip-if-healthy) is INTERIM — keeps keycloak live-warm/healthy meanwhile; will be replaced. Also need
to settle the **alert mechanism**: a bash systemd reconciler can't call the agent's PushNotification
tool directly — decision needed (alert sentinel file the Builder loop reads + relays, or a webhook).

## 2026-05-29 — W0.5 WC3 snapshot helper proven; disk reclaim (WC8 hygiene)

W0.5 warmsnap.py landed + LIVE round-trip proven on warm keycloak (see STATUS-2w). Then settled the
W0.6 reconciler approach (python entrypoint in nix store; deploy-by-tag; recipe-semver = pre-`+`
component) in DECISIONS.

**Disk reclaim.** After 3 nixos-rebuild switches + 3 keycloak deploy cycles (WC3 proof) + a 159M
keycloak snapshot, `/` hit 96% (1.2G free) — a WC8 red flag before continuing. Reclaimed safely
(reversibility is via the git-declared config, not old generations): `rm -rf /root/cc-ci.prev`;
`nix-collect-garbage -d` (2553 paths, 3.38G); `docker image prune -f` dangling-only (3.32G, KEEPS the
tagged pull-cache); pruned old abra deploy logs (keep last 5). Result: **62% (10G free)**. This
GC+dangling-prune is the disk-management mechanism WC8 must formalize (run it in the nightly/W4, and
keep one last-good snapshot per app bounded). NOTE for WC8: the WC3 keycloak snapshot is 159M; a
warm-set of ~6 canonicals × (volume + 1 snapshot) is the disk budget to size.

**State at checkpoint:** warm keycloak healthy (200), only infra+warm stacks, system running (0
failed), disk 62%. W0.1-W0.5 done+proven+pushed (HEAD 67240dc). Next unit: W0.6 reconciler rewrite
(unpin + WC1.2 safety gate + WC1.1 health-gated rollback), then W0.7/W0.8 (lasuite-docs race +
headline WC1 e2e).

## 2026-05-29 — W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback)

Built `runner/warm_reconcile.py`'s health-gated rollback and proved it live against the warm keycloak
using annotated fake tags + `CCCI_SKIP_FETCH=1`. The proof iterations surfaced 4 real issues, each
fixed against the real system (verify-don't-assume):

1. **deploy-failure must roll back too** — a broken "latest" can fail abra's *lint/converge*
   (deploy_version raises) rather than deploy-then-be-unhealthy; wrapped the upgrade deploy so BOTH
   raise and unhealthy paths trigger the snapshot-restore rollback (else the unit just crashes).
2. **warmsnap clobbered last_good** — snapshot's atomic swap renamed the whole `<recipe>/` dir,
   wiping the sibling `last_good` file. Fixed: snapshot lives in `<recipe>/snapshot/`; only that
   subdir is swapped; `last_good` (sibling) survives.
3. **swarm settle race** — abra undeploy returns before swarm finishes removing tasks, so an
   immediate snapshot/restore/redeploy of the same stack raced a half-removed stack. Added
   `wait_undeployed()` after every undeploy.
4. **abra writes FATA to stdout** — deploy_version only surfaced stderr (empty); now includes stdout.
   This is how I diagnosed the two test-artifact failures: the broken deploy failed abra **lint R009**
   (bad env not a string — a valid "broken latest"), and the first rollback attempts failed abra
   **lint R014 "only annotated tags used for recipe version"** because my fake tags were *lightweight*
   (production tags are annotated) — a TEST artifact, not a reconciler bug. Fixed the test to create
   annotated tags (peel `^{}` to avoid nested-tag; set git identity).

**Final PROOF (ALL PASS):**
- (a) healthy upgrade 10.7.1→10.7.9: snapshot taken (subdir), deploy, health-pass, last_good
  committed=10.7.9, marker realm preserved through the undeploy/snapshot/redeploy.
- (b) marquee rollback: broken latest 10.7.10 → deploy fails → rollback to 10.7.9 → HEALTHY; marker
  realm INTACT (data preserved through broken-upgrade + snapshot-restore); last_good NOT advanced;
  rollback alert sentinel written (attempted=10.7.10, last_good=10.7.9, recovered=True). keycloak
  recovered to canonical 10.7.1+26.6.2 healthy, no fake tags left.

This satisfies the WC1.1 Adversary mandate (broken latest → self-revert + data intact + alert;
healthy update commits last-good). WC1.2 holds were proven in W0.6. **The reconciler-side WC1/WC1.1/
WC1.2 are proven; the alert RELAY (Builder loop scans /var/lib/ci-warm/alerts/ → PushNotification +
archive to seen/) is still to wire (flagged for when nightly WC6 lands / a real alert can occur).**

Remaining for the WC1 gate: W0.7 (lasuite-docs in-place chaos-redeploy nginx race) + W0.8 (headline
dependent-SSO-green e2e vs warm keycloak + concurrent distinct realms + reaping).

## 2026-05-29 — Fixed daily-failing docker-prune (WC8 landmine)

While checking state I found the system `degraded`: `docker-prune.service` had been FAILING every day
(May 27/28/29) with `The "until" filter is not supported with "--volumes"`. Root: swarm.nix autoPrune
flags `[--all --volumes --filter until=24h]` — docker rejects `--volumes` + `--filter until`, so the
daily prune never ran (a cause of disk creeping to 96%). Worse: `--volumes` prunes any volume with no
running container → it would DELETE Phase-2w DATA-WARM canonical volumes (undeployed by design) the
moment it started working. Fixed: dropped `--volumes` (prune images/containers/networks/build-cache
≤24h only). Warm volumes survive and are pruned deliberately by the warm reconcilers (WC8). Verified:
rebuild → docker-prune.service runs clean, system `running` (0 failed), keycloak 200. Note for WC8:
the warm-volume/snapshot prune policy + nix-generation GC should be folded into the maintenance
story.

## 2026-05-29 — W0.7/W0.8 headline WC1 e2e GREEN; concurrency+reaping proven → claiming WC1/WC1.1/WC1.2

The W0.4 lasuite-docs failure was TRANSIENT (resource contention from the since-killed stale Phase-2
run; disk was tight). Re-ran on the clean system (disk 36% after the prune fix):
`RECIPE=lasuite-docs STAGES=install,custom` → **install: pass, custom: pass** — all 3 SSO tests green
vs the WARM keycloak: test_health_check (200), **test_oidc_login_via_keycloak** (full app OIDC flow),
**test_oidc_password_grant_against_dep_keycloak** (per-run realm JWT). **deploy-count=1** (keycloak
NOT co-deployed — warm path); per-run realm `lasuite-docs-4c0858` created + DELETED at teardown; no
lasu stack left; warm keycloak realm list back to just `master`. So W0.7 needs no recipe fix — the
in-place chaos-redeploy converges fine with adequate resources.

Concurrency+reaping (deploy-free, live warm keycloak): realm_for gives DISTINCT realms for two
concurrent same-recipe runs (`lasuite-docs-aaa111` vs `-bbb222`) + a different recipe
(`cryptpad-ccc333`); all 3 created, each grants its own JWT independently (no collision);
reap_orphaned_realms with live_hexes={aaa111} deleted exactly the two orphans and KEPT the live one.

All WC1 sub-claims now proven: (warm dep, no co-deploy, per-run realm create+delete) + (concurrent
distinct realms) + (orphan reaping); plus WC1.1 (W0.9 marquee rollback) + WC1.2 (W0.6 holds). Warm
keycloak healthy on 10.7.1+26.6.2, last_good=10.7.1+26.6.2, no alerts, system running (0 failed).
Claiming the WC1/WC1.1/WC1.2 gate.

Note: the reconciler WRITES alert sentinels to /var/lib/ci-warm/alerts/ (proven for rollback +
holds). The Builder-loop RELAY (sentinel → PushNotification + archive to seen/) runs each wake when an
alert is present; none currently. This delivery layer is loop behavior, not reconciler logic.

## 2026-05-29 — Gate WC1+WC1.2+WC1.1(keycloak) ADVERSARY PASS; advancing to W1

The Adversary cold-verified all 6 checks from its OWN clone (`cc-ci:/root/cc-ci-adv-verify`):
check1 unpinned/healthy/wired, check2 57 units, check3 headline lasuite-docs SSO e2e (install+custom
pass, deploy-count=1, per-run realm created+deleted, warm kc left `['master']`, cold teardown sacred),
check4 concurrency+reaping, check5 WC1.1 marquee rollback (data intact, last_good held, alert), check6
WC1.2 holds. **Gate verdict: PASS @2026-05-29** (REVIEW-2w 31ac86d) for exactly the claimed scope.
The Adversary independently hit + correctly attributed the same test-script cleanup footgun to the
test, not the reconciler. ONE tracked-open before DONE (no finding): traefik WC1.1 (W0.10) — its
stateless version-rollback isn't yet on the shared reconciler.

**Advancing to W1 (WC2 canonical registry + WC3 closure).** Design intent: a small declarative
registry of canonical recipes → known-good commit, each at `warm-<recipe>` kept DATA-warm (undeployed
when idle, volume retained), re-warmable. warmsnap (W0.5) already provides one-last-good snapshot +
restore. Need to decide: registry format/location (in-repo declarative) + the data-warm lifecycle
(deploy→use→undeploy-keep-volume) + how a canonical is seeded/advanced (WC5 cold-only, later). W1
builds the registry + data-warm reconcile; WC5/WC6 (promote-on-green-cold + nightly) come in W3.

traefik W0.10 + alert-relay deferred to a quiet window before DONE (traefik is critical TLS infra).

## 2026-05-29 — W1.2 data-warm canonical PROVEN (WC2+WC3); claiming W1 gate

Enrolled custom-html (`recipe_meta.WARM_CANONICAL=True`) and ran the live data-warm proof
(/tmp/wc2_proof.py): deploy warm-custom-html @ 1.11.0+1.29.0 → write marker into the content volume →
undeploy → seed_canonical (registry + snapshot while undeployed) → confirm app UNDEPLOYED but volume
RETAINED → deploy_canonical reattach → **marker SURVIVED**. ALL PASS. custom-html is now the first
real data-warm canonical, left idle (undeployed, volume retained, registry status=idle). Disk 49%
(custom-html canonical 32K; keycloak snapshot 318M = the one-per-app DB snapshot, WC8 budget).

WC2 (registry + data-warm model) + WC3 (snapshot tied to canonical; restore proven in W0.5) are
proven. Claimed the WC2+WC3 gate for Adversary cold-verify. One canonical (custom-html) demonstrates
the model; the nightly sweep (WC6/W3) populates more over time — not re-warming all here (plan §4
bounded). Did NOT enroll a 2nd recipe yet (custom-html suffices for W2 --quick + the model proof).

Parked at the W1 gate. While awaiting: will do non-disruptive W0.10b (alert-relay) — NOT the traefik
W0.10a migration (it disrupts TLS the Adversary needs to verify the data-warm round-trip through).

## 2026-05-29 — W1 gate WC2+WC3 ADVERSARY PASS; advancing to W2 (--quick)

Adversary cold-verified WC2+WC3 from its own clone (REVIEW-2w 0246296): 61 units; its OWN data-warm
round-trip (deploy→write ADV marker→undeploy-keep-volume→redeploy→marker survived, Builder's known-good
also reattached); its OWN WC3 restore round-trip (mutate→restore→exact known-good content back,
mutation gone). Its 2 crashes were its own driver-script bugs, not product defects. Canonical left
clean. **WC2 + WC3 PASS @2026-05-29.** Same coordination lag as the W0 claim (its watchdog pinged on a
pre-claim read; resolved via ADVERSARY-INBOX). traefik WC1.1 (W0.10a) remains the sole tracked-open
before DONE.

**Advancing to W2 (--quick, WC4+WC7).** Design: a `--quick` opt-in path in run_recipe_ci.py that
consumes the canonical (reattach → upgrade-to-PR-head → assert → PASS keep-volume / FAIL
restore-snapshot, NEVER promote), tagging results mode=quick, with a clean no-canonical fallback to
cold. Will study the existing upgrade-tier chaos-to-PR-head (HC1) mechanism, then add the quick flow +
units + a live proof on the custom-html canonical (the deliberately-fail-restores-known-good case is
also the WC9 rollback-proof preview).

## 2026-05-29 — W2 (--quick, WC4+WC7) built + proven live; claiming gate

WC4 run_quick in run_recipe_ci.py (dispatch on CCCI_QUICK=1/MODE=quick when a canonical exists, else
clean cold fallback). Live PASS+FAIL proof on the custom-html canonical (ALL PASS): PASS run
(upgrade→different-healthy-head) leaves known-good UNCHANGED + idle + volume/data intact; FAIL run
(broken-image head) rolls back — undeploy→restore last-known-good→idle, known-good UNCHANGED, data
intact. 3 bugs found+fixed by the live proof (missing `import time` crashed the rollback; stale .env
TYPE from a prior --quick upgrade pointing at a removed PR commit FATAL'd abra — deploy_canonical +
rollback now reset TYPE to the known-good).

WC7 trigger surface: bridge `parse_trigger` accepts `!testme` (cold) / `!testme --quick` (opt-in),
rejects `!testmexyz` etc.; threads CCCI_QUICK=1 through trigger_build (auto-exposed Drone param);
quick PR comment labelled lower-confidence; default !testme unchanged; never gates merge.
Deployed via nixos-rebuild (content-tagged bridge image rolled) + LIVE-verified in the running
container (parse_trigger correct, healthz 200). 64 unit pass.

Handoff-signalling note (orchestrator): the watchdog now pings off COMMIT PREFIXES on origin/main
(`claim(...)` pings Adversary; `review(...)` pings Builder), not prose — which caused the earlier
premature "no formal gate" dances. I already use `claim(2w):` for gate claims + push promptly; keep
doing so. Claiming WC4+WC7 now with that prefix.

System clean post-rebuild: keycloak 200, custom-html canonical idle@1.11.0+1.29.0, 0 failed units,
disk 50%. Parked at the W2 gate; next quiet-window work = W0.10a traefik WC1.1 migration.