Files
cc-ci/machine-docs/JOURNAL-2w.md
autonomic-bot 3ff2bf6c48 claim(2w): Gate WC4+WC7 CLAIMED — --quick fast lane proven live (PASS keeps known-good, FAIL restores) + bridge !testme --quick deployed
WC4 run_quick: reattach canonical → upgrade-to-PR-head → assert → PASS
undeploy-keep-volume (known-good UNCHANGED, never promote) / FAIL restore
last-known-good snapshot + undeploy. Live PASS+FAIL proof on custom-html: ALL
PASS (canonical left clean idle@1.11.0+1.29.0). WC7: bridge parse_trigger
(!testme / !testme --quick / reject !testmexyz) → CCCI_QUICK param, deployed +
live-verified; default !testme stays cold; never gates merge; mode-labeled;
no-canonical fallback to cold. 64 unit pass. Full HOW/EXPECTED/WHERE in STATUS-2w.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:17:29 +01:00

294 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL — Phase 2w (warm canonical + `--quick`) — Builder
Append-only reasoning log (WHY). Facts/verification go in STATUS-2w; verdicts in REVIEW-2w.
## 2026-05-28 — Phase 2w bootstrap + cleanup + W0 design
**Orientation.** Operator interjected Phase 2w into Phase 2 (Phase 2 paused, state preserved).
Read the 2w plan + plan.md §6.1/§7/§9. Adversary already online (REVIEW-2w `@2026-05-28 start`),
idle awaiting a WC gate claim. Seeded STATUS-2w/BACKLOG-2w/JOURNAL-2w.
**In-flight Phase 2 work committed.** Working tree had an uncommitted edit to
`tests/lasuite-drive/setup_custom_tests.sh` (Q3.2 MinIO bucket creation via the createbuckets
one-shot) — the continuation of commit 6557197. Committed it (66e065d) with an honest message: not
yet live-verified (needs a lasuite-drive deploy once warm keycloak exists). This preserves Phase 2
progress at the pause point; it resumes after 2w DONE.
**Cleanup (orchestrator-requested).** cc-ci `/` was at 91% (only 2.4G free) — a real WC8 concern
before adding warm volumes/snapshots. Tore down the leftover COLD per-run apps from paused Phase 2
via `lifecycle.teardown_app(..., verify=True)`: `lasu-0a6fb2` (12-service lasuite-drive, heaviest),
`keyc-07d81e` (cold keycloak), `lasu-dbg` (debug lasuite). All TEARDOWN OK, no residual. Disk →
86% (3.8G free). Only infra stacks remain (backups, bridge, dashboard, drone, traefik). Did NOT
`docker image prune` — 9.7GB reclaimable but the image cache is the warm pull-cache; with authed
Docker Hub pulls now wired, a re-pull is billed to the account (cheaper) but still slow, so keep the
cache. Disk is the Phase-2w budget (WC8) — monitor.
**W0 design (WC1 — live-warm keycloak).** The existing SSO harness is already most of the way there:
- `sso.setup_keycloak_realm(provider_domain, realm, client_id, ...)` creates a realm+client+user
**idempotently via the admin API**, and `_kc_admin_password` reads the admin password from inside
the running container (`docker exec ... cat /run/secrets/admin_password`). So it works against ANY
running keycloak — cold or warm — with no external password handling.
- The orchestrator dep flow (`run_recipe_ci.py`): `declared_deps``deploy_deps` (fresh co-deploy
per run) → `_enrich_deps_with_sso` (creates realm, realm name currently = `parent_recipe`) →
`setup_custom_tests.sh` hook → teardown_deps (undeploy).
What WC1 changes:
1. The **realm becomes the per-run isolation unit** on a shared live-warm keycloak. Realm name must
be unique per (parent, pr, ref) so concurrent dependents don't collide — change from
`realm=parent_recipe` to `realm=<parent>-<6hex>` (derive the hex from the parent's per-run domain
label so it's stable within a run and distinct across concurrent runs).
2. The keycloak dep is **not co-deployed**: point at the stable warm domain; on teardown **delete the
realm** (not undeploy keycloak). Fall back to cold co-deploy if no warm keycloak is present (so a
from-scratch / no-warm environment still works — the warm keycloak is an optimization layer).
3. The warm keycloak itself is **declarative infra** (Nix reconciler, like traefik) — NOT warm
*data* (so it IS in the D8 closure as a reconciler; its realm data is ephemeral per-run anyway).
Re-warmable from scratch.
Stable-domain scheme decision: `warm-<recipe>.ci.commoninternet.net` (here `warm-keycloak...`),
clearly distinct from cold `<recipe[:4]>-<6hex>`. Risk: longer stack name → swarm 64-char
config/secret limit; will verify on first deploy and shorten if it overflows.
Building W0 in increments (each verified): (1) sso realm lifecycle prims + units; (2) deploy warm
keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the
orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof.
</content>
## 2026-05-29 — W0 core mechanism PROVEN; declarative reconciler up; design update absorbed
**Stale Phase-2 run killed.** Found an orphaned `run_recipe_ci.py` (RECIPE=lasuite-drive, the Q3.2
`ccci-q32-drive-sso2.log` run) still alive from before the phase switch (PPID 1, nohup). It had
deployed lasu-0a6fb2 + tried a cold keyc-07d81e dep — both of which I'd already torn down, so it was
failing. Killed its process tree + janitored. Only infra + warm-keycloak remain.
**W0.1 realm lifecycle (sso.py)** — list_realms / delete_keycloak_realm (idempotent, refuses master)
/ realms_to_reap (pure predicate) / reap_orphaned_realms. +8 unit tests. The per-run realm is the
isolation unit on a shared keycloak; orphans reaped by hex-not-in-live-stacks (concurrency-safe).
**W0.2 orchestrator live-warm mode** — warm.py (stable-domain scheme, is_warm_up probe,
live_app_hexes, realm_for=<parent>-<6hex>, reap_orphan_realms). run_recipe_ci splits declared deps
into warm (shared provider + per-run realm, no deploy, realm deleted at teardown) vs cold
(co-deploy), warm only if provider up else cold fallback; deploy-count excludes warm deps; reaps
orphans at run start. Dependent tests now assert the namespaced realm pattern (stronger than ==parent).
**WC1 CORE MECHANISM PROVEN** (deploy-free, live warm keycloak): realm create → password-grant JWT
→ discovery issuer → delete(idempotent) → reap(keeps live hex, deletes orphan): ALL PASS.
**W0.3 declarative reconciler** (nix/modules/warm-keycloak.nix) — systemd oneshot, converges warm
keycloak. Two bugs found+fixed against the real system:
1. `abra app deploy` non-chaos FATALs "already deployed" → need `-f` (tested: redeploys at ENV
VERSION, exit 0).
2. **Newline bite** (the backupbot.nix bite): keycloak's .env.sample ends with a newline-less
`#COMPOSE_FILE=` comment, so bash `set_env`'s printf glued `DOMAIN=` onto that comment →
DOMAIN unset → `KC_HOSTNAME=https://` (empty host) → keycloak crash-loop ("Expected authority at
index 8: https://"). Fixed set_env to ensure a trailing newline before append (same as backupbot).
Also made converge **skip the redeploy when already 200** (no JVM-restart blip on every rebuild;
only (re)deploys when down/crash-looping). Verified: nixos-rebuild switch → warm-keycloak.service
active "no-op converge", system running (0 failed), /realms/master=200.
**W0.4 e2e (lasuite-docs vs warm keycloak)** — the WARM MECHANISM worked: deploy-count=1 (keycloak
NOT co-deployed), per-run realm `lasuite-docs-9c1995` created + **deleted on the warm keycloak** at
teardown, install pass. BUT `setup_custom_tests.sh exited 1` → 3 requires_deps SSO tests SKIPPED →
F2-11 correctly FAILED the run (not green). Root cause = a **lasuite-docs recipe race**, NOT warm
keycloak: the in-place `abra app deploy --force --chaos` (OIDC wiring) rolls all services; nginx
`web` fatally exits on `[emerg] host not found in upstream ...backend:8000` while backend is
mid-restart, and abra's converge check times out → "deploy failed 🛑". This is independent of
warm/cold keycloak (Q2.4 cold-keycloak lasuite-docs passed before; warm should REDUCE contention).
Filed as a finding to investigate (flaky/timing/resource vs deterministic regression); the headline
WC1 "dependent SSO tests green against warm keycloak" needs this resolved or a more-robust dependent.
**DESIGN UPDATE absorbed (orchestrator + Adversary REVIEW-2w, 2026-05-28→29).** Warm/infra apps
(traefik + keycloak) now AUTO-UPDATE to LATEST nightly with HEALTH-GATED ROLLBACK:
- **WC1 revised:** UNPIN keycloak (match traefik: `abra recipe fetch` latest + chaos deploy; DROP
kcVersion). Keep secret-generate-only-if-missing + health-wait. D8 preserved (recipe fetched at
runtime → nix closure byte-identical).
- **WC1.1 NEW:** health-gated deploy-with-rollback IN the reconcilers. record last-good → deploy
latest → health-check → healthy: commit last-good:=latest; unhealthy: rollback + PushNotification.
Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → on fail restore snapshot
+ redeploy prior version (forward DB migrations make version-only rollback unsafe). traefik
(stateless) = version rollback only. Reuse WC3 snapshot helper.
- **WC1.2 NEW:** pre-deploy safety gate — auto-apply only non-major/no-manual-migration bumps; a
MAJOR bump or manual-migration release notes → stay on current + alert (don't auto-apply).
- **WC6 reordered:** nightly = nixos-rebuild switch FIRST (warm/infra→latest, health-gated) THEN
full-cold sweep; never while a test is in flight.
**Re-sequencing consequence:** WC1.1 depends on the **WC3 snapshot/restore helper**, so I build that
FIRST (foundational), then rewrite the reconciler ONCE into the full unpinned + health-gated +
safety-gated + rollback form (avoids reworking the reconciler twice). Current reconciler (pinned,
skip-if-healthy) is INTERIM — keeps keycloak live-warm/healthy meanwhile; will be replaced. Also need
to settle the **alert mechanism**: a bash systemd reconciler can't call the agent's PushNotification
tool directly — decision needed (alert sentinel file the Builder loop reads + relays, or a webhook).
## 2026-05-29 — W0.5 WC3 snapshot helper proven; disk reclaim (WC8 hygiene)
W0.5 warmsnap.py landed + LIVE round-trip proven on warm keycloak (see STATUS-2w). Then settled the
W0.6 reconciler approach (python entrypoint in nix store; deploy-by-tag; recipe-semver = pre-`+`
component) in DECISIONS.
**Disk reclaim.** After 3 nixos-rebuild switches + 3 keycloak deploy cycles (WC3 proof) + a 159M
keycloak snapshot, `/` hit 96% (1.2G free) — a WC8 red flag before continuing. Reclaimed safely
(reversibility is via the git-declared config, not old generations): `rm -rf /root/cc-ci.prev`;
`nix-collect-garbage -d` (2553 paths, 3.38G); `docker image prune -f` dangling-only (3.32G, KEEPS the
tagged pull-cache); pruned old abra deploy logs (keep last 5). Result: **62% (10G free)**. This
GC+dangling-prune is the disk-management mechanism WC8 must formalize (run it in the nightly/W4, and
keep one last-good snapshot per app bounded). NOTE for WC8: the WC3 keycloak snapshot is 159M; a
warm-set of ~6 canonicals × (volume + 1 snapshot) is the disk budget to size.
**State at checkpoint:** warm keycloak healthy (200), only infra+warm stacks, system running (0
failed), disk 62%. W0.1-W0.5 done+proven+pushed (HEAD 67240dc). Next unit: W0.6 reconciler rewrite
(unpin + WC1.2 safety gate + WC1.1 health-gated rollback), then W0.7/W0.8 (lasuite-docs race +
headline WC1 e2e).
## 2026-05-29 — W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback)
Built `runner/warm_reconcile.py`'s health-gated rollback and proved it live against the warm keycloak
using annotated fake tags + `CCCI_SKIP_FETCH=1`. The proof iterations surfaced 4 real issues, each
fixed against the real system (verify-don't-assume):
1. **deploy-failure must roll back too** — a broken "latest" can fail abra's *lint/converge*
(deploy_version raises) rather than deploy-then-be-unhealthy; wrapped the upgrade deploy so BOTH
raise and unhealthy paths trigger the snapshot-restore rollback (else the unit just crashes).
2. **warmsnap clobbered last_good** — snapshot's atomic swap renamed the whole `<recipe>/` dir,
wiping the sibling `last_good` file. Fixed: snapshot lives in `<recipe>/snapshot/`; only that
subdir is swapped; `last_good` (sibling) survives.
3. **swarm settle race** — abra undeploy returns before swarm finishes removing tasks, so an
immediate snapshot/restore/redeploy of the same stack raced a half-removed stack. Added
`wait_undeployed()` after every undeploy.
4. **abra writes FATA to stdout** — deploy_version only surfaced stderr (empty); now includes stdout.
This is how I diagnosed the two test-artifact failures: the broken deploy failed abra **lint R009**
(bad env not a string — a valid "broken latest"), and the first rollback attempts failed abra
**lint R014 "only annotated tags used for recipe version"** because my fake tags were *lightweight*
(production tags are annotated) — a TEST artifact, not a reconciler bug. Fixed the test to create
annotated tags (peel `^{}` to avoid nested-tag; set git identity).
**Final PROOF (ALL PASS):**
- (a) healthy upgrade 10.7.1→10.7.9: snapshot taken (subdir), deploy, health-pass, last_good
committed=10.7.9, marker realm preserved through the undeploy/snapshot/redeploy.
- (b) marquee rollback: broken latest 10.7.10 → deploy fails → rollback to 10.7.9 → HEALTHY; marker
realm INTACT (data preserved through broken-upgrade + snapshot-restore); last_good NOT advanced;
rollback alert sentinel written (attempted=10.7.10, last_good=10.7.9, recovered=True). keycloak
recovered to canonical 10.7.1+26.6.2 healthy, no fake tags left.
This satisfies the WC1.1 Adversary mandate (broken latest → self-revert + data intact + alert;
healthy update commits last-good). WC1.2 holds were proven in W0.6. **The reconciler-side WC1/WC1.1/
WC1.2 are proven; the alert RELAY (Builder loop scans /var/lib/ci-warm/alerts/ → PushNotification +
archive to seen/) is still to wire (flagged for when nightly WC6 lands / a real alert can occur).**
Remaining for the WC1 gate: W0.7 (lasuite-docs in-place chaos-redeploy nginx race) + W0.8 (headline
dependent-SSO-green e2e vs warm keycloak + concurrent distinct realms + reaping).
## 2026-05-29 — Fixed daily-failing docker-prune (WC8 landmine)
While checking state I found the system `degraded`: `docker-prune.service` had been FAILING every day
(May 27/28/29) with `The "until" filter is not supported with "--volumes"`. Root: swarm.nix autoPrune
flags `[--all --volumes --filter until=24h]` — docker rejects `--volumes` + `--filter until`, so the
daily prune never ran (a cause of disk creeping to 96%). Worse: `--volumes` prunes any volume with no
running container → it would DELETE Phase-2w DATA-WARM canonical volumes (undeployed by design) the
moment it started working. Fixed: dropped `--volumes` (prune images/containers/networks/build-cache
≤24h only). Warm volumes survive and are pruned deliberately by the warm reconcilers (WC8). Verified:
rebuild → docker-prune.service runs clean, system `running` (0 failed), keycloak 200. Note for WC8:
the warm-volume/snapshot prune policy + nix-generation GC should be folded into the maintenance
story.
## 2026-05-29 — W0.7/W0.8 headline WC1 e2e GREEN; concurrency+reaping proven → claiming WC1/WC1.1/WC1.2
The W0.4 lasuite-docs failure was TRANSIENT (resource contention from the since-killed stale Phase-2
run; disk was tight). Re-ran on the clean system (disk 36% after the prune fix):
`RECIPE=lasuite-docs STAGES=install,custom`**install: pass, custom: pass** — all 3 SSO tests green
vs the WARM keycloak: test_health_check (200), **test_oidc_login_via_keycloak** (full app OIDC flow),
**test_oidc_password_grant_against_dep_keycloak** (per-run realm JWT). **deploy-count=1** (keycloak
NOT co-deployed — warm path); per-run realm `lasuite-docs-4c0858` created + DELETED at teardown; no
lasu stack left; warm keycloak realm list back to just `master`. So W0.7 needs no recipe fix — the
in-place chaos-redeploy converges fine with adequate resources.
Concurrency+reaping (deploy-free, live warm keycloak): realm_for gives DISTINCT realms for two
concurrent same-recipe runs (`lasuite-docs-aaa111` vs `-bbb222`) + a different recipe
(`cryptpad-ccc333`); all 3 created, each grants its own JWT independently (no collision);
reap_orphaned_realms with live_hexes={aaa111} deleted exactly the two orphans and KEPT the live one.
All WC1 sub-claims now proven: (warm dep, no co-deploy, per-run realm create+delete) + (concurrent
distinct realms) + (orphan reaping); plus WC1.1 (W0.9 marquee rollback) + WC1.2 (W0.6 holds). Warm
keycloak healthy on 10.7.1+26.6.2, last_good=10.7.1+26.6.2, no alerts, system running (0 failed).
Claiming the WC1/WC1.1/WC1.2 gate.
Note: the reconciler WRITES alert sentinels to /var/lib/ci-warm/alerts/ (proven for rollback +
holds). The Builder-loop RELAY (sentinel → PushNotification + archive to seen/) runs each wake when an
alert is present; none currently. This delivery layer is loop behavior, not reconciler logic.
## 2026-05-29 — Gate WC1+WC1.2+WC1.1(keycloak) ADVERSARY PASS; advancing to W1
The Adversary cold-verified all 6 checks from its OWN clone (`cc-ci:/root/cc-ci-adv-verify`):
check1 unpinned/healthy/wired, check2 57 units, check3 headline lasuite-docs SSO e2e (install+custom
pass, deploy-count=1, per-run realm created+deleted, warm kc left `['master']`, cold teardown sacred),
check4 concurrency+reaping, check5 WC1.1 marquee rollback (data intact, last_good held, alert), check6
WC1.2 holds. **Gate verdict: PASS @2026-05-29** (REVIEW-2w 31ac86d) for exactly the claimed scope.
The Adversary independently hit + correctly attributed the same test-script cleanup footgun to the
test, not the reconciler. ONE tracked-open before DONE (no finding): traefik WC1.1 (W0.10) — its
stateless version-rollback isn't yet on the shared reconciler.
**Advancing to W1 (WC2 canonical registry + WC3 closure).** Design intent: a small declarative
registry of canonical recipes → known-good commit, each at `warm-<recipe>` kept DATA-warm (undeployed
when idle, volume retained), re-warmable. warmsnap (W0.5) already provides one-last-good snapshot +
restore. Need to decide: registry format/location (in-repo declarative) + the data-warm lifecycle
(deploy→use→undeploy-keep-volume) + how a canonical is seeded/advanced (WC5 cold-only, later). W1
builds the registry + data-warm reconcile; WC5/WC6 (promote-on-green-cold + nightly) come in W3.
traefik W0.10 + alert-relay deferred to a quiet window before DONE (traefik is critical TLS infra).
## 2026-05-29 — W1.2 data-warm canonical PROVEN (WC2+WC3); claiming W1 gate
Enrolled custom-html (`recipe_meta.WARM_CANONICAL=True`) and ran the live data-warm proof
(/tmp/wc2_proof.py): deploy warm-custom-html @ 1.11.0+1.29.0 → write marker into the content volume →
undeploy → seed_canonical (registry + snapshot while undeployed) → confirm app UNDEPLOYED but volume
RETAINED → deploy_canonical reattach → **marker SURVIVED**. ALL PASS. custom-html is now the first
real data-warm canonical, left idle (undeployed, volume retained, registry status=idle). Disk 49%
(custom-html canonical 32K; keycloak snapshot 318M = the one-per-app DB snapshot, WC8 budget).
WC2 (registry + data-warm model) + WC3 (snapshot tied to canonical; restore proven in W0.5) are
proven. Claimed the WC2+WC3 gate for Adversary cold-verify. One canonical (custom-html) demonstrates
the model; the nightly sweep (WC6/W3) populates more over time — not re-warming all here (plan §4
bounded). Did NOT enroll a 2nd recipe yet (custom-html suffices for W2 --quick + the model proof).
Parked at the W1 gate. While awaiting: will do non-disruptive W0.10b (alert-relay) — NOT the traefik
W0.10a migration (it disrupts TLS the Adversary needs to verify the data-warm round-trip through).
## 2026-05-29 — W1 gate WC2+WC3 ADVERSARY PASS; advancing to W2 (--quick)
Adversary cold-verified WC2+WC3 from its own clone (REVIEW-2w 0246296): 61 units; its OWN data-warm
round-trip (deploy→write ADV marker→undeploy-keep-volume→redeploy→marker survived, Builder's known-good
also reattached); its OWN WC3 restore round-trip (mutate→restore→exact known-good content back,
mutation gone). Its 2 crashes were its own driver-script bugs, not product defects. Canonical left
clean. **WC2 + WC3 PASS @2026-05-29.** Same coordination lag as the W0 claim (its watchdog pinged on a
pre-claim read; resolved via ADVERSARY-INBOX). traefik WC1.1 (W0.10a) remains the sole tracked-open
before DONE.
**Advancing to W2 (--quick, WC4+WC7).** Design: a `--quick` opt-in path in run_recipe_ci.py that
consumes the canonical (reattach → upgrade-to-PR-head → assert → PASS keep-volume / FAIL
restore-snapshot, NEVER promote), tagging results mode=quick, with a clean no-canonical fallback to
cold. Will study the existing upgrade-tier chaos-to-PR-head (HC1) mechanism, then add the quick flow +
units + a live proof on the custom-html canonical (the deliberately-fail-restores-known-good case is
also the WC9 rollback-proof preview).
## 2026-05-29 — W2 (--quick, WC4+WC7) built + proven live; claiming gate
WC4 run_quick in run_recipe_ci.py (dispatch on CCCI_QUICK=1/MODE=quick when a canonical exists, else
clean cold fallback). Live PASS+FAIL proof on the custom-html canonical (ALL PASS): PASS run
(upgrade→different-healthy-head) leaves known-good UNCHANGED + idle + volume/data intact; FAIL run
(broken-image head) rolls back — undeploy→restore last-known-good→idle, known-good UNCHANGED, data
intact. 3 bugs found+fixed by the live proof (missing `import time` crashed the rollback; stale .env
TYPE from a prior --quick upgrade pointing at a removed PR commit FATAL'd abra — deploy_canonical +
rollback now reset TYPE to the known-good).
WC7 trigger surface: bridge `parse_trigger` accepts `!testme` (cold) / `!testme --quick` (opt-in),
rejects `!testmexyz` etc.; threads CCCI_QUICK=1 through trigger_build (auto-exposed Drone param);
quick PR comment labelled lower-confidence; default !testme unchanged; never gates merge.
Deployed via nixos-rebuild (content-tagged bridge image rolled) + LIVE-verified in the running
container (parse_trigger correct, healthz 200). 64 unit pass.
Handoff-signalling note (orchestrator): the watchdog now pings off COMMIT PREFIXES on origin/main
(`claim(...)` pings Adversary; `review(...)` pings Builder), not prose — which caused the earlier
premature "no formal gate" dances. I already use `claim(2w):` for gate claims + push promptly; keep
doing so. Claiming WC4+WC7 now with that prefix.
System clean post-rebuild: keycloak 200, custom-html canonical idle@1.11.0+1.29.0, 0 failed units,
disk 50%. Parked at the W2 gate; next quiet-window work = W0.10a traefik WC1.1 migration.