Files
cc-ci/machine-docs/JOURNAL-2w.md
autonomic-bot aebb28d774 done(2w): Phase 2w COMPLETE — WC1-WC9 (incl WC1.1/WC1.2) all Adversary-verified, NO VETO
## DONE written to STATUS-2w. Adversary authorized (REVIEW-2w 2822d60: all gates
cold-verified, no veto, no open findings). Final state healthy: keycloak+traefik
200, custom-html canonical idle@1.11.0+1.29.0, nightly-sweep timer active, system
running 0 failed, disk 50%. Watchdog auto-returns to Phase 2 (resume recipe
authoring; STATUS-2/BACKLOG-2 intact).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:48:02 +01:00

31 KiB
Raw Blame History

JOURNAL — Phase 2w (warm canonical + --quick) — Builder

Append-only reasoning log (WHY). Facts/verification go in STATUS-2w; verdicts in REVIEW-2w.

2026-05-28 — Phase 2w bootstrap + cleanup + W0 design

Orientation. Operator interjected Phase 2w into Phase 2 (Phase 2 paused, state preserved). Read the 2w plan + plan.md §6.1/§7/§9. Adversary already online (REVIEW-2w @2026-05-28 start), idle awaiting a WC gate claim. Seeded STATUS-2w/BACKLOG-2w/JOURNAL-2w.

In-flight Phase 2 work committed. Working tree had an uncommitted edit to tests/lasuite-drive/setup_custom_tests.sh (Q3.2 MinIO bucket creation via the createbuckets one-shot) — the continuation of commit 6557197. Committed it (66e065d) with an honest message: not yet live-verified (needs a lasuite-drive deploy once warm keycloak exists). This preserves Phase 2 progress at the pause point; it resumes after 2w DONE.

Cleanup (orchestrator-requested). cc-ci / was at 91% (only 2.4G free) — a real WC8 concern before adding warm volumes/snapshots. Tore down the leftover COLD per-run apps from paused Phase 2 via lifecycle.teardown_app(..., verify=True): lasu-0a6fb2 (12-service lasuite-drive, heaviest), keyc-07d81e (cold keycloak), lasu-dbg (debug lasuite). All TEARDOWN OK, no residual. Disk → 86% (3.8G free). Only infra stacks remain (backups, bridge, dashboard, drone, traefik). Did NOT docker image prune — 9.7GB reclaimable but the image cache is the warm pull-cache; with authed Docker Hub pulls now wired, a re-pull is billed to the account (cheaper) but still slow, so keep the cache. Disk is the Phase-2w budget (WC8) — monitor.

W0 design (WC1 — live-warm keycloak). The existing SSO harness is already most of the way there:

  • sso.setup_keycloak_realm(provider_domain, realm, client_id, ...) creates a realm+client+user idempotently via the admin API, and _kc_admin_password reads the admin password from inside the running container (docker exec ... cat /run/secrets/admin_password). So it works against ANY running keycloak — cold or warm — with no external password handling.
  • The orchestrator dep flow (run_recipe_ci.py): declared_depsdeploy_deps (fresh co-deploy per run) → _enrich_deps_with_sso (creates realm, realm name currently = parent_recipe) → setup_custom_tests.sh hook → teardown_deps (undeploy).

What WC1 changes:

  1. The realm becomes the per-run isolation unit on a shared live-warm keycloak. Realm name must be unique per (parent, pr, ref) so concurrent dependents don't collide — change from realm=parent_recipe to realm=<parent>-<6hex> (derive the hex from the parent's per-run domain label so it's stable within a run and distinct across concurrent runs).
  2. The keycloak dep is not co-deployed: point at the stable warm domain; on teardown delete the realm (not undeploy keycloak). Fall back to cold co-deploy if no warm keycloak is present (so a from-scratch / no-warm environment still works — the warm keycloak is an optimization layer).
  3. The warm keycloak itself is declarative infra (Nix reconciler, like traefik) — NOT warm data (so it IS in the D8 closure as a reconciler; its realm data is ephemeral per-run anyway). Re-warmable from scratch.

Stable-domain scheme decision: warm-<recipe>.ci.commoninternet.net (here warm-keycloak...), clearly distinct from cold <recipe[:4]>-<6hex>. Risk: longer stack name → swarm 64-char config/secret limit; will verify on first deploy and shorten if it overflows.

Building W0 in increments (each verified): (1) sso realm lifecycle prims + units; (2) deploy warm keycloak manually at the stable domain and prove realm create→delete via admin API; (3) wire the orchestrator live-warm mode; (4) declarative Nix reconciler; (5) e2e + concurrency + reaping proof.

2026-05-29 — W0 core mechanism PROVEN; declarative reconciler up; design update absorbed

Stale Phase-2 run killed. Found an orphaned run_recipe_ci.py (RECIPE=lasuite-drive, the Q3.2 ccci-q32-drive-sso2.log run) still alive from before the phase switch (PPID 1, nohup). It had deployed lasu-0a6fb2 + tried a cold keyc-07d81e dep — both of which I'd already torn down, so it was failing. Killed its process tree + janitored. Only infra + warm-keycloak remain.

W0.1 realm lifecycle (sso.py) — list_realms / delete_keycloak_realm (idempotent, refuses master) / realms_to_reap (pure predicate) / reap_orphaned_realms. +8 unit tests. The per-run realm is the isolation unit on a shared keycloak; orphans reaped by hex-not-in-live-stacks (concurrency-safe).

W0.2 orchestrator live-warm mode — warm.py (stable-domain scheme, is_warm_up probe, live_app_hexes, realm_for=-<6hex>, reap_orphan_realms). run_recipe_ci splits declared deps into warm (shared provider + per-run realm, no deploy, realm deleted at teardown) vs cold (co-deploy), warm only if provider up else cold fallback; deploy-count excludes warm deps; reaps orphans at run start. Dependent tests now assert the namespaced realm pattern (stronger than ==parent).

WC1 CORE MECHANISM PROVEN (deploy-free, live warm keycloak): realm create → password-grant JWT → discovery issuer → delete(idempotent) → reap(keeps live hex, deletes orphan): ALL PASS.

W0.3 declarative reconciler (nix/modules/warm-keycloak.nix) — systemd oneshot, converges warm keycloak. Two bugs found+fixed against the real system:

  1. abra app deploy non-chaos FATALs "already deployed" → need -f (tested: redeploys at ENV VERSION, exit 0).
  2. Newline bite (the backupbot.nix bite): keycloak's .env.sample ends with a newline-less #COMPOSE_FILE= comment, so bash set_env's printf glued DOMAIN= onto that comment → DOMAIN unset → KC_HOSTNAME=https:// (empty host) → keycloak crash-loop ("Expected authority at index 8: https://"). Fixed set_env to ensure a trailing newline before append (same as backupbot). Also made converge skip the redeploy when already 200 (no JVM-restart blip on every rebuild; only (re)deploys when down/crash-looping). Verified: nixos-rebuild switch → warm-keycloak.service active "no-op converge", system running (0 failed), /realms/master=200.

W0.4 e2e (lasuite-docs vs warm keycloak) — the WARM MECHANISM worked: deploy-count=1 (keycloak NOT co-deployed), per-run realm lasuite-docs-9c1995 created + deleted on the warm keycloak at teardown, install pass. BUT setup_custom_tests.sh exited 1 → 3 requires_deps SSO tests SKIPPED → F2-11 correctly FAILED the run (not green). Root cause = a lasuite-docs recipe race, NOT warm keycloak: the in-place abra app deploy --force --chaos (OIDC wiring) rolls all services; nginx web fatally exits on [emerg] host not found in upstream ...backend:8000 while backend is mid-restart, and abra's converge check times out → "deploy failed 🛑". This is independent of warm/cold keycloak (Q2.4 cold-keycloak lasuite-docs passed before; warm should REDUCE contention). Filed as a finding to investigate (flaky/timing/resource vs deterministic regression); the headline WC1 "dependent SSO tests green against warm keycloak" needs this resolved or a more-robust dependent.

DESIGN UPDATE absorbed (orchestrator + Adversary REVIEW-2w, 2026-05-28→29). Warm/infra apps (traefik + keycloak) now AUTO-UPDATE to LATEST nightly with HEALTH-GATED ROLLBACK:

  • WC1 revised: UNPIN keycloak (match traefik: abra recipe fetch latest + chaos deploy; DROP kcVersion). Keep secret-generate-only-if-missing + health-wait. D8 preserved (recipe fetched at runtime → nix closure byte-identical).
  • WC1.1 NEW: health-gated deploy-with-rollback IN the reconcilers. record last-good → deploy latest → health-check → healthy: commit last-good:=latest; unhealthy: rollback + PushNotification. Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → on fail restore snapshot
    • redeploy prior version (forward DB migrations make version-only rollback unsafe). traefik (stateless) = version rollback only. Reuse WC3 snapshot helper.
  • WC1.2 NEW: pre-deploy safety gate — auto-apply only non-major/no-manual-migration bumps; a MAJOR bump or manual-migration release notes → stay on current + alert (don't auto-apply).
  • WC6 reordered: nightly = nixos-rebuild switch FIRST (warm/infra→latest, health-gated) THEN full-cold sweep; never while a test is in flight.

Re-sequencing consequence: WC1.1 depends on the WC3 snapshot/restore helper, so I build that FIRST (foundational), then rewrite the reconciler ONCE into the full unpinned + health-gated + safety-gated + rollback form (avoids reworking the reconciler twice). Current reconciler (pinned, skip-if-healthy) is INTERIM — keeps keycloak live-warm/healthy meanwhile; will be replaced. Also need to settle the alert mechanism: a bash systemd reconciler can't call the agent's PushNotification tool directly — decision needed (alert sentinel file the Builder loop reads + relays, or a webhook).

2026-05-29 — W0.5 WC3 snapshot helper proven; disk reclaim (WC8 hygiene)

W0.5 warmsnap.py landed + LIVE round-trip proven on warm keycloak (see STATUS-2w). Then settled the W0.6 reconciler approach (python entrypoint in nix store; deploy-by-tag; recipe-semver = pre-+ component) in DECISIONS.

Disk reclaim. After 3 nixos-rebuild switches + 3 keycloak deploy cycles (WC3 proof) + a 159M keycloak snapshot, / hit 96% (1.2G free) — a WC8 red flag before continuing. Reclaimed safely (reversibility is via the git-declared config, not old generations): rm -rf /root/cc-ci.prev; nix-collect-garbage -d (2553 paths, 3.38G); docker image prune -f dangling-only (3.32G, KEEPS the tagged pull-cache); pruned old abra deploy logs (keep last 5). Result: 62% (10G free). This GC+dangling-prune is the disk-management mechanism WC8 must formalize (run it in the nightly/W4, and keep one last-good snapshot per app bounded). NOTE for WC8: the WC3 keycloak snapshot is 159M; a warm-set of ~6 canonicals × (volume + 1 snapshot) is the disk budget to size.

State at checkpoint: warm keycloak healthy (200), only infra+warm stacks, system running (0 failed), disk 62%. W0.1-W0.5 done+proven+pushed (HEAD 67240dc). Next unit: W0.6 reconciler rewrite (unpin + WC1.2 safety gate + WC1.1 health-gated rollback), then W0.7/W0.8 (lasuite-docs race + headline WC1 e2e).

2026-05-29 — W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback)

Built runner/warm_reconcile.py's health-gated rollback and proved it live against the warm keycloak using annotated fake tags + CCCI_SKIP_FETCH=1. The proof iterations surfaced 4 real issues, each fixed against the real system (verify-don't-assume):

  1. deploy-failure must roll back too — a broken "latest" can fail abra's lint/converge (deploy_version raises) rather than deploy-then-be-unhealthy; wrapped the upgrade deploy so BOTH raise and unhealthy paths trigger the snapshot-restore rollback (else the unit just crashes).
  2. warmsnap clobbered last_good — snapshot's atomic swap renamed the whole <recipe>/ dir, wiping the sibling last_good file. Fixed: snapshot lives in <recipe>/snapshot/; only that subdir is swapped; last_good (sibling) survives.
  3. swarm settle race — abra undeploy returns before swarm finishes removing tasks, so an immediate snapshot/restore/redeploy of the same stack raced a half-removed stack. Added wait_undeployed() after every undeploy.
  4. abra writes FATA to stdout — deploy_version only surfaced stderr (empty); now includes stdout. This is how I diagnosed the two test-artifact failures: the broken deploy failed abra lint R009 (bad env not a string — a valid "broken latest"), and the first rollback attempts failed abra lint R014 "only annotated tags used for recipe version" because my fake tags were lightweight (production tags are annotated) — a TEST artifact, not a reconciler bug. Fixed the test to create annotated tags (peel ^{} to avoid nested-tag; set git identity).

Final PROOF (ALL PASS):

  • (a) healthy upgrade 10.7.1→10.7.9: snapshot taken (subdir), deploy, health-pass, last_good committed=10.7.9, marker realm preserved through the undeploy/snapshot/redeploy.
  • (b) marquee rollback: broken latest 10.7.10 → deploy fails → rollback to 10.7.9 → HEALTHY; marker realm INTACT (data preserved through broken-upgrade + snapshot-restore); last_good NOT advanced; rollback alert sentinel written (attempted=10.7.10, last_good=10.7.9, recovered=True). keycloak recovered to canonical 10.7.1+26.6.2 healthy, no fake tags left.

This satisfies the WC1.1 Adversary mandate (broken latest → self-revert + data intact + alert; healthy update commits last-good). WC1.2 holds were proven in W0.6. The reconciler-side WC1/WC1.1/ WC1.2 are proven; the alert RELAY (Builder loop scans /var/lib/ci-warm/alerts/ → PushNotification + archive to seen/) is still to wire (flagged for when nightly WC6 lands / a real alert can occur).

Remaining for the WC1 gate: W0.7 (lasuite-docs in-place chaos-redeploy nginx race) + W0.8 (headline dependent-SSO-green e2e vs warm keycloak + concurrent distinct realms + reaping).

2026-05-29 — Fixed daily-failing docker-prune (WC8 landmine)

While checking state I found the system degraded: docker-prune.service had been FAILING every day (May 27/28/29) with The "until" filter is not supported with "--volumes". Root: swarm.nix autoPrune flags [--all --volumes --filter until=24h] — docker rejects --volumes + --filter until, so the daily prune never ran (a cause of disk creeping to 96%). Worse: --volumes prunes any volume with no running container → it would DELETE Phase-2w DATA-WARM canonical volumes (undeployed by design) the moment it started working. Fixed: dropped --volumes (prune images/containers/networks/build-cache ≤24h only). Warm volumes survive and are pruned deliberately by the warm reconcilers (WC8). Verified: rebuild → docker-prune.service runs clean, system running (0 failed), keycloak 200. Note for WC8: the warm-volume/snapshot prune policy + nix-generation GC should be folded into the maintenance story.

2026-05-29 — W0.7/W0.8 headline WC1 e2e GREEN; concurrency+reaping proven → claiming WC1/WC1.1/WC1.2

The W0.4 lasuite-docs failure was TRANSIENT (resource contention from the since-killed stale Phase-2 run; disk was tight). Re-ran on the clean system (disk 36% after the prune fix): RECIPE=lasuite-docs STAGES=install,custominstall: pass, custom: pass — all 3 SSO tests green vs the WARM keycloak: test_health_check (200), test_oidc_login_via_keycloak (full app OIDC flow), test_oidc_password_grant_against_dep_keycloak (per-run realm JWT). deploy-count=1 (keycloak NOT co-deployed — warm path); per-run realm lasuite-docs-4c0858 created + DELETED at teardown; no lasu stack left; warm keycloak realm list back to just master. So W0.7 needs no recipe fix — the in-place chaos-redeploy converges fine with adequate resources.

Concurrency+reaping (deploy-free, live warm keycloak): realm_for gives DISTINCT realms for two concurrent same-recipe runs (lasuite-docs-aaa111 vs -bbb222) + a different recipe (cryptpad-ccc333); all 3 created, each grants its own JWT independently (no collision); reap_orphaned_realms with live_hexes={aaa111} deleted exactly the two orphans and KEPT the live one.

All WC1 sub-claims now proven: (warm dep, no co-deploy, per-run realm create+delete) + (concurrent distinct realms) + (orphan reaping); plus WC1.1 (W0.9 marquee rollback) + WC1.2 (W0.6 holds). Warm keycloak healthy on 10.7.1+26.6.2, last_good=10.7.1+26.6.2, no alerts, system running (0 failed). Claiming the WC1/WC1.1/WC1.2 gate.

Note: the reconciler WRITES alert sentinels to /var/lib/ci-warm/alerts/ (proven for rollback + holds). The Builder-loop RELAY (sentinel → PushNotification + archive to seen/) runs each wake when an alert is present; none currently. This delivery layer is loop behavior, not reconciler logic.

2026-05-29 — Gate WC1+WC1.2+WC1.1(keycloak) ADVERSARY PASS; advancing to W1

The Adversary cold-verified all 6 checks from its OWN clone (cc-ci:/root/cc-ci-adv-verify): check1 unpinned/healthy/wired, check2 57 units, check3 headline lasuite-docs SSO e2e (install+custom pass, deploy-count=1, per-run realm created+deleted, warm kc left ['master'], cold teardown sacred), check4 concurrency+reaping, check5 WC1.1 marquee rollback (data intact, last_good held, alert), check6 WC1.2 holds. Gate verdict: PASS @2026-05-29 (REVIEW-2w 31ac86d) for exactly the claimed scope. The Adversary independently hit + correctly attributed the same test-script cleanup footgun to the test, not the reconciler. ONE tracked-open before DONE (no finding): traefik WC1.1 (W0.10) — its stateless version-rollback isn't yet on the shared reconciler.

Advancing to W1 (WC2 canonical registry + WC3 closure). Design intent: a small declarative registry of canonical recipes → known-good commit, each at warm-<recipe> kept DATA-warm (undeployed when idle, volume retained), re-warmable. warmsnap (W0.5) already provides one-last-good snapshot + restore. Need to decide: registry format/location (in-repo declarative) + the data-warm lifecycle (deploy→use→undeploy-keep-volume) + how a canonical is seeded/advanced (WC5 cold-only, later). W1 builds the registry + data-warm reconcile; WC5/WC6 (promote-on-green-cold + nightly) come in W3.

traefik W0.10 + alert-relay deferred to a quiet window before DONE (traefik is critical TLS infra).

2026-05-29 — W1.2 data-warm canonical PROVEN (WC2+WC3); claiming W1 gate

Enrolled custom-html (recipe_meta.WARM_CANONICAL=True) and ran the live data-warm proof (/tmp/wc2_proof.py): deploy warm-custom-html @ 1.11.0+1.29.0 → write marker into the content volume → undeploy → seed_canonical (registry + snapshot while undeployed) → confirm app UNDEPLOYED but volume RETAINED → deploy_canonical reattach → marker SURVIVED. ALL PASS. custom-html is now the first real data-warm canonical, left idle (undeployed, volume retained, registry status=idle). Disk 49% (custom-html canonical 32K; keycloak snapshot 318M = the one-per-app DB snapshot, WC8 budget).

WC2 (registry + data-warm model) + WC3 (snapshot tied to canonical; restore proven in W0.5) are proven. Claimed the WC2+WC3 gate for Adversary cold-verify. One canonical (custom-html) demonstrates the model; the nightly sweep (WC6/W3) populates more over time — not re-warming all here (plan §4 bounded). Did NOT enroll a 2nd recipe yet (custom-html suffices for W2 --quick + the model proof).

Parked at the W1 gate. While awaiting: will do non-disruptive W0.10b (alert-relay) — NOT the traefik W0.10a migration (it disrupts TLS the Adversary needs to verify the data-warm round-trip through).

2026-05-29 — W1 gate WC2+WC3 ADVERSARY PASS; advancing to W2 (--quick)

Adversary cold-verified WC2+WC3 from its own clone (REVIEW-2w 0246296): 61 units; its OWN data-warm round-trip (deploy→write ADV marker→undeploy-keep-volume→redeploy→marker survived, Builder's known-good also reattached); its OWN WC3 restore round-trip (mutate→restore→exact known-good content back, mutation gone). Its 2 crashes were its own driver-script bugs, not product defects. Canonical left clean. WC2 + WC3 PASS @2026-05-29. Same coordination lag as the W0 claim (its watchdog pinged on a pre-claim read; resolved via ADVERSARY-INBOX). traefik WC1.1 (W0.10a) remains the sole tracked-open before DONE.

Advancing to W2 (--quick, WC4+WC7). Design: a --quick opt-in path in run_recipe_ci.py that consumes the canonical (reattach → upgrade-to-PR-head → assert → PASS keep-volume / FAIL restore-snapshot, NEVER promote), tagging results mode=quick, with a clean no-canonical fallback to cold. Will study the existing upgrade-tier chaos-to-PR-head (HC1) mechanism, then add the quick flow + units + a live proof on the custom-html canonical (the deliberately-fail-restores-known-good case is also the WC9 rollback-proof preview).

2026-05-29 — W2 (--quick, WC4+WC7) built + proven live; claiming gate

WC4 run_quick in run_recipe_ci.py (dispatch on CCCI_QUICK=1/MODE=quick when a canonical exists, else clean cold fallback). Live PASS+FAIL proof on the custom-html canonical (ALL PASS): PASS run (upgrade→different-healthy-head) leaves known-good UNCHANGED + idle + volume/data intact; FAIL run (broken-image head) rolls back — undeploy→restore last-known-good→idle, known-good UNCHANGED, data intact. 3 bugs found+fixed by the live proof (missing import time crashed the rollback; stale .env TYPE from a prior --quick upgrade pointing at a removed PR commit FATAL'd abra — deploy_canonical + rollback now reset TYPE to the known-good).

WC7 trigger surface: bridge parse_trigger accepts !testme (cold) / !testme --quick (opt-in), rejects !testmexyz etc.; threads CCCI_QUICK=1 through trigger_build (auto-exposed Drone param); quick PR comment labelled lower-confidence; default !testme unchanged; never gates merge. Deployed via nixos-rebuild (content-tagged bridge image rolled) + LIVE-verified in the running container (parse_trigger correct, healthz 200). 64 unit pass.

Handoff-signalling note (orchestrator): the watchdog now pings off COMMIT PREFIXES on origin/main (claim(...) pings Adversary; review(...) pings Builder), not prose — which caused the earlier premature "no formal gate" dances. I already use claim(2w): for gate claims + push promptly; keep doing so. Claiming WC4+WC7 now with that prefix.

System clean post-rebuild: keycloak 200, custom-html canonical idle@1.11.0+1.29.0, 0 failed units, disk 50%. Parked at the W2 gate; next quiet-window work = W0.10a traefik WC1.1 migration.

2026-05-29 — W2 gate WC4+WC7 ADVERSARY PASS; advancing to W3 (+ traefik quiet window)

Adversary cold-verified WC4+WC7 (REVIEW-2w 31f0e42): 64 units; WC7 adversarial trigger battery (all negatives rejected on the live bridge); WC4 never-promote (snapshot byte-identical sha256 9ef62bdf, registry unchanged); WC4 FAIL→rollback restored EXACT known-good (marker back, app 200, broken image gone, exit 1 — "WC9 rollback-proof in miniature"); no-canonical fallback to a cold per-run domain (canonical untouched). No tests softened. WC4+WC7 PASS @2026-05-29.

Three of four milestones now PASS (W0, W1, W2). Advancing to W3 (WC5 promote-on-green-cold + WC6 nightly sweep). ALSO: the Adversary is now idle (post-W2), so this is the QUIET WINDOW for the tracked W0.10a traefik WC1.1 migration (it disrupts TLS, so it must NOT overlap an Adversary verify).

Plan for next: (a) W0.10a traefik health-gated reconciler migration (quiet window, careful — traefik serves all TLS); (b) W3 WC5 promote-on-green-cold (extend cold-run teardown to re-seed the canonical on green-latest, reusing seed_canonical); (c) W3 WC6 nightly sweep (systemd timer: rebuild-then-cold- sweep). traefik first (use the window) or interleave; W0.10b alert-relay is a small loop step.

2026-05-29 — W0.10a traefik WC1.1 migrated (quiet window) — code + no-op converge; rollback = Adversary proof

Used the post-W2 quiet window (Adversary idle) for the tracked traefik WC1.1 migration. Generalized warm_reconcile.py: per-spec setup hook + health_domain; added SPECS["traefik"] (stateful=False → stateless version-rollback-only, NO snapshot; setup=_traefik_setup preserving the wildcard-cert/ file-provider config EXACTLY via the proven newline-safe abra.env_set; health on the routed dashboard host). keycloak's path is unchanged (no setup key → default). proxy.nix migrated: deploy-proxy.service now execs warm_reconcile.py traefik (runner/ packaged in the store, D8-clean).

ZERO-DISRUPTION migration: traefik was already at the latest tag (5.1.1+v3.6.15, image v3.6.15, chaos commit 005f023 = the tag commit). I pre-seeded the .env TYPE + last_good to 5.1.1+v3.6.15 (accurate — traefik IS at that version), so the health-gated reconcile is a clean no-op (current==latest==healthy) → NO redeploy, NO TLS blip. Verified via nixos-rebuild switch: deploy-proxy.service → "no-op", traefik 200 + keycloak-through-traefik 200 + 0 failed units. 65 unit pass.

Per the operator's explicit out (a destructive traefik test risks ALL TLS), I delivered the code + safe no-op converge and left the DESTRUCTIVE rollback as the Adversary's required cold proof (staged broken traefik tag → reconcile → rollback to last-good, brief TLS blip + manual recovery ready). The rollback logic is the proven keycloak pattern, stateless variant. Claiming W0.10a so the Adversary runs that cold proof. After this clears, WC1.1 is fully closed (keycloak + traefik).

2026-05-29 — W0.10a traefik WC1.1 ADVERSARY PASS → WC1.1 fully closed; building W3 WC5

Adversary PASS (REVIEW-2w e3b08a9): units 65; no-op converge; and the destructive rollback proven WITHOUT a TLS outage — it staged a LINT-breaking newer traefik tag, so the broken deploy was rejected at abra lint BEFORE the running proxy was touched → rollback to 5.1.1, ci.commoninternet.net=200 + keycloak-through-traefik=200 throughout. Stateless path confirmed (no snapshot, version-only rollback). Honest-scope note from the Adversary: the "deploys-clean-but-unhealthy→rollback" branch is shared+unit-covered but not live-exercised for either app (would need a real outage to induce); judged sufficient. No finding. WC1.1 FULLY closed (keycloak + traefik).

Phase-2w verified: WC1, WC1.1, WC1.2, WC2, WC3, WC4, WC7. Remaining: WC5, WC6, WC8, WC9. Adversary now idle → safe for live cold runs. Building W3 WC5 (promote-on-green-cold) next.

2026-05-29 — W3 WC5 promote-on-green-cold built + proven; claiming. (WC6 next.)

should_promote_canonical(recipe,ref,overall,quick) = is_enrolled & green & cold & on-latest(no ref); promote_canonical(recipe,head_ref) = deploy warm- at latest (reattach retained volume if any, else fresh) → healthy → undeploy → seed_canonical (snapshot+registry, atomic; old known-good replaced ONLY on green so it's never lost). Wired into main() after a green cold run; non-fatal on failure. +5 unit tests (70 pass). LIVE: set custom-html canonical to 1.10.0+1.28.0, ran full cold (no REF), all tiers green + deploy-count=1 → promote advanced canonical 1.10.0→1.11.0+1.29.0, snapshot refreshed, idle, per-run cust-* torn down, traefik/kc still 200. WC5 proven; claimed.

Mechanism note: cold runs still use FRESH per-run domains (unchanged); promote re-deploys the canonical at latest separately (one extra deploy) so the old known-good is never at risk on a red run (DECISIONS Phase-2w WC5). Next: WC6 nightly sweep (systemd timer: nixos-rebuild switch FIRST then serial cold sweep over enrolled recipes; need canonical.enrolled_recipes() + a nightly-sweep nix module). Building WC6 code while the Adversary verifies WC5.

2026-05-29 — W3 WC6 nightly full-cold sweep built + proven (systemd service); claiming. WC5+WC6 close W3.

canonical.enrolled_recipes() (scan tests/*/recipe_meta.py for WARM_CANONICAL). runner/nightly_sweep.py (roll keycloak+traefik via warm_reconcile health-gated → serial full-cold over enrolled recipes on latest → each green promotes WC5; skip if a run is active; per-recipe red reported not fatal). nix/modules/nightly-sweep.nix = systemd timer (OnCalendar 03:00 Persistent +RandomizedDelay) + oneshot service; wired into configuration.nix. 71 unit pass.

Two bugs found via the live SERVICE run (not the direct run): (1) the store packages only runner/ (not tests/), so enrolled_recipes scanned a nonexistent store/tests → []; fixed nightly_sweep to operate against $CCCI_REPO=/root/cc-ci (the checkout with tests/) — same place run_recipe_ci runs from. (2) the sweep wrapper's runtimeInputs lacked util-linux → abra's backup/restore PTY (script) failed → backup red; added util-linux (matching cc-ci-run). After both fixes, the live SERVICE sweep: enrolled= ['custom-html'] → all 5 tiers green → WC5 promote advanced canonical 1.10.0→1.11.0+1.29.0; timer active (next ~03:00). Also confirmed the red-run path (the util-linux flake) correctly did NOT promote (known-good stayed 1.10.0 — never lose known-good). W3 (WC5+WC6) essentially closed. Remaining: WC8 (resource/isolation hardening — mostly already in place) + WC9 (docs + --quick rollback proof, already shown) → then DONE.

2026-05-29 — W4 WC8 + WC9 (final gates) built + claimed; DONE pending their PASS

WC6 ADVERSARY PASS (REVIEW-2w b8b698e). Then built the final two:

  • WC8 resource safety + isolation — most was already in place; consolidated + added the missing piece: canonical.prune_stale() drops /var/lib/ci-warm/<recipe>/ + the warm-<recipe> volumes for DE-ENROLLED canonicals (keeps enrolled + reconciler dirs keycloak/traefik + alerts/), wired into the nightly sweep + a df log. +1 unit (72 pass). Verified live: DRONE_RUNNER_CAPACITY=maxTests (serialize); autoPrune flags drop --volumes (warm vols survive); grep ci-warm nix/ = comment only (excluded from D8); disk 50%, warm ~318M.
  • WC9 docsdocs/warm.md: the full warm/quick model (live/data-warm/cold, warm- scheme, health-gated reconcilers + WC1.2 safety gate + alerts, canonicals + warmsnap + enroll, --quick, promote-on-green-cold, nightly sweep, resource safety, operate/debug) + the --quick rollback proof (FAIL restores exact known-good; PASS byte-identical snapshot — proven W2/WC4).

Claimed WC8+WC9 (the final gates). On their PASS, EVERY WC1WC9 (incl WC1.1/WC1.2) is Adversary-verified → write ## DONE to STATUS-2w (handshake: <24h PASS for all + no VETO) → watchdog returns to Phase 2.

2026-05-29 — Phase 2w COMPLETE — ## DONE written

WC8+WC9 ADVERSARY PASS (REVIEW-2w 2822d60). The Adversary explicitly authorized DONE: ALL WC1-WC9 (incl WC1.1 keycloak+traefik, WC1.2) cold-verified from its own clone with fresh PASSes dated 2026-05-29, NO VETO, no open findings, W0.10 traefik tracked-open CLOSED.

Wrote ## DONE to STATUS-2w.md with the per-WC evidence table (each WC → REVIEW-2w PASS commit / gate). Final state: keycloak+traefik 200, custom-html canonical idle@1.11.0+1.29.0, nightly-sweep timer active, system running (0 failed), disk 50%. No tests softened anywhere in the phase.

What Phase 2w delivered: a warm-data layer for cc-ci CI — (1) a live-warm shared keycloak + a health-gated traefik that auto-update to latest with snapshot-backed rollback (keycloak) / version rollback (traefik) behind a pre-deploy major/manual-migration safety gate, alerting via sentinels; (2) data-warm per-recipe canonicals at stable warm- domains with one known-good snapshot each; (3) an opt-in --quick fast lane (reattach canonical → upgrade to PR head → assert → PASS keep-volume / FAIL restore; never promotes, never gates merge); (4) cold-only canonical advancement (promote-on-green-cold) + a nightly rebuild-then-cold-sweep; (5) resource/disk safety + docs.

Per §6.1, ## DONE makes the watchdog auto-return to Phase 2 (resume recipe authoring from STATUS-2/BACKLOG-2, which were preserved at the pause). Stopping the 2w loop here.