Files
cc-ci/machine-docs/STATUS-2w.md
autonomic-bot aebb28d774 done(2w): Phase 2w COMPLETE — WC1-WC9 (incl WC1.1/WC1.2) all Adversary-verified, NO VETO
## DONE written to STATUS-2w. Adversary authorized (REVIEW-2w 2822d60: all gates
cold-verified, no veto, no open findings). Final state healthy: keycloak+traefik
200, custom-html canonical idle@1.11.0+1.29.0, nightly-sweep timer active, system
running 0 failed, disk 50%. Watchdog auto-returns to Phase 2 (resume recipe
authoring; STATUS-2/BACKLOG-2 intact).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:48:02 +01:00

465 lines
34 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 2w (warm canonical deployments + `--quick` CI mode)
## DONE
**Phase 2w COMPLETE @2026-05-29.** Every Definition-of-Done item (WC1WC9, incl. WC1.1 + WC1.2) is
**Adversary cold-verified with a fresh (<24h) PASS in REVIEW-2w, NO `## VETO`, no open `[adversary]`
findings** the Adversary authorized DONE (REVIEW-2w 2822d60: "ALL Phase-2w gates Adversary
cold-verified NO VETO DONE authorized"). The watchdog now auto-returns to **Phase 2** (resume
recipe authoring; STATUS-2/BACKLOG-2 intact).
Evidence (each WC its REVIEW-2w PASS / gate commit):
| WC | What | PASS (REVIEW-2w / gate) |
|---|---|---|
| WC1 | live-warm UNPINNED keycloak; per-run namespaced realms; concurrency; reaping | 31ac86d / 985686f |
| WC1.1 | health-gated rollback keycloak (stateful, snapshot) | 31ac86d / 985686f |
| WC1.1 | health-gated rollback traefik (stateless, version-only) | e3b08a9 / e678d2e |
| WC1.2 | pre-deploy safety gate (major / manual-migration hold+alert) | 31ac86d / 985686f |
| WC2 | data-warm canonical model + registry | 0246296 / 4ce80f8 |
| WC3 | known-good snapshots (raw-while-undeployed, restore round-trips) | 0246296 / 4ce80f8 |
| WC4 | `--quick` mode (PASS keeps known-good; FAIL restores; never promote) | 31f0e42 / 3ff2bf6 |
| WC5 | promote-on-green-cold (only cold-on-latest advances) | 5bbc47c / 125453d |
| WC6 | nightly full-cold sweep (timer + roll-warm/infra + serial sweep) | b8b698e / 465e105 |
| WC7 | `!testme --quick` trigger / labeling / no-canonical fallback | 31f0e42 / 3ff2bf6 |
| WC8 | resource safety + isolation (serialize, disk prune, D8-excluded) | 2822d60 / 40b03a9 |
| WC9 | docs (`docs/warm.md`) + the `--quick` rollback proof | 2822d60 / 40b03a9 |
Final state: keycloak + traefik 200; custom-html canonical idle@1.11.0+1.29.0; nightly-sweep.timer
active; system running (0 failed); disk 50%. No tests softened in the phase.
---
**Phase plan (SSOT):** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
**Loop state for THIS phase:** STATUS-2w / BACKLOG-2w / REVIEW-2w / JOURNAL-2w (DECISIONS.md shared).
Phase 1/1b/1c/1d/1e and Phase 2 STATUS/BACKLOG/REVIEW files are NOT this phase's state.
Phase 2 is **PAUSED** (STATUS-2/BACKLOG-2 intact) and resumes after 2w `## DONE`.
## Phase
Add a warm-data layer to cc-ci CI: a live-warm shared keycloak for SSO deps, data-warm per-recipe
canonicals at stable domains, known-good snapshots, an opt-in `--quick` fast lane that reattaches the
canonical and upgrades to PR head (rolling back on failure), cold-only canonical advancement, and a
nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversary cold-verified.
## Definition of Done (Phase 2w) — WC1WC9 (+WC1.1/WC1.2), each Adversary cold-verified in REVIEW-2w
- [x] **WC1** Live-warm UNPINNED keycloak; per-run namespaced realms (create+delete); concurrent
distinct realms; orphan realms reaped. **Adversary PASS @2026-05-29** (REVIEW-2w, gate 985686f).
- [~] **WC1.1** Health-gated deploy-with-rollback. **keycloak (stateful) Adversary PASS
@2026-05-29** (marquee). **traefik (stateless, version-rollback-only) reconciler MIGRATED
(W0.10a): proxy.nix now drives `warm_reconcile.py traefik` (shared health-gated path, no
snapshot; cert/file-provider setup preserved); no-op converge proven live (traefik 200,
keycloak-through-traefik 200, 0 failed). **Adversary PASS @2026-05-29** (REVIEW-2w e3b08a9):
destructive rollback proven (lint-breaking tag rollback to 5.1.1, NO TLS outage). **WC1.1
FULLY CLOSED (keycloak stateful + traefik stateless).**
- [x] **WC1.2** Pre-deploy safety gate (major / manual-migration hold + alert with notes, no
churn, short-circuits before WC1.1). **Adversary PASS @2026-05-29**.
- [x] **WC2** Data-warm canonical model: per-recipe canonical at stable domain `warm-<recipe>`,
declarative registry (canonical.json + recipe_meta.WARM_CANONICAL) tracking recipeknown-good
version/commit; data-warm (undeployed-when-idle, volume retained); re-warmable via seed_canonical.
Proven on custom-html (W1.2). **Adversary PASS @2026-05-29** (REVIEW-2w 0246296, gate 4ce80f8).
- [x] **WC3** Known-good snapshots: raw per-volume tar taken while undeployed under
`/var/lib/ci-warm/<recipe>/snapshot/`; one last-good per app, atomic subdir swap; restore
round-trips data (W0.5 + W1.2 + Adversary's own mutaterestore). **Adversary PASS @2026-05-29**.
- [x] **WC4** `--quick` mode (`run_quick` in run_recipe_ci.py): reattach canonical upgrade to PR
head (chaos) generic UPGRADE+serving+overlay+custom; PASSundeploy-keep-volume (known-good
UNCHANGED, never promote); FAILrestore last-known-good snapshot then undeploy. Proven live on
custom-html (PASS + FAIL). **Adversary PASS @2026-05-29** (REVIEW-2w 31f0e42, gate 3ff2bf6).
- [x] **WC5** Canonical advancement via cold only (promote-on-green-cold). `should_promote_canonical`
(enrolled+green+cold+latest) + `promote_canonical` (re-seed canonical at green-verified latest
snapshot+registry; never lose known-good). Proven live: green cold custom-html run advanced the
canonical 1.10.0+1.28.0 1.11.0+1.29.0 (snapshot refreshed, idle, per-run app torn down).
`--quick` never promotes (W2). **Adversary PASS @2026-05-29** (REVIEW-2w 5bbc47c, gate 125453d).
- [x] **WC6** Nightly full-cold sweep. `nix/modules/nightly-sweep.nix` (systemd TIMER OnCalendar
03:00 Persistent + oneshot service) `runner/nightly_sweep.py`: roll warm/infra (keycloak+traefik
health-gated, WC1.1) SERIAL full-cold run over enrolled (`canonical.enrolled_recipes`) recipes
on latest each green run promotes its canonical (WC5); skips if a test is in flight. Proven via
the live service: enrolled=['custom-html'] all tiers green canonical advanced 1.10.01.11.0.
**Adversary PASS @2026-05-29** (REVIEW-2w b8b698e, gate 465e105).
- [x] **WC7** Trigger/authority/labeling: default `!testme`=cold (unchanged); `--quick` opt-in via
bridge `parse_trigger` (`!testme --quick` CCCI_QUICK=1 Drone param, deployed+live-verified);
never gates merge; runs carry mode=quick (lower-confidence label); clean no-canonical fallback
to cold. **Adversary PASS @2026-05-29** (REVIEW-2w 31f0e42, gate 3ff2bf6).
- [x] **WC8** Resource safety + isolation: serialize via `DRONE_RUNNER_CAPACITY=MAX_TESTS` + serial
nightly that skips-if-test-active; warm keycloak shared via per-run realms (WC1); disk
monitored+pruned (autoPrune drops `--volumes` so warm vols survive; `canonical.prune_stale`
drops de-enrolled warm data nightly; nightly logs `df`); cold teardown sacred; warm data
EXCLUDED from D8 (no Nix module references `/var/lib/ci-warm` as a source). **CLAIMED — see Gate.**
- [x] **WC9** `docs/warm.md` documents the full warm/quick model; the `--quick` rollback proof
(FAIL restores last-known-good intact; PASS doesn't move it) is proven live (W2 FAIL + WC4
Adversary byte-identical-snapshot verify). **CLAIMED — see Gate.**
## Milestones (plan §3)
- **W0** Warm keycloak (WC1/WC1.1-keycloak/WC1.2). Adversary PASS @2026-05-29.
- **W1** Canonical registry + snapshot/restore (WC2, WC3). Adversary PASS @2026-05-29.
- **W2** `--quick` mode (WC4, WC7). Adversary PASS @2026-05-29.
- **W3** Cold-advances-canonical (WC5 PASS) + nightly sweep (WC6 building).
- **W4** Resource/isolation hardening + docs + cold verify (WC8, WC9).
- **W1** Canonical registry + snapshot/restore (WC2, WC3).
- **W2** `--quick` mode (WC4, WC7).
- **W3** Cold-advances-canonical + nightly sweep (WC5, WC6).
- **W4** Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). DONE.
## In flight
**W0 live-warm keycloak (WC1).** Done so far (commits up to 88c1114):
- W0.1 sso realm lifecycle (list/delete/realms_to_reap/reap) + 8 unit tests (43 unit pass).
- W0.2 orchestrator live-warm dep mode (warm.py + run_recipe_ci split warm/cold; per-run realm).
- **WC1 core mechanism PROVEN** deploy-free on the live warm keycloak: realm create password-grant
JWT discovery issuer delete(idempotent) reap(keeps live hex / deletes orphan). All PASS.
- W0.3 declarative reconciler `nix/modules/warm-keycloak.nix` up; `nixos-rebuild switch`
warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned +
skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.)
- **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests
(48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm undeploy snapshot
(mariadb+providers) deploy delete marker (mutate DB) undeploy restore deploy marker
realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm/<recipe>/`, atomic, one last-good.
- **W0.6 reconciler rewrite** DONE (a044abb). `runner/warm_reconcile.py` (python, packaged into the
nix store, replaces the bash reconcile): UNPIN keycloak (deploy latest version TAG; recipe fetched
at runtime D8 closure byte-identical); WC1.2 pre-deploy safety gate (major recipe/app bump OR
releaseNotes manual-migration hold + alert, no churn); WC1.1 health-gated upgrade-with-rollback
scaffold (record last-good keycloak undeploysnapshotdeploy latest health-gate
commit-or-restore+redeploy-prior+alert). Alerts = `/var/lib/ci-warm/alerts/*.json`. +8 unit tests
(56 unit pass). PROVEN live: `nixos-rebuild switch` warm-keycloak.service runs the python
reconciler noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR held-major,
keycloak untouched; minor+manual-migration notes held-manual-migration, alert carries notes).
- **W0.9 WC1.1 live proofs** DONE (32f0071). PROVEN on warm keycloak (annotated fake tags +
CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.110.7.9 snapshot+deploy+health-pass, last_good
committed, marker preserved; (b) **marquee rollback** broken latest 10.7.10 deploy fails
rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback
alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical
10.7.1+26.6.2. Fixed 4 issues live (deploy-failrollback, warmsnap last_good subdir, wait_undeployed
swarm-settle, abra-stdout capture). 57 unit pass. **Reconciler-side WC1/WC1.1/WC1.2 proven.**
**Adversary reproduce (W0.9):** on cc-ci, with the keycloak recipe clone, create annotated fake
tags (peel `^{}`, set git identity) `10.7.9+26.6.2`(=good commit) and `10.7.10+26.6.2`(broken
KC_HOSTNAME), then `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice; observe
`upgraded:` then `rolled-back:`, marker realm survives, `/var/lib/ci-warm/keycloak/last_good`
unchanged at the prior version, a `*rollback*.json` alert under `/var/lib/ci-warm/alerts/`.
**W0 COMPLETE — Adversary PASS @2026-05-29.** Now in **W1 (canonical registry, WC2/WC3)**.
**W0 ✅ + W1 ✅ + W2 ✅ Adversary PASS. Now in W3 (cold-advances-canonical WC5 + nightly sweep WC6).**
**W3 plan:**
- **WC5 promote-on-green-cold.** A GREEN full-cold run on the LATEST (not a `--quick` run) of an
enrolled (WARM_CANONICAL) recipe re-snapshots + re-tags the canonical known-good instead of
deleting the volume at teardown: at the end of a green cold run, undeploy `canonical.seed_canonical`
(snapshot while undeployed + write registry version=the green commit/version) keep the volume as
the new canonical. The FIRST green cold run on latest SEEDS the canonical. ONLY cold advances it
(`--quick` never promotes proven W2). Wire into run_recipe_ci.py cold teardown, gated on:
recipe is WARM_CANONICAL + run was green + deployed LATEST (not a pinned/prev base). Add unit
tests + a live proof (green cold custom-html run canonical re-seeded at the new known-good).
- **WC6 nightly full-cold sweep.** Declarative scheduler (systemd timer on cc-ci): nightly does
`nixos-rebuild switch` FIRST (rolls warm/infra to latest, health-gated per WC1.1) THEN a full-cold
sweep across enrolled recipes (serial, MAX_TESTS-bounded), refreshing each canonical's known-good
(WC5) + serving as the daily authoritative regression. MUST NOT run while a test is in flight.
- **Quiet-window opportunity (now): W0.10a traefik WC1.1** Adversary idle post-W2 PASS, so this is
the window to migrate traefik onto the health-gated reconciler (tracked-before-DONE; below).
**Tracked before Phase-2w DONE:**
- **W0.10a traefik WC1.1** (Adversary requires a cold proof): migrate `proxy.nix` onto the shared
health-gated reconciler (stateless = version-rollback-only; preserve cert-secret/WILDCARDS_ENABLED/
COMPOSE_FILE setup). CAREFUL traefik serves all TLS; deploy/test only in a quiet window.
- **W0.10b Builder-loop alert relay**: each wake, scan `/var/lib/ci-warm/alerts/*.json`
PushNotification archive to `alerts/seen/`.
**Build finding (RESOLVED):** the W0.4 lasuite-docs `setup_custom_tests` redeploy failure (nginx web
`host not found in upstream ...backend:8000`) was **transient resource contention** from the
since-killed stale Phase-2 run (disk was also tight). On the clean system it converges fine the
headline e2e is green (below). No recipe/harness change needed.
## Gate
### Gate: WC8 + WC9 — CLAIMED, awaiting Adversary (@2026-05-29) [FINAL gates]
**WHAT.** WC8 resource safety/isolation (consolidated + a stale-warm prune) + WC9 docs + the proven
`--quick` rollback. **WHERE:** `runner/harness/canonical.py` (`prune_stale`), `runner/nightly_sweep.py`
(prune + df after sweep), `nix/modules/{drone-runner,swarm}.nix` (capacity, autoPrune), `docs/warm.md`.
**HOW + EXPECTED (cold):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q` **72 passed** (incl. test_canonical prune_stale:
drops de-enrolled canonical dirs, keeps enrolled + reconciler dirs + alerts/).
2. **WC8 serialize:** `grep DRONE_RUNNER_CAPACITY nix/modules/drone-runner.nix` `= maxTests`
(MAX_TESTS, default 1); `nightly_sweep.py` `_another_run_active()` skips if a run is in flight;
sweep loop is serial.
3. **WC8 disk/prune:** `grep flags nix/modules/swarm.nix` `[ "--all" "--filter" "until=24h" ]`
(NO `--volumes` warm volumes survive); `canonical.prune_stale()` drops `/var/lib/ci-warm/<r>/`
(+ its `warm-<r>` volumes) for recipes no longer WARM_CANONICAL, run nightly; `df -h /` logged by
the sweep. Live: disk `/` 50% (14G free); warm total ~318M (keycloak DB snapshot dominates).
4. **WC8 cold teardown sacred:** proven across W2/WC5/WC6 (no `<recipe>-<6hex>` leftovers post-run).
5. **WC8 excluded from D8:** `grep -rn ci-warm nix/` only a COMMENT (no Nix source declares
`/var/lib/ci-warm`); it's runtime cache re-seeded by cold runs.
6. **WC9 docs:** `docs/warm.md` covers live-warm/data-warm/cold, the reconcilers + health-gate +
safety gate + alerts, canonicals + snapshots + enroll, `--quick`, promote-on-green-cold, the
nightly sweep, resource safety, and the `--quick` rollback proof + operate/debug.
7. **WC9 `--quick` rollback proof:** already cold-verified W2 FAIL run restored the exact
known-good; WC4 Adversary verify confirmed a PASS run leaves the snapshot byte-identical (does NOT
move the known-good). Re-runnable per docs/warm.md "The --quick rollback proof".
**On WC8+WC9 PASS → ALL of WC1WC9 (incl WC1.1/WC1.2) verified → Builder writes `## DONE`.**
---
### Gate: WC6 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w b8b698e, gate 465e105)
Declarative timer (Persistent) + orchestration + the live systemd-service run (infra roll
health-gated serial cold sweep canonical advanced, infra healthy, no leftovers) cold-verified.
Builder may proceed to W4 (WC8/WC9). (claim detail retained below.)
### (claimed, now PASS) Gate: WC6 — CLAIMED detail
**WHAT.** Nightly full-cold sweep: a scheduled job rolls warm/infra to latest (health-gated, WC1.1)
then runs the full COLD suite serially across enrolled canonical recipes on latest refreshing each
canonical's known-good (WC5) + a daily authoritative regression. Declarative, MAX_TESTS-bounded
(serial), skips if a test is in flight. **WHERE:** `nix/modules/nightly-sweep.nix` (timer+service),
`runner/nightly_sweep.py`, `runner/harness/canonical.py` (`enrolled_recipes`). Wired into
`hosts/cc-ci/configuration.nix`.
**HOW + EXPECTED (cold):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q` **71 passed** (incl. test_canonical enrolled_recipes).
2. **Timer present:** `systemctl is-active nightly-sweep.timer` active; `systemctl list-timers
nightly-sweep.timer` → next ~03:00 (Persistent).
3. **Live sweep (via the systemd SERVICE, store copy):** set the custom-html canonical to an OLDER
version, then `systemctl start nightly-sweep.service` → journal shows: roll keycloak rc=0 + traefik
rc=0 (health-gated, noop at latest); `enrolled canonicals = ['custom-html']`; full-cold custom-html
install/upgrade/backup/restore/custom **all pass**; `WC5 promote: canonical custom-html advanced to
known-good 1.11.0+1.29.0`; `custom-html: PASS`; afterwards `canonical.json` version ADVANCED to
1.11.0+1.29.0, canonical idle, traefik+keycloak 200, system running. Builder ran this live: **PASS**.
(A red recipe in the sweep is reported FAIL + does NOT promote — known-good safe; verified when a
missing-util-linux backup flake red'd a run and the canonical stayed put, then fixed.)
---
### Gate: WC5 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 5bbc47c, gate 125453d)
Anti-poison gate predicate + live advancement 1.10.0→1.11.0 (cold-only) cold-verified. Builder may
proceed to WC6. (claim detail retained below.)
### (claimed, now PASS) Gate: WC5 — CLAIMED detail
**WHAT.** Promote-on-green-cold: a GREEN full-cold run on LATEST (no PR head) of an enrolled
(WARM_CANONICAL) recipe advances/seeds the canonical known-good; `--quick` never promotes; only cold
advances. **WHERE:** `runner/run_recipe_ci.py` (`should_promote_canonical` gate + `promote_canonical`
+ the post-green-cold hook in main()), `runner/harness/canonical.py` (seed_canonical).
**HOW + EXPECTED (cold):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **70 passed** (incl. test_promote: the gate fires
only for enrolled+green+cold+latest; not on red / quick / PR-head / unenrolled).
2. **Live advancement (custom-html canonical):** set its registry version to an OLDER value
(`canonical.write_registry("custom-html", version="1.10.0+1.28.0", …)`), then a full COLD run
`RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py` (no REF = latest) → install/upgrade/backup/
restore/custom all pass, deploy-count=1, then `WC5 promote-on-green-cold: (re)seed canonical
custom-html @ 1.11.0+1.29.0` → afterwards `canonical.json` version **ADVANCED to 1.11.0+1.29.0**
(commit=head 8a02606…), snapshot refreshed (`warmsnap.read_meta` version=1.11.0+1.29.0), canonical
idle + volume retained, NO `cust-*` per-run service left (cold teardown sacred). Builder ran this
live: **advanced 1.10.0→1.11.0**. (A PR `!testme` REF=PR-head does NOT promote; `--quick` never
promotes — both gate-checked.)
---
### Gate: W0.10a traefik WC1.1 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w e3b08a9, gate e678d2e)
Migration + no-op converge + destructive rollback (lint-breaking tag → rollback to last-good, NO TLS
outage — broken deploy rejected at lint before touching the running proxy) all cold-verified.
**WC1.1 now FULLY closed (keycloak + traefik).** (claim detail retained below.)
### (claimed, now PASS) Gate: W0.10a traefik WC1.1 — CLAIMED detail
**WHAT.** traefik migrated onto the shared health-gated reconciler (WC1.1, stateless =
version-rollback-only, NO snapshot): record last-good → deploy latest tag → health-gate (routed host
ci.commoninternet.net = 200) → healthy commit / unhealthy roll back to last-good + alert. Closes the
W0.10a tracked-open item from the W0 gate. traefik's wildcard-cert/file-provider config preserved.
**WHERE.** `runner/warm_reconcile.py` (SPECS["traefik"] stateful=False + `_traefik_setup` + health_domain;
reconcile() per-app setup hook; the stateless path skips snapshot/restore — version rollback only),
`nix/modules/proxy.nix` (deploy-proxy.service now execs `python3 …/warm_reconcile.py traefik`).
**HOW + EXPECTED (cold):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **65 passed** (incl. test_warm_reconcile traefik
spec: stateful=False, callable setup, health_domain=ci.commoninternet.net; keycloak unchanged).
2. **No-op converge (delivered, proven live):** `systemctl is-active deploy-proxy.service` → active;
`journalctl -u deploy-proxy.service` → `[traefik] already on latest 5.1.1+v3.6.15 and healthy
no-op`; traefik serving (ci.commoninternet.net=200) + keycloak-through-traefik=200 + system
`running` (0 failed). The migration was zero-disruption (traefik was already at the latest tag; I
pre-seeded TYPE+last_good to 5.1.1+v3.6.15 so the reconcile is a clean no-op).
3. **Destructive rollback (the Adversary's required cold proof):** stage a fake newer traefik tag with
a broken config → `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py traefik` → broken deploy
fails health → reconciler rolls back to last-good 5.1.1+v3.6.15 (version-only, no snapshot — traefik
is stateless) → traefik healthy again + a `*-rollback.json` alert. NOTE: a destructive traefik test
briefly drops TLS for ALL routes during the broken-deploy window until rollback — run it knowing
that + with manual recovery ready (`abra app deploy traefik.ci.commoninternet.net 5.1.1+v3.6.15
-o -n -f`). The rollback logic is the SAME proven keycloak pattern, stateless variant (no snapshot).
Per operator guidance, I delivered the code + the safe no-op converge this iteration and left the
destructive rollback as the Adversary's cold proof (a live destructive traefik test risks all TLS).
---
### Gate: WC4 + WC7 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 31f0e42, gate 3ff2bf6)
Cold-verified from the Adversary's own clone: 64 units; WC7 adversarial trigger battery (all negatives
rejected, live bridge); WC4 never-promote (snapshot byte-identical, registry unchanged); WC4
FAIL→rollback restored EXACT known-good (marker back, 200, broken image gone, exit 1); no-canonical
fallback to a cold per-run domain. Builder may proceed to W3. (claim detail retained below.)
### (claimed, now PASS) Gate: WC4 + WC7 — CLAIMED detail
**WHAT.** The `--quick` opt-in fast lane (W2): reattach the data-warm canonical → upgrade in place to
the PR head → assert (generic upgrade reconverge+moved+serving + overlay + custom); PASS →
undeploy-keep-volume with the **known-good UNCHANGED (never promote)**; FAIL → restore the
last-known-good snapshot + undeploy (roll back, data safe). Opt-in via `!testme --quick`, mode-tagged
lower-confidence, never gates merge; clean no-canonical fallback to COLD.
**WHERE (code).** `runner/run_recipe_ci.py` (`run_quick`, dispatched from `main()` on CCCI_QUICK=1 /
MODE=quick; `_wait_undeployed`; no-canonical fallback), `runner/harness/canonical.py`
(deploy_canonical resets TYPE; undeploy_keep_volume), `runner/harness/warmsnap.py` (restore),
`bridge/bridge.py` (`parse_trigger` + CCCI_QUICK param), `.drone.yml` (quick echo). 64 unit pass.
**HOW + EXPECTED (cold, from your own clone on cc-ci):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **64 passed** (incl. test_bridge_trigger:
`!testme`→cold, `!testme --quick`→quick, `!testmexyz`→reject).
2. **WC7 trigger (live in the running bridge):** `cid=$(docker ps -q -f name=ccci-bridge);
docker exec $cid python3 -c 'import sys;sys.path.insert(0,"/app");import bridge;
print(bridge.parse_trigger("!testme --quick"), bridge.parse_trigger("!testmexyz"))'` →
`(True, True) (False, False)`. `trigger_build` adds `CCCI_QUICK=1` (auto-exposed to run_recipe_ci);
a `!testme --quick` PR comment is labelled lower-confidence; plain `!testme` stays full cold.
3. **WC4 `--quick` flow (custom-html canonical, currently idle at 1.11.0+1.29.0):**
- **PASS run:** `RECIPE=custom-html CCCI_QUICK=1 REF=87a62a5 cc-ci-run runner/run_recipe_ci.py`
(REF=87a62a5 is the 1.10.0+1.28.0 commit — a different healthy head) → exit 0; SUMMARY shows
`mode=quick`, `upgrade: pass`, `custom: pass`, "canonical undeployed, volume retained, known-good
UNCHANGED"; afterwards `canonical.json` version STILL 1.11.0+1.29.0 (NOT promoted), canonical
idle, content volume retained, known-good marker intact.
- **FAIL run (rollback):** stage a broken custom-html commit (`image: nginx:99.99.99-doesnotexist`),
`RECIPE=custom-html CCCI_QUICK=1 CCCI_SKIP_FETCH=1 REF=<broken sha> cc-ci-run
runner/run_recipe_ci.py` → exit 1; SUMMARY shows "rolling back … restored known-good data;
canonical idle (NOT promoted)"; afterwards known-good version UNCHANGED, canonical idle, data
(marker) intact. Builder ran both live: **ALL PASS** (canonical left clean idle@1.11.0+1.29.0).
- **no-canonical fallback:** MODE=quick for a recipe with no canonical → logs "falling back to COLD"
and runs the full cold flow (so the PR is still tested; default `!testme` unaffected).
**Builder will NOT advance into W3 (cold-advances-canonical / nightly) past this gate** until
REVIEW-2w shows PASS — but will do the tracked W0.10a (traefik) in a quiet window meanwhile.
---
### Gate: WC2 + WC3 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 0246296, gate 4ce80f8)
Cold-verified from the Adversary's own clone (its own data-warm round-trip + restore round-trip).
Builder may proceed to W2 (`--quick`). custom-html canonical left clean (idle, volume retained,
known-good content, snapshot intact, v1.11.0+1.29.0). (claim detail retained below.)
### (claimed, now PASS) Gate: WC2 + WC3 — CLAIMED detail
**WHAT.** The data-warm canonical model (W1): a declarative per-recipe canonical at the stable domain
`warm-<recipe>.ci.commoninternet.net`, kept **data-warm** (undeployed-when-idle, data volume
retained), tracked by a registry; **known-good snapshots** (raw per-volume tar while undeployed, one
last-good per app, restore round-trips data).
**WHERE (code).** `runner/harness/canonical.py` (registry + data-warm lifecycle), `runner/harness/
warmsnap.py` (snapshot/restore), enrollment `tests/custom-html/recipe_meta.py: WARM_CANONICAL=True`.
State on cc-ci under `/var/lib/ci-warm/<recipe>/` (`canonical.json`, `snapshot/`, retained volume).
**HOW + EXPECTED (cold, from your own clone on cc-ci):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **61 passed** (incl. test_canonical, test_warmsnap).
2. **WC2/WC3 data-warm round-trip** (custom-html canonical exists idle now): reproduce with a driver
that uses `runner/harness/canonical.py` — deploy `warm-custom-html.ci.commoninternet.net` @
`1.11.0+1.29.0`, write a marker file into `/usr/share/nginx/html/`, undeploy, `seed_canonical`
(writes `/var/lib/ci-warm/custom-html/canonical.json` + a `snapshot/` while undeployed); confirm
**app UNDEPLOYED but the `content` volume RETAINED** (`docker volume ls | grep warm-custom-html`);
then `deploy_canonical('custom-html')` → the marker **survives** (data-warm reattach). Builder ran
this live: **ALL PASS** (marker `WC2-DATA-MARKER-7f3a9c` survived; registry version=1.11.0+1.29.0;
snapshot present). Current live state: `cat /var/lib/ci-warm/custom-html/canonical.json` →
status=idle, version=1.11.0+1.29.0; `docker volume ls` shows
`warm-custom-html_ci_commoninternet_net_content` retained with NO custom-html service running.
3. **WC3 restore round-trip** already cold-verified in the W0.9/W0.5 keycloak proof (snapshot →
mutate DB → restore → data back); same `warmsnap` helper.
4. **D8/WC8:** `/var/lib/ci-warm/` is cache, NOT in the nix closure (no module references it as a
source); re-seeded by cold runs, not restored on rebuild.
**Builder will NOT advance into W2 (`--quick`, which consumes the canonical) past this gate** until
REVIEW-2w shows PASS — but will do non-disruptive W0.10 follow-ups (alert relay) meanwhile.
---
### Gate: WC1 + WC1.2 + WC1.1(keycloak) — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 31ac86d, gate 985686f)
All 6 checks cold-verified from the Adversary's own clone. Builder may proceed to W1. **Tracked open
(must close before Phase-2w DONE, not a blocker now): traefik WC1.1 (W0.10)** — stateless
version-rollback not yet on the shared health-gated reconciler; Adversary will require a cold proof.
(claim detail retained below for the record)
**WHAT.** The live-warm keycloak layer (W0): a persistent **unpinned** keycloak at the stable domain
`warm-keycloak.ci.commoninternet.net`, declaratively reconciled, that SSO-dependent runs use via a
**per-run namespaced realm** (created + deleted) instead of co-deploying; concurrent dependents get
distinct realms; orphan realms are reaped (WC1). The reconciler health-gates auto-upgrades with
snapshot-backed rollback (WC1.1) behind a pre-deploy safety gate for major/manual-migration bumps
(WC1.2).
**WHERE (code).** `runner/warm_reconcile.py` (reconcile logic), `runner/harness/warm.py` (stable
domain, per-run realm naming, reaping), `runner/harness/sso.py` (realm lifecycle), `runner/harness/
warmsnap.py` (snapshot/restore), `runner/run_recipe_ci.py` (warm/cold dep split), `nix/modules/
warm-keycloak.nix` (systemd reconcile unit). Warm state on cc-ci under `/var/lib/ci-warm/`.
**HOW + EXPECTED (cold, from your own clone on cc-ci — tar-sync runner+tests to your /root/<clone>):**
1. **Declarative + unpinned + healthy:** `grep -n kcVersion nix/modules/warm-keycloak.nix` → *no
match* (pin removed; the unit runs `runner/warm_reconcile.py keycloak`). `ssh cc-ci 'systemctl
is-active warm-keycloak.service'` → `active`; `systemctl is-system-running` → `running`. Health:
`curl -sk --resolve warm-keycloak.ci.commoninternet.net:443:127.0.0.1
https://warm-keycloak.ci.commoninternet.net/realms/master -o /dev/null -w '%{http_code}'` → `200`.
D8: a `nixos-rebuild build` closure hash is unaffected by which keycloak version is live (recipe
fetched at runtime).
2. **Units:** `cc-ci-run -m pytest tests/unit -q` → **57 passed** (incl. test_warm_realm,
test_warmsnap, test_warm_reconcile).
3. **WC1 headline e2e:** `RECIPE=lasuite-docs STAGES=install,custom cc-ci-run
runner/run_recipe_ci.py` → `install: pass`, `custom: pass`, **`deploy-count = 1 (expect 1)`**
(keycloak NOT co-deployed), log shows `dep: using live-warm keycloak @ warm-keycloak...` and
`dep: deleted per-run realm lasuite-docs-<hex> on warm keycloak`. The 3 custom SSO tests pass
(test_health_check, test_oidc_login_via_keycloak, test_oidc_password_grant_against_dep_keycloak).
After the run, warm keycloak realms = `['master']` only (no leftover); no `lasu*` docker stack.
4. **WC1 concurrency + reaping (deploy-free):** `realm_for("lasuite-docs","lasu-aaa111...")` =
`lasuite-docs-aaa111` and `...bbb222` → distinct (two concurrent same-recipe runs never collide);
create realms aaa111/bbb222/ccc333 on the warm kc, each `oidc_password_grant` returns a JWT;
`sso.reap_orphaned_realms(D, live_hexes={"aaa111"})` deletes exactly bbb222+ccc333 and KEEPS
aaa111. (Builder ran this live: PASS.)
5. **WC1.1 health-gated rollback (live):** with `CCCI_SKIP_FETCH=1` stage two **annotated** fake tags
on `~/.abra/recipes/keycloak` — `10.7.9+26.6.2` at the good commit (`git tag -a -m x 10.7.9+26.6.2
10.7.1+26.6.2^{}`) and `10.7.10+26.6.2` at a commit whose compose.yml has a broken
`KC_HOSTNAME=:::bad-host:::`. Create a marker realm, set last_good, then run `CCCI_SKIP_FETCH=1
cc-ci-run runner/warm_reconcile.py keycloak` twice → first `RECONCILE RESULT: upgraded:...->10.7.9`
(snapshot taken, last_good=10.7.9, marker preserved); second `rolled-back:10.7.10->10.7.9` —
keycloak HEALTHY on 10.7.9, **marker realm INTACT** (data preserved), `/var/lib/ci-warm/keycloak/
last_good` still `10.7.9` (NOT advanced), a `*-rollback.json` alert under `/var/lib/ci-warm/alerts/`
with `attempted=10.7.10 last_good=10.7.9 recovered=true`. (Builder ran this live: ALL PASS; keycloak
restored to canonical 10.7.1+26.6.2.)
6. **WC1.2 pre-deploy safety gate (live):** stage an annotated fake tag with a MAJOR bump
(`11.0.0+27.0.0`) → `CCCI_SKIP_FETCH=1 ... warm_reconcile.py keycloak` → `RECONCILE RESULT:
held-major:...`, a `*-held-major.json` alert written, **keycloak untouched** (TYPE unchanged,
200, no snapshot/deploy churn). Stage a minor tag (`10.7.2+26.6.3`) with `releaseNotes/
10.7.2+26.6.3.md` containing "manual migration" → `held-manual-migration`, alert carries the notes.
(Builder ran both live: held + untouched.)
**SCOPE (honest).** WC1 and WC1.2 are complete. **WC1.1 is proven for keycloak** — the *stateful*
case (snapshot-backed data-integrity rollback), which is the hard part and the Adversary's marquee
proof. **traefik's WC1.1** (stateless = version-rollback-only) is **NOT yet migrated** onto the shared
health-gated reconciler — it still uses the existing `proxy.nix` chaos-deploy reconciler. That
migration is **W0.10** (tracked in BACKLOG-2w), to land before the Phase-2w DONE. If the Adversary
wants WC1.1 fully closed (both reconcilers) before PASS, treat this gate as WC1 + WC1.2 + WC1.1(keycloak).
**Alert delivery note (not blocking):** the reconciler WRITES alert sentinels to
`/var/lib/ci-warm/alerts/*.json` (proven above). The operator-facing relay (Builder loop scans →
PushNotification → archive to `alerts/seen/`) is loop behavior, run each wake when an alert exists;
none currently. "Alert fired" for WC1.1/WC1.2 = sentinel written, which is independently checkable.
**Builder will NOT advance past this gate** (to W1/WC2 canonical registry) until REVIEW-2w shows PASS.
## (prior) Gate
(none before this)
## Blocked
(none)
## Notes
- **Disk budget (WC8 watch):** cc-ci `/` was 91% (2.4G free) at phase start; freed orphaned Phase-2
cold apps (lasu-0a6fb2 12-svc, keyc-07d81e, lasu-dbg) → 86% (3.8G free). 9.7GB reclaimable in
Docker images kept as warm pull-cache (authenticated pulls now, so re-pull is cheaper but slower).
- Stable-domain scheme (proposed, see DECISIONS): `warm-<recipe>.ci.commoninternet.net`, distinct
from cold `<recipe[:4]>-<6hex>`.
</content>