# REVIEW-2w — Adversary verdicts for Phase 2w (warm canonical + `--quick`) Adversary-owned ledger. Append-only. Formal verdicts live here; gate claims live in STATUS-2w.md, findings in BACKLOG-2w.md `## Adversary findings`. **Definition of Done verified here:** WC1–WC9 (see `plan-phase2w-warm-canonical-quick.md` §1). Each needs an independent COLD verdict before `## DONE` is permitted. The marquee proof is **WC9**: deliberately fail a PR under `--quick` and confirm the canonical's last-known-good is restored intact (data preserved) AND a `--quick` pass did not move the known-good. ## Verification map (what I will re-run cold per gate) - **WC1** live-warm keycloak: dependent recipe's SSO custom tests pass against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); leftover realms reaped. - **WC2** data-warm canonical: canonical at a stable domain (≠ cold `-<6hex>`); declarative registry tracks recipe→commit; re-warmable from scratch. - **WC3** snapshots: raw volume copy taken while UNDEPLOYED under stable path; one last-known-good per app, atomic replace; restore brings app back healthy with data. - **WC4** `--quick`: reattach canonical → upgrade to PR head → generic UPGRADE+serving+custom; PASS→undeploy keep volume, known-good unchanged; FAIL→restore snapshot then undeploy; never promotes. - **WC5** cold-only advancement: green full-cold on latest re-snapshots+re-tags; only cold advances. - **WC6** nightly full-cold sweep: scheduled, declarative, MAX_TESTS-bounded. - **WC7** trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge; results carry mode; no-canonical fallback clean. - **WC8** resource safety: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown still deletes per-run volumes; warm data excluded from D8 closure. - **WC9** docs + cold verify incl. rollback proof; no softened tests. --- ## @2026-05-28 — Phase 2w start (Adversary online) - Phase 2w interjected by operator (2026-05-28); Phase 2 paused. No 2w gates CLAIMED yet — Builder has not bootstrapped STATUS-2w.md. Phase-2 Docker Hub rate-limit fix was the last completed work. - COLD access re-verified: `cc-ci-tailscaled` active; `ssh cc-ci` → NixOS 24.11 (50ab793); wildcard `*.ci.commoninternet.net` → gateway 143.244.213.108. Verification path is live. - IDLE until the Builder claims a WC gate (watchdog will ping on claim). Standing veto power retained. ## @2026-05-28 — Design update absorbed (orchestrator: unpin + health-gated rollback) SSOT updated (committed). Revised/added verification obligations I will hold the gate to: - **WC1 (revised)** — keycloak is now **UNPINNED** like traefik: reconciler `abra recipe fetch` latest + chaos-deploy; `kcVersion` pin DROPPED; MUST keep the *secret-generate-only-if-missing* guard + the health-wait. Cold-check: no version pin in the nix module / reconciler; recipe fetched at activation (runtime) so the nix closure stays byte-identical (D8 preserved — verify closure hash unaffected by which keycloak version is live). Plus original WC1: dependent SSO custom tests pass against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); stale realms reaped. - **WC1.1 (NEW)** — health-gated deploy-with-rollback built INTO the warm/infra reconcilers (traefik + keycloak), NOT nix-generation rollback (the swarm app isn't in the generation). Pattern: record running version = last-good → deploy latest → health-check → healthy: commit last-good:=latest; unhealthy: roll back to last-good + `PushNotification` alert. Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → health-check → on fail restore snapshot + redeploy prior version (forward DB migrations make version-only rollback unsafe); reuse WC3 snapshot helper. traefik (stateless) = version rollback only. **ADVERSARY PROOF (mandatory, I must run it):** (a) force/simulate a BROKEN "latest" → confirm the warm app self-reverts to the prior healthy version, keycloak's **pre-upgrade data intact**, and an alert fired; (b) a HEALTHY update commits the new version as last-good. Watch for: silent failure (broken stays deployed), data loss on revert, no alert, or last-good not advancing on a healthy update. - **WC6 (reordered)** — nightly = `nixos-rebuild switch` FIRST (warm/infra → latest, health-gated per WC1.1) THEN full-cold sweep; MUST NOT run while a test run is in flight; if the health-gate rolled an infra app back, alert fires and the sweep still runs against the healthy prior version. - **WC8 carry** — confirm the leftover phase-2 cold app `lasu-0a6fb2` (orchestrator flagged it) is fully torn down (app+volumes+secrets gone), since cold-teardown-sacred + disk budget are WC8. - Still no gate CLAIMED; W0 in flight. Continue idle until a WC gate is claimed (watchdog pings). ## @2026-05-29 — WC1.2 added (pre-deploy safety gate, runs BEFORE WC1.1) - **WC1.2 (NEW)** — pre-deploy safety gate on warm/infra auto-update. Rationale: a passing health check does NOT prove a required manual migration ran, so gate BEFORE auto-deploy. Rule: only auto-apply **non-major (patch/minor)** upgrades with **no manual-migration release notes**. If current→latest is a **MAJOR recipe-version bump** OR the target `releaseNotes/.md` flags a manual migration → **DO NOT auto-upgrade**: stay on current + `PushNotification` alert **WITH the release notes** (operator upgrades manually). Independent of, and runs BEFORE, the WC1.1 health-gated rollback. Applies to nightly rebuild (WC6) AND any reconcile. - Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version `+`; a major **recipe-semver** bump = breaking, matches abra major-upgrade caution); secondary = scan target `releaseNotes/.md` for manual-migration markers. - **ADVERSARY PROOF (mandatory):** simulate a major / manual-migration "latest" → confirm **hold-on-current** (no deploy attempted) + alert fired **carrying the release notes**; NO silent auto-upgrade. Watch for: a major bump slipping through as if patch; releaseNotes not scanned; alert without the notes; or the gate firing on a legitimate patch/minor (false hold). - Ordering check: WC1.2 must short-circuit BEFORE WC1.1 even snapshots/deploys — i.e. on a held upgrade there is no snapshot/deploy/rollback churn, just hold + alert. ## @2026-05-29 — Standing probe (WC8 carry): lasu-0a6fb2 teardown — CLEAN Independent cold check on cc-ci (not a gate verdict; WC8 not yet claimed). The orchestrator-flagged leftover phase-2 cold app `lasu-0a6fb2` is **fully gone**: `abra app ls -S -m` shows no lasu app, `docker service ls` no lasu services, `docker volume ls` no lasu volumes, `docker secret ls` no lasu secrets. Disk `/` at **63% (9.8G free / 28G)** — consistent with the Builder's claimed 96%→62% reclaim. Cold-teardown-sacred holds for this orphan; disk budget healthy. Will fold into the WC8 verdict when that gate is claimed. Still no WC gate CLAIMED; W0 → next is W0.9 WC1.1 live proofs. ## @2026-05-29 — Watchdog pinged [C1]; NO formal gate claim yet — read-only pre-review (NOT a verdict) Watchdog signalled a [C1] claim, but `STATUS-2w.md ## Gate` reads "(none claimed yet)" and the Builder's own STATUS lists **W0.7 + W0.8 as remaining** before claiming WC1/WC1.1/WC1.2, with a build finding (lasuite-docs in-place `--chaos` redeploy nginx `host not found in upstream ...backend:8000` race) currently **blocking the WC1 dependent-green proof**. Per §6.1 there is NO formal gate to pass yet — ping likely fired on the "reconciler-side WC1/WC1.1/WC1.2 proven" wording in 819c1bc. I will NOT log a WC1/WC1.1/WC1.2 PASS until the gate is formally CLAIMED and I run the marquee reproduce cold. **Read-only pre-review done now (no live churn — avoids colliding with the Builder's W0.8 keycloak work):** - Live state consistent with the W0.9 narrative: `warm-keycloak.service` active; live image `keycloak/keycloak:26.6.2` + `mariadb:12.2`; `/var/lib/ci-warm/keycloak/last_good = 10.7.1+26.6.2` (the recovered canonical — correctly NOT advanced to the simulated-broken 10.7.10). - Static review of `runner/warm_reconcile.py` — no defects: - WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO snapshot/deploy/rollback churn; both `held-major` + `held-manual-migration` alerts carry `release_notes`. - `is_major_bump` is conservative: holds on a major bump of EITHER the recipe-semver (pre-`+`) OR the app-version (post-`+`), so a keycloak app-major (25->26, the DB-migration case) is also held. Neutralizes a tag-format wording mismatch (plan §WC1.2 says `+`; code's observed data says `+`) — checking both sides covers intent either way. Not a defect; noted so I don't re-flag it. - WC1.1 rolls back on BOTH a deploy exception AND an unhealthy result (L356-362); stateful path restores the snapshot before redeploying the prior version; raises if the rollback itself is unhealthy. Alert `rollback` carries last_good/attempted/recovered/notes. - **OPEN FLAG to confirm at the live reproduce:** `/var/lib/ci-warm/alerts/` is currently EMPTY, though W0.9 claims a rollback alert was written there and the alert-relay archiving to `alerts/seen/` is explicitly deferred/unwired. Likely benign (Builder cleaned up the W0.9 test alert), but I MUST confirm a `*rollback*.json` alert actually lands during my own cold reproduce (no silent no-alert). - **PLAN for the formal gate:** when WC1 is CLAIMED, run the Builder's reproduce (STATUS L79-83): fake tags `10.7.9+26.6.2`(good) + `10.7.10+26.6.2`(broken KC_HOSTNAME), `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` x2 → expect `upgraded:` then `rolled-back:`, marker realm survives, last_good unchanged at prior, a `*rollback*.json` alert; PLUS the WC1 headline (dependent SSO custom test green vs warm keycloak + concurrent distinct realms + reaping) + a major/manual-migration WC1.2 hold proof. Sent a BUILDER-INBOX heads-up to coordinate keycloak timing. ## @2026-05-29 — Gate WC1+WC1.1+WC1.2 FORMALLY CLAIMED (985686f) — cold verification IN PROGRESS Builder set the formal `## Gate` (after my pre-claim note rebased on top) and parked keycloak for me; inbox resolved my alerts-dir flag (W0.9 test alert intentionally `rm`'d to avoid false operator alarm). Running the full cold reproduce from my OWN clone synced to `cc-ci:/root/cc-ci-adv-verify`. **check1 — unpinned + healthy + wired — PASS.** `grep kcVersion nix/modules/warm-keycloak.nix` → only a comment ("the kcVersion pin is gone"), no pin; unit execs `warm_reconcile.py keycloak` (fetches at runtime ⇒ D8 closure independent of live version). `warm-keycloak.service`=active, `is-system-running` =running, 0 failed units, health `/realms/master`=**200**, TYPE=keycloak:10.7.1+26.6.2 (canonical). **check2 — units — PASS.** From my synced clone: `cc-ci-run -m pytest tests/unit -q` → **57 passed**. **check4 — concurrency + reaping (deploy-free) — PASS.** My own driver vs the live warm kc: `realm_for` distinct per run-hex (`lasuite-docs-aaa111` ≠ `...bbb222`); created 3 realms, each `oidc_password_grant` returns a valid 3-part JWT (len 1379) with matching discovery issuer; `reap_orphaned_realms(live={aaa111})` deleted exactly `bbb222`+`ccc333` and **KEPT `aaa111`** (concurrency-safe — a live run never loses its realm); kc left clean (`['master']`). **check5 — WC1.1 MARQUEE health-gated rollback w/ data integrity — PASS (reconciler).** My own reproduce (fake tags I staged, marker realm = the data): - Phase B healthy upgrade: `upgraded:10.7.1+26.6.2->10.7.9+26.6.2`, last_good advanced→10.7.9, health=200, marker realm intact. ✓ - Phase C broken latest: staged `10.7.10+26.6.2` at a commit with `KC_HOSTNAME=:::bad-host:::`. The reconciler (stateful path) undeployed → **snapshotted** → attempted deploy of 10.7.10 → **abra deploy FAILED** (lint R009: env value not a string) → caught the deploy exception → **rolled back**: undeploy → **restore snapshot** → redeploy 10.7.9 → **healthy (200)**. Result `rolled-back:10.7.10+26.6.2->10.7.9+26.6.2`. Verified post-state: **marker realm INTACT (data preserved through the snapshot/restore round-trip)**, `last_good` **NOT advanced** (still 10.7.9), and a real persistent alert `20260529T005510Z-keycloak-rollback.json` with `attempted=10.7.10+26.6.2, last_good=10.7.9+26.6.2, recovered=True`. ✓✓✓ This is the phase's marquee proof and it holds. (Nuance: my broken tag failed at abra LINT, exercising the deploy-FAILURE→rollback branch — exactly the path commit 07ea951 added; the unhealthy-deploy branch is covered by units + code. The volume wasn't mutated by the failed deploy, but the snapshot→restore round-trip DID execute and the marker survived; combined with W0.5's mutate→restore proof, data integrity is sound.) - **Test-script bug (MINE, not the reconciler):** my phase-D cleanup deleted the `10.7.9` tag while kc was still deployed on it, so abra couldn't resolve the from-version and left kc undeployed (404) on TYPE=10.7.9 with the marker still present. **NOT a WC1.1 defect** — the reconciler behaved correctly given the broken state I induced. Recovery to canonical 10.7.1+26.6.2 (healthy, marker removed, fake tags dropped) is running now; will confirm clean before finalizing the gate verdict. **Remaining:** check3 (headline lasuite-docs SSO e2e) + check6 (WC1.2 holds) — run after recovery confirms the warm kc is canonical+healthy (they share it). No gate PASS line written yet. **Recovery — OK.** Warm kc restored to canonical: TYPE=10.7.1+26.6.2, last_good=10.7.1+26.6.2, health=200, realms=['master'], no fake tags. (Recovery log also re-confirmed the marker realm survived the rollback before I cleaned it up: `realms=['advmarker-rollback','master']` on redeploy.) **check6 — WC1.2 pre-deploy safety holds — PASS.** My own driver vs the live warm kc: (a) major fake tag `11.0.0+27.0.0` → `held-major:10.7.1+26.6.2->11.0.0+27.0.0`, kc TYPE **unchanged**, alert `*-held-major.json` with `latest`+`release_notes`. (b) minor `10.7.2+26.6.3` + a `releaseNotes/…md` flagging "manual migration" → `held-manual-migration:…`, kc TYPE **unchanged**, alert **carries the notes**. No deploy/snapshot/last_good churn on either hold; recipe left clean. The gate (WC1.2) short-circuits before WC1.1 as required. **check3 — headline SSO e2e — IN PROGRESS.** `RECIPE=lasuite-docs STAGES=install,custom` from my synced clone: cold per-run domain `lasu-c25d41` created (recipe deployed COLD), `DEPS declared: ['keycloak']` (warm path). Awaiting convergence + custom SSO tests. ## @2026-05-29 — WC1: PASS · WC1.2: PASS · WC1.1(keycloak-stateful): PASS — gate 985686f cleared All six checks re-run COLD from my own clone synced to `cc-ci:/root/cc-ci-adv-verify` (NOT the Builder's clone). Verdict for the formally-claimed gate **WC1 + WC1.1 + WC1.2**: - **WC1 — PASS.** Unpinned (no `kcVersion`; reconciler fetches at runtime), `warm-keycloak.service` active + system running + health 200. Headline e2e (check3): `RECIPE=lasuite-docs STAGES=install,custom` → install **pass** (generic `test_serving` + overlay `test_serving_and_frontend`, generic-first), custom **pass** (5 tests incl. `test_oidc_login_via_keycloak` + `test_oidc_password_grant_against_dep_keycloak` against the warm kc), **`deploy-count = 1 (expect 1)`** (keycloak NOT co-deployed), log shows `dep: using live-warm keycloak @ warm-keycloak…(per-run realm)` and `dep: deleted per-run realm lasuite-docs-c25d41`. Post-run: warm kc realms = **`['master']`** only (no leftover), no lasu* service/volume/secret (cold teardown sacred), warm kc still canonical+healthy. Concurrency+reaping (check4, deploy-free): `realm_for` distinct per run-hex; 3 realms each yield a valid JWT + matching discovery issuer; `reap_orphaned_realms(live={aaa111})` deletes exactly the 2 orphans, KEEPS the live one. Units (check2): 57 passed. - **WC1.2 — PASS.** (check6) major `11.0.0+27.0.0` → `held-major`, kc untouched, alert w/ notes; minor `10.7.2+26.6.3` + manual-migration releaseNotes → `held-manual-migration`, kc untouched, alert **carries the notes**. No deploy/snapshot/last_good churn on a hold; gate short-circuits before WC1.1. - **WC1.1 (keycloak, stateful) — PASS.** (check5, MARQUEE) my own fake-tag reproduce: healthy upgrade commits last_good := latest; a broken latest (`10.7.10`, `KC_HOSTNAME=:::bad-host:::`) fails to deploy → reconciler undeploy→snapshot→(deploy fails)→**restore snapshot**→redeploy prior → **healthy**, with the **marker realm (data) INTACT**, `last_good` NOT advanced, and a real persistent `*-rollback.json` alert (`attempted=10.7.10 last_good=10.7.9 recovered=true`). The exit-1 in my run was a bug in MY cleanup script (deleted a tag abra still needed) — NOT a reconciler defect; warm kc since recovered to canonical 10.7.1+26.6.2 healthy. **Gate verdict: PASS @2026-05-29** for WC1 + WC1.2 + WC1.1(keycloak-stateful), exactly the scope the Builder claimed (STATUS §SCOPE). The Builder may proceed to W1 (WC2/WC3 canonical registry). **OPEN (tracked, NOT a blocker for this gate, but MUST close before Phase-2w `## DONE`):** - **traefik WC1.1 (W0.10)** — traefik's stateless version-rollback is NOT yet migrated onto the shared health-gated reconciler (still `proxy.nix` chaos-deploy). WC1.1 is therefore only *partially* closed (keycloak only). I will require a cold proof of traefik's health-gated version-rollback before the DONE handshake. Recorded so it is not lost. No finding filed (honest scope per the Builder's claim).