Files
cc-ci/machine-docs/REVIEW-2w.md

412 lines
33 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW-2w — Adversary verdicts for Phase 2w (warm canonical + `--quick`)
Adversary-owned ledger. Append-only. Formal verdicts live here; gate claims live in STATUS-2w.md,
findings in BACKLOG-2w.md `## Adversary findings`.
**Definition of Done verified here:** WC1WC9 (see `plan-phase2w-warm-canonical-quick.md` §1).
Each needs an independent COLD verdict before `## DONE` is permitted. The marquee proof is **WC9**:
deliberately fail a PR under `--quick` and confirm the canonical's last-known-good is restored intact
(data preserved) AND a `--quick` pass did not move the known-good.
## Verification map (what I will re-run cold per gate)
- **WC1** live-warm keycloak: dependent recipe's SSO custom tests pass against warm keycloak;
concurrent dependents use distinct namespaced realms (no collision); leftover realms reaped.
- **WC2** data-warm canonical: canonical at a stable domain (≠ cold `<recipe>-<6hex>`); declarative
registry tracks recipe→commit; re-warmable from scratch.
- **WC3** snapshots: raw volume copy taken while UNDEPLOYED under stable path; one last-known-good per
app, atomic replace; restore brings app back healthy with data.
- **WC4** `--quick`: reattach canonical → upgrade to PR head → generic UPGRADE+serving+custom;
PASS→undeploy keep volume, known-good unchanged; FAIL→restore snapshot then undeploy; never promotes.
- **WC5** cold-only advancement: green full-cold on latest re-snapshots+re-tags; only cold advances.
- **WC6** nightly full-cold sweep: scheduled, declarative, MAX_TESTS-bounded.
- **WC7** trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
results carry mode; no-canonical fallback clean.
- **WC8** resource safety: warm runs serialize per app; warm keycloak shared via per-run realms; disk
monitored+pruned; cold teardown still deletes per-run volumes; warm data excluded from D8 closure.
- **WC9** docs + cold verify incl. rollback proof; no softened tests.
---
## @2026-05-28 — Phase 2w start (Adversary online)
- Phase 2w interjected by operator (2026-05-28); Phase 2 paused. No 2w gates CLAIMED yet — Builder
has not bootstrapped STATUS-2w.md. Phase-2 Docker Hub rate-limit fix was the last completed work.
- COLD access re-verified: `cc-ci-tailscaled` active; `ssh cc-ci` → NixOS 24.11 (50ab793);
wildcard `*.ci.commoninternet.net` → gateway 143.244.213.108. Verification path is live.
- IDLE until the Builder claims a WC gate (watchdog will ping on claim). Standing veto power retained.
## @2026-05-28 — Design update absorbed (orchestrator: unpin + health-gated rollback)
SSOT updated (committed). Revised/added verification obligations I will hold the gate to:
- **WC1 (revised)** — keycloak is now **UNPINNED** like traefik: reconciler `abra recipe fetch`
latest + chaos-deploy; `kcVersion` pin DROPPED; MUST keep the *secret-generate-only-if-missing*
guard + the health-wait. Cold-check: no version pin in the nix module / reconciler; recipe fetched
at activation (runtime) so the nix closure stays byte-identical (D8 preserved — verify closure hash
unaffected by which keycloak version is live). Plus original WC1: dependent SSO custom tests pass
against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); stale
realms reaped.
- **WC1.1 (NEW)** — health-gated deploy-with-rollback built INTO the warm/infra reconcilers
(traefik + keycloak), NOT nix-generation rollback (the swarm app isn't in the generation). Pattern:
record running version = last-good → deploy latest → health-check → healthy: commit last-good:=latest;
unhealthy: roll back to last-good + `PushNotification` alert. Stateful (keycloak): undeploy → raw
snapshot data volume → deploy latest → health-check → on fail restore snapshot + redeploy prior
version (forward DB migrations make version-only rollback unsafe); reuse WC3 snapshot helper.
traefik (stateless) = version rollback only. **ADVERSARY PROOF (mandatory, I must run it):**
(a) force/simulate a BROKEN "latest" → confirm the warm app self-reverts to the prior healthy
version, keycloak's **pre-upgrade data intact**, and an alert fired; (b) a HEALTHY update commits
the new version as last-good. Watch for: silent failure (broken stays deployed), data loss on
revert, no alert, or last-good not advancing on a healthy update.
- **WC6 (reordered)** — nightly = `nixos-rebuild switch` FIRST (warm/infra → latest, health-gated per
WC1.1) THEN full-cold sweep; MUST NOT run while a test run is in flight; if the health-gate rolled
an infra app back, alert fires and the sweep still runs against the healthy prior version.
- **WC8 carry** — confirm the leftover phase-2 cold app `lasu-0a6fb2` (orchestrator flagged it) is
fully torn down (app+volumes+secrets gone), since cold-teardown-sacred + disk budget are WC8.
- Still no gate CLAIMED; W0 in flight. Continue idle until a WC gate is claimed (watchdog pings).
## @2026-05-29 — WC1.2 added (pre-deploy safety gate, runs BEFORE WC1.1)
- **WC1.2 (NEW)** — pre-deploy safety gate on warm/infra auto-update. Rationale: a passing health
check does NOT prove a required manual migration ran, so gate BEFORE auto-deploy. Rule: only
auto-apply **non-major (patch/minor)** upgrades with **no manual-migration release notes**. If
current→latest is a **MAJOR recipe-version bump** OR the target `releaseNotes/<version>.md` flags a
manual migration → **DO NOT auto-upgrade**: stay on current + `PushNotification` alert **WITH the
release notes** (operator upgrades manually). Independent of, and runs BEFORE, the WC1.1
health-gated rollback. Applies to nightly rebuild (WC6) AND any reconcile.
- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
`<upstream>+<recipe-semver>`; a major **recipe-semver** bump = breaking, matches abra
major-upgrade caution); secondary = scan target `releaseNotes/<version>.md` for manual-migration
markers.
- **ADVERSARY PROOF (mandatory):** simulate a major / manual-migration "latest" → confirm
**hold-on-current** (no deploy attempted) + alert fired **carrying the release notes**; NO silent
auto-upgrade. Watch for: a major bump slipping through as if patch; releaseNotes not scanned;
alert without the notes; or the gate firing on a legitimate patch/minor (false hold).
- Ordering check: WC1.2 must short-circuit BEFORE WC1.1 even snapshots/deploys — i.e. on a held
upgrade there is no snapshot/deploy/rollback churn, just hold + alert.
## @2026-05-29 — Standing probe (WC8 carry): lasu-0a6fb2 teardown — CLEAN
Independent cold check on cc-ci (not a gate verdict; WC8 not yet claimed). The orchestrator-flagged
leftover phase-2 cold app `lasu-0a6fb2` is **fully gone**: `abra app ls -S -m` shows no lasu app,
`docker service ls` no lasu services, `docker volume ls` no lasu volumes, `docker secret ls` no lasu
secrets. Disk `/` at **63% (9.8G free / 28G)** — consistent with the Builder's claimed 96%→62%
reclaim. Cold-teardown-sacred holds for this orphan; disk budget healthy. Will fold into the WC8
verdict when that gate is claimed. Still no WC gate CLAIMED; W0 → next is W0.9 WC1.1 live proofs.
## @2026-05-29 — Watchdog pinged [C1]; NO formal gate claim yet — read-only pre-review (NOT a verdict)
Watchdog signalled a [C1] claim, but `STATUS-2w.md ## Gate` reads "(none claimed yet)" and the
Builder's own STATUS lists **W0.7 + W0.8 as remaining** before claiming WC1/WC1.1/WC1.2, with a build
finding (lasuite-docs in-place `--chaos` redeploy nginx `host not found in upstream ...backend:8000`
race) currently **blocking the WC1 dependent-green proof**. Per §6.1 there is NO formal gate to pass
yet — ping likely fired on the "reconciler-side WC1/WC1.1/WC1.2 proven" wording in 819c1bc. I will
NOT log a WC1/WC1.1/WC1.2 PASS until the gate is formally CLAIMED and I run the marquee reproduce cold.
**Read-only pre-review done now (no live churn — avoids colliding with the Builder's W0.8 keycloak work):**
- Live state consistent with the W0.9 narrative: `warm-keycloak.service` active; live image
`keycloak/keycloak:26.6.2` + `mariadb:12.2`; `/var/lib/ci-warm/keycloak/last_good = 10.7.1+26.6.2`
(the recovered canonical — correctly NOT advanced to the simulated-broken 10.7.10).
- Static review of `runner/warm_reconcile.py` — no defects:
- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
snapshot/deploy/rollback churn; both `held-major` + `held-manual-migration` alerts carry `release_notes`.
- `is_major_bump` is conservative: holds on a major bump of EITHER the recipe-semver (pre-`+`) OR
the app-version (post-`+`), so a keycloak app-major (25->26, the DB-migration case) is also held.
Neutralizes a tag-format wording mismatch (plan §WC1.2 says `<upstream>+<recipe-semver>`; code's
observed data says `<recipe-semver>+<app-version>`) — checking both sides covers intent either way.
Not a defect; noted so I don't re-flag it.
- WC1.1 rolls back on BOTH a deploy exception AND an unhealthy result (L356-362); stateful path
restores the snapshot before redeploying the prior version; raises if the rollback itself is
unhealthy. Alert `rollback` carries last_good/attempted/recovered/notes.
- **OPEN FLAG to confirm at the live reproduce:** `/var/lib/ci-warm/alerts/` is currently EMPTY,
though W0.9 claims a rollback alert was written there and the alert-relay archiving to `alerts/seen/`
is explicitly deferred/unwired. Likely benign (Builder cleaned up the W0.9 test alert), but I MUST
confirm a `*rollback*.json` alert actually lands during my own cold reproduce (no silent no-alert).
- **PLAN for the formal gate:** when WC1 is CLAIMED, run the Builder's reproduce (STATUS L79-83):
fake tags `10.7.9+26.6.2`(good) + `10.7.10+26.6.2`(broken KC_HOSTNAME), `CCCI_SKIP_FETCH=1
cc-ci-run runner/warm_reconcile.py keycloak` x2 → expect `upgraded:` then `rolled-back:`, marker
realm survives, last_good unchanged at prior, a `*rollback*.json` alert; PLUS the WC1 headline
(dependent SSO custom test green vs warm keycloak + concurrent distinct realms + reaping) + a
major/manual-migration WC1.2 hold proof. Sent a BUILDER-INBOX heads-up to coordinate keycloak timing.
## @2026-05-29 — Gate WC1+WC1.1+WC1.2 FORMALLY CLAIMED (985686f) — cold verification IN PROGRESS
Builder set the formal `## Gate` (after my pre-claim note rebased on top) and parked keycloak for me;
inbox resolved my alerts-dir flag (W0.9 test alert intentionally `rm`'d to avoid false operator
alarm). Running the full cold reproduce from my OWN clone synced to `cc-ci:/root/cc-ci-adv-verify`.
**check1 — unpinned + healthy + wired — PASS.** `grep kcVersion nix/modules/warm-keycloak.nix` → only
a comment ("the kcVersion pin is gone"), no pin; unit execs `warm_reconcile.py keycloak` (fetches at
runtime ⇒ D8 closure independent of live version). `warm-keycloak.service`=active, `is-system-running`
=running, 0 failed units, health `/realms/master`=**200**, TYPE=keycloak:10.7.1+26.6.2 (canonical).
**check2 — units — PASS.** From my synced clone: `cc-ci-run -m pytest tests/unit -q`**57 passed**.
**check4 — concurrency + reaping (deploy-free) — PASS.** My own driver vs the live warm kc:
`realm_for` distinct per run-hex (`lasuite-docs-aaa111``...bbb222`); created 3 realms, each
`oidc_password_grant` returns a valid 3-part JWT (len 1379) with matching discovery issuer;
`reap_orphaned_realms(live={aaa111})` deleted exactly `bbb222`+`ccc333` and **KEPT `aaa111`**
(concurrency-safe — a live run never loses its realm); kc left clean (`['master']`).
**check5 — WC1.1 MARQUEE health-gated rollback w/ data integrity — PASS (reconciler).** My own
reproduce (fake tags I staged, marker realm = the data):
- Phase B healthy upgrade: `upgraded:10.7.1+26.6.2->10.7.9+26.6.2`, last_good advanced→10.7.9,
health=200, marker realm intact. ✓
- Phase C broken latest: staged `10.7.10+26.6.2` at a commit with `KC_HOSTNAME=:::bad-host:::`. The
reconciler (stateful path) undeployed → **snapshotted** → attempted deploy of 10.7.10 → **abra deploy
FAILED** (lint R009: env value not a string) → caught the deploy exception → **rolled back**:
undeploy → **restore snapshot** → redeploy 10.7.9 → **healthy (200)**. Result
`rolled-back:10.7.10+26.6.2->10.7.9+26.6.2`. Verified post-state: **marker realm INTACT (data
preserved through the snapshot/restore round-trip)**, `last_good` **NOT advanced** (still 10.7.9),
and a real persistent alert `20260529T005510Z-keycloak-rollback.json` with
`attempted=10.7.10+26.6.2, last_good=10.7.9+26.6.2, recovered=True`. ✓✓✓ This is the phase's marquee
proof and it holds. (Nuance: my broken tag failed at abra LINT, exercising the deploy-FAILURE→rollback
branch — exactly the path commit 07ea951 added; the unhealthy-deploy branch is covered by units +
code. The volume wasn't mutated by the failed deploy, but the snapshot→restore round-trip DID
execute and the marker survived; combined with W0.5's mutate→restore proof, data integrity is sound.)
- **Test-script bug (MINE, not the reconciler):** my phase-D cleanup deleted the `10.7.9` tag while kc
was still deployed on it, so abra couldn't resolve the from-version and left kc undeployed (404) on
TYPE=10.7.9 with the marker still present. **NOT a WC1.1 defect** — the reconciler behaved correctly
given the broken state I induced. Recovery to canonical 10.7.1+26.6.2 (healthy, marker removed, fake
tags dropped) is running now; will confirm clean before finalizing the gate verdict.
**Remaining:** check3 (headline lasuite-docs SSO e2e) + check6 (WC1.2 holds) — run after recovery
confirms the warm kc is canonical+healthy (they share it). No gate PASS line written yet.
**Recovery — OK.** Warm kc restored to canonical: TYPE=10.7.1+26.6.2, last_good=10.7.1+26.6.2,
health=200, realms=['master'], no fake tags. (Recovery log also re-confirmed the marker realm survived
the rollback before I cleaned it up: `realms=['advmarker-rollback','master']` on redeploy.)
**check6 — WC1.2 pre-deploy safety holds — PASS.** My own driver vs the live warm kc:
(a) major fake tag `11.0.0+27.0.0``held-major:10.7.1+26.6.2->11.0.0+27.0.0`, kc TYPE **unchanged**,
alert `*-held-major.json` with `latest`+`release_notes`. (b) minor `10.7.2+26.6.3` + a
`releaseNotes/…md` flagging "manual migration" → `held-manual-migration:…`, kc TYPE **unchanged**,
alert **carries the notes**. No deploy/snapshot/last_good churn on either hold; recipe left clean.
The gate (WC1.2) short-circuits before WC1.1 as required.
**check3 — headline SSO e2e — IN PROGRESS.** `RECIPE=lasuite-docs STAGES=install,custom` from my
synced clone: cold per-run domain `lasu-c25d41` created (recipe deployed COLD), `DEPS declared:
['keycloak']` (warm path). Awaiting convergence + custom SSO tests.
## @2026-05-29 — WC1: PASS · WC1.2: PASS · WC1.1(keycloak-stateful): PASS — gate 985686f cleared
All six checks re-run COLD from my own clone synced to `cc-ci:/root/cc-ci-adv-verify` (NOT the
Builder's clone). Verdict for the formally-claimed gate **WC1 + WC1.1 + WC1.2**:
- **WC1 — PASS.** Unpinned (no `kcVersion`; reconciler fetches at runtime), `warm-keycloak.service`
active + system running + health 200. Headline e2e (check3): `RECIPE=lasuite-docs
STAGES=install,custom` → install **pass** (generic `test_serving` + overlay
`test_serving_and_frontend`, generic-first), custom **pass** (5 tests incl.
`test_oidc_login_via_keycloak` + `test_oidc_password_grant_against_dep_keycloak` against the warm
kc), **`deploy-count = 1 (expect 1)`** (keycloak NOT co-deployed), log shows `dep: using live-warm
keycloak @ warm-keycloak…(per-run realm)` and `dep: deleted per-run realm lasuite-docs-c25d41`.
Post-run: warm kc realms = **`['master']`** only (no leftover), no lasu* service/volume/secret (cold
teardown sacred), warm kc still canonical+healthy. Concurrency+reaping (check4, deploy-free):
`realm_for` distinct per run-hex; 3 realms each yield a valid JWT + matching discovery issuer;
`reap_orphaned_realms(live={aaa111})` deletes exactly the 2 orphans, KEEPS the live one. Units
(check2): 57 passed.
- **WC1.2 — PASS.** (check6) major `11.0.0+27.0.0``held-major`, kc untouched, alert w/ notes;
minor `10.7.2+26.6.3` + manual-migration releaseNotes → `held-manual-migration`, kc untouched,
alert **carries the notes**. No deploy/snapshot/last_good churn on a hold; gate short-circuits
before WC1.1.
- **WC1.1 (keycloak, stateful) — PASS.** (check5, MARQUEE) my own fake-tag reproduce: healthy
upgrade commits last_good := latest; a broken latest (`10.7.10`, `KC_HOSTNAME=:::bad-host:::`)
fails to deploy → reconciler undeploy→snapshot→(deploy fails)→**restore snapshot**→redeploy prior
**healthy**, with the **marker realm (data) INTACT**, `last_good` NOT advanced, and a real
persistent `*-rollback.json` alert (`attempted=10.7.10 last_good=10.7.9 recovered=true`). The
exit-1 in my run was a bug in MY cleanup script (deleted a tag abra still needed) — NOT a
reconciler defect; warm kc since recovered to canonical 10.7.1+26.6.2 healthy.
**Gate verdict: PASS @2026-05-29** for WC1 + WC1.2 + WC1.1(keycloak-stateful), exactly the scope the
Builder claimed (STATUS §SCOPE). The Builder may proceed to W1 (WC2/WC3 canonical registry).
**OPEN (tracked, NOT a blocker for this gate, but MUST close before Phase-2w `## DONE`):**
- **traefik WC1.1 (W0.10)** — traefik's stateless version-rollback is NOT yet migrated onto the shared
health-gated reconciler (still `proxy.nix` chaos-deploy). WC1.1 is therefore only *partially* closed
(keycloak only). I will require a cold proof of traefik's health-gated version-rollback before the
DONE handshake. Recorded so it is not lost. No finding filed (honest scope per the Builder's claim).
## @2026-05-29 — Watchdog pinged [C2 C3]; NO formal WC2/WC3 claim yet (premature)
`## Gate` holds only the WC1 PASS; `grep CLAIMED|awaiting adversary` → none. STATUS "In flight" shows
W1 mid-build: **W1.1 registry module DONE** (`runner/harness/canonical.py`, 61 unit pass) but **W1.2
(the LIVE data-warm proof: seed → undeploy-keep-volume → redeploy-reattach → data survives) is "Next"**
and the Builder explicitly says "Then close WC2/WC3." So WC2/WC3 are NOT yet claimable — ping fired on
"WC2/WC3" wording in commits b6ef83a/563156a, not a §6.1 gate. No verdict written.
Read-only glance (NOT a verdict): canonical.py is a sound registry primitive — `seed_canonical`
honors snapshot-while-undeployed; `has_canonical` requires both a registry record AND retained
volume; deploy/undeploy-keep-volume lifecycle matches WC2. Will cold-verify when WC2/WC3 is formally
CLAIMED (the live data-warm round-trip is the key thing to re-run myself). Idle until then.
## @2026-05-29 — WC2 + WC3 — PASS (gate 4ce80f8 cleared; cold-verified from own clone)
WC2/WC3 formally claimed (4ce80f8; my premature note rebased on top). Builder parked custom-html (first
data-warm canonical, left idle) + traefik for me. All re-run COLD from `cc-ci:/root/cc-ci-adv-verify`.
- **Units — PASS:** `cc-ci-run -m pytest tests/unit -q`**61 passed** (incl. test_canonical, test_warmsnap).
- **WC2 data-warm canonical model — PASS.** Idle state matches: `canonical.json`
{recipe=custom-html, domain=warm-custom-html.ci.commoninternet.net, version=1.11.0+1.29.0,
commit=wc2proof, **status=idle**}; content volume **retained** (`warm-custom-html_…_content`); **no
service** running (idle = undeployed-keep-volume); stable `warm-<recipe>` domain (≠ cold
`<recipe[:4]>-<6hex>`). My OWN data-warm round-trip: deploy_canonical → wrote my marker
`ADV-OWN-MARKER-a1b2c3``undeploy_keep_volume` (**app down + volume retained**, registry→idle) →
deploy_canonical → **my marker SURVIVED**. The Builder's known-good marker also reattached. HTTPS
serving confirmed (`/`=200, `/index.html`=200; an earlier one-off 404 was a curl-vs-deploy-converge
race, 200 once settled — not a defect).
- **WC3 known-good snapshots — PASS.** Snapshot is a **raw per-volume tar taken while undeployed**
(`/var/lib/ci-warm/custom-html/snapshot/volumes/warm-custom-html_…_content.tar` + meta.json), one
last-good per app under the stable path. My OWN restore round-trip: mutate (deleted the known-good
`wc2-marker.txt`) → undeploy → `warmsnap.restore` → deploy → **known-good marker BACK with exact
content `WC2-DATA-MARKER-7f3a9c`** AND my mutation gone → restore round-trips the EXACT known-good.
(Same warmsnap helper already cold-proven on keycloak in check5/W0.5.) `has_canonical` correctly
requires BOTH a registry record AND a retained volume.
- **D8/WC8 (spot):** `/var/lib/ci-warm/` is cache — no nix module references it as a source; full D8
closure-exclusion folds into the WC8 verdict later.
Two crashes during my runs were **bugs in my OWN driver scripts** (a tag I deleted that abra still
needed in check5; `grep -rl` returning rc=1 on no-match which `exec_in_app` raises on) — NOT product
defects. Canonical left clean: idle, volume retained, known-good content, snapshot intact, v1.11.0+1.29.0.
**Gate verdict: WC2 + WC3 — PASS @2026-05-29.** Builder may proceed to W2 (`--quick`).
**Still tracked-open before Phase-2w DONE (unchanged):** traefik WC1.1 (W0.10) cold proof.
## @2026-05-29 — WC4 + WC7 — PASS (gate 3ff2bf6 cleared; cold-verified from own clone)
All re-run COLD from `cc-ci:/root/cc-ci-adv-verify`. Builder parked custom-html canonical for me.
- **Units — PASS:** `cc-ci-run -m pytest tests/unit -q`**64 passed** (incl. test_bridge_trigger).
- **WC7 trigger — PASS** (against the LIVE deployed bridge `ccci-bridge`, adversarial battery):
`!testme`→(True,False)=cold; `!testme --quick`→(True,True)=quick; and ALL of `!testmexyz`,
`!testme foo`, `!testme --quick` (double-space), `!testme --quickx`, `please !testme`,
`!testme --quick extra` → (False,False) rejected; surrounding whitespace tolerated. Strict
exact-match, no false-trigger. `trigger_build` wires `CCCI_QUICK=1`; default `!testme` stays cold.
- **WC4 `--quick` PASS / NEVER-PROMOTE — PASS.** `RECIPE=custom-html CCCI_QUICK=1 REF=87a62a5`
(healthy 1.10.0+1.28.0 head): mode=quick, in-place upgrade 1.11.0+1.29.0→1.10.0+1.28.0, **upgrade
pass** (generic test_upgrade_reconverges first, then overlay), **custom pass** (5 tests incl.
playwright), "known-good UNCHANGED", exit 0. Independently verified the never-promote invariant:
registry version STILL 1.11.0+1.29.0 (NOT promoted), **known-good snapshot tar byte-identical**
(sha256 9ef62bdf… == pre-run baseline → snapshot never re-taken), canonical idle, volume retained.
- **WC4 `--quick` FAIL / ROLLBACK — PASS** (the data-safety proof). Staged a broken custom-html
commit (`image: nginx:99.99.99-doesnotexist`), ran `CCCI_QUICK=1 CCCI_SKIP_FETCH=1 REF=<broken>`:
broken upgrade `abra deploy … FATA deploy failed 🛑` → upgrade **fail** + custom **fail** (app down)
`quick FAIL → rolling back … restored known-good data; canonical idle (NOT promoted)`, **exit 1**
(correctly RED). Independently verified the rollback restored the EXACT known-good: registry version
unchanged (1.11.0+1.29.0), snapshot byte-identical (9ef62bdf…), and on redeploy the known-good
marker `WC2-DATA-MARKER-7f3a9c` is back, app serves **200**, image is **nginx:1.29.0** (broken image
GONE); left idle. (This is also the WC9 `--quick` rollback-proof in miniature on custom-html.)
- **WC7 no-canonical fallback — PASS.** `RECIPE=custom-html-tiny MODE=quick` (no canonical) → logs
`MODE=quick requested but no canonical … falling back to COLD run` → runs COLD at a **cold per-run
domain** `cust-9834f5` (not `warm-`), install **pass**, deploy-count=1, exit 0; post-run no `cust-*`
service/volume (cold teardown sacred) and the **custom-html canonical untouched** (idle@1.11.0+1.29.0).
The PR is still tested; default `!testme` cold path unaffected.
Cleanup: staged broken commit reverted (recipe clone restored to 87a62a5, broken commit dangling);
custom-html canonical left idle@1.11.0+1.29.0 with snapshot intact. Generic-first invariant held in
`--quick`. No tests softened.
**Gate verdict: WC4 + WC7 — PASS @2026-05-29.** Builder may proceed to W3 (WC5/WC6 cold-advances +
nightly). **Still tracked-open before Phase-2w DONE:** traefik WC1.1 (W0.10) cold proof.
## @2026-05-29 — traefik WC1.1 (W0.10a) — PASS → WC1.1 now FULLY closed (keycloak + traefik)
Gate e678d2e. The Builder delivered the migration + safe no-op converge and (correctly, to avoid an
all-TLS outage) left the destructive rollback as my cold proof. All cold from my own clone.
- **Units — PASS:** 65 passed (incl. traefik spec: stateful=False, callable setup, health_domain).
- **Migration + no-op converge — PASS:** `deploy-proxy.service` active now execs
`warm_reconcile.py traefik`; journal `RECONCILE RESULT: noop-healthy:5.1.1+v3.6.15`; system running,
0 failed; `ci.commoninternet.net=200` (routing+TLS) + `keycloak-through-traefik=200`; traefik
TYPE+last_good=5.1.1+v3.6.15. Wildcard cert / file-provider config preserved (HTTPS 200 on the
wildcard domain proves the pre-issued cert is served).
- **Destructive rollback — PASS (low-disruption variant):** staged a fake NEWER tag `5.2.0+v3.6.15`
with a lint-breaking env (a YAML mapping entry). Reconcile: auto-upgrade 5.1.1→5.2.0 → `abra deploy
… FATA failed lint checks (R009 environment.0 must be a string)``rolling back to 5.1.1+v3.6.15`
`RECONCILE RESULT: rolled-back:5.2.0+v3.6.15->5.1.1+v3.6.15`, rollback alert
`{attempted:5.2.0, last_good:5.1.1, recovered:True}`. **Stateless path confirmed: NO snapshot, just
version redeploy of last_good.** Crucially, **TLS was NOT dropped**`ci.commoninternet.net=200`
and `keycloak-through-traefik=200` throughout the window (the broken deploy was rejected at lint
before the running proxy was touched); last_good unchanged; recipe clone restored to HEAD, fake tag
cleaned; system running / 0 failed after.
- *Honest scope:* my broken tag failed at abra LINT (the deploy-FAILURE→rollback branch), exactly as
the keycloak proof did. The "deploys-clean-but-health-fails→rollback" branch is the SAME shared
`wait_healthy`-False code (stateless skips only snapshot/restore), unit-tested, not live-exercised
for either app — deliberately, since for traefik that path REQUIRES a real all-route TLS outage to
induce. I judge the shared+unit-covered code + the live deploy-failure rollback sufficient; flagged
so it's not a hidden gap.
**Gate verdict: traefik WC1.1 (W0.10a) — PASS @2026-05-29.** This **CLOSES the W0.10 tracked-open
item**: WC1.1 is now fully verified for BOTH reconcilers (keycloak stateful + traefik stateless).
**Phase-2w gates verified so far:** WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC7. **Remaining for
DONE:** WC5, WC6, WC8, WC9.
## @2026-05-29 — WC5 promote-on-green-cold — PASS (gate 125453d; cold-verified from own clone)
- **Units — PASS:** 70 passed (incl. test_promote).
- **Gate predicate — PASS (anti-poison logic).** `should_promote_canonical` =
`is_enrolled AND overall==0 AND not quick AND not ref` — promotes ONLY enrolled + GREEN + COLD +
LATEST(no PR head). A PR `!testme` (REF=PR-head) is excluded (`not ref`), `--quick` excluded
(`not quick`, also proven live in WC4 = byte-identical snapshot), red excluded (`overall==0`),
unenrolled excluded. `promote_canonical` replaces the known-good ONLY after green (never lost on
red). So a bad PR can never poison the canonical; only cold-on-latest (manual `RECIPE=` / nightly)
advances it.
- **Live advancement — PASS.** I forced the custom-html registry to an OLDER value
(`version=1.10.0+1.28.0, commit=advold`), then ran a full COLD run `RECIPE=custom-html` (no REF =
latest): install/upgrade/backup/restore/custom **all pass**, deploy-count=1, then `WC5
promote-on-green-cold: (re)seed canonical custom-html @ 1.11.0+1.29.0`. Independently verified after:
registry version **ADVANCED 1.10.0+1.28.0 → 1.11.0+1.29.0** (commit=head 8a02606, new ts), snapshot
meta re-seeded to 1.11.0+1.29.0, `has_canonical=True`, canonical idle + volume retained, and **no
`cust-*` per-run service left** (cold teardown sacred). (The promote reattaches the retained volume
→ re-snapshot is byte-identical content, expected.) The advancement also restored the canonical to
its correct version.
**Gate verdict: WC5 — PASS @2026-05-29.** Builder may proceed to W3's WC6 (nightly sweep).
**Phase-2w gates verified so far:** WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC5, WC7.
**Remaining for DONE:** WC6, WC8, WC9.
## @2026-05-29 — WC6 nightly full-cold sweep — PASS (gate 465e105; cold-verified)
- **Units — PASS:** 71 passed (incl. enrolled_recipes).
- **Declarative timer/service — PASS.** `nightly-sweep.timer` active; `OnCalendar=*-*-* 03:00:00`,
**Persistent=true** (catches up a missed nightly), RandomizedDelaySec=600, next Sat 03:05 UTC;
service = oneshot, 6h ceiling, after deploy-proxy/warm-keycloak/docker, packaged in the nix store
(D8-clean; runtimeInputs incl. util-linux for the backup PTY). Imported in
`nix/hosts/cc-ci/configuration.nix`.
- **Orchestration — PASS (code read from own clone).** `nightly_sweep.py`: in-flight guard
`_another_run_active()` pgreps `run_recipe_ci.py` (excl. self) → skips/defers if a run is active;
`roll_warm_infra()` runs the health-gated keycloak+traefik reconcilers (WC1.1); `sweep()` iterates
`enrolled_recipes()` SERIALLY, each a cold latest run (REF/QUICK/MODE stripped) whose own promote
hook refreshes the canonical (WC5); red recipes reported FAIL but non-fatal and DON'T promote.
- **Live sweep via the actual systemd SERVICE — PASS.** Forced custom-html canonical OLD
(1.10.0+1.28.0), `systemctl start nightly-sweep.service`. Journal: roll keycloak
`noop-healthy:10.7.1+26.6.2` rc=0 + traefik `noop-healthy:5.1.1+v3.6.15` rc=0 (health-gated);
`enrolled canonicals = ['custom-html']`; full-cold install/upgrade/backup/restore/custom **all
pass**; `WC5 promote: canonical custom-html advanced to known-good 1.11.0+1.29.0`; sweep summary
`custom-html: PASS`; service Finished. Independently verified after: registry **ADVANCED
1.10.0+1.28.0 → 1.11.0+1.29.0** (new ts), **no `cust-*` per-run leftover** (cold teardown sacred),
`ci.commoninternet.net=200` + `keycloak-through-traefik=200` (infra healthy post-roll), system
running / 0 failed.
**Gate verdict: WC6 — PASS @2026-05-29.** Builder may proceed to W4 (WC8/WC9).
**Phase-2w gates verified so far:** WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC5, WC6, WC7.
**Remaining for DONE:** WC8, WC9 (incl. the full `--quick` rollback proof + docs).
## @2026-05-29 — WC8 + WC9 (FINAL gates) — PASS (gate 40b03a9; cold-verified)
- **Units — PASS:** 72 passed (incl. test_canonical prune_stale).
- **WC8 serialize — PASS:** `DRONE_RUNNER_CAPACITY = maxTests = "1"` (MAX_TESTS cap); nightly sweep
serial + `_another_run_active()` in-flight skip (verified in WC6); one app at a time.
- **WC8 disk/prune — PASS:** swarm `autoPrune.flags = ["--all" "--filter" "until=24h"]` — **no
`--volumes`** (data-warm volumes + snapshots survive docker prune; the module comments why
`--volumes` would destroy the known-good). `canonical.prune_stale()` is SAFE: drops a
`/var/lib/ci-warm/<r>/` only if it's a dir AND not enrolled AND has a `canonical.json` — so it
spares enrolled canonicals, the keycloak/traefik reconciler dirs (last_good, no canonical.json),
and `alerts/`. Ran it LIVE: `pruned: []` (no-op) and all four dirs (alerts, custom-html, keycloak,
traefik) intact after. Disk `/` = 50% (14G free); warm total **318M** (bounded). Run nightly + df logged.
- **WC8 cold teardown sacred — PASS:** no `<recipe>-<6hex>` per-run leftovers after any of my
W2/WC4/WC5/WC6 runs (independently confirmed each time).
- **WC8 excluded from D8 — PASS:** `grep -rn ci-warm nix/` → only a COMMENT; no Nix source declares
`/var/lib/ci-warm` as a store/source path → runtime cache, re-seeded by cold runs, not on the closure.
- **WC9 docs — PASS:** `docs/warm.md` (116 lines) covers the three states, the health-gated
reconcilers + WC1.2 safety gate + alerts, data-warm canonicals + snapshots + enroll, `--quick`,
promote-on-green-cold, the nightly sweep, resource safety, an explicit "## The `--quick` rollback
proof (WC9)" section, and "## Operate / debug".
- **WC9 `--quick` rollback proof — PASS (already cold-verified in WC4, @REVIEW 31f0e42):** I
deliberately failed a PR under `--quick` (broken image) → the canonical's last-known-good was
restored INTACT (marker `WC2-DATA-MARKER-7f3a9c` back, app healthy on nginx:1.29.0, broken image
gone, registry+snapshot unchanged), exit RED; and a `--quick` PASS left the snapshot byte-identical
(did NOT move the known-good). No tests softened anywhere in the phase.
**Gate verdict: WC8 + WC9 — PASS @2026-05-29.**
### ✅ ALL Phase-2w gates Adversary cold-verified — NO VETO — DONE authorized
WC1, **WC1.1 (full: keycloak stateful + traefik stateless)**, WC1.2, WC2, WC3, WC4, WC5, WC6, WC7,
WC8, WC9 — every one has a fresh PASS in this REVIEW-2w, each re-run COLD from my own clone
(`cc-ci:/root/cc-ci-adv-verify`). No open `[adversary]` findings; no `## VETO`. The W0.10 traefik
tracked-open item is CLOSED. System healthy (running, 0 failed), infra serving (ci+keycloak 200),
custom-html canonical idle@1.11.0+1.29.0, recipe clones restored, disk 50%. **The Builder is cleared
to write `## DONE` to STATUS-2w.md** per §6.1.