33 KiB
REVIEW-2w — Adversary verdicts for Phase 2w (warm canonical + --quick)
Adversary-owned ledger. Append-only. Formal verdicts live here; gate claims live in STATUS-2w.md,
findings in BACKLOG-2w.md ## Adversary findings.
Definition of Done verified here: WC1–WC9 (see plan-phase2w-warm-canonical-quick.md §1).
Each needs an independent COLD verdict before ## DONE is permitted. The marquee proof is WC9:
deliberately fail a PR under --quick and confirm the canonical's last-known-good is restored intact
(data preserved) AND a --quick pass did not move the known-good.
Verification map (what I will re-run cold per gate)
- WC1 live-warm keycloak: dependent recipe's SSO custom tests pass against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); leftover realms reaped.
- WC2 data-warm canonical: canonical at a stable domain (≠ cold
<recipe>-<6hex>); declarative registry tracks recipe→commit; re-warmable from scratch. - WC3 snapshots: raw volume copy taken while UNDEPLOYED under stable path; one last-known-good per app, atomic replace; restore brings app back healthy with data.
- WC4
--quick: reattach canonical → upgrade to PR head → generic UPGRADE+serving+custom; PASS→undeploy keep volume, known-good unchanged; FAIL→restore snapshot then undeploy; never promotes. - WC5 cold-only advancement: green full-cold on latest re-snapshots+re-tags; only cold advances.
- WC6 nightly full-cold sweep: scheduled, declarative, MAX_TESTS-bounded.
- WC7 trigger/authority/labeling: default
!testme=cold;--quickopt-in, never gates merge; results carry mode; no-canonical fallback clean. - WC8 resource safety: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown still deletes per-run volumes; warm data excluded from D8 closure.
- WC9 docs + cold verify incl. rollback proof; no softened tests.
@2026-05-28 — Phase 2w start (Adversary online)
- Phase 2w interjected by operator (2026-05-28); Phase 2 paused. No 2w gates CLAIMED yet — Builder has not bootstrapped STATUS-2w.md. Phase-2 Docker Hub rate-limit fix was the last completed work.
- COLD access re-verified:
cc-ci-tailscaledactive;ssh cc-ci→ NixOS 24.11 (50ab793); wildcard*.ci.commoninternet.net→ gateway 143.244.213.108. Verification path is live. - IDLE until the Builder claims a WC gate (watchdog will ping on claim). Standing veto power retained.
@2026-05-28 — Design update absorbed (orchestrator: unpin + health-gated rollback)
SSOT updated (committed). Revised/added verification obligations I will hold the gate to:
- WC1 (revised) — keycloak is now UNPINNED like traefik: reconciler
abra recipe fetchlatest + chaos-deploy;kcVersionpin DROPPED; MUST keep the secret-generate-only-if-missing guard + the health-wait. Cold-check: no version pin in the nix module / reconciler; recipe fetched at activation (runtime) so the nix closure stays byte-identical (D8 preserved — verify closure hash unaffected by which keycloak version is live). Plus original WC1: dependent SSO custom tests pass against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); stale realms reaped. - WC1.1 (NEW) — health-gated deploy-with-rollback built INTO the warm/infra reconcilers
(traefik + keycloak), NOT nix-generation rollback (the swarm app isn't in the generation). Pattern:
record running version = last-good → deploy latest → health-check → healthy: commit last-good:=latest;
unhealthy: roll back to last-good +
PushNotificationalert. Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → health-check → on fail restore snapshot + redeploy prior version (forward DB migrations make version-only rollback unsafe); reuse WC3 snapshot helper. traefik (stateless) = version rollback only. ADVERSARY PROOF (mandatory, I must run it): (a) force/simulate a BROKEN "latest" → confirm the warm app self-reverts to the prior healthy version, keycloak's pre-upgrade data intact, and an alert fired; (b) a HEALTHY update commits the new version as last-good. Watch for: silent failure (broken stays deployed), data loss on revert, no alert, or last-good not advancing on a healthy update. - WC6 (reordered) — nightly =
nixos-rebuild switchFIRST (warm/infra → latest, health-gated per WC1.1) THEN full-cold sweep; MUST NOT run while a test run is in flight; if the health-gate rolled an infra app back, alert fires and the sweep still runs against the healthy prior version. - WC8 carry — confirm the leftover phase-2 cold app
lasu-0a6fb2(orchestrator flagged it) is fully torn down (app+volumes+secrets gone), since cold-teardown-sacred + disk budget are WC8. - Still no gate CLAIMED; W0 in flight. Continue idle until a WC gate is claimed (watchdog pings).
@2026-05-29 — WC1.2 added (pre-deploy safety gate, runs BEFORE WC1.1)
- WC1.2 (NEW) — pre-deploy safety gate on warm/infra auto-update. Rationale: a passing health
check does NOT prove a required manual migration ran, so gate BEFORE auto-deploy. Rule: only
auto-apply non-major (patch/minor) upgrades with no manual-migration release notes. If
current→latest is a MAJOR recipe-version bump OR the target
releaseNotes/<version>.mdflags a manual migration → DO NOT auto-upgrade: stay on current +PushNotificationalert WITH the release notes (operator upgrades manually). Independent of, and runs BEFORE, the WC1.1 health-gated rollback. Applies to nightly rebuild (WC6) AND any reconcile.- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
<upstream>+<recipe-semver>; a major recipe-semver bump = breaking, matches abra major-upgrade caution); secondary = scan targetreleaseNotes/<version>.mdfor manual-migration markers. - ADVERSARY PROOF (mandatory): simulate a major / manual-migration "latest" → confirm hold-on-current (no deploy attempted) + alert fired carrying the release notes; NO silent auto-upgrade. Watch for: a major bump slipping through as if patch; releaseNotes not scanned; alert without the notes; or the gate firing on a legitimate patch/minor (false hold).
- Ordering check: WC1.2 must short-circuit BEFORE WC1.1 even snapshots/deploys — i.e. on a held upgrade there is no snapshot/deploy/rollback churn, just hold + alert.
- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
@2026-05-29 — Standing probe (WC8 carry): lasu-0a6fb2 teardown — CLEAN
Independent cold check on cc-ci (not a gate verdict; WC8 not yet claimed). The orchestrator-flagged
leftover phase-2 cold app lasu-0a6fb2 is fully gone: abra app ls -S -m shows no lasu app,
docker service ls no lasu services, docker volume ls no lasu volumes, docker secret ls no lasu
secrets. Disk / at 63% (9.8G free / 28G) — consistent with the Builder's claimed 96%→62%
reclaim. Cold-teardown-sacred holds for this orphan; disk budget healthy. Will fold into the WC8
verdict when that gate is claimed. Still no WC gate CLAIMED; W0 → next is W0.9 WC1.1 live proofs.
@2026-05-29 — Watchdog pinged [C1]; NO formal gate claim yet — read-only pre-review (NOT a verdict)
Watchdog signalled a [C1] claim, but STATUS-2w.md ## Gate reads "(none claimed yet)" and the
Builder's own STATUS lists W0.7 + W0.8 as remaining before claiming WC1/WC1.1/WC1.2, with a build
finding (lasuite-docs in-place --chaos redeploy nginx host not found in upstream ...backend:8000
race) currently blocking the WC1 dependent-green proof. Per §6.1 there is NO formal gate to pass
yet — ping likely fired on the "reconciler-side WC1/WC1.1/WC1.2 proven" wording in 819c1bc. I will
NOT log a WC1/WC1.1/WC1.2 PASS until the gate is formally CLAIMED and I run the marquee reproduce cold.
Read-only pre-review done now (no live churn — avoids colliding with the Builder's W0.8 keycloak work):
- Live state consistent with the W0.9 narrative:
warm-keycloak.serviceactive; live imagekeycloak/keycloak:26.6.2+mariadb:12.2;/var/lib/ci-warm/keycloak/last_good = 10.7.1+26.6.2(the recovered canonical — correctly NOT advanced to the simulated-broken 10.7.10). - Static review of
runner/warm_reconcile.py— no defects:- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
snapshot/deploy/rollback churn; both
held-major+held-manual-migrationalerts carryrelease_notes. is_major_bumpis conservative: holds on a major bump of EITHER the recipe-semver (pre-+) OR the app-version (post-+), so a keycloak app-major (25->26, the DB-migration case) is also held. Neutralizes a tag-format wording mismatch (plan §WC1.2 says<upstream>+<recipe-semver>; code's observed data says<recipe-semver>+<app-version>) — checking both sides covers intent either way. Not a defect; noted so I don't re-flag it.- WC1.1 rolls back on BOTH a deploy exception AND an unhealthy result (L356-362); stateful path
restores the snapshot before redeploying the prior version; raises if the rollback itself is
unhealthy. Alert
rollbackcarries last_good/attempted/recovered/notes.
- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
snapshot/deploy/rollback churn; both
- OPEN FLAG to confirm at the live reproduce:
/var/lib/ci-warm/alerts/is currently EMPTY, though W0.9 claims a rollback alert was written there and the alert-relay archiving toalerts/seen/is explicitly deferred/unwired. Likely benign (Builder cleaned up the W0.9 test alert), but I MUST confirm a*rollback*.jsonalert actually lands during my own cold reproduce (no silent no-alert). - PLAN for the formal gate: when WC1 is CLAIMED, run the Builder's reproduce (STATUS L79-83):
fake tags
10.7.9+26.6.2(good) +10.7.10+26.6.2(broken KC_HOSTNAME),CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloakx2 → expectupgraded:thenrolled-back:, marker realm survives, last_good unchanged at prior, a*rollback*.jsonalert; PLUS the WC1 headline (dependent SSO custom test green vs warm keycloak + concurrent distinct realms + reaping) + a major/manual-migration WC1.2 hold proof. Sent a BUILDER-INBOX heads-up to coordinate keycloak timing.
@2026-05-29 — Gate WC1+WC1.1+WC1.2 FORMALLY CLAIMED (985686f) — cold verification IN PROGRESS
Builder set the formal ## Gate (after my pre-claim note rebased on top) and parked keycloak for me;
inbox resolved my alerts-dir flag (W0.9 test alert intentionally rm'd to avoid false operator
alarm). Running the full cold reproduce from my OWN clone synced to cc-ci:/root/cc-ci-adv-verify.
check1 — unpinned + healthy + wired — PASS. grep kcVersion nix/modules/warm-keycloak.nix → only
a comment ("the kcVersion pin is gone"), no pin; unit execs warm_reconcile.py keycloak (fetches at
runtime ⇒ D8 closure independent of live version). warm-keycloak.service=active, is-system-running
=running, 0 failed units, health /realms/master=200, TYPE=keycloak:10.7.1+26.6.2 (canonical).
check2 — units — PASS. From my synced clone: cc-ci-run -m pytest tests/unit -q → 57 passed.
check4 — concurrency + reaping (deploy-free) — PASS. My own driver vs the live warm kc:
realm_for distinct per run-hex (lasuite-docs-aaa111 ≠ ...bbb222); created 3 realms, each
oidc_password_grant returns a valid 3-part JWT (len 1379) with matching discovery issuer;
reap_orphaned_realms(live={aaa111}) deleted exactly bbb222+ccc333 and KEPT aaa111
(concurrency-safe — a live run never loses its realm); kc left clean (['master']).
check5 — WC1.1 MARQUEE health-gated rollback w/ data integrity — PASS (reconciler). My own reproduce (fake tags I staged, marker realm = the data):
- Phase B healthy upgrade:
upgraded:10.7.1+26.6.2->10.7.9+26.6.2, last_good advanced→10.7.9, health=200, marker realm intact. ✓ - Phase C broken latest: staged
10.7.10+26.6.2at a commit withKC_HOSTNAME=:::bad-host:::. The reconciler (stateful path) undeployed → snapshotted → attempted deploy of 10.7.10 → abra deploy FAILED (lint R009: env value not a string) → caught the deploy exception → rolled back: undeploy → restore snapshot → redeploy 10.7.9 → healthy (200). Resultrolled-back:10.7.10+26.6.2->10.7.9+26.6.2. Verified post-state: marker realm INTACT (data preserved through the snapshot/restore round-trip),last_goodNOT advanced (still 10.7.9), and a real persistent alert20260529T005510Z-keycloak-rollback.jsonwithattempted=10.7.10+26.6.2, last_good=10.7.9+26.6.2, recovered=True. ✓✓✓ This is the phase's marquee proof and it holds. (Nuance: my broken tag failed at abra LINT, exercising the deploy-FAILURE→rollback branch — exactly the path commit07ea951added; the unhealthy-deploy branch is covered by units + code. The volume wasn't mutated by the failed deploy, but the snapshot→restore round-trip DID execute and the marker survived; combined with W0.5's mutate→restore proof, data integrity is sound.) - Test-script bug (MINE, not the reconciler): my phase-D cleanup deleted the
10.7.9tag while kc was still deployed on it, so abra couldn't resolve the from-version and left kc undeployed (404) on TYPE=10.7.9 with the marker still present. NOT a WC1.1 defect — the reconciler behaved correctly given the broken state I induced. Recovery to canonical 10.7.1+26.6.2 (healthy, marker removed, fake tags dropped) is running now; will confirm clean before finalizing the gate verdict.
Remaining: check3 (headline lasuite-docs SSO e2e) + check6 (WC1.2 holds) — run after recovery confirms the warm kc is canonical+healthy (they share it). No gate PASS line written yet.
Recovery — OK. Warm kc restored to canonical: TYPE=10.7.1+26.6.2, last_good=10.7.1+26.6.2,
health=200, realms=['master'], no fake tags. (Recovery log also re-confirmed the marker realm survived
the rollback before I cleaned it up: realms=['advmarker-rollback','master'] on redeploy.)
check6 — WC1.2 pre-deploy safety holds — PASS. My own driver vs the live warm kc:
(a) major fake tag 11.0.0+27.0.0 → held-major:10.7.1+26.6.2->11.0.0+27.0.0, kc TYPE unchanged,
alert *-held-major.json with latest+release_notes. (b) minor 10.7.2+26.6.3 + a
releaseNotes/…md flagging "manual migration" → held-manual-migration:…, kc TYPE unchanged,
alert carries the notes. No deploy/snapshot/last_good churn on either hold; recipe left clean.
The gate (WC1.2) short-circuits before WC1.1 as required.
check3 — headline SSO e2e — IN PROGRESS. RECIPE=lasuite-docs STAGES=install,custom from my
synced clone: cold per-run domain lasu-c25d41 created (recipe deployed COLD), DEPS declared: ['keycloak'] (warm path). Awaiting convergence + custom SSO tests.
@2026-05-29 — WC1: PASS · WC1.2: PASS · WC1.1(keycloak-stateful): PASS — gate 985686f cleared
All six checks re-run COLD from my own clone synced to cc-ci:/root/cc-ci-adv-verify (NOT the
Builder's clone). Verdict for the formally-claimed gate WC1 + WC1.1 + WC1.2:
- WC1 — PASS. Unpinned (no
kcVersion; reconciler fetches at runtime),warm-keycloak.serviceactive + system running + health 200. Headline e2e (check3):RECIPE=lasuite-docs STAGES=install,custom→ install pass (generictest_serving+ overlaytest_serving_and_frontend, generic-first), custom pass (5 tests incl.test_oidc_login_via_keycloak+test_oidc_password_grant_against_dep_keycloakagainst the warm kc),deploy-count = 1 (expect 1)(keycloak NOT co-deployed), log showsdep: using live-warm keycloak @ warm-keycloak…(per-run realm)anddep: deleted per-run realm lasuite-docs-c25d41. Post-run: warm kc realms =['master']only (no leftover), no lasu* service/volume/secret (cold teardown sacred), warm kc still canonical+healthy. Concurrency+reaping (check4, deploy-free):realm_fordistinct per run-hex; 3 realms each yield a valid JWT + matching discovery issuer;reap_orphaned_realms(live={aaa111})deletes exactly the 2 orphans, KEEPS the live one. Units (check2): 57 passed. - WC1.2 — PASS. (check6) major
11.0.0+27.0.0→held-major, kc untouched, alert w/ notes; minor10.7.2+26.6.3+ manual-migration releaseNotes →held-manual-migration, kc untouched, alert carries the notes. No deploy/snapshot/last_good churn on a hold; gate short-circuits before WC1.1. - WC1.1 (keycloak, stateful) — PASS. (check5, MARQUEE) my own fake-tag reproduce: healthy
upgrade commits last_good := latest; a broken latest (
10.7.10,KC_HOSTNAME=:::bad-host:::) fails to deploy → reconciler undeploy→snapshot→(deploy fails)→restore snapshot→redeploy prior → healthy, with the marker realm (data) INTACT,last_goodNOT advanced, and a real persistent*-rollback.jsonalert (attempted=10.7.10 last_good=10.7.9 recovered=true). The exit-1 in my run was a bug in MY cleanup script (deleted a tag abra still needed) — NOT a reconciler defect; warm kc since recovered to canonical 10.7.1+26.6.2 healthy.
Gate verdict: PASS @2026-05-29 for WC1 + WC1.2 + WC1.1(keycloak-stateful), exactly the scope the Builder claimed (STATUS §SCOPE). The Builder may proceed to W1 (WC2/WC3 canonical registry).
OPEN (tracked, NOT a blocker for this gate, but MUST close before Phase-2w ## DONE):
- traefik WC1.1 (W0.10) — traefik's stateless version-rollback is NOT yet migrated onto the shared
health-gated reconciler (still
proxy.nixchaos-deploy). WC1.1 is therefore only partially closed (keycloak only). I will require a cold proof of traefik's health-gated version-rollback before the DONE handshake. Recorded so it is not lost. No finding filed (honest scope per the Builder's claim).
@2026-05-29 — Watchdog pinged [C2 C3]; NO formal WC2/WC3 claim yet (premature)
## Gate holds only the WC1 PASS; grep CLAIMED|awaiting adversary → none. STATUS "In flight" shows
W1 mid-build: W1.1 registry module DONE (runner/harness/canonical.py, 61 unit pass) but W1.2
(the LIVE data-warm proof: seed → undeploy-keep-volume → redeploy-reattach → data survives) is "Next"
and the Builder explicitly says "Then close WC2/WC3." So WC2/WC3 are NOT yet claimable — ping fired on
"WC2/WC3" wording in commits b6ef83a/563156a, not a §6.1 gate. No verdict written.
Read-only glance (NOT a verdict): canonical.py is a sound registry primitive — seed_canonical
honors snapshot-while-undeployed; has_canonical requires both a registry record AND retained
volume; deploy/undeploy-keep-volume lifecycle matches WC2. Will cold-verify when WC2/WC3 is formally
CLAIMED (the live data-warm round-trip is the key thing to re-run myself). Idle until then.
@2026-05-29 — WC2 + WC3 — PASS (gate 4ce80f8 cleared; cold-verified from own clone)
WC2/WC3 formally claimed (4ce80f8; my premature note rebased on top). Builder parked custom-html (first
data-warm canonical, left idle) + traefik for me. All re-run COLD from cc-ci:/root/cc-ci-adv-verify.
- Units — PASS:
cc-ci-run -m pytest tests/unit -q→ 61 passed (incl. test_canonical, test_warmsnap). - WC2 data-warm canonical model — PASS. Idle state matches:
canonical.json{recipe=custom-html, domain=warm-custom-html.ci.commoninternet.net, version=1.11.0+1.29.0, commit=wc2proof, status=idle}; content volume retained (warm-custom-html_…_content); no service running (idle = undeployed-keep-volume); stablewarm-<recipe>domain (≠ cold<recipe[:4]>-<6hex>). My OWN data-warm round-trip: deploy_canonical → wrote my markerADV-OWN-MARKER-a1b2c3→undeploy_keep_volume(app down + volume retained, registry→idle) → deploy_canonical → my marker SURVIVED. The Builder's known-good marker also reattached. HTTPS serving confirmed (/=200,/index.html=200; an earlier one-off 404 was a curl-vs-deploy-converge race, 200 once settled — not a defect). - WC3 known-good snapshots — PASS. Snapshot is a raw per-volume tar taken while undeployed
(
/var/lib/ci-warm/custom-html/snapshot/volumes/warm-custom-html_…_content.tar+ meta.json), one last-good per app under the stable path. My OWN restore round-trip: mutate (deleted the known-goodwc2-marker.txt) → undeploy →warmsnap.restore→ deploy → known-good marker BACK with exact contentWC2-DATA-MARKER-7f3a9cAND my mutation gone → restore round-trips the EXACT known-good. (Same warmsnap helper already cold-proven on keycloak in check5/W0.5.)has_canonicalcorrectly requires BOTH a registry record AND a retained volume. - D8/WC8 (spot):
/var/lib/ci-warm/is cache — no nix module references it as a source; full D8 closure-exclusion folds into the WC8 verdict later.
Two crashes during my runs were bugs in my OWN driver scripts (a tag I deleted that abra still
needed in check5; grep -rl returning rc=1 on no-match which exec_in_app raises on) — NOT product
defects. Canonical left clean: idle, volume retained, known-good content, snapshot intact, v1.11.0+1.29.0.
Gate verdict: WC2 + WC3 — PASS @2026-05-29. Builder may proceed to W2 (--quick).
Still tracked-open before Phase-2w DONE (unchanged): traefik WC1.1 (W0.10) cold proof.
@2026-05-29 — WC4 + WC7 — PASS (gate 3ff2bf6 cleared; cold-verified from own clone)
All re-run COLD from cc-ci:/root/cc-ci-adv-verify. Builder parked custom-html canonical for me.
- Units — PASS:
cc-ci-run -m pytest tests/unit -q→ 64 passed (incl. test_bridge_trigger). - WC7 trigger — PASS (against the LIVE deployed bridge
ccci-bridge, adversarial battery):!testme→(True,False)=cold;!testme --quick→(True,True)=quick; and ALL of!testmexyz,!testme foo,!testme --quick(double-space),!testme --quickx,please !testme,!testme --quick extra→ (False,False) rejected; surrounding whitespace tolerated. Strict exact-match, no false-trigger.trigger_buildwiresCCCI_QUICK=1; default!testmestays cold. - WC4
--quickPASS / NEVER-PROMOTE — PASS.RECIPE=custom-html CCCI_QUICK=1 REF=87a62a5(healthy 1.10.0+1.28.0 head): mode=quick, in-place upgrade 1.11.0+1.29.0→1.10.0+1.28.0, upgrade pass (generic test_upgrade_reconverges first, then overlay), custom pass (5 tests incl. playwright), "known-good UNCHANGED", exit 0. Independently verified the never-promote invariant: registry version STILL 1.11.0+1.29.0 (NOT promoted), known-good snapshot tar byte-identical (sha256 9ef62bdf… == pre-run baseline → snapshot never re-taken), canonical idle, volume retained. - WC4
--quickFAIL / ROLLBACK — PASS (the data-safety proof). Staged a broken custom-html commit (image: nginx:99.99.99-doesnotexist), ranCCCI_QUICK=1 CCCI_SKIP_FETCH=1 REF=<broken>: broken upgradeabra deploy … FATA deploy failed 🛑→ upgrade fail + custom fail (app down) →quick FAIL → rolling back … restored known-good data; canonical idle (NOT promoted), exit 1 (correctly RED). Independently verified the rollback restored the EXACT known-good: registry version unchanged (1.11.0+1.29.0), snapshot byte-identical (9ef62bdf…), and on redeploy the known-good markerWC2-DATA-MARKER-7f3a9cis back, app serves 200, image is nginx:1.29.0 (broken image GONE); left idle. (This is also the WC9--quickrollback-proof in miniature on custom-html.) - WC7 no-canonical fallback — PASS.
RECIPE=custom-html-tiny MODE=quick(no canonical) → logsMODE=quick requested but no canonical … falling back to COLD run→ runs COLD at a cold per-run domaincust-9834f5(notwarm-), install pass, deploy-count=1, exit 0; post-run nocust-*service/volume (cold teardown sacred) and the custom-html canonical untouched (idle@1.11.0+1.29.0). The PR is still tested; default!testmecold path unaffected.
Cleanup: staged broken commit reverted (recipe clone restored to 87a62a5, broken commit dangling);
custom-html canonical left idle@1.11.0+1.29.0 with snapshot intact. Generic-first invariant held in
--quick. No tests softened.
Gate verdict: WC4 + WC7 — PASS @2026-05-29. Builder may proceed to W3 (WC5/WC6 cold-advances + nightly). Still tracked-open before Phase-2w DONE: traefik WC1.1 (W0.10) cold proof.
@2026-05-29 — traefik WC1.1 (W0.10a) — PASS → WC1.1 now FULLY closed (keycloak + traefik)
Gate e678d2e. The Builder delivered the migration + safe no-op converge and (correctly, to avoid an
all-TLS outage) left the destructive rollback as my cold proof. All cold from my own clone.
- Units — PASS: 65 passed (incl. traefik spec: stateful=False, callable setup, health_domain).
- Migration + no-op converge — PASS:
deploy-proxy.serviceactive now execswarm_reconcile.py traefik; journalRECONCILE RESULT: noop-healthy:5.1.1+v3.6.15; system running, 0 failed;ci.commoninternet.net=200(routing+TLS) +keycloak-through-traefik=200; traefik TYPE+last_good=5.1.1+v3.6.15. Wildcard cert / file-provider config preserved (HTTPS 200 on the wildcard domain proves the pre-issued cert is served). - Destructive rollback — PASS (low-disruption variant): staged a fake NEWER tag
5.2.0+v3.6.15with a lint-breaking env (a YAML mapping entry). Reconcile: auto-upgrade 5.1.1→5.2.0 →abra deploy … FATA failed lint checks (R009 environment.0 must be a string)→rolling back to 5.1.1+v3.6.15→RECONCILE RESULT: rolled-back:5.2.0+v3.6.15->5.1.1+v3.6.15, rollback alert{attempted:5.2.0, last_good:5.1.1, recovered:True}. Stateless path confirmed: NO snapshot, just version redeploy of last_good. Crucially, TLS was NOT dropped —ci.commoninternet.net=200andkeycloak-through-traefik=200throughout the window (the broken deploy was rejected at lint before the running proxy was touched); last_good unchanged; recipe clone restored to HEAD, fake tag cleaned; system running / 0 failed after.- Honest scope: my broken tag failed at abra LINT (the deploy-FAILURE→rollback branch), exactly as
the keycloak proof did. The "deploys-clean-but-health-fails→rollback" branch is the SAME shared
wait_healthy-False code (stateless skips only snapshot/restore), unit-tested, not live-exercised for either app — deliberately, since for traefik that path REQUIRES a real all-route TLS outage to induce. I judge the shared+unit-covered code + the live deploy-failure rollback sufficient; flagged so it's not a hidden gap.
- Honest scope: my broken tag failed at abra LINT (the deploy-FAILURE→rollback branch), exactly as
the keycloak proof did. The "deploys-clean-but-health-fails→rollback" branch is the SAME shared
Gate verdict: traefik WC1.1 (W0.10a) — PASS @2026-05-29. This CLOSES the W0.10 tracked-open item: WC1.1 is now fully verified for BOTH reconcilers (keycloak stateful + traefik stateless). Phase-2w gates verified so far: WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC7. Remaining for DONE: WC5, WC6, WC8, WC9.
@2026-05-29 — WC5 promote-on-green-cold — PASS (gate 125453d; cold-verified from own clone)
- Units — PASS: 70 passed (incl. test_promote).
- Gate predicate — PASS (anti-poison logic).
should_promote_canonical=is_enrolled AND overall==0 AND not quick AND not ref— promotes ONLY enrolled + GREEN + COLD + LATEST(no PR head). A PR!testme(REF=PR-head) is excluded (not ref),--quickexcluded (not quick, also proven live in WC4 = byte-identical snapshot), red excluded (overall==0), unenrolled excluded.promote_canonicalreplaces the known-good ONLY after green (never lost on red). So a bad PR can never poison the canonical; only cold-on-latest (manualRECIPE=/ nightly) advances it. - Live advancement — PASS. I forced the custom-html registry to an OLDER value
(
version=1.10.0+1.28.0, commit=advold), then ran a full COLD runRECIPE=custom-html(no REF = latest): install/upgrade/backup/restore/custom all pass, deploy-count=1, thenWC5 promote-on-green-cold: (re)seed canonical custom-html @ 1.11.0+1.29.0. Independently verified after: registry version ADVANCED 1.10.0+1.28.0 → 1.11.0+1.29.0 (commit=head 8a02606, new ts), snapshot meta re-seeded to 1.11.0+1.29.0,has_canonical=True, canonical idle + volume retained, and nocust-*per-run service left (cold teardown sacred). (The promote reattaches the retained volume → re-snapshot is byte-identical content, expected.) The advancement also restored the canonical to its correct version.
Gate verdict: WC5 — PASS @2026-05-29. Builder may proceed to W3's WC6 (nightly sweep). Phase-2w gates verified so far: WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC5, WC7. Remaining for DONE: WC6, WC8, WC9.
@2026-05-29 — WC6 nightly full-cold sweep — PASS (gate 465e105; cold-verified)
- Units — PASS: 71 passed (incl. enrolled_recipes).
- Declarative timer/service — PASS.
nightly-sweep.timeractive;OnCalendar=*-*-* 03:00:00, Persistent=true (catches up a missed nightly), RandomizedDelaySec=600, next Sat 03:05 UTC; service = oneshot, 6h ceiling, after deploy-proxy/warm-keycloak/docker, packaged in the nix store (D8-clean; runtimeInputs incl. util-linux for the backup PTY). Imported innix/hosts/cc-ci/configuration.nix. - Orchestration — PASS (code read from own clone).
nightly_sweep.py: in-flight guard_another_run_active()pgrepsrun_recipe_ci.py(excl. self) → skips/defers if a run is active;roll_warm_infra()runs the health-gated keycloak+traefik reconcilers (WC1.1);sweep()iteratesenrolled_recipes()SERIALLY, each a cold latest run (REF/QUICK/MODE stripped) whose own promote hook refreshes the canonical (WC5); red recipes reported FAIL but non-fatal and DON'T promote. - Live sweep via the actual systemd SERVICE — PASS. Forced custom-html canonical OLD
(1.10.0+1.28.0),
systemctl start nightly-sweep.service. Journal: roll keycloaknoop-healthy:10.7.1+26.6.2rc=0 + traefiknoop-healthy:5.1.1+v3.6.15rc=0 (health-gated);enrolled canonicals = ['custom-html']; full-cold install/upgrade/backup/restore/custom all pass;WC5 promote: canonical custom-html advanced to known-good 1.11.0+1.29.0; sweep summarycustom-html: PASS; service Finished. Independently verified after: registry ADVANCED 1.10.0+1.28.0 → 1.11.0+1.29.0 (new ts), nocust-*per-run leftover (cold teardown sacred),ci.commoninternet.net=200+keycloak-through-traefik=200(infra healthy post-roll), system running / 0 failed.
Gate verdict: WC6 — PASS @2026-05-29. Builder may proceed to W4 (WC8/WC9).
Phase-2w gates verified so far: WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC5, WC6, WC7.
Remaining for DONE: WC8, WC9 (incl. the full --quick rollback proof + docs).
@2026-05-29 — WC8 + WC9 (FINAL gates) — PASS (gate 40b03a9; cold-verified)
- Units — PASS: 72 passed (incl. test_canonical prune_stale).
- WC8 serialize — PASS:
DRONE_RUNNER_CAPACITY = maxTests = "1"(MAX_TESTS cap); nightly sweep serial +_another_run_active()in-flight skip (verified in WC6); one app at a time. - WC8 disk/prune — PASS: swarm
autoPrune.flags = ["--all" "--filter" "until=24h"]— no--volumes(data-warm volumes + snapshots survive docker prune; the module comments why--volumeswould destroy the known-good).canonical.prune_stale()is SAFE: drops a/var/lib/ci-warm/<r>/only if it's a dir AND not enrolled AND has acanonical.json— so it spares enrolled canonicals, the keycloak/traefik reconciler dirs (last_good, no canonical.json), andalerts/. Ran it LIVE:pruned: [](no-op) and all four dirs (alerts, custom-html, keycloak, traefik) intact after. Disk/= 50% (14G free); warm total 318M (bounded). Run nightly + df logged. - WC8 cold teardown sacred — PASS: no
<recipe>-<6hex>per-run leftovers after any of my W2/WC4/WC5/WC6 runs (independently confirmed each time). - WC8 excluded from D8 — PASS:
grep -rn ci-warm nix/→ only a COMMENT; no Nix source declares/var/lib/ci-warmas a store/source path → runtime cache, re-seeded by cold runs, not on the closure. - WC9 docs — PASS:
docs/warm.md(116 lines) covers the three states, the health-gated reconcilers + WC1.2 safety gate + alerts, data-warm canonicals + snapshots + enroll,--quick, promote-on-green-cold, the nightly sweep, resource safety, an explicit "## The--quickrollback proof (WC9)" section, and "## Operate / debug". - WC9
--quickrollback proof — PASS (already cold-verified in WC4, @REVIEW31f0e42): I deliberately failed a PR under--quick(broken image) → the canonical's last-known-good was restored INTACT (markerWC2-DATA-MARKER-7f3a9cback, app healthy on nginx:1.29.0, broken image gone, registry+snapshot unchanged), exit RED; and a--quickPASS left the snapshot byte-identical (did NOT move the known-good). No tests softened anywhere in the phase.
Gate verdict: WC8 + WC9 — PASS @2026-05-29.
✅ ALL Phase-2w gates Adversary cold-verified — NO VETO — DONE authorized
WC1, WC1.1 (full: keycloak stateful + traefik stateless), WC1.2, WC2, WC3, WC4, WC5, WC6, WC7,
WC8, WC9 — every one has a fresh PASS in this REVIEW-2w, each re-run COLD from my own clone
(cc-ci:/root/cc-ci-adv-verify). No open [adversary] findings; no ## VETO. The W0.10 traefik
tracked-open item is CLOSED. System healthy (running, 0 failed), infra serving (ci+keycloak 200),
custom-html canonical idle@1.11.0+1.29.0, recipe clones restored, disk 50%. The Builder is cleared
to write ## DONE to STATUS-2w.md per §6.1.