10 KiB
REVIEW-2w — Adversary verdicts for Phase 2w (warm canonical + --quick)
Adversary-owned ledger. Append-only. Formal verdicts live here; gate claims live in STATUS-2w.md,
findings in BACKLOG-2w.md ## Adversary findings.
Definition of Done verified here: WC1–WC9 (see plan-phase2w-warm-canonical-quick.md §1).
Each needs an independent COLD verdict before ## DONE is permitted. The marquee proof is WC9:
deliberately fail a PR under --quick and confirm the canonical's last-known-good is restored intact
(data preserved) AND a --quick pass did not move the known-good.
Verification map (what I will re-run cold per gate)
- WC1 live-warm keycloak: dependent recipe's SSO custom tests pass against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); leftover realms reaped.
- WC2 data-warm canonical: canonical at a stable domain (≠ cold
<recipe>-<6hex>); declarative registry tracks recipe→commit; re-warmable from scratch. - WC3 snapshots: raw volume copy taken while UNDEPLOYED under stable path; one last-known-good per app, atomic replace; restore brings app back healthy with data.
- WC4
--quick: reattach canonical → upgrade to PR head → generic UPGRADE+serving+custom; PASS→undeploy keep volume, known-good unchanged; FAIL→restore snapshot then undeploy; never promotes. - WC5 cold-only advancement: green full-cold on latest re-snapshots+re-tags; only cold advances.
- WC6 nightly full-cold sweep: scheduled, declarative, MAX_TESTS-bounded.
- WC7 trigger/authority/labeling: default
!testme=cold;--quickopt-in, never gates merge; results carry mode; no-canonical fallback clean. - WC8 resource safety: warm runs serialize per app; warm keycloak shared via per-run realms; disk monitored+pruned; cold teardown still deletes per-run volumes; warm data excluded from D8 closure.
- WC9 docs + cold verify incl. rollback proof; no softened tests.
@2026-05-28 — Phase 2w start (Adversary online)
- Phase 2w interjected by operator (2026-05-28); Phase 2 paused. No 2w gates CLAIMED yet — Builder has not bootstrapped STATUS-2w.md. Phase-2 Docker Hub rate-limit fix was the last completed work.
- COLD access re-verified:
cc-ci-tailscaledactive;ssh cc-ci→ NixOS 24.11 (50ab793); wildcard*.ci.commoninternet.net→ gateway 143.244.213.108. Verification path is live. - IDLE until the Builder claims a WC gate (watchdog will ping on claim). Standing veto power retained.
@2026-05-28 — Design update absorbed (orchestrator: unpin + health-gated rollback)
SSOT updated (committed). Revised/added verification obligations I will hold the gate to:
- WC1 (revised) — keycloak is now UNPINNED like traefik: reconciler
abra recipe fetchlatest + chaos-deploy;kcVersionpin DROPPED; MUST keep the secret-generate-only-if-missing guard + the health-wait. Cold-check: no version pin in the nix module / reconciler; recipe fetched at activation (runtime) so the nix closure stays byte-identical (D8 preserved — verify closure hash unaffected by which keycloak version is live). Plus original WC1: dependent SSO custom tests pass against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); stale realms reaped. - WC1.1 (NEW) — health-gated deploy-with-rollback built INTO the warm/infra reconcilers
(traefik + keycloak), NOT nix-generation rollback (the swarm app isn't in the generation). Pattern:
record running version = last-good → deploy latest → health-check → healthy: commit last-good:=latest;
unhealthy: roll back to last-good +
PushNotificationalert. Stateful (keycloak): undeploy → raw snapshot data volume → deploy latest → health-check → on fail restore snapshot + redeploy prior version (forward DB migrations make version-only rollback unsafe); reuse WC3 snapshot helper. traefik (stateless) = version rollback only. ADVERSARY PROOF (mandatory, I must run it): (a) force/simulate a BROKEN "latest" → confirm the warm app self-reverts to the prior healthy version, keycloak's pre-upgrade data intact, and an alert fired; (b) a HEALTHY update commits the new version as last-good. Watch for: silent failure (broken stays deployed), data loss on revert, no alert, or last-good not advancing on a healthy update. - WC6 (reordered) — nightly =
nixos-rebuild switchFIRST (warm/infra → latest, health-gated per WC1.1) THEN full-cold sweep; MUST NOT run while a test run is in flight; if the health-gate rolled an infra app back, alert fires and the sweep still runs against the healthy prior version. - WC8 carry — confirm the leftover phase-2 cold app
lasu-0a6fb2(orchestrator flagged it) is fully torn down (app+volumes+secrets gone), since cold-teardown-sacred + disk budget are WC8. - Still no gate CLAIMED; W0 in flight. Continue idle until a WC gate is claimed (watchdog pings).
@2026-05-29 — WC1.2 added (pre-deploy safety gate, runs BEFORE WC1.1)
- WC1.2 (NEW) — pre-deploy safety gate on warm/infra auto-update. Rationale: a passing health
check does NOT prove a required manual migration ran, so gate BEFORE auto-deploy. Rule: only
auto-apply non-major (patch/minor) upgrades with no manual-migration release notes. If
current→latest is a MAJOR recipe-version bump OR the target
releaseNotes/<version>.mdflags a manual migration → DO NOT auto-upgrade: stay on current +PushNotificationalert WITH the release notes (operator upgrades manually). Independent of, and runs BEFORE, the WC1.1 health-gated rollback. Applies to nightly rebuild (WC6) AND any reconcile.- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
<upstream>+<recipe-semver>; a major recipe-semver bump = breaking, matches abra major-upgrade caution); secondary = scan targetreleaseNotes/<version>.mdfor manual-migration markers. - ADVERSARY PROOF (mandatory): simulate a major / manual-migration "latest" → confirm hold-on-current (no deploy attempted) + alert fired carrying the release notes; NO silent auto-upgrade. Watch for: a major bump slipping through as if patch; releaseNotes not scanned; alert without the notes; or the gate firing on a legitimate patch/minor (false hold).
- Ordering check: WC1.2 must short-circuit BEFORE WC1.1 even snapshots/deploys — i.e. on a held upgrade there is no snapshot/deploy/rollback churn, just hold + alert.
- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
@2026-05-29 — Standing probe (WC8 carry): lasu-0a6fb2 teardown — CLEAN
Independent cold check on cc-ci (not a gate verdict; WC8 not yet claimed). The orchestrator-flagged
leftover phase-2 cold app lasu-0a6fb2 is fully gone: abra app ls -S -m shows no lasu app,
docker service ls no lasu services, docker volume ls no lasu volumes, docker secret ls no lasu
secrets. Disk / at 63% (9.8G free / 28G) — consistent with the Builder's claimed 96%→62%
reclaim. Cold-teardown-sacred holds for this orphan; disk budget healthy. Will fold into the WC8
verdict when that gate is claimed. Still no WC gate CLAIMED; W0 → next is W0.9 WC1.1 live proofs.
@2026-05-29 — Watchdog pinged [C1]; NO formal gate claim yet — read-only pre-review (NOT a verdict)
Watchdog signalled a [C1] claim, but STATUS-2w.md ## Gate reads "(none claimed yet)" and the
Builder's own STATUS lists W0.7 + W0.8 as remaining before claiming WC1/WC1.1/WC1.2, with a build
finding (lasuite-docs in-place --chaos redeploy nginx host not found in upstream ...backend:8000
race) currently blocking the WC1 dependent-green proof. Per §6.1 there is NO formal gate to pass
yet — ping likely fired on the "reconciler-side WC1/WC1.1/WC1.2 proven" wording in 819c1bc. I will
NOT log a WC1/WC1.1/WC1.2 PASS until the gate is formally CLAIMED and I run the marquee reproduce cold.
Read-only pre-review done now (no live churn — avoids colliding with the Builder's W0.8 keycloak work):
- Live state consistent with the W0.9 narrative:
warm-keycloak.serviceactive; live imagekeycloak/keycloak:26.6.2+mariadb:12.2;/var/lib/ci-warm/keycloak/last_good = 10.7.1+26.6.2(the recovered canonical — correctly NOT advanced to the simulated-broken 10.7.10). - Static review of
runner/warm_reconcile.py— no defects:- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
snapshot/deploy/rollback churn; both
held-major+held-manual-migrationalerts carryrelease_notes. is_major_bumpis conservative: holds on a major bump of EITHER the recipe-semver (pre-+) OR the app-version (post-+), so a keycloak app-major (25->26, the DB-migration case) is also held. Neutralizes a tag-format wording mismatch (plan §WC1.2 says<upstream>+<recipe-semver>; code's observed data says<recipe-semver>+<app-version>) — checking both sides covers intent either way. Not a defect; noted so I don't re-flag it.- WC1.1 rolls back on BOTH a deploy exception AND an unhealthy result (L356-362); stateful path
restores the snapshot before redeploying the prior version; raises if the rollback itself is
unhealthy. Alert
rollbackcarries last_good/attempted/recovered/notes.
- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
snapshot/deploy/rollback churn; both
- OPEN FLAG to confirm at the live reproduce:
/var/lib/ci-warm/alerts/is currently EMPTY, though W0.9 claims a rollback alert was written there and the alert-relay archiving toalerts/seen/is explicitly deferred/unwired. Likely benign (Builder cleaned up the W0.9 test alert), but I MUST confirm a*rollback*.jsonalert actually lands during my own cold reproduce (no silent no-alert). - PLAN for the formal gate: when WC1 is CLAIMED, run the Builder's reproduce (STATUS L79-83):
fake tags
10.7.9+26.6.2(good) +10.7.10+26.6.2(broken KC_HOSTNAME),CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloakx2 → expectupgraded:thenrolled-back:, marker realm survives, last_good unchanged at prior, a*rollback*.jsonalert; PLUS the WC1 headline (dependent SSO custom test green vs warm keycloak + concurrent distinct realms + reaping) + a major/manual-migration WC1.2 hold proof. Sent a BUILDER-INBOX heads-up to coordinate keycloak timing.