Files
cc-ci/machine-docs/REVIEW-2w.md

124 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW-2w — Adversary verdicts for Phase 2w (warm canonical + `--quick`)
Adversary-owned ledger. Append-only. Formal verdicts live here; gate claims live in STATUS-2w.md,
findings in BACKLOG-2w.md `## Adversary findings`.
**Definition of Done verified here:** WC1WC9 (see `plan-phase2w-warm-canonical-quick.md` §1).
Each needs an independent COLD verdict before `## DONE` is permitted. The marquee proof is **WC9**:
deliberately fail a PR under `--quick` and confirm the canonical's last-known-good is restored intact
(data preserved) AND a `--quick` pass did not move the known-good.
## Verification map (what I will re-run cold per gate)
- **WC1** live-warm keycloak: dependent recipe's SSO custom tests pass against warm keycloak;
concurrent dependents use distinct namespaced realms (no collision); leftover realms reaped.
- **WC2** data-warm canonical: canonical at a stable domain (≠ cold `<recipe>-<6hex>`); declarative
registry tracks recipe→commit; re-warmable from scratch.
- **WC3** snapshots: raw volume copy taken while UNDEPLOYED under stable path; one last-known-good per
app, atomic replace; restore brings app back healthy with data.
- **WC4** `--quick`: reattach canonical → upgrade to PR head → generic UPGRADE+serving+custom;
PASS→undeploy keep volume, known-good unchanged; FAIL→restore snapshot then undeploy; never promotes.
- **WC5** cold-only advancement: green full-cold on latest re-snapshots+re-tags; only cold advances.
- **WC6** nightly full-cold sweep: scheduled, declarative, MAX_TESTS-bounded.
- **WC7** trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
results carry mode; no-canonical fallback clean.
- **WC8** resource safety: warm runs serialize per app; warm keycloak shared via per-run realms; disk
monitored+pruned; cold teardown still deletes per-run volumes; warm data excluded from D8 closure.
- **WC9** docs + cold verify incl. rollback proof; no softened tests.
---
## @2026-05-28 — Phase 2w start (Adversary online)
- Phase 2w interjected by operator (2026-05-28); Phase 2 paused. No 2w gates CLAIMED yet — Builder
has not bootstrapped STATUS-2w.md. Phase-2 Docker Hub rate-limit fix was the last completed work.
- COLD access re-verified: `cc-ci-tailscaled` active; `ssh cc-ci` → NixOS 24.11 (50ab793);
wildcard `*.ci.commoninternet.net` → gateway 143.244.213.108. Verification path is live.
- IDLE until the Builder claims a WC gate (watchdog will ping on claim). Standing veto power retained.
## @2026-05-28 — Design update absorbed (orchestrator: unpin + health-gated rollback)
SSOT updated (committed). Revised/added verification obligations I will hold the gate to:
- **WC1 (revised)** — keycloak is now **UNPINNED** like traefik: reconciler `abra recipe fetch`
latest + chaos-deploy; `kcVersion` pin DROPPED; MUST keep the *secret-generate-only-if-missing*
guard + the health-wait. Cold-check: no version pin in the nix module / reconciler; recipe fetched
at activation (runtime) so the nix closure stays byte-identical (D8 preserved — verify closure hash
unaffected by which keycloak version is live). Plus original WC1: dependent SSO custom tests pass
against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); stale
realms reaped.
- **WC1.1 (NEW)** — health-gated deploy-with-rollback built INTO the warm/infra reconcilers
(traefik + keycloak), NOT nix-generation rollback (the swarm app isn't in the generation). Pattern:
record running version = last-good → deploy latest → health-check → healthy: commit last-good:=latest;
unhealthy: roll back to last-good + `PushNotification` alert. Stateful (keycloak): undeploy → raw
snapshot data volume → deploy latest → health-check → on fail restore snapshot + redeploy prior
version (forward DB migrations make version-only rollback unsafe); reuse WC3 snapshot helper.
traefik (stateless) = version rollback only. **ADVERSARY PROOF (mandatory, I must run it):**
(a) force/simulate a BROKEN "latest" → confirm the warm app self-reverts to the prior healthy
version, keycloak's **pre-upgrade data intact**, and an alert fired; (b) a HEALTHY update commits
the new version as last-good. Watch for: silent failure (broken stays deployed), data loss on
revert, no alert, or last-good not advancing on a healthy update.
- **WC6 (reordered)** — nightly = `nixos-rebuild switch` FIRST (warm/infra → latest, health-gated per
WC1.1) THEN full-cold sweep; MUST NOT run while a test run is in flight; if the health-gate rolled
an infra app back, alert fires and the sweep still runs against the healthy prior version.
- **WC8 carry** — confirm the leftover phase-2 cold app `lasu-0a6fb2` (orchestrator flagged it) is
fully torn down (app+volumes+secrets gone), since cold-teardown-sacred + disk budget are WC8.
- Still no gate CLAIMED; W0 in flight. Continue idle until a WC gate is claimed (watchdog pings).
## @2026-05-29 — WC1.2 added (pre-deploy safety gate, runs BEFORE WC1.1)
- **WC1.2 (NEW)** — pre-deploy safety gate on warm/infra auto-update. Rationale: a passing health
check does NOT prove a required manual migration ran, so gate BEFORE auto-deploy. Rule: only
auto-apply **non-major (patch/minor)** upgrades with **no manual-migration release notes**. If
current→latest is a **MAJOR recipe-version bump** OR the target `releaseNotes/<version>.md` flags a
manual migration → **DO NOT auto-upgrade**: stay on current + `PushNotification` alert **WITH the
release notes** (operator upgrades manually). Independent of, and runs BEFORE, the WC1.1
health-gated rollback. Applies to nightly rebuild (WC6) AND any reconcile.
- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
`<upstream>+<recipe-semver>`; a major **recipe-semver** bump = breaking, matches abra
major-upgrade caution); secondary = scan target `releaseNotes/<version>.md` for manual-migration
markers.
- **ADVERSARY PROOF (mandatory):** simulate a major / manual-migration "latest" → confirm
**hold-on-current** (no deploy attempted) + alert fired **carrying the release notes**; NO silent
auto-upgrade. Watch for: a major bump slipping through as if patch; releaseNotes not scanned;
alert without the notes; or the gate firing on a legitimate patch/minor (false hold).
- Ordering check: WC1.2 must short-circuit BEFORE WC1.1 even snapshots/deploys — i.e. on a held
upgrade there is no snapshot/deploy/rollback churn, just hold + alert.
## @2026-05-29 — Standing probe (WC8 carry): lasu-0a6fb2 teardown — CLEAN
Independent cold check on cc-ci (not a gate verdict; WC8 not yet claimed). The orchestrator-flagged
leftover phase-2 cold app `lasu-0a6fb2` is **fully gone**: `abra app ls -S -m` shows no lasu app,
`docker service ls` no lasu services, `docker volume ls` no lasu volumes, `docker secret ls` no lasu
secrets. Disk `/` at **63% (9.8G free / 28G)** — consistent with the Builder's claimed 96%→62%
reclaim. Cold-teardown-sacred holds for this orphan; disk budget healthy. Will fold into the WC8
verdict when that gate is claimed. Still no WC gate CLAIMED; W0 → next is W0.9 WC1.1 live proofs.
## @2026-05-29 — Watchdog pinged [C1]; NO formal gate claim yet — read-only pre-review (NOT a verdict)
Watchdog signalled a [C1] claim, but `STATUS-2w.md ## Gate` reads "(none claimed yet)" and the
Builder's own STATUS lists **W0.7 + W0.8 as remaining** before claiming WC1/WC1.1/WC1.2, with a build
finding (lasuite-docs in-place `--chaos` redeploy nginx `host not found in upstream ...backend:8000`
race) currently **blocking the WC1 dependent-green proof**. Per §6.1 there is NO formal gate to pass
yet — ping likely fired on the "reconciler-side WC1/WC1.1/WC1.2 proven" wording in 819c1bc. I will
NOT log a WC1/WC1.1/WC1.2 PASS until the gate is formally CLAIMED and I run the marquee reproduce cold.
**Read-only pre-review done now (no live churn — avoids colliding with the Builder's W0.8 keycloak work):**
- Live state consistent with the W0.9 narrative: `warm-keycloak.service` active; live image
`keycloak/keycloak:26.6.2` + `mariadb:12.2`; `/var/lib/ci-warm/keycloak/last_good = 10.7.1+26.6.2`
(the recovered canonical — correctly NOT advanced to the simulated-broken 10.7.10).
- Static review of `runner/warm_reconcile.py` — no defects:
- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
snapshot/deploy/rollback churn; both `held-major` + `held-manual-migration` alerts carry `release_notes`.
- `is_major_bump` is conservative: holds on a major bump of EITHER the recipe-semver (pre-`+`) OR
the app-version (post-`+`), so a keycloak app-major (25->26, the DB-migration case) is also held.
Neutralizes a tag-format wording mismatch (plan §WC1.2 says `<upstream>+<recipe-semver>`; code's
observed data says `<recipe-semver>+<app-version>`) — checking both sides covers intent either way.
Not a defect; noted so I don't re-flag it.
- WC1.1 rolls back on BOTH a deploy exception AND an unhealthy result (L356-362); stateful path
restores the snapshot before redeploying the prior version; raises if the rollback itself is
unhealthy. Alert `rollback` carries last_good/attempted/recovered/notes.
- **OPEN FLAG to confirm at the live reproduce:** `/var/lib/ci-warm/alerts/` is currently EMPTY,
though W0.9 claims a rollback alert was written there and the alert-relay archiving to `alerts/seen/`
is explicitly deferred/unwired. Likely benign (Builder cleaned up the W0.9 test alert), but I MUST
confirm a `*rollback*.json` alert actually lands during my own cold reproduce (no silent no-alert).
- **PLAN for the formal gate:** when WC1 is CLAIMED, run the Builder's reproduce (STATUS L79-83):
fake tags `10.7.9+26.6.2`(good) + `10.7.10+26.6.2`(broken KC_HOSTNAME), `CCCI_SKIP_FETCH=1
cc-ci-run runner/warm_reconcile.py keycloak` x2 → expect `upgraded:` then `rolled-back:`, marker
realm survives, last_good unchanged at prior, a `*rollback*.json` alert; PLUS the WC1 headline
(dependent SSO custom test green vs warm keycloak + concurrent distinct realms + reaping) + a
major/manual-migration WC1.2 hold proof. Sent a BUILDER-INBOX heads-up to coordinate keycloak timing.