124 lines
10 KiB
Markdown
124 lines
10 KiB
Markdown
# REVIEW-2w — Adversary verdicts for Phase 2w (warm canonical + `--quick`)
|
||
|
||
Adversary-owned ledger. Append-only. Formal verdicts live here; gate claims live in STATUS-2w.md,
|
||
findings in BACKLOG-2w.md `## Adversary findings`.
|
||
|
||
**Definition of Done verified here:** WC1–WC9 (see `plan-phase2w-warm-canonical-quick.md` §1).
|
||
Each needs an independent COLD verdict before `## DONE` is permitted. The marquee proof is **WC9**:
|
||
deliberately fail a PR under `--quick` and confirm the canonical's last-known-good is restored intact
|
||
(data preserved) AND a `--quick` pass did not move the known-good.
|
||
|
||
## Verification map (what I will re-run cold per gate)
|
||
- **WC1** live-warm keycloak: dependent recipe's SSO custom tests pass against warm keycloak;
|
||
concurrent dependents use distinct namespaced realms (no collision); leftover realms reaped.
|
||
- **WC2** data-warm canonical: canonical at a stable domain (≠ cold `<recipe>-<6hex>`); declarative
|
||
registry tracks recipe→commit; re-warmable from scratch.
|
||
- **WC3** snapshots: raw volume copy taken while UNDEPLOYED under stable path; one last-known-good per
|
||
app, atomic replace; restore brings app back healthy with data.
|
||
- **WC4** `--quick`: reattach canonical → upgrade to PR head → generic UPGRADE+serving+custom;
|
||
PASS→undeploy keep volume, known-good unchanged; FAIL→restore snapshot then undeploy; never promotes.
|
||
- **WC5** cold-only advancement: green full-cold on latest re-snapshots+re-tags; only cold advances.
|
||
- **WC6** nightly full-cold sweep: scheduled, declarative, MAX_TESTS-bounded.
|
||
- **WC7** trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
|
||
results carry mode; no-canonical fallback clean.
|
||
- **WC8** resource safety: warm runs serialize per app; warm keycloak shared via per-run realms; disk
|
||
monitored+pruned; cold teardown still deletes per-run volumes; warm data excluded from D8 closure.
|
||
- **WC9** docs + cold verify incl. rollback proof; no softened tests.
|
||
|
||
---
|
||
|
||
## @2026-05-28 — Phase 2w start (Adversary online)
|
||
- Phase 2w interjected by operator (2026-05-28); Phase 2 paused. No 2w gates CLAIMED yet — Builder
|
||
has not bootstrapped STATUS-2w.md. Phase-2 Docker Hub rate-limit fix was the last completed work.
|
||
- COLD access re-verified: `cc-ci-tailscaled` active; `ssh cc-ci` → NixOS 24.11 (50ab793);
|
||
wildcard `*.ci.commoninternet.net` → gateway 143.244.213.108. Verification path is live.
|
||
- IDLE until the Builder claims a WC gate (watchdog will ping on claim). Standing veto power retained.
|
||
|
||
## @2026-05-28 — Design update absorbed (orchestrator: unpin + health-gated rollback)
|
||
SSOT updated (committed). Revised/added verification obligations I will hold the gate to:
|
||
- **WC1 (revised)** — keycloak is now **UNPINNED** like traefik: reconciler `abra recipe fetch`
|
||
latest + chaos-deploy; `kcVersion` pin DROPPED; MUST keep the *secret-generate-only-if-missing*
|
||
guard + the health-wait. Cold-check: no version pin in the nix module / reconciler; recipe fetched
|
||
at activation (runtime) so the nix closure stays byte-identical (D8 preserved — verify closure hash
|
||
unaffected by which keycloak version is live). Plus original WC1: dependent SSO custom tests pass
|
||
against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); stale
|
||
realms reaped.
|
||
- **WC1.1 (NEW)** — health-gated deploy-with-rollback built INTO the warm/infra reconcilers
|
||
(traefik + keycloak), NOT nix-generation rollback (the swarm app isn't in the generation). Pattern:
|
||
record running version = last-good → deploy latest → health-check → healthy: commit last-good:=latest;
|
||
unhealthy: roll back to last-good + `PushNotification` alert. Stateful (keycloak): undeploy → raw
|
||
snapshot data volume → deploy latest → health-check → on fail restore snapshot + redeploy prior
|
||
version (forward DB migrations make version-only rollback unsafe); reuse WC3 snapshot helper.
|
||
traefik (stateless) = version rollback only. **ADVERSARY PROOF (mandatory, I must run it):**
|
||
(a) force/simulate a BROKEN "latest" → confirm the warm app self-reverts to the prior healthy
|
||
version, keycloak's **pre-upgrade data intact**, and an alert fired; (b) a HEALTHY update commits
|
||
the new version as last-good. Watch for: silent failure (broken stays deployed), data loss on
|
||
revert, no alert, or last-good not advancing on a healthy update.
|
||
- **WC6 (reordered)** — nightly = `nixos-rebuild switch` FIRST (warm/infra → latest, health-gated per
|
||
WC1.1) THEN full-cold sweep; MUST NOT run while a test run is in flight; if the health-gate rolled
|
||
an infra app back, alert fires and the sweep still runs against the healthy prior version.
|
||
- **WC8 carry** — confirm the leftover phase-2 cold app `lasu-0a6fb2` (orchestrator flagged it) is
|
||
fully torn down (app+volumes+secrets gone), since cold-teardown-sacred + disk budget are WC8.
|
||
- Still no gate CLAIMED; W0 in flight. Continue idle until a WC gate is claimed (watchdog pings).
|
||
|
||
## @2026-05-29 — WC1.2 added (pre-deploy safety gate, runs BEFORE WC1.1)
|
||
- **WC1.2 (NEW)** — pre-deploy safety gate on warm/infra auto-update. Rationale: a passing health
|
||
check does NOT prove a required manual migration ran, so gate BEFORE auto-deploy. Rule: only
|
||
auto-apply **non-major (patch/minor)** upgrades with **no manual-migration release notes**. If
|
||
current→latest is a **MAJOR recipe-version bump** OR the target `releaseNotes/<version>.md` flags a
|
||
manual migration → **DO NOT auto-upgrade**: stay on current + `PushNotification` alert **WITH the
|
||
release notes** (operator upgrades manually). Independent of, and runs BEFORE, the WC1.1
|
||
health-gated rollback. Applies to nightly rebuild (WC6) AND any reconcile.
|
||
- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
|
||
`<upstream>+<recipe-semver>`; a major **recipe-semver** bump = breaking, matches abra
|
||
major-upgrade caution); secondary = scan target `releaseNotes/<version>.md` for manual-migration
|
||
markers.
|
||
- **ADVERSARY PROOF (mandatory):** simulate a major / manual-migration "latest" → confirm
|
||
**hold-on-current** (no deploy attempted) + alert fired **carrying the release notes**; NO silent
|
||
auto-upgrade. Watch for: a major bump slipping through as if patch; releaseNotes not scanned;
|
||
alert without the notes; or the gate firing on a legitimate patch/minor (false hold).
|
||
- Ordering check: WC1.2 must short-circuit BEFORE WC1.1 even snapshots/deploys — i.e. on a held
|
||
upgrade there is no snapshot/deploy/rollback churn, just hold + alert.
|
||
|
||
## @2026-05-29 — Standing probe (WC8 carry): lasu-0a6fb2 teardown — CLEAN
|
||
Independent cold check on cc-ci (not a gate verdict; WC8 not yet claimed). The orchestrator-flagged
|
||
leftover phase-2 cold app `lasu-0a6fb2` is **fully gone**: `abra app ls -S -m` shows no lasu app,
|
||
`docker service ls` no lasu services, `docker volume ls` no lasu volumes, `docker secret ls` no lasu
|
||
secrets. Disk `/` at **63% (9.8G free / 28G)** — consistent with the Builder's claimed 96%→62%
|
||
reclaim. Cold-teardown-sacred holds for this orphan; disk budget healthy. Will fold into the WC8
|
||
verdict when that gate is claimed. Still no WC gate CLAIMED; W0 → next is W0.9 WC1.1 live proofs.
|
||
|
||
## @2026-05-29 — Watchdog pinged [C1]; NO formal gate claim yet — read-only pre-review (NOT a verdict)
|
||
Watchdog signalled a [C1] claim, but `STATUS-2w.md ## Gate` reads "(none claimed yet)" and the
|
||
Builder's own STATUS lists **W0.7 + W0.8 as remaining** before claiming WC1/WC1.1/WC1.2, with a build
|
||
finding (lasuite-docs in-place `--chaos` redeploy nginx `host not found in upstream ...backend:8000`
|
||
race) currently **blocking the WC1 dependent-green proof**. Per §6.1 there is NO formal gate to pass
|
||
yet — ping likely fired on the "reconciler-side WC1/WC1.1/WC1.2 proven" wording in 819c1bc. I will
|
||
NOT log a WC1/WC1.1/WC1.2 PASS until the gate is formally CLAIMED and I run the marquee reproduce cold.
|
||
|
||
**Read-only pre-review done now (no live churn — avoids colliding with the Builder's W0.8 keycloak work):**
|
||
- Live state consistent with the W0.9 narrative: `warm-keycloak.service` active; live image
|
||
`keycloak/keycloak:26.6.2` + `mariadb:12.2`; `/var/lib/ci-warm/keycloak/last_good = 10.7.1+26.6.2`
|
||
(the recovered canonical — correctly NOT advanced to the simulated-broken 10.7.10).
|
||
- Static review of `runner/warm_reconcile.py` — no defects:
|
||
- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
|
||
snapshot/deploy/rollback churn; both `held-major` + `held-manual-migration` alerts carry `release_notes`.
|
||
- `is_major_bump` is conservative: holds on a major bump of EITHER the recipe-semver (pre-`+`) OR
|
||
the app-version (post-`+`), so a keycloak app-major (25->26, the DB-migration case) is also held.
|
||
Neutralizes a tag-format wording mismatch (plan §WC1.2 says `<upstream>+<recipe-semver>`; code's
|
||
observed data says `<recipe-semver>+<app-version>`) — checking both sides covers intent either way.
|
||
Not a defect; noted so I don't re-flag it.
|
||
- WC1.1 rolls back on BOTH a deploy exception AND an unhealthy result (L356-362); stateful path
|
||
restores the snapshot before redeploying the prior version; raises if the rollback itself is
|
||
unhealthy. Alert `rollback` carries last_good/attempted/recovered/notes.
|
||
- **OPEN FLAG to confirm at the live reproduce:** `/var/lib/ci-warm/alerts/` is currently EMPTY,
|
||
though W0.9 claims a rollback alert was written there and the alert-relay archiving to `alerts/seen/`
|
||
is explicitly deferred/unwired. Likely benign (Builder cleaned up the W0.9 test alert), but I MUST
|
||
confirm a `*rollback*.json` alert actually lands during my own cold reproduce (no silent no-alert).
|
||
- **PLAN for the formal gate:** when WC1 is CLAIMED, run the Builder's reproduce (STATUS L79-83):
|
||
fake tags `10.7.9+26.6.2`(good) + `10.7.10+26.6.2`(broken KC_HOSTNAME), `CCCI_SKIP_FETCH=1
|
||
cc-ci-run runner/warm_reconcile.py keycloak` x2 → expect `upgraded:` then `rolled-back:`, marker
|
||
realm survives, last_good unchanged at prior, a `*rollback*.json` alert; PLUS the WC1 headline
|
||
(dependent SSO custom test green vs warm keycloak + concurrent distinct realms + reaping) + a
|
||
major/manual-migration WC1.2 hold proof. Sent a BUILDER-INBOX heads-up to coordinate keycloak timing.
|