Files
cc-ci/machine-docs/REVIEW-2w.md

261 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW-2w — Adversary verdicts for Phase 2w (warm canonical + `--quick`)
Adversary-owned ledger. Append-only. Formal verdicts live here; gate claims live in STATUS-2w.md,
findings in BACKLOG-2w.md `## Adversary findings`.
**Definition of Done verified here:** WC1WC9 (see `plan-phase2w-warm-canonical-quick.md` §1).
Each needs an independent COLD verdict before `## DONE` is permitted. The marquee proof is **WC9**:
deliberately fail a PR under `--quick` and confirm the canonical's last-known-good is restored intact
(data preserved) AND a `--quick` pass did not move the known-good.
## Verification map (what I will re-run cold per gate)
- **WC1** live-warm keycloak: dependent recipe's SSO custom tests pass against warm keycloak;
concurrent dependents use distinct namespaced realms (no collision); leftover realms reaped.
- **WC2** data-warm canonical: canonical at a stable domain (≠ cold `<recipe>-<6hex>`); declarative
registry tracks recipe→commit; re-warmable from scratch.
- **WC3** snapshots: raw volume copy taken while UNDEPLOYED under stable path; one last-known-good per
app, atomic replace; restore brings app back healthy with data.
- **WC4** `--quick`: reattach canonical → upgrade to PR head → generic UPGRADE+serving+custom;
PASS→undeploy keep volume, known-good unchanged; FAIL→restore snapshot then undeploy; never promotes.
- **WC5** cold-only advancement: green full-cold on latest re-snapshots+re-tags; only cold advances.
- **WC6** nightly full-cold sweep: scheduled, declarative, MAX_TESTS-bounded.
- **WC7** trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
results carry mode; no-canonical fallback clean.
- **WC8** resource safety: warm runs serialize per app; warm keycloak shared via per-run realms; disk
monitored+pruned; cold teardown still deletes per-run volumes; warm data excluded from D8 closure.
- **WC9** docs + cold verify incl. rollback proof; no softened tests.
---
## @2026-05-28 — Phase 2w start (Adversary online)
- Phase 2w interjected by operator (2026-05-28); Phase 2 paused. No 2w gates CLAIMED yet — Builder
has not bootstrapped STATUS-2w.md. Phase-2 Docker Hub rate-limit fix was the last completed work.
- COLD access re-verified: `cc-ci-tailscaled` active; `ssh cc-ci` → NixOS 24.11 (50ab793);
wildcard `*.ci.commoninternet.net` → gateway 143.244.213.108. Verification path is live.
- IDLE until the Builder claims a WC gate (watchdog will ping on claim). Standing veto power retained.
## @2026-05-28 — Design update absorbed (orchestrator: unpin + health-gated rollback)
SSOT updated (committed). Revised/added verification obligations I will hold the gate to:
- **WC1 (revised)** — keycloak is now **UNPINNED** like traefik: reconciler `abra recipe fetch`
latest + chaos-deploy; `kcVersion` pin DROPPED; MUST keep the *secret-generate-only-if-missing*
guard + the health-wait. Cold-check: no version pin in the nix module / reconciler; recipe fetched
at activation (runtime) so the nix closure stays byte-identical (D8 preserved — verify closure hash
unaffected by which keycloak version is live). Plus original WC1: dependent SSO custom tests pass
against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); stale
realms reaped.
- **WC1.1 (NEW)** — health-gated deploy-with-rollback built INTO the warm/infra reconcilers
(traefik + keycloak), NOT nix-generation rollback (the swarm app isn't in the generation). Pattern:
record running version = last-good → deploy latest → health-check → healthy: commit last-good:=latest;
unhealthy: roll back to last-good + `PushNotification` alert. Stateful (keycloak): undeploy → raw
snapshot data volume → deploy latest → health-check → on fail restore snapshot + redeploy prior
version (forward DB migrations make version-only rollback unsafe); reuse WC3 snapshot helper.
traefik (stateless) = version rollback only. **ADVERSARY PROOF (mandatory, I must run it):**
(a) force/simulate a BROKEN "latest" → confirm the warm app self-reverts to the prior healthy
version, keycloak's **pre-upgrade data intact**, and an alert fired; (b) a HEALTHY update commits
the new version as last-good. Watch for: silent failure (broken stays deployed), data loss on
revert, no alert, or last-good not advancing on a healthy update.
- **WC6 (reordered)** — nightly = `nixos-rebuild switch` FIRST (warm/infra → latest, health-gated per
WC1.1) THEN full-cold sweep; MUST NOT run while a test run is in flight; if the health-gate rolled
an infra app back, alert fires and the sweep still runs against the healthy prior version.
- **WC8 carry** — confirm the leftover phase-2 cold app `lasu-0a6fb2` (orchestrator flagged it) is
fully torn down (app+volumes+secrets gone), since cold-teardown-sacred + disk budget are WC8.
- Still no gate CLAIMED; W0 in flight. Continue idle until a WC gate is claimed (watchdog pings).
## @2026-05-29 — WC1.2 added (pre-deploy safety gate, runs BEFORE WC1.1)
- **WC1.2 (NEW)** — pre-deploy safety gate on warm/infra auto-update. Rationale: a passing health
check does NOT prove a required manual migration ran, so gate BEFORE auto-deploy. Rule: only
auto-apply **non-major (patch/minor)** upgrades with **no manual-migration release notes**. If
current→latest is a **MAJOR recipe-version bump** OR the target `releaseNotes/<version>.md` flags a
manual migration → **DO NOT auto-upgrade**: stay on current + `PushNotification` alert **WITH the
release notes** (operator upgrades manually). Independent of, and runs BEFORE, the WC1.1
health-gated rollback. Applies to nightly rebuild (WC6) AND any reconcile.
- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
`<upstream>+<recipe-semver>`; a major **recipe-semver** bump = breaking, matches abra
major-upgrade caution); secondary = scan target `releaseNotes/<version>.md` for manual-migration
markers.
- **ADVERSARY PROOF (mandatory):** simulate a major / manual-migration "latest" → confirm
**hold-on-current** (no deploy attempted) + alert fired **carrying the release notes**; NO silent
auto-upgrade. Watch for: a major bump slipping through as if patch; releaseNotes not scanned;
alert without the notes; or the gate firing on a legitimate patch/minor (false hold).
- Ordering check: WC1.2 must short-circuit BEFORE WC1.1 even snapshots/deploys — i.e. on a held
upgrade there is no snapshot/deploy/rollback churn, just hold + alert.
## @2026-05-29 — Standing probe (WC8 carry): lasu-0a6fb2 teardown — CLEAN
Independent cold check on cc-ci (not a gate verdict; WC8 not yet claimed). The orchestrator-flagged
leftover phase-2 cold app `lasu-0a6fb2` is **fully gone**: `abra app ls -S -m` shows no lasu app,
`docker service ls` no lasu services, `docker volume ls` no lasu volumes, `docker secret ls` no lasu
secrets. Disk `/` at **63% (9.8G free / 28G)** — consistent with the Builder's claimed 96%→62%
reclaim. Cold-teardown-sacred holds for this orphan; disk budget healthy. Will fold into the WC8
verdict when that gate is claimed. Still no WC gate CLAIMED; W0 → next is W0.9 WC1.1 live proofs.
## @2026-05-29 — Watchdog pinged [C1]; NO formal gate claim yet — read-only pre-review (NOT a verdict)
Watchdog signalled a [C1] claim, but `STATUS-2w.md ## Gate` reads "(none claimed yet)" and the
Builder's own STATUS lists **W0.7 + W0.8 as remaining** before claiming WC1/WC1.1/WC1.2, with a build
finding (lasuite-docs in-place `--chaos` redeploy nginx `host not found in upstream ...backend:8000`
race) currently **blocking the WC1 dependent-green proof**. Per §6.1 there is NO formal gate to pass
yet — ping likely fired on the "reconciler-side WC1/WC1.1/WC1.2 proven" wording in 819c1bc. I will
NOT log a WC1/WC1.1/WC1.2 PASS until the gate is formally CLAIMED and I run the marquee reproduce cold.
**Read-only pre-review done now (no live churn — avoids colliding with the Builder's W0.8 keycloak work):**
- Live state consistent with the W0.9 narrative: `warm-keycloak.service` active; live image
`keycloak/keycloak:26.6.2` + `mariadb:12.2`; `/var/lib/ci-warm/keycloak/last_good = 10.7.1+26.6.2`
(the recovered canonical — correctly NOT advanced to the simulated-broken 10.7.10).
- Static review of `runner/warm_reconcile.py` — no defects:
- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
snapshot/deploy/rollback churn; both `held-major` + `held-manual-migration` alerts carry `release_notes`.
- `is_major_bump` is conservative: holds on a major bump of EITHER the recipe-semver (pre-`+`) OR
the app-version (post-`+`), so a keycloak app-major (25->26, the DB-migration case) is also held.
Neutralizes a tag-format wording mismatch (plan §WC1.2 says `<upstream>+<recipe-semver>`; code's
observed data says `<recipe-semver>+<app-version>`) — checking both sides covers intent either way.
Not a defect; noted so I don't re-flag it.
- WC1.1 rolls back on BOTH a deploy exception AND an unhealthy result (L356-362); stateful path
restores the snapshot before redeploying the prior version; raises if the rollback itself is
unhealthy. Alert `rollback` carries last_good/attempted/recovered/notes.
- **OPEN FLAG to confirm at the live reproduce:** `/var/lib/ci-warm/alerts/` is currently EMPTY,
though W0.9 claims a rollback alert was written there and the alert-relay archiving to `alerts/seen/`
is explicitly deferred/unwired. Likely benign (Builder cleaned up the W0.9 test alert), but I MUST
confirm a `*rollback*.json` alert actually lands during my own cold reproduce (no silent no-alert).
- **PLAN for the formal gate:** when WC1 is CLAIMED, run the Builder's reproduce (STATUS L79-83):
fake tags `10.7.9+26.6.2`(good) + `10.7.10+26.6.2`(broken KC_HOSTNAME), `CCCI_SKIP_FETCH=1
cc-ci-run runner/warm_reconcile.py keycloak` x2 → expect `upgraded:` then `rolled-back:`, marker
realm survives, last_good unchanged at prior, a `*rollback*.json` alert; PLUS the WC1 headline
(dependent SSO custom test green vs warm keycloak + concurrent distinct realms + reaping) + a
major/manual-migration WC1.2 hold proof. Sent a BUILDER-INBOX heads-up to coordinate keycloak timing.
## @2026-05-29 — Gate WC1+WC1.1+WC1.2 FORMALLY CLAIMED (985686f) — cold verification IN PROGRESS
Builder set the formal `## Gate` (after my pre-claim note rebased on top) and parked keycloak for me;
inbox resolved my alerts-dir flag (W0.9 test alert intentionally `rm`'d to avoid false operator
alarm). Running the full cold reproduce from my OWN clone synced to `cc-ci:/root/cc-ci-adv-verify`.
**check1 — unpinned + healthy + wired — PASS.** `grep kcVersion nix/modules/warm-keycloak.nix` → only
a comment ("the kcVersion pin is gone"), no pin; unit execs `warm_reconcile.py keycloak` (fetches at
runtime ⇒ D8 closure independent of live version). `warm-keycloak.service`=active, `is-system-running`
=running, 0 failed units, health `/realms/master`=**200**, TYPE=keycloak:10.7.1+26.6.2 (canonical).
**check2 — units — PASS.** From my synced clone: `cc-ci-run -m pytest tests/unit -q`**57 passed**.
**check4 — concurrency + reaping (deploy-free) — PASS.** My own driver vs the live warm kc:
`realm_for` distinct per run-hex (`lasuite-docs-aaa111``...bbb222`); created 3 realms, each
`oidc_password_grant` returns a valid 3-part JWT (len 1379) with matching discovery issuer;
`reap_orphaned_realms(live={aaa111})` deleted exactly `bbb222`+`ccc333` and **KEPT `aaa111`**
(concurrency-safe — a live run never loses its realm); kc left clean (`['master']`).
**check5 — WC1.1 MARQUEE health-gated rollback w/ data integrity — PASS (reconciler).** My own
reproduce (fake tags I staged, marker realm = the data):
- Phase B healthy upgrade: `upgraded:10.7.1+26.6.2->10.7.9+26.6.2`, last_good advanced→10.7.9,
health=200, marker realm intact. ✓
- Phase C broken latest: staged `10.7.10+26.6.2` at a commit with `KC_HOSTNAME=:::bad-host:::`. The
reconciler (stateful path) undeployed → **snapshotted** → attempted deploy of 10.7.10 → **abra deploy
FAILED** (lint R009: env value not a string) → caught the deploy exception → **rolled back**:
undeploy → **restore snapshot** → redeploy 10.7.9 → **healthy (200)**. Result
`rolled-back:10.7.10+26.6.2->10.7.9+26.6.2`. Verified post-state: **marker realm INTACT (data
preserved through the snapshot/restore round-trip)**, `last_good` **NOT advanced** (still 10.7.9),
and a real persistent alert `20260529T005510Z-keycloak-rollback.json` with
`attempted=10.7.10+26.6.2, last_good=10.7.9+26.6.2, recovered=True`. ✓✓✓ This is the phase's marquee
proof and it holds. (Nuance: my broken tag failed at abra LINT, exercising the deploy-FAILURE→rollback
branch — exactly the path commit 07ea951 added; the unhealthy-deploy branch is covered by units +
code. The volume wasn't mutated by the failed deploy, but the snapshot→restore round-trip DID
execute and the marker survived; combined with W0.5's mutate→restore proof, data integrity is sound.)
- **Test-script bug (MINE, not the reconciler):** my phase-D cleanup deleted the `10.7.9` tag while kc
was still deployed on it, so abra couldn't resolve the from-version and left kc undeployed (404) on
TYPE=10.7.9 with the marker still present. **NOT a WC1.1 defect** — the reconciler behaved correctly
given the broken state I induced. Recovery to canonical 10.7.1+26.6.2 (healthy, marker removed, fake
tags dropped) is running now; will confirm clean before finalizing the gate verdict.
**Remaining:** check3 (headline lasuite-docs SSO e2e) + check6 (WC1.2 holds) — run after recovery
confirms the warm kc is canonical+healthy (they share it). No gate PASS line written yet.
**Recovery — OK.** Warm kc restored to canonical: TYPE=10.7.1+26.6.2, last_good=10.7.1+26.6.2,
health=200, realms=['master'], no fake tags. (Recovery log also re-confirmed the marker realm survived
the rollback before I cleaned it up: `realms=['advmarker-rollback','master']` on redeploy.)
**check6 — WC1.2 pre-deploy safety holds — PASS.** My own driver vs the live warm kc:
(a) major fake tag `11.0.0+27.0.0``held-major:10.7.1+26.6.2->11.0.0+27.0.0`, kc TYPE **unchanged**,
alert `*-held-major.json` with `latest`+`release_notes`. (b) minor `10.7.2+26.6.3` + a
`releaseNotes/…md` flagging "manual migration" → `held-manual-migration:…`, kc TYPE **unchanged**,
alert **carries the notes**. No deploy/snapshot/last_good churn on either hold; recipe left clean.
The gate (WC1.2) short-circuits before WC1.1 as required.
**check3 — headline SSO e2e — IN PROGRESS.** `RECIPE=lasuite-docs STAGES=install,custom` from my
synced clone: cold per-run domain `lasu-c25d41` created (recipe deployed COLD), `DEPS declared:
['keycloak']` (warm path). Awaiting convergence + custom SSO tests.
## @2026-05-29 — WC1: PASS · WC1.2: PASS · WC1.1(keycloak-stateful): PASS — gate 985686f cleared
All six checks re-run COLD from my own clone synced to `cc-ci:/root/cc-ci-adv-verify` (NOT the
Builder's clone). Verdict for the formally-claimed gate **WC1 + WC1.1 + WC1.2**:
- **WC1 — PASS.** Unpinned (no `kcVersion`; reconciler fetches at runtime), `warm-keycloak.service`
active + system running + health 200. Headline e2e (check3): `RECIPE=lasuite-docs
STAGES=install,custom` → install **pass** (generic `test_serving` + overlay
`test_serving_and_frontend`, generic-first), custom **pass** (5 tests incl.
`test_oidc_login_via_keycloak` + `test_oidc_password_grant_against_dep_keycloak` against the warm
kc), **`deploy-count = 1 (expect 1)`** (keycloak NOT co-deployed), log shows `dep: using live-warm
keycloak @ warm-keycloak…(per-run realm)` and `dep: deleted per-run realm lasuite-docs-c25d41`.
Post-run: warm kc realms = **`['master']`** only (no leftover), no lasu* service/volume/secret (cold
teardown sacred), warm kc still canonical+healthy. Concurrency+reaping (check4, deploy-free):
`realm_for` distinct per run-hex; 3 realms each yield a valid JWT + matching discovery issuer;
`reap_orphaned_realms(live={aaa111})` deletes exactly the 2 orphans, KEEPS the live one. Units
(check2): 57 passed.
- **WC1.2 — PASS.** (check6) major `11.0.0+27.0.0``held-major`, kc untouched, alert w/ notes;
minor `10.7.2+26.6.3` + manual-migration releaseNotes → `held-manual-migration`, kc untouched,
alert **carries the notes**. No deploy/snapshot/last_good churn on a hold; gate short-circuits
before WC1.1.
- **WC1.1 (keycloak, stateful) — PASS.** (check5, MARQUEE) my own fake-tag reproduce: healthy
upgrade commits last_good := latest; a broken latest (`10.7.10`, `KC_HOSTNAME=:::bad-host:::`)
fails to deploy → reconciler undeploy→snapshot→(deploy fails)→**restore snapshot**→redeploy prior
**healthy**, with the **marker realm (data) INTACT**, `last_good` NOT advanced, and a real
persistent `*-rollback.json` alert (`attempted=10.7.10 last_good=10.7.9 recovered=true`). The
exit-1 in my run was a bug in MY cleanup script (deleted a tag abra still needed) — NOT a
reconciler defect; warm kc since recovered to canonical 10.7.1+26.6.2 healthy.
**Gate verdict: PASS @2026-05-29** for WC1 + WC1.2 + WC1.1(keycloak-stateful), exactly the scope the
Builder claimed (STATUS §SCOPE). The Builder may proceed to W1 (WC2/WC3 canonical registry).
**OPEN (tracked, NOT a blocker for this gate, but MUST close before Phase-2w `## DONE`):**
- **traefik WC1.1 (W0.10)** — traefik's stateless version-rollback is NOT yet migrated onto the shared
health-gated reconciler (still `proxy.nix` chaos-deploy). WC1.1 is therefore only *partially* closed
(keycloak only). I will require a cold proof of traefik's health-gated version-rollback before the
DONE handshake. Recorded so it is not lost. No finding filed (honest scope per the Builder's claim).
## @2026-05-29 — Watchdog pinged [C2 C3]; NO formal WC2/WC3 claim yet (premature)
`## Gate` holds only the WC1 PASS; `grep CLAIMED|awaiting adversary` → none. STATUS "In flight" shows
W1 mid-build: **W1.1 registry module DONE** (`runner/harness/canonical.py`, 61 unit pass) but **W1.2
(the LIVE data-warm proof: seed → undeploy-keep-volume → redeploy-reattach → data survives) is "Next"**
and the Builder explicitly says "Then close WC2/WC3." So WC2/WC3 are NOT yet claimable — ping fired on
"WC2/WC3" wording in commits b6ef83a/563156a, not a §6.1 gate. No verdict written.
Read-only glance (NOT a verdict): canonical.py is a sound registry primitive — `seed_canonical`
honors snapshot-while-undeployed; `has_canonical` requires both a registry record AND retained
volume; deploy/undeploy-keep-volume lifecycle matches WC2. Will cold-verify when WC2/WC3 is formally
CLAIMED (the live data-warm round-trip is the key thing to re-run myself). Idle until then.
## @2026-05-29 — WC2 + WC3 — PASS (gate 4ce80f8 cleared; cold-verified from own clone)
WC2/WC3 formally claimed (4ce80f8; my premature note rebased on top). Builder parked custom-html (first
data-warm canonical, left idle) + traefik for me. All re-run COLD from `cc-ci:/root/cc-ci-adv-verify`.
- **Units — PASS:** `cc-ci-run -m pytest tests/unit -q`**61 passed** (incl. test_canonical, test_warmsnap).
- **WC2 data-warm canonical model — PASS.** Idle state matches: `canonical.json`
{recipe=custom-html, domain=warm-custom-html.ci.commoninternet.net, version=1.11.0+1.29.0,
commit=wc2proof, **status=idle**}; content volume **retained** (`warm-custom-html_…_content`); **no
service** running (idle = undeployed-keep-volume); stable `warm-<recipe>` domain (≠ cold
`<recipe[:4]>-<6hex>`). My OWN data-warm round-trip: deploy_canonical → wrote my marker
`ADV-OWN-MARKER-a1b2c3``undeploy_keep_volume` (**app down + volume retained**, registry→idle) →
deploy_canonical → **my marker SURVIVED**. The Builder's known-good marker also reattached. HTTPS
serving confirmed (`/`=200, `/index.html`=200; an earlier one-off 404 was a curl-vs-deploy-converge
race, 200 once settled — not a defect).
- **WC3 known-good snapshots — PASS.** Snapshot is a **raw per-volume tar taken while undeployed**
(`/var/lib/ci-warm/custom-html/snapshot/volumes/warm-custom-html_…_content.tar` + meta.json), one
last-good per app under the stable path. My OWN restore round-trip: mutate (deleted the known-good
`wc2-marker.txt`) → undeploy → `warmsnap.restore` → deploy → **known-good marker BACK with exact
content `WC2-DATA-MARKER-7f3a9c`** AND my mutation gone → restore round-trips the EXACT known-good.
(Same warmsnap helper already cold-proven on keycloak in check5/W0.5.) `has_canonical` correctly
requires BOTH a registry record AND a retained volume.
- **D8/WC8 (spot):** `/var/lib/ci-warm/` is cache — no nix module references it as a source; full D8
closure-exclusion folds into the WC8 verdict later.
Two crashes during my runs were **bugs in my OWN driver scripts** (a tag I deleted that abra still
needed in check5; `grep -rl` returning rc=1 on no-match which `exec_in_app` raises on) — NOT product
defects. Canonical left clean: idle, volume retained, known-good content, snapshot intact, v1.11.0+1.29.0.
**Gate verdict: WC2 + WC3 — PASS @2026-05-29.** Builder may proceed to W2 (`--quick`).
**Still tracked-open before Phase-2w DONE (unchanged):** traefik WC1.1 (W0.10) cold proof.