412 lines
33 KiB
Markdown
412 lines
33 KiB
Markdown
# REVIEW-2w — Adversary verdicts for Phase 2w (warm canonical + `--quick`)
|
||
|
||
Adversary-owned ledger. Append-only. Formal verdicts live here; gate claims live in STATUS-2w.md,
|
||
findings in BACKLOG-2w.md `## Adversary findings`.
|
||
|
||
**Definition of Done verified here:** WC1–WC9 (see `plan-phase2w-warm-canonical-quick.md` §1).
|
||
Each needs an independent COLD verdict before `## DONE` is permitted. The marquee proof is **WC9**:
|
||
deliberately fail a PR under `--quick` and confirm the canonical's last-known-good is restored intact
|
||
(data preserved) AND a `--quick` pass did not move the known-good.
|
||
|
||
## Verification map (what I will re-run cold per gate)
|
||
- **WC1** live-warm keycloak: dependent recipe's SSO custom tests pass against warm keycloak;
|
||
concurrent dependents use distinct namespaced realms (no collision); leftover realms reaped.
|
||
- **WC2** data-warm canonical: canonical at a stable domain (≠ cold `<recipe>-<6hex>`); declarative
|
||
registry tracks recipe→commit; re-warmable from scratch.
|
||
- **WC3** snapshots: raw volume copy taken while UNDEPLOYED under stable path; one last-known-good per
|
||
app, atomic replace; restore brings app back healthy with data.
|
||
- **WC4** `--quick`: reattach canonical → upgrade to PR head → generic UPGRADE+serving+custom;
|
||
PASS→undeploy keep volume, known-good unchanged; FAIL→restore snapshot then undeploy; never promotes.
|
||
- **WC5** cold-only advancement: green full-cold on latest re-snapshots+re-tags; only cold advances.
|
||
- **WC6** nightly full-cold sweep: scheduled, declarative, MAX_TESTS-bounded.
|
||
- **WC7** trigger/authority/labeling: default `!testme`=cold; `--quick` opt-in, never gates merge;
|
||
results carry mode; no-canonical fallback clean.
|
||
- **WC8** resource safety: warm runs serialize per app; warm keycloak shared via per-run realms; disk
|
||
monitored+pruned; cold teardown still deletes per-run volumes; warm data excluded from D8 closure.
|
||
- **WC9** docs + cold verify incl. rollback proof; no softened tests.
|
||
|
||
---
|
||
|
||
## @2026-05-28 — Phase 2w start (Adversary online)
|
||
- Phase 2w interjected by operator (2026-05-28); Phase 2 paused. No 2w gates CLAIMED yet — Builder
|
||
has not bootstrapped STATUS-2w.md. Phase-2 Docker Hub rate-limit fix was the last completed work.
|
||
- COLD access re-verified: `cc-ci-tailscaled` active; `ssh cc-ci` → NixOS 24.11 (50ab793);
|
||
wildcard `*.ci.commoninternet.net` → gateway 143.244.213.108. Verification path is live.
|
||
- IDLE until the Builder claims a WC gate (watchdog will ping on claim). Standing veto power retained.
|
||
|
||
## @2026-05-28 — Design update absorbed (orchestrator: unpin + health-gated rollback)
|
||
SSOT updated (committed). Revised/added verification obligations I will hold the gate to:
|
||
- **WC1 (revised)** — keycloak is now **UNPINNED** like traefik: reconciler `abra recipe fetch`
|
||
latest + chaos-deploy; `kcVersion` pin DROPPED; MUST keep the *secret-generate-only-if-missing*
|
||
guard + the health-wait. Cold-check: no version pin in the nix module / reconciler; recipe fetched
|
||
at activation (runtime) so the nix closure stays byte-identical (D8 preserved — verify closure hash
|
||
unaffected by which keycloak version is live). Plus original WC1: dependent SSO custom tests pass
|
||
against warm keycloak; concurrent dependents use distinct namespaced realms (no collision); stale
|
||
realms reaped.
|
||
- **WC1.1 (NEW)** — health-gated deploy-with-rollback built INTO the warm/infra reconcilers
|
||
(traefik + keycloak), NOT nix-generation rollback (the swarm app isn't in the generation). Pattern:
|
||
record running version = last-good → deploy latest → health-check → healthy: commit last-good:=latest;
|
||
unhealthy: roll back to last-good + `PushNotification` alert. Stateful (keycloak): undeploy → raw
|
||
snapshot data volume → deploy latest → health-check → on fail restore snapshot + redeploy prior
|
||
version (forward DB migrations make version-only rollback unsafe); reuse WC3 snapshot helper.
|
||
traefik (stateless) = version rollback only. **ADVERSARY PROOF (mandatory, I must run it):**
|
||
(a) force/simulate a BROKEN "latest" → confirm the warm app self-reverts to the prior healthy
|
||
version, keycloak's **pre-upgrade data intact**, and an alert fired; (b) a HEALTHY update commits
|
||
the new version as last-good. Watch for: silent failure (broken stays deployed), data loss on
|
||
revert, no alert, or last-good not advancing on a healthy update.
|
||
- **WC6 (reordered)** — nightly = `nixos-rebuild switch` FIRST (warm/infra → latest, health-gated per
|
||
WC1.1) THEN full-cold sweep; MUST NOT run while a test run is in flight; if the health-gate rolled
|
||
an infra app back, alert fires and the sweep still runs against the healthy prior version.
|
||
- **WC8 carry** — confirm the leftover phase-2 cold app `lasu-0a6fb2` (orchestrator flagged it) is
|
||
fully torn down (app+volumes+secrets gone), since cold-teardown-sacred + disk budget are WC8.
|
||
- Still no gate CLAIMED; W0 in flight. Continue idle until a WC gate is claimed (watchdog pings).
|
||
|
||
## @2026-05-29 — WC1.2 added (pre-deploy safety gate, runs BEFORE WC1.1)
|
||
- **WC1.2 (NEW)** — pre-deploy safety gate on warm/infra auto-update. Rationale: a passing health
|
||
check does NOT prove a required manual migration ran, so gate BEFORE auto-deploy. Rule: only
|
||
auto-apply **non-major (patch/minor)** upgrades with **no manual-migration release notes**. If
|
||
current→latest is a **MAJOR recipe-version bump** OR the target `releaseNotes/<version>.md` flags a
|
||
manual migration → **DO NOT auto-upgrade**: stay on current + `PushNotification` alert **WITH the
|
||
release notes** (operator upgrades manually). Independent of, and runs BEFORE, the WC1.1
|
||
health-gated rollback. Applies to nightly rebuild (WC6) AND any reconcile.
|
||
- Detection (verify the impl uses both): primary = major recipe-version bump (coop-cloud version
|
||
`<upstream>+<recipe-semver>`; a major **recipe-semver** bump = breaking, matches abra
|
||
major-upgrade caution); secondary = scan target `releaseNotes/<version>.md` for manual-migration
|
||
markers.
|
||
- **ADVERSARY PROOF (mandatory):** simulate a major / manual-migration "latest" → confirm
|
||
**hold-on-current** (no deploy attempted) + alert fired **carrying the release notes**; NO silent
|
||
auto-upgrade. Watch for: a major bump slipping through as if patch; releaseNotes not scanned;
|
||
alert without the notes; or the gate firing on a legitimate patch/minor (false hold).
|
||
- Ordering check: WC1.2 must short-circuit BEFORE WC1.1 even snapshots/deploys — i.e. on a held
|
||
upgrade there is no snapshot/deploy/rollback churn, just hold + alert.
|
||
|
||
## @2026-05-29 — Standing probe (WC8 carry): lasu-0a6fb2 teardown — CLEAN
|
||
Independent cold check on cc-ci (not a gate verdict; WC8 not yet claimed). The orchestrator-flagged
|
||
leftover phase-2 cold app `lasu-0a6fb2` is **fully gone**: `abra app ls -S -m` shows no lasu app,
|
||
`docker service ls` no lasu services, `docker volume ls` no lasu volumes, `docker secret ls` no lasu
|
||
secrets. Disk `/` at **63% (9.8G free / 28G)** — consistent with the Builder's claimed 96%→62%
|
||
reclaim. Cold-teardown-sacred holds for this orphan; disk budget healthy. Will fold into the WC8
|
||
verdict when that gate is claimed. Still no WC gate CLAIMED; W0 → next is W0.9 WC1.1 live proofs.
|
||
|
||
## @2026-05-29 — Watchdog pinged [C1]; NO formal gate claim yet — read-only pre-review (NOT a verdict)
|
||
Watchdog signalled a [C1] claim, but `STATUS-2w.md ## Gate` reads "(none claimed yet)" and the
|
||
Builder's own STATUS lists **W0.7 + W0.8 as remaining** before claiming WC1/WC1.1/WC1.2, with a build
|
||
finding (lasuite-docs in-place `--chaos` redeploy nginx `host not found in upstream ...backend:8000`
|
||
race) currently **blocking the WC1 dependent-green proof**. Per §6.1 there is NO formal gate to pass
|
||
yet — ping likely fired on the "reconciler-side WC1/WC1.1/WC1.2 proven" wording in 819c1bc. I will
|
||
NOT log a WC1/WC1.1/WC1.2 PASS until the gate is formally CLAIMED and I run the marquee reproduce cold.
|
||
|
||
**Read-only pre-review done now (no live churn — avoids colliding with the Builder's W0.8 keycloak work):**
|
||
- Live state consistent with the W0.9 narrative: `warm-keycloak.service` active; live image
|
||
`keycloak/keycloak:26.6.2` + `mariadb:12.2`; `/var/lib/ci-warm/keycloak/last_good = 10.7.1+26.6.2`
|
||
(the recovered canonical — correctly NOT advanced to the simulated-broken 10.7.10).
|
||
- Static review of `runner/warm_reconcile.py` — no defects:
|
||
- WC1.2 safety gate runs BEFORE any snapshot/deploy (L335-343); a hold returns with NO
|
||
snapshot/deploy/rollback churn; both `held-major` + `held-manual-migration` alerts carry `release_notes`.
|
||
- `is_major_bump` is conservative: holds on a major bump of EITHER the recipe-semver (pre-`+`) OR
|
||
the app-version (post-`+`), so a keycloak app-major (25->26, the DB-migration case) is also held.
|
||
Neutralizes a tag-format wording mismatch (plan §WC1.2 says `<upstream>+<recipe-semver>`; code's
|
||
observed data says `<recipe-semver>+<app-version>`) — checking both sides covers intent either way.
|
||
Not a defect; noted so I don't re-flag it.
|
||
- WC1.1 rolls back on BOTH a deploy exception AND an unhealthy result (L356-362); stateful path
|
||
restores the snapshot before redeploying the prior version; raises if the rollback itself is
|
||
unhealthy. Alert `rollback` carries last_good/attempted/recovered/notes.
|
||
- **OPEN FLAG to confirm at the live reproduce:** `/var/lib/ci-warm/alerts/` is currently EMPTY,
|
||
though W0.9 claims a rollback alert was written there and the alert-relay archiving to `alerts/seen/`
|
||
is explicitly deferred/unwired. Likely benign (Builder cleaned up the W0.9 test alert), but I MUST
|
||
confirm a `*rollback*.json` alert actually lands during my own cold reproduce (no silent no-alert).
|
||
- **PLAN for the formal gate:** when WC1 is CLAIMED, run the Builder's reproduce (STATUS L79-83):
|
||
fake tags `10.7.9+26.6.2`(good) + `10.7.10+26.6.2`(broken KC_HOSTNAME), `CCCI_SKIP_FETCH=1
|
||
cc-ci-run runner/warm_reconcile.py keycloak` x2 → expect `upgraded:` then `rolled-back:`, marker
|
||
realm survives, last_good unchanged at prior, a `*rollback*.json` alert; PLUS the WC1 headline
|
||
(dependent SSO custom test green vs warm keycloak + concurrent distinct realms + reaping) + a
|
||
major/manual-migration WC1.2 hold proof. Sent a BUILDER-INBOX heads-up to coordinate keycloak timing.
|
||
|
||
## @2026-05-29 — Gate WC1+WC1.1+WC1.2 FORMALLY CLAIMED (985686f) — cold verification IN PROGRESS
|
||
Builder set the formal `## Gate` (after my pre-claim note rebased on top) and parked keycloak for me;
|
||
inbox resolved my alerts-dir flag (W0.9 test alert intentionally `rm`'d to avoid false operator
|
||
alarm). Running the full cold reproduce from my OWN clone synced to `cc-ci:/root/cc-ci-adv-verify`.
|
||
|
||
**check1 — unpinned + healthy + wired — PASS.** `grep kcVersion nix/modules/warm-keycloak.nix` → only
|
||
a comment ("the kcVersion pin is gone"), no pin; unit execs `warm_reconcile.py keycloak` (fetches at
|
||
runtime ⇒ D8 closure independent of live version). `warm-keycloak.service`=active, `is-system-running`
|
||
=running, 0 failed units, health `/realms/master`=**200**, TYPE=keycloak:10.7.1+26.6.2 (canonical).
|
||
|
||
**check2 — units — PASS.** From my synced clone: `cc-ci-run -m pytest tests/unit -q` → **57 passed**.
|
||
|
||
**check4 — concurrency + reaping (deploy-free) — PASS.** My own driver vs the live warm kc:
|
||
`realm_for` distinct per run-hex (`lasuite-docs-aaa111` ≠ `...bbb222`); created 3 realms, each
|
||
`oidc_password_grant` returns a valid 3-part JWT (len 1379) with matching discovery issuer;
|
||
`reap_orphaned_realms(live={aaa111})` deleted exactly `bbb222`+`ccc333` and **KEPT `aaa111`**
|
||
(concurrency-safe — a live run never loses its realm); kc left clean (`['master']`).
|
||
|
||
**check5 — WC1.1 MARQUEE health-gated rollback w/ data integrity — PASS (reconciler).** My own
|
||
reproduce (fake tags I staged, marker realm = the data):
|
||
- Phase B healthy upgrade: `upgraded:10.7.1+26.6.2->10.7.9+26.6.2`, last_good advanced→10.7.9,
|
||
health=200, marker realm intact. ✓
|
||
- Phase C broken latest: staged `10.7.10+26.6.2` at a commit with `KC_HOSTNAME=:::bad-host:::`. The
|
||
reconciler (stateful path) undeployed → **snapshotted** → attempted deploy of 10.7.10 → **abra deploy
|
||
FAILED** (lint R009: env value not a string) → caught the deploy exception → **rolled back**:
|
||
undeploy → **restore snapshot** → redeploy 10.7.9 → **healthy (200)**. Result
|
||
`rolled-back:10.7.10+26.6.2->10.7.9+26.6.2`. Verified post-state: **marker realm INTACT (data
|
||
preserved through the snapshot/restore round-trip)**, `last_good` **NOT advanced** (still 10.7.9),
|
||
and a real persistent alert `20260529T005510Z-keycloak-rollback.json` with
|
||
`attempted=10.7.10+26.6.2, last_good=10.7.9+26.6.2, recovered=True`. ✓✓✓ This is the phase's marquee
|
||
proof and it holds. (Nuance: my broken tag failed at abra LINT, exercising the deploy-FAILURE→rollback
|
||
branch — exactly the path commit 07ea951 added; the unhealthy-deploy branch is covered by units +
|
||
code. The volume wasn't mutated by the failed deploy, but the snapshot→restore round-trip DID
|
||
execute and the marker survived; combined with W0.5's mutate→restore proof, data integrity is sound.)
|
||
- **Test-script bug (MINE, not the reconciler):** my phase-D cleanup deleted the `10.7.9` tag while kc
|
||
was still deployed on it, so abra couldn't resolve the from-version and left kc undeployed (404) on
|
||
TYPE=10.7.9 with the marker still present. **NOT a WC1.1 defect** — the reconciler behaved correctly
|
||
given the broken state I induced. Recovery to canonical 10.7.1+26.6.2 (healthy, marker removed, fake
|
||
tags dropped) is running now; will confirm clean before finalizing the gate verdict.
|
||
|
||
**Remaining:** check3 (headline lasuite-docs SSO e2e) + check6 (WC1.2 holds) — run after recovery
|
||
confirms the warm kc is canonical+healthy (they share it). No gate PASS line written yet.
|
||
|
||
**Recovery — OK.** Warm kc restored to canonical: TYPE=10.7.1+26.6.2, last_good=10.7.1+26.6.2,
|
||
health=200, realms=['master'], no fake tags. (Recovery log also re-confirmed the marker realm survived
|
||
the rollback before I cleaned it up: `realms=['advmarker-rollback','master']` on redeploy.)
|
||
|
||
**check6 — WC1.2 pre-deploy safety holds — PASS.** My own driver vs the live warm kc:
|
||
(a) major fake tag `11.0.0+27.0.0` → `held-major:10.7.1+26.6.2->11.0.0+27.0.0`, kc TYPE **unchanged**,
|
||
alert `*-held-major.json` with `latest`+`release_notes`. (b) minor `10.7.2+26.6.3` + a
|
||
`releaseNotes/…md` flagging "manual migration" → `held-manual-migration:…`, kc TYPE **unchanged**,
|
||
alert **carries the notes**. No deploy/snapshot/last_good churn on either hold; recipe left clean.
|
||
The gate (WC1.2) short-circuits before WC1.1 as required.
|
||
|
||
**check3 — headline SSO e2e — IN PROGRESS.** `RECIPE=lasuite-docs STAGES=install,custom` from my
|
||
synced clone: cold per-run domain `lasu-c25d41` created (recipe deployed COLD), `DEPS declared:
|
||
['keycloak']` (warm path). Awaiting convergence + custom SSO tests.
|
||
|
||
## @2026-05-29 — WC1: PASS · WC1.2: PASS · WC1.1(keycloak-stateful): PASS — gate 985686f cleared
|
||
All six checks re-run COLD from my own clone synced to `cc-ci:/root/cc-ci-adv-verify` (NOT the
|
||
Builder's clone). Verdict for the formally-claimed gate **WC1 + WC1.1 + WC1.2**:
|
||
|
||
- **WC1 — PASS.** Unpinned (no `kcVersion`; reconciler fetches at runtime), `warm-keycloak.service`
|
||
active + system running + health 200. Headline e2e (check3): `RECIPE=lasuite-docs
|
||
STAGES=install,custom` → install **pass** (generic `test_serving` + overlay
|
||
`test_serving_and_frontend`, generic-first), custom **pass** (5 tests incl.
|
||
`test_oidc_login_via_keycloak` + `test_oidc_password_grant_against_dep_keycloak` against the warm
|
||
kc), **`deploy-count = 1 (expect 1)`** (keycloak NOT co-deployed), log shows `dep: using live-warm
|
||
keycloak @ warm-keycloak…(per-run realm)` and `dep: deleted per-run realm lasuite-docs-c25d41`.
|
||
Post-run: warm kc realms = **`['master']`** only (no leftover), no lasu* service/volume/secret (cold
|
||
teardown sacred), warm kc still canonical+healthy. Concurrency+reaping (check4, deploy-free):
|
||
`realm_for` distinct per run-hex; 3 realms each yield a valid JWT + matching discovery issuer;
|
||
`reap_orphaned_realms(live={aaa111})` deletes exactly the 2 orphans, KEEPS the live one. Units
|
||
(check2): 57 passed.
|
||
- **WC1.2 — PASS.** (check6) major `11.0.0+27.0.0` → `held-major`, kc untouched, alert w/ notes;
|
||
minor `10.7.2+26.6.3` + manual-migration releaseNotes → `held-manual-migration`, kc untouched,
|
||
alert **carries the notes**. No deploy/snapshot/last_good churn on a hold; gate short-circuits
|
||
before WC1.1.
|
||
- **WC1.1 (keycloak, stateful) — PASS.** (check5, MARQUEE) my own fake-tag reproduce: healthy
|
||
upgrade commits last_good := latest; a broken latest (`10.7.10`, `KC_HOSTNAME=:::bad-host:::`)
|
||
fails to deploy → reconciler undeploy→snapshot→(deploy fails)→**restore snapshot**→redeploy prior
|
||
→ **healthy**, with the **marker realm (data) INTACT**, `last_good` NOT advanced, and a real
|
||
persistent `*-rollback.json` alert (`attempted=10.7.10 last_good=10.7.9 recovered=true`). The
|
||
exit-1 in my run was a bug in MY cleanup script (deleted a tag abra still needed) — NOT a
|
||
reconciler defect; warm kc since recovered to canonical 10.7.1+26.6.2 healthy.
|
||
|
||
**Gate verdict: PASS @2026-05-29** for WC1 + WC1.2 + WC1.1(keycloak-stateful), exactly the scope the
|
||
Builder claimed (STATUS §SCOPE). The Builder may proceed to W1 (WC2/WC3 canonical registry).
|
||
|
||
**OPEN (tracked, NOT a blocker for this gate, but MUST close before Phase-2w `## DONE`):**
|
||
- **traefik WC1.1 (W0.10)** — traefik's stateless version-rollback is NOT yet migrated onto the shared
|
||
health-gated reconciler (still `proxy.nix` chaos-deploy). WC1.1 is therefore only *partially* closed
|
||
(keycloak only). I will require a cold proof of traefik's health-gated version-rollback before the
|
||
DONE handshake. Recorded so it is not lost. No finding filed (honest scope per the Builder's claim).
|
||
|
||
## @2026-05-29 — Watchdog pinged [C2 C3]; NO formal WC2/WC3 claim yet (premature)
|
||
`## Gate` holds only the WC1 PASS; `grep CLAIMED|awaiting adversary` → none. STATUS "In flight" shows
|
||
W1 mid-build: **W1.1 registry module DONE** (`runner/harness/canonical.py`, 61 unit pass) but **W1.2
|
||
(the LIVE data-warm proof: seed → undeploy-keep-volume → redeploy-reattach → data survives) is "Next"**
|
||
and the Builder explicitly says "Then close WC2/WC3." So WC2/WC3 are NOT yet claimable — ping fired on
|
||
"WC2/WC3" wording in commits b6ef83a/563156a, not a §6.1 gate. No verdict written.
|
||
Read-only glance (NOT a verdict): canonical.py is a sound registry primitive — `seed_canonical`
|
||
honors snapshot-while-undeployed; `has_canonical` requires both a registry record AND retained
|
||
volume; deploy/undeploy-keep-volume lifecycle matches WC2. Will cold-verify when WC2/WC3 is formally
|
||
CLAIMED (the live data-warm round-trip is the key thing to re-run myself). Idle until then.
|
||
|
||
## @2026-05-29 — WC2 + WC3 — PASS (gate 4ce80f8 cleared; cold-verified from own clone)
|
||
WC2/WC3 formally claimed (4ce80f8; my premature note rebased on top). Builder parked custom-html (first
|
||
data-warm canonical, left idle) + traefik for me. All re-run COLD from `cc-ci:/root/cc-ci-adv-verify`.
|
||
|
||
- **Units — PASS:** `cc-ci-run -m pytest tests/unit -q` → **61 passed** (incl. test_canonical, test_warmsnap).
|
||
- **WC2 data-warm canonical model — PASS.** Idle state matches: `canonical.json`
|
||
{recipe=custom-html, domain=warm-custom-html.ci.commoninternet.net, version=1.11.0+1.29.0,
|
||
commit=wc2proof, **status=idle**}; content volume **retained** (`warm-custom-html_…_content`); **no
|
||
service** running (idle = undeployed-keep-volume); stable `warm-<recipe>` domain (≠ cold
|
||
`<recipe[:4]>-<6hex>`). My OWN data-warm round-trip: deploy_canonical → wrote my marker
|
||
`ADV-OWN-MARKER-a1b2c3` → `undeploy_keep_volume` (**app down + volume retained**, registry→idle) →
|
||
deploy_canonical → **my marker SURVIVED**. The Builder's known-good marker also reattached. HTTPS
|
||
serving confirmed (`/`=200, `/index.html`=200; an earlier one-off 404 was a curl-vs-deploy-converge
|
||
race, 200 once settled — not a defect).
|
||
- **WC3 known-good snapshots — PASS.** Snapshot is a **raw per-volume tar taken while undeployed**
|
||
(`/var/lib/ci-warm/custom-html/snapshot/volumes/warm-custom-html_…_content.tar` + meta.json), one
|
||
last-good per app under the stable path. My OWN restore round-trip: mutate (deleted the known-good
|
||
`wc2-marker.txt`) → undeploy → `warmsnap.restore` → deploy → **known-good marker BACK with exact
|
||
content `WC2-DATA-MARKER-7f3a9c`** AND my mutation gone → restore round-trips the EXACT known-good.
|
||
(Same warmsnap helper already cold-proven on keycloak in check5/W0.5.) `has_canonical` correctly
|
||
requires BOTH a registry record AND a retained volume.
|
||
- **D8/WC8 (spot):** `/var/lib/ci-warm/` is cache — no nix module references it as a source; full D8
|
||
closure-exclusion folds into the WC8 verdict later.
|
||
|
||
Two crashes during my runs were **bugs in my OWN driver scripts** (a tag I deleted that abra still
|
||
needed in check5; `grep -rl` returning rc=1 on no-match which `exec_in_app` raises on) — NOT product
|
||
defects. Canonical left clean: idle, volume retained, known-good content, snapshot intact, v1.11.0+1.29.0.
|
||
|
||
**Gate verdict: WC2 + WC3 — PASS @2026-05-29.** Builder may proceed to W2 (`--quick`).
|
||
**Still tracked-open before Phase-2w DONE (unchanged):** traefik WC1.1 (W0.10) cold proof.
|
||
|
||
## @2026-05-29 — WC4 + WC7 — PASS (gate 3ff2bf6 cleared; cold-verified from own clone)
|
||
All re-run COLD from `cc-ci:/root/cc-ci-adv-verify`. Builder parked custom-html canonical for me.
|
||
|
||
- **Units — PASS:** `cc-ci-run -m pytest tests/unit -q` → **64 passed** (incl. test_bridge_trigger).
|
||
- **WC7 trigger — PASS** (against the LIVE deployed bridge `ccci-bridge`, adversarial battery):
|
||
`!testme`→(True,False)=cold; `!testme --quick`→(True,True)=quick; and ALL of `!testmexyz`,
|
||
`!testme foo`, `!testme --quick` (double-space), `!testme --quickx`, `please !testme`,
|
||
`!testme --quick extra` → (False,False) rejected; surrounding whitespace tolerated. Strict
|
||
exact-match, no false-trigger. `trigger_build` wires `CCCI_QUICK=1`; default `!testme` stays cold.
|
||
- **WC4 `--quick` PASS / NEVER-PROMOTE — PASS.** `RECIPE=custom-html CCCI_QUICK=1 REF=87a62a5`
|
||
(healthy 1.10.0+1.28.0 head): mode=quick, in-place upgrade 1.11.0+1.29.0→1.10.0+1.28.0, **upgrade
|
||
pass** (generic test_upgrade_reconverges first, then overlay), **custom pass** (5 tests incl.
|
||
playwright), "known-good UNCHANGED", exit 0. Independently verified the never-promote invariant:
|
||
registry version STILL 1.11.0+1.29.0 (NOT promoted), **known-good snapshot tar byte-identical**
|
||
(sha256 9ef62bdf… == pre-run baseline → snapshot never re-taken), canonical idle, volume retained.
|
||
- **WC4 `--quick` FAIL / ROLLBACK — PASS** (the data-safety proof). Staged a broken custom-html
|
||
commit (`image: nginx:99.99.99-doesnotexist`), ran `CCCI_QUICK=1 CCCI_SKIP_FETCH=1 REF=<broken>`:
|
||
broken upgrade `abra deploy … FATA deploy failed 🛑` → upgrade **fail** + custom **fail** (app down)
|
||
→ `quick FAIL → rolling back … restored known-good data; canonical idle (NOT promoted)`, **exit 1**
|
||
(correctly RED). Independently verified the rollback restored the EXACT known-good: registry version
|
||
unchanged (1.11.0+1.29.0), snapshot byte-identical (9ef62bdf…), and on redeploy the known-good
|
||
marker `WC2-DATA-MARKER-7f3a9c` is back, app serves **200**, image is **nginx:1.29.0** (broken image
|
||
GONE); left idle. (This is also the WC9 `--quick` rollback-proof in miniature on custom-html.)
|
||
- **WC7 no-canonical fallback — PASS.** `RECIPE=custom-html-tiny MODE=quick` (no canonical) → logs
|
||
`MODE=quick requested but no canonical … falling back to COLD run` → runs COLD at a **cold per-run
|
||
domain** `cust-9834f5` (not `warm-`), install **pass**, deploy-count=1, exit 0; post-run no `cust-*`
|
||
service/volume (cold teardown sacred) and the **custom-html canonical untouched** (idle@1.11.0+1.29.0).
|
||
The PR is still tested; default `!testme` cold path unaffected.
|
||
|
||
Cleanup: staged broken commit reverted (recipe clone restored to 87a62a5, broken commit dangling);
|
||
custom-html canonical left idle@1.11.0+1.29.0 with snapshot intact. Generic-first invariant held in
|
||
`--quick`. No tests softened.
|
||
|
||
**Gate verdict: WC4 + WC7 — PASS @2026-05-29.** Builder may proceed to W3 (WC5/WC6 cold-advances +
|
||
nightly). **Still tracked-open before Phase-2w DONE:** traefik WC1.1 (W0.10) cold proof.
|
||
|
||
## @2026-05-29 — traefik WC1.1 (W0.10a) — PASS → WC1.1 now FULLY closed (keycloak + traefik)
|
||
Gate e678d2e. The Builder delivered the migration + safe no-op converge and (correctly, to avoid an
|
||
all-TLS outage) left the destructive rollback as my cold proof. All cold from my own clone.
|
||
|
||
- **Units — PASS:** 65 passed (incl. traefik spec: stateful=False, callable setup, health_domain).
|
||
- **Migration + no-op converge — PASS:** `deploy-proxy.service` active now execs
|
||
`warm_reconcile.py traefik`; journal `RECONCILE RESULT: noop-healthy:5.1.1+v3.6.15`; system running,
|
||
0 failed; `ci.commoninternet.net=200` (routing+TLS) + `keycloak-through-traefik=200`; traefik
|
||
TYPE+last_good=5.1.1+v3.6.15. Wildcard cert / file-provider config preserved (HTTPS 200 on the
|
||
wildcard domain proves the pre-issued cert is served).
|
||
- **Destructive rollback — PASS (low-disruption variant):** staged a fake NEWER tag `5.2.0+v3.6.15`
|
||
with a lint-breaking env (a YAML mapping entry). Reconcile: auto-upgrade 5.1.1→5.2.0 → `abra deploy
|
||
… FATA failed lint checks (R009 environment.0 must be a string)` → `rolling back to 5.1.1+v3.6.15`
|
||
→ `RECONCILE RESULT: rolled-back:5.2.0+v3.6.15->5.1.1+v3.6.15`, rollback alert
|
||
`{attempted:5.2.0, last_good:5.1.1, recovered:True}`. **Stateless path confirmed: NO snapshot, just
|
||
version redeploy of last_good.** Crucially, **TLS was NOT dropped** — `ci.commoninternet.net=200`
|
||
and `keycloak-through-traefik=200` throughout the window (the broken deploy was rejected at lint
|
||
before the running proxy was touched); last_good unchanged; recipe clone restored to HEAD, fake tag
|
||
cleaned; system running / 0 failed after.
|
||
- *Honest scope:* my broken tag failed at abra LINT (the deploy-FAILURE→rollback branch), exactly as
|
||
the keycloak proof did. The "deploys-clean-but-health-fails→rollback" branch is the SAME shared
|
||
`wait_healthy`-False code (stateless skips only snapshot/restore), unit-tested, not live-exercised
|
||
for either app — deliberately, since for traefik that path REQUIRES a real all-route TLS outage to
|
||
induce. I judge the shared+unit-covered code + the live deploy-failure rollback sufficient; flagged
|
||
so it's not a hidden gap.
|
||
|
||
**Gate verdict: traefik WC1.1 (W0.10a) — PASS @2026-05-29.** This **CLOSES the W0.10 tracked-open
|
||
item**: WC1.1 is now fully verified for BOTH reconcilers (keycloak stateful + traefik stateless).
|
||
**Phase-2w gates verified so far:** WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC7. **Remaining for
|
||
DONE:** WC5, WC6, WC8, WC9.
|
||
|
||
## @2026-05-29 — WC5 promote-on-green-cold — PASS (gate 125453d; cold-verified from own clone)
|
||
- **Units — PASS:** 70 passed (incl. test_promote).
|
||
- **Gate predicate — PASS (anti-poison logic).** `should_promote_canonical` =
|
||
`is_enrolled AND overall==0 AND not quick AND not ref` — promotes ONLY enrolled + GREEN + COLD +
|
||
LATEST(no PR head). A PR `!testme` (REF=PR-head) is excluded (`not ref`), `--quick` excluded
|
||
(`not quick`, also proven live in WC4 = byte-identical snapshot), red excluded (`overall==0`),
|
||
unenrolled excluded. `promote_canonical` replaces the known-good ONLY after green (never lost on
|
||
red). So a bad PR can never poison the canonical; only cold-on-latest (manual `RECIPE=` / nightly)
|
||
advances it.
|
||
- **Live advancement — PASS.** I forced the custom-html registry to an OLDER value
|
||
(`version=1.10.0+1.28.0, commit=advold`), then ran a full COLD run `RECIPE=custom-html` (no REF =
|
||
latest): install/upgrade/backup/restore/custom **all pass**, deploy-count=1, then `WC5
|
||
promote-on-green-cold: (re)seed canonical custom-html @ 1.11.0+1.29.0`. Independently verified after:
|
||
registry version **ADVANCED 1.10.0+1.28.0 → 1.11.0+1.29.0** (commit=head 8a02606, new ts), snapshot
|
||
meta re-seeded to 1.11.0+1.29.0, `has_canonical=True`, canonical idle + volume retained, and **no
|
||
`cust-*` per-run service left** (cold teardown sacred). (The promote reattaches the retained volume
|
||
→ re-snapshot is byte-identical content, expected.) The advancement also restored the canonical to
|
||
its correct version.
|
||
|
||
**Gate verdict: WC5 — PASS @2026-05-29.** Builder may proceed to W3's WC6 (nightly sweep).
|
||
**Phase-2w gates verified so far:** WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC5, WC7.
|
||
**Remaining for DONE:** WC6, WC8, WC9.
|
||
|
||
## @2026-05-29 — WC6 nightly full-cold sweep — PASS (gate 465e105; cold-verified)
|
||
- **Units — PASS:** 71 passed (incl. enrolled_recipes).
|
||
- **Declarative timer/service — PASS.** `nightly-sweep.timer` active; `OnCalendar=*-*-* 03:00:00`,
|
||
**Persistent=true** (catches up a missed nightly), RandomizedDelaySec=600, next Sat 03:05 UTC;
|
||
service = oneshot, 6h ceiling, after deploy-proxy/warm-keycloak/docker, packaged in the nix store
|
||
(D8-clean; runtimeInputs incl. util-linux for the backup PTY). Imported in
|
||
`nix/hosts/cc-ci/configuration.nix`.
|
||
- **Orchestration — PASS (code read from own clone).** `nightly_sweep.py`: in-flight guard
|
||
`_another_run_active()` pgreps `run_recipe_ci.py` (excl. self) → skips/defers if a run is active;
|
||
`roll_warm_infra()` runs the health-gated keycloak+traefik reconcilers (WC1.1); `sweep()` iterates
|
||
`enrolled_recipes()` SERIALLY, each a cold latest run (REF/QUICK/MODE stripped) whose own promote
|
||
hook refreshes the canonical (WC5); red recipes reported FAIL but non-fatal and DON'T promote.
|
||
- **Live sweep via the actual systemd SERVICE — PASS.** Forced custom-html canonical OLD
|
||
(1.10.0+1.28.0), `systemctl start nightly-sweep.service`. Journal: roll keycloak
|
||
`noop-healthy:10.7.1+26.6.2` rc=0 + traefik `noop-healthy:5.1.1+v3.6.15` rc=0 (health-gated);
|
||
`enrolled canonicals = ['custom-html']`; full-cold install/upgrade/backup/restore/custom **all
|
||
pass**; `WC5 promote: canonical custom-html advanced to known-good 1.11.0+1.29.0`; sweep summary
|
||
`custom-html: PASS`; service Finished. Independently verified after: registry **ADVANCED
|
||
1.10.0+1.28.0 → 1.11.0+1.29.0** (new ts), **no `cust-*` per-run leftover** (cold teardown sacred),
|
||
`ci.commoninternet.net=200` + `keycloak-through-traefik=200` (infra healthy post-roll), system
|
||
running / 0 failed.
|
||
|
||
**Gate verdict: WC6 — PASS @2026-05-29.** Builder may proceed to W4 (WC8/WC9).
|
||
**Phase-2w gates verified so far:** WC1, WC1.1 (full), WC1.2, WC2, WC3, WC4, WC5, WC6, WC7.
|
||
**Remaining for DONE:** WC8, WC9 (incl. the full `--quick` rollback proof + docs).
|
||
|
||
## @2026-05-29 — WC8 + WC9 (FINAL gates) — PASS (gate 40b03a9; cold-verified)
|
||
- **Units — PASS:** 72 passed (incl. test_canonical prune_stale).
|
||
- **WC8 serialize — PASS:** `DRONE_RUNNER_CAPACITY = maxTests = "1"` (MAX_TESTS cap); nightly sweep
|
||
serial + `_another_run_active()` in-flight skip (verified in WC6); one app at a time.
|
||
- **WC8 disk/prune — PASS:** swarm `autoPrune.flags = ["--all" "--filter" "until=24h"]` — **no
|
||
`--volumes`** (data-warm volumes + snapshots survive docker prune; the module comments why
|
||
`--volumes` would destroy the known-good). `canonical.prune_stale()` is SAFE: drops a
|
||
`/var/lib/ci-warm/<r>/` only if it's a dir AND not enrolled AND has a `canonical.json` — so it
|
||
spares enrolled canonicals, the keycloak/traefik reconciler dirs (last_good, no canonical.json),
|
||
and `alerts/`. Ran it LIVE: `pruned: []` (no-op) and all four dirs (alerts, custom-html, keycloak,
|
||
traefik) intact after. Disk `/` = 50% (14G free); warm total **318M** (bounded). Run nightly + df logged.
|
||
- **WC8 cold teardown sacred — PASS:** no `<recipe>-<6hex>` per-run leftovers after any of my
|
||
W2/WC4/WC5/WC6 runs (independently confirmed each time).
|
||
- **WC8 excluded from D8 — PASS:** `grep -rn ci-warm nix/` → only a COMMENT; no Nix source declares
|
||
`/var/lib/ci-warm` as a store/source path → runtime cache, re-seeded by cold runs, not on the closure.
|
||
- **WC9 docs — PASS:** `docs/warm.md` (116 lines) covers the three states, the health-gated
|
||
reconcilers + WC1.2 safety gate + alerts, data-warm canonicals + snapshots + enroll, `--quick`,
|
||
promote-on-green-cold, the nightly sweep, resource safety, an explicit "## The `--quick` rollback
|
||
proof (WC9)" section, and "## Operate / debug".
|
||
- **WC9 `--quick` rollback proof — PASS (already cold-verified in WC4, @REVIEW 31f0e42):** I
|
||
deliberately failed a PR under `--quick` (broken image) → the canonical's last-known-good was
|
||
restored INTACT (marker `WC2-DATA-MARKER-7f3a9c` back, app healthy on nginx:1.29.0, broken image
|
||
gone, registry+snapshot unchanged), exit RED; and a `--quick` PASS left the snapshot byte-identical
|
||
(did NOT move the known-good). No tests softened anywhere in the phase.
|
||
|
||
**Gate verdict: WC8 + WC9 — PASS @2026-05-29.**
|
||
|
||
### ✅ ALL Phase-2w gates Adversary cold-verified — NO VETO — DONE authorized
|
||
WC1, **WC1.1 (full: keycloak stateful + traefik stateless)**, WC1.2, WC2, WC3, WC4, WC5, WC6, WC7,
|
||
WC8, WC9 — every one has a fresh PASS in this REVIEW-2w, each re-run COLD from my own clone
|
||
(`cc-ci:/root/cc-ci-adv-verify`). No open `[adversary]` findings; no `## VETO`. The W0.10 traefik
|
||
tracked-open item is CLOSED. System healthy (running, 0 failed), infra serving (ci+keycloak 200),
|
||
custom-html canonical idle@1.11.0+1.29.0, recipe clones restored, disk 50%. **The Builder is cleared
|
||
to write `## DONE` to STATUS-2w.md** per §6.1.
|