status+journal(2w): W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback); reconciler-side WC1/WC1.1/WC1.2 proven

This commit is contained in:
2026-05-29 01:21:59 +01:00
parent 32f00717ac
commit 819c1bc0fd
2 changed files with 58 additions and 11 deletions

View File

@ -137,3 +137,41 @@ warm-set of ~6 canonicals × (volume + 1 snapshot) is the disk budget to size.
failed), disk 62%. W0.1-W0.5 done+proven+pushed (HEAD 67240dc). Next unit: W0.6 reconciler rewrite
(unpin + WC1.2 safety gate + WC1.1 health-gated rollback), then W0.7/W0.8 (lasuite-docs race +
headline WC1 e2e).
## 2026-05-29 — W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback)
Built `runner/warm_reconcile.py`'s health-gated rollback and proved it live against the warm keycloak
using annotated fake tags + `CCCI_SKIP_FETCH=1`. The proof iterations surfaced 4 real issues, each
fixed against the real system (verify-don't-assume):
1. **deploy-failure must roll back too** — a broken "latest" can fail abra's *lint/converge*
(deploy_version raises) rather than deploy-then-be-unhealthy; wrapped the upgrade deploy so BOTH
raise and unhealthy paths trigger the snapshot-restore rollback (else the unit just crashes).
2. **warmsnap clobbered last_good** — snapshot's atomic swap renamed the whole `<recipe>/` dir,
wiping the sibling `last_good` file. Fixed: snapshot lives in `<recipe>/snapshot/`; only that
subdir is swapped; `last_good` (sibling) survives.
3. **swarm settle race** — abra undeploy returns before swarm finishes removing tasks, so an
immediate snapshot/restore/redeploy of the same stack raced a half-removed stack. Added
`wait_undeployed()` after every undeploy.
4. **abra writes FATA to stdout** — deploy_version only surfaced stderr (empty); now includes stdout.
This is how I diagnosed the two test-artifact failures: the broken deploy failed abra **lint R009**
(bad env not a string — a valid "broken latest"), and the first rollback attempts failed abra
**lint R014 "only annotated tags used for recipe version"** because my fake tags were *lightweight*
(production tags are annotated) — a TEST artifact, not a reconciler bug. Fixed the test to create
annotated tags (peel `^{}` to avoid nested-tag; set git identity).
**Final PROOF (ALL PASS):**
- (a) healthy upgrade 10.7.1→10.7.9: snapshot taken (subdir), deploy, health-pass, last_good
committed=10.7.9, marker realm preserved through the undeploy/snapshot/redeploy.
- (b) marquee rollback: broken latest 10.7.10 → deploy fails → rollback to 10.7.9 → HEALTHY; marker
realm INTACT (data preserved through broken-upgrade + snapshot-restore); last_good NOT advanced;
rollback alert sentinel written (attempted=10.7.10, last_good=10.7.9, recovered=True). keycloak
recovered to canonical 10.7.1+26.6.2 healthy, no fake tags left.
This satisfies the WC1.1 Adversary mandate (broken latest → self-revert + data intact + alert;
healthy update commits last-good). WC1.2 holds were proven in W0.6. **The reconciler-side WC1/WC1.1/
WC1.2 are proven; the alert RELAY (Builder loop scans /var/lib/ci-warm/alerts/ → PushNotification +
archive to seen/) is still to wire (flagged for when nightly WC6 lands / a real alert can occur).**
Remaining for the WC1 gate: W0.7 (lasuite-docs in-place chaos-redeploy nginx race) + W0.8 (headline
dependent-SSO-green e2e vs warm keycloak + concurrent distinct realms + reaping).

View File

@ -68,18 +68,27 @@ nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversa
reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major,
keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).
**Next:**
1. **W0.9 — WC1.1 live proofs** (deploy cycles): (a) healthy upgrade — stage a fake newer tag
(re-tag of current → same healthy image) → reconcile snapshots + deploys + commits last-good;
(b) **rollback (marquee)** — stage a fake newer tag with a BROKEN compose (bad KC_HOSTNAME →
crash-loop) → reconcile snapshots → deploys broken → health-gate fails → restores snapshot +
redeploys prior → healthy + data intact (marker realm) + alert written + last_good NOT advanced.
2. **W0.7** — fix the lasuite-docs in-place-redeploy nginx-upstream race OR pick a more-robust SSO
dependent for the headline proof.
3. **W0.8** — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent
- **W0.9 WC1.1 live proofs** DONE (32f0071). PROVEN on warm keycloak (annotated fake tags +
CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good
committed, marker preserved; (b) **marquee rollback** — broken latest 10.7.10 → deploy fails →
rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback
alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical
10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed
swarm-settle, abra-stdout capture). 57 unit pass. **Reconciler-side WC1/WC1.1/WC1.2 proven.**
**Adversary reproduce (W0.9):** on cc-ci, with the keycloak recipe clone, create annotated fake
tags (peel `^{}`, set git identity) `10.7.9+26.6.2`(=good commit) and `10.7.10+26.6.2`(broken
KC_HOSTNAME), then `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice; observe
`upgraded:` then `rolled-back:`, marker realm survives, `/var/lib/ci-warm/keycloak/last_good`
unchanged at the prior version, a `*rollback*.json` alert under `/var/lib/ci-warm/alerts/`.
**Next (remaining for WC1 gate):**
1. **W0.7** — fix the lasuite-docs in-place chaos-redeploy nginx-upstream race (`host not found in
upstream ...backend:8000`) OR pick a more-robust SSO dependent for the headline proof.
2. **W0.8** — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent
distinct realms (no collision) + reaping. → claim WC1/WC1.1/WC1.2.
4. **Builder-loop alert relay** — on each wake, scan `/var/lib/ci-warm/alerts/*.json`, PushNotification
+ record + archive to `alerts/seen/` (wire when first real alert can occur, i.e. with nightly WC6).
3. **Builder-loop alert relay** (deferred wiring) — on each wake, scan `/var/lib/ci-warm/alerts/*.json`,
PushNotification + record + archive to `alerts/seen/`; wire when nightly WC6 lands (first real alert).
**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy
--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream