status+journal(2w): W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback); reconciler-side WC1/WC1.1/WC1.2 proven
This commit is contained in:
@ -137,3 +137,41 @@ warm-set of ~6 canonicals × (volume + 1 snapshot) is the disk budget to size.
|
||||
failed), disk 62%. W0.1-W0.5 done+proven+pushed (HEAD 67240dc). Next unit: W0.6 reconciler rewrite
|
||||
(unpin + WC1.2 safety gate + WC1.1 health-gated rollback), then W0.7/W0.8 (lasuite-docs race +
|
||||
headline WC1 e2e).
|
||||
|
||||
## 2026-05-29 — W0.9 WC1.1 live proofs PASS (healthy upgrade + marquee rollback)
|
||||
|
||||
Built `runner/warm_reconcile.py`'s health-gated rollback and proved it live against the warm keycloak
|
||||
using annotated fake tags + `CCCI_SKIP_FETCH=1`. The proof iterations surfaced 4 real issues, each
|
||||
fixed against the real system (verify-don't-assume):
|
||||
|
||||
1. **deploy-failure must roll back too** — a broken "latest" can fail abra's *lint/converge*
|
||||
(deploy_version raises) rather than deploy-then-be-unhealthy; wrapped the upgrade deploy so BOTH
|
||||
raise and unhealthy paths trigger the snapshot-restore rollback (else the unit just crashes).
|
||||
2. **warmsnap clobbered last_good** — snapshot's atomic swap renamed the whole `<recipe>/` dir,
|
||||
wiping the sibling `last_good` file. Fixed: snapshot lives in `<recipe>/snapshot/`; only that
|
||||
subdir is swapped; `last_good` (sibling) survives.
|
||||
3. **swarm settle race** — abra undeploy returns before swarm finishes removing tasks, so an
|
||||
immediate snapshot/restore/redeploy of the same stack raced a half-removed stack. Added
|
||||
`wait_undeployed()` after every undeploy.
|
||||
4. **abra writes FATA to stdout** — deploy_version only surfaced stderr (empty); now includes stdout.
|
||||
This is how I diagnosed the two test-artifact failures: the broken deploy failed abra **lint R009**
|
||||
(bad env not a string — a valid "broken latest"), and the first rollback attempts failed abra
|
||||
**lint R014 "only annotated tags used for recipe version"** because my fake tags were *lightweight*
|
||||
(production tags are annotated) — a TEST artifact, not a reconciler bug. Fixed the test to create
|
||||
annotated tags (peel `^{}` to avoid nested-tag; set git identity).
|
||||
|
||||
**Final PROOF (ALL PASS):**
|
||||
- (a) healthy upgrade 10.7.1→10.7.9: snapshot taken (subdir), deploy, health-pass, last_good
|
||||
committed=10.7.9, marker realm preserved through the undeploy/snapshot/redeploy.
|
||||
- (b) marquee rollback: broken latest 10.7.10 → deploy fails → rollback to 10.7.9 → HEALTHY; marker
|
||||
realm INTACT (data preserved through broken-upgrade + snapshot-restore); last_good NOT advanced;
|
||||
rollback alert sentinel written (attempted=10.7.10, last_good=10.7.9, recovered=True). keycloak
|
||||
recovered to canonical 10.7.1+26.6.2 healthy, no fake tags left.
|
||||
|
||||
This satisfies the WC1.1 Adversary mandate (broken latest → self-revert + data intact + alert;
|
||||
healthy update commits last-good). WC1.2 holds were proven in W0.6. **The reconciler-side WC1/WC1.1/
|
||||
WC1.2 are proven; the alert RELAY (Builder loop scans /var/lib/ci-warm/alerts/ → PushNotification +
|
||||
archive to seen/) is still to wire (flagged for when nightly WC6 lands / a real alert can occur).**
|
||||
|
||||
Remaining for the WC1 gate: W0.7 (lasuite-docs in-place chaos-redeploy nginx race) + W0.8 (headline
|
||||
dependent-SSO-green e2e vs warm keycloak + concurrent distinct realms + reaping).
|
||||
|
||||
@ -68,18 +68,27 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa
|
||||
reconciler → noop-healthy (system 0-failed, 200); **WC1.2 holds proven** (MAJOR → held-major,
|
||||
keycloak untouched; minor+manual-migration notes → held-manual-migration, alert carries notes).
|
||||
|
||||
**Next:**
|
||||
1. **W0.9 — WC1.1 live proofs** (deploy cycles): (a) healthy upgrade — stage a fake newer tag
|
||||
(re-tag of current → same healthy image) → reconcile snapshots + deploys + commits last-good;
|
||||
(b) **rollback (marquee)** — stage a fake newer tag with a BROKEN compose (bad KC_HOSTNAME →
|
||||
crash-loop) → reconcile snapshots → deploys broken → health-gate fails → restores snapshot +
|
||||
redeploys prior → healthy + data intact (marker realm) + alert written + last_good NOT advanced.
|
||||
2. **W0.7** — fix the lasuite-docs in-place-redeploy nginx-upstream race OR pick a more-robust SSO
|
||||
dependent for the headline proof.
|
||||
3. **W0.8** — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent
|
||||
- **W0.9 WC1.1 live proofs** DONE (32f0071). PROVEN on warm keycloak (annotated fake tags +
|
||||
CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1→10.7.9 — snapshot+deploy+health-pass, last_good
|
||||
committed, marker preserved; (b) **marquee rollback** — broken latest 10.7.10 → deploy fails →
|
||||
rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved), last_good NOT advanced, rollback
|
||||
alert written (attempted=10.7.10,last_good=10.7.9,recovered=True); recovered to canonical
|
||||
10.7.1+26.6.2. Fixed 4 issues live (deploy-fail→rollback, warmsnap last_good subdir, wait_undeployed
|
||||
swarm-settle, abra-stdout capture). 57 unit pass. **Reconciler-side WC1/WC1.1/WC1.2 proven.**
|
||||
|
||||
**Adversary reproduce (W0.9):** on cc-ci, with the keycloak recipe clone, create annotated fake
|
||||
tags (peel `^{}`, set git identity) `10.7.9+26.6.2`(=good commit) and `10.7.10+26.6.2`(broken
|
||||
KC_HOSTNAME), then `CCCI_SKIP_FETCH=1 cc-ci-run runner/warm_reconcile.py keycloak` twice; observe
|
||||
`upgraded:` then `rolled-back:`, marker realm survives, `/var/lib/ci-warm/keycloak/last_good`
|
||||
unchanged at the prior version, a `*rollback*.json` alert under `/var/lib/ci-warm/alerts/`.
|
||||
|
||||
**Next (remaining for WC1 gate):**
|
||||
1. **W0.7** — fix the lasuite-docs in-place chaos-redeploy nginx-upstream race (`host not found in
|
||||
upstream ...backend:8000`) OR pick a more-robust SSO dependent for the headline proof.
|
||||
2. **W0.8** — headline WC1 e2e: dependent SSO custom test green vs warm keycloak + concurrent
|
||||
distinct realms (no collision) + reaping. → claim WC1/WC1.1/WC1.2.
|
||||
4. **Builder-loop alert relay** — on each wake, scan `/var/lib/ci-warm/alerts/*.json`, PushNotification
|
||||
+ record + archive to `alerts/seen/` (wire when first real alert can occur, i.e. with nightly WC6).
|
||||
3. **Builder-loop alert relay** (deferred wiring) — on each wake, scan `/var/lib/ci-warm/alerts/*.json`,
|
||||
PushNotification + record + archive to `alerts/seen/`; wire when nightly WC6 lands (first real alert).
|
||||
|
||||
**Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy
|
||||
--force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream
|
||||
|
||||
Reference in New Issue
Block a user