fix(2w): W0.9 WC1.1 hardening (proven live: healthy upgrade + marquee rollback)

Bugs found by the live proof, fixed:
- warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole
  <recipe>/ dir — so the reconciler's sibling last_good file survives a
  snapshot swap (was being clobbered).
- warm_reconcile: deploy_version captures abra's stdout (it writes FATA to
  stdout) in the error; add wait_undeployed() after every undeploy so
  snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade
  deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers
  rollback. (57 unit pass.)

LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH):
(a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good
    committed=10.7.9, marker realm preserved.
(b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9,
    HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore),
    last_good NOT advanced, rollback alert written (attempted=10.7.10,
    last_good=10.7.9, recovered=True). keycloak recovered to canonical
    10.7.1+26.6.2 healthy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 01:21:05 +01:00
parent 07ea951f31
commit 32f00717ac
3 changed files with 59 additions and 23 deletions

View File

@ -12,10 +12,13 @@ Used by:
Warm snapshots are **cache, excluded from the D8 reproducibility closure** (WC8) — re-seeded by cold
runs, not restored on a VM rebuild.
Layout (atomic dir swap on update; one last-good per app):
Layout (atomic dir swap of the `snapshot/` subdir; one last-good per app). Sibling per-app state
(e.g. the reconciler's `last_good`) lives in `<recipe>/` and is NOT clobbered by the swap:
$CCCI_WARM_ROOT/<recipe>/
meta.json # {recipe, domain, commit, version, ts, volumes:[...]}
volumes/<volname>.tar # raw tar of the volume root, one per stack volume
last_good # (reconciler) the version known healthy — survives snapshot swaps
snapshot/
meta.json # {recipe, domain, commit, version, ts, volumes:[...]}
volumes/<volname>.tar # raw tar of the volume root, one per stack volume
Implementation note: volumes are tarred from their host mountpoint
(`docker volume inspect -f '{{.Mountpoint}}'`), so no sidecar image pull is needed. The caller runs
@ -47,12 +50,18 @@ def app_dir(recipe: str) -> str:
return os.path.join(warm_root(), recipe)
def snap_dir(recipe: str) -> str:
"""The snapshot subdir — atomically swapped on update. Kept SEPARATE from app_dir so sibling
per-app state (the reconciler's last_good) survives a snapshot swap."""
return os.path.join(app_dir(recipe), "snapshot")
def meta_path(recipe: str) -> str:
return os.path.join(app_dir(recipe), "meta.json")
return os.path.join(snap_dir(recipe), "meta.json")
def volumes_dir(recipe: str) -> str:
return os.path.join(app_dir(recipe), "volumes")
return os.path.join(snap_dir(recipe), "volumes")
def has_snapshot(recipe: str) -> bool:
@ -112,9 +121,8 @@ def snapshot(recipe: str, domain: str, commit: str | None = None, version: str |
if not volumes:
raise SnapshotError(f"no volumes found for {domain} — nothing to snapshot")
root = warm_root()
os.makedirs(root, exist_ok=True)
staging = os.path.join(root, f".{recipe}.staging")
os.makedirs(app_dir(recipe), exist_ok=True)
staging = os.path.join(app_dir(recipe), ".snapshot.staging")
shutil.rmtree(staging, ignore_errors=True)
os.makedirs(os.path.join(staging, "volumes"), exist_ok=True)
@ -138,9 +146,9 @@ def snapshot(recipe: str, domain: str, commit: str | None = None, version: str |
with open(os.path.join(staging, "meta.json"), "w") as f:
json.dump(meta, f)
# Atomic-ish swap: move current aside, move staging in, drop the old. One last-good retained.
target = app_dir(recipe)
old = os.path.join(root, f".{recipe}.old")
# Atomic-ish swap of the snapshot subdir only (sibling state like last_good is untouched).
target = snap_dir(recipe)
old = os.path.join(app_dir(recipe), ".snapshot.old")
shutil.rmtree(old, ignore_errors=True)
if os.path.exists(target):
os.rename(target, old)