fix(2w): W0.9 WC1.1 hardening (proven live: healthy upgrade + marquee rollback)

Bugs found by the live proof, fixed: - warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole <recipe>/ dir — so the reconciler's sibling last_good file survives a snapshot swap (was being clobbered). - warm_reconcile: deploy_version captures abra's stdout (it writes FATA to stdout) in the error; add wait_undeployed() after every undeploy so snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers rollback. (57 unit pass.) LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH): (a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good committed=10.7.9, marker realm preserved. (b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9, HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore), last_good NOT advanced, rollback alert written (attempted=10.7.10, last_good=10.7.9, recovered=True). keycloak recovered to canonical 10.7.1+26.6.2 healthy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 01:21:05 +01:00
parent 07ea951f31
commit 32f00717ac
3 changed files with 59 additions and 23 deletions
--- a/runner/warm_reconcile.py
+++ b/runner/warm_reconcile.py
@ -207,11 +207,26 @@ def release_notes(recipe: str, version: str) -> str:
 def deploy_version(recipe: str, domain: str, version: str, timeout: int) -> None:
    """Deploy a specific published version: checkout the tag (so the on-disk tree matches) then a
    pinned non-chaos redeploy with the version positional (so abra records TYPE=<recipe>:<version>).
-    `-f` makes it idempotent against an already-deployed app."""
+    `-f` makes it idempotent against an already-deployed app. abra writes FATA to stdout, so include
+    both streams in the error."""
    abra.recipe_checkout(recipe, version)
    r = _run(["abra", "app", "deploy", domain, version, "-o", "-n", "-f"], timeout=timeout)
    if r.returncode != 0:
-        raise RuntimeError(f"deploy {domain} {version} failed: {r.stderr.strip()[:300]}")
+        msg = (r.stderr.strip() + " " + r.stdout.strip()).strip()[:400]
+        raise RuntimeError(f"deploy {domain} {version} failed: {msg}")
+
+
+def wait_undeployed(domain: str, timeout: int = 120) -> None:
+    """Block until the app's swarm stack is fully removed after an undeploy. abra's undeploy may
+    return before swarm finishes tearing down tasks; snapshot/restore (which require undeployed) and
+    an immediate redeploy of the same stack name otherwise race a half-removed stack."""
+    stack = lifecycle._stack_name(domain)  # noqa: SLF001
+    deadline = time.time() + timeout
+    while time.time() < deadline:
+        if not lifecycle._docker_names("service", stack):  # noqa: SLF001
+            return
+        time.sleep(2)
+    raise RuntimeError(f"{domain} stack not fully undeployed after {timeout}s")


 # --------------------------------------------------------------------------- last-good + alerts
@ -332,6 +347,7 @@ def reconcile(app: str) -> str:
    print(f"[{app}] auto-upgrade {last_good} → {latest} (health-gated)", flush=True)
    if stateful:
        abra.undeploy(domain)
+        wait_undeployed(domain)
        warmsnap.snapshot(recipe, domain, version=last_good)
        # snapshot requires undeployed; now bring up latest.
    # A broken "latest" can fail in two ways: deploy_version raises (abra converge times out on a
@ -353,6 +369,7 @@ def reconcile(app: str) -> str:
    print(f"[{app}] latest {latest} UNHEALTHY → rolling back to {last_good}", flush=True)
    if stateful:
        abra.undeploy(domain)
+        wait_undeployed(domain)
        warmsnap.restore(recipe, domain)
    deploy_version(recipe, domain, last_good, dt)
    recovered = wait_healthy(spec)