fix(2): F2-12 lasuite-drive upgrade tier — own convergence wait (abra -c) + collabora READY_PROBE

Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor
FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init),
even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora
being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold).

Fix (cc-ci-side, stronger verification — not weaker):
- abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's
  impatient monitor no longer FATAs (the stack spec is applied regardless).
- perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services
  N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the
  recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade.
- recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI
  discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after
  the upgrade redeploy. No-op for recipes without a READY_PROBE.

NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora
crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 11:55:53 +01:00
parent aab77ea0f3
commit e1147b5fe3
5 changed files with 92 additions and 12 deletions

View File

@ -316,7 +316,7 @@ def recipe_checkout_ref(recipe: str, ref: str) -> None:
abra.recipe_checkout(recipe, ref)
def chaos_redeploy(domain: str, deploy_timeout: int = 900) -> None:
def chaos_redeploy(domain: str, deploy_timeout: int = 900, no_converge_checks: bool = False) -> None:
"""In-place `abra app deploy --chaos`: redeploy the running app at the CURRENT recipe checkout
(HC1: the PR-head code under test). This is the upgrade op, not a fresh install — it does NOT go
through deploy_app, so the deploy-count guard (DG4.1) is not incremented.
@ -324,8 +324,45 @@ def chaos_redeploy(domain: str, deploy_timeout: int = 900) -> None:
`deploy_timeout` is the abra subprocess wrapper timeout; pass the recipe's DEPLOY_TIMEOUT so a
heavy stack's reconverge (e.g. lasuite-drive's slow collabora/onlyoffice boot) isn't SIGKILLed
by the 900s default while abra is still legitimately waiting (its internal TIMEOUT can be larger
via the .env). Mirrors the install deploy_app timeout plumbing."""
abra.deploy(domain, chaos=True, timeout=deploy_timeout)
via the .env). Mirrors the install deploy_app timeout plumbing.
`no_converge_checks` (`abra … -c`): skip abra's own convergence monitor — the caller then owns a
stricter convergence+health wait (F2-12: abra FATAs on the heavy lasuite-drive prev→PR-head
crossover while the new collabora's healthcheck is still in its start_period, even though it
converges given swarm's healthcheck retries). The stack spec IS applied either way (docker stack
deploy runs before the monitor)."""
abra.deploy(domain, chaos=True, timeout=deploy_timeout, no_converge_checks=no_converge_checks)
def wait_ready_probes(meta: dict, domain: str, timeout: int = 600) -> None:
"""Poll a recipe's optional READY_PROBE endpoints until each returns an accepted status, or raise.
A recipe_meta may define `READY_PROBE(domain) -> [{"host":..., "path":..., "ok":(200,)}, ...]`
for readiness signals NOT captured by container-replica convergence or the app's HEALTH_PATH —
e.g. lasuite-drive's collabora WOPI discovery (`/hosting/discovery` on the collabora sibling
host): swarm reports collabora 1/1 'running' while coolwsd is still doing jail/config init and
its discovery endpoint 404s, so replica-convergence alone is not real readiness. Used after the
install deploy and after the upgrade chaos redeploy so 'reconverged' means genuinely ready."""
probe_fn = meta.get("READY_PROBE")
if not callable(probe_fn):
return
probes = probe_fn(domain) or []
for probe in probes:
host = probe["host"]
path = probe.get("path", "/")
ok = tuple(probe.get("ok", (200,)))
deadline = time.time() + timeout
last = 0
while time.time() < deadline:
last = http_get(host, path, timeout=15)
if last in ok:
print(f" ready-probe OK ({last}): https://{host}{path}", flush=True)
break
time.sleep(5)
else:
raise TimeoutError(
f"READY_PROBE not ready: https://{host}{path} (last status {last}) within {timeout}s"
)
def backup_app(domain: str) -> str: