fix(2): F2-12 lasuite-drive upgrade tier — own convergence wait (abra -c) + collabora READY_PROBE

Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor
FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init),
even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora
being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold).

Fix (cc-ci-side, stronger verification — not weaker):
- abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's
  impatient monitor no longer FATAs (the stack spec is applied regardless).
- perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services
  N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the
  recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade.
- recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI
  discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after
  the upgrade redeploy. No-op for recipes without a READY_PROBE.

NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora
crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 11:55:53 +01:00
parent aab77ea0f3
commit e1147b5fe3
5 changed files with 92 additions and 12 deletions

View File

@ -194,7 +194,7 @@ def _load_meta(recipe: str) -> dict:
ns: dict = {}
with open(path) as fh:
exec(compile(fh.read(), path, "exec"), ns) # noqa: S102 (trusted, in-repo)
for k in list(meta) + ["BACKUP_CAPABLE", "SKIP_GENERIC", "OIDC_AT_INSTALL"]:
for k in list(meta) + ["BACKUP_CAPABLE", "SKIP_GENERIC", "OIDC_AT_INSTALL", "READY_PROBE"]:
if k in ns:
meta[k] = ns[k]
return meta
@ -240,15 +240,17 @@ def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, met
def _perform_op(
op: str, domain: str, recipe: str, head_ref: str | None, op_state: dict, deploy_timeout: int = 900
op: str, domain: str, recipe: str, head_ref: str | None, op_state: dict, deploy_timeout: int = 900,
meta: dict | None = None,
) -> None:
"""Perform the single mutating op ONCE (the harness owns the op, HC3). install has no op. Records
what the assertions need (pre-upgrade identity, backup snapshot_id) into op_state. None of these
call deploy_app, so the deploy-count guard (DG4.1) stays 1 — the in-place chaos upgrade is not a
new install (HC1 reconciliation). `deploy_timeout` (recipe DEPLOY_TIMEOUT) is plumbed to the
upgrade chaos redeploy so a heavy reconverge isn't SIGKILLed by the 900s default mid-wait."""
upgrade chaos redeploy so a heavy reconverge isn't SIGKILLed by the 900s default mid-wait; `meta`
lets the upgrade op own a recipe-aware convergence+health wait (F2-12, READY_PROBE)."""
if op == "upgrade":
before = generic.perform_upgrade(domain, recipe, head_ref, deploy_timeout=deploy_timeout)
before = generic.perform_upgrade(domain, recipe, head_ref, deploy_timeout=deploy_timeout, meta=meta)
op_state["upgrade"] = {"before": before, "head_ref": head_ref}
elif op == "backup":
op_state["backup"] = {"snapshot_id": generic.perform_backup(domain)}
@ -290,7 +292,10 @@ def run_lifecycle_tier(
# 1) pre-op seed hook + 2) the op ONCE (harness-owned). A failure here is an op failure → tier fail.
try:
_run_pre_hook(recipe, op, repo_local, domain, meta)
_perform_op(op, domain, recipe, head_ref, op_state, deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)))
_perform_op(
op, domain, recipe, head_ref, op_state,
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)), meta=meta,
)
with open(os.environ["CCCI_OP_STATE_FILE"], "w") as f:
json.dump(op_state, f)
except Exception as e: # noqa: BLE001 — a failed op is a reported tier failure, not a crash
@ -801,6 +806,9 @@ def main() -> int:
deploy_timeout=meta["DEPLOY_TIMEOUT"],
http_timeout=meta["HTTP_TIMEOUT"],
)
# Recipe READY_PROBE (e.g. lasuite-drive collabora WOPI discovery) — readiness beyond
# replica convergence + app HEALTH_PATH; no-op for recipes without one.
lifecycle.wait_ready_probes(meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", 900)))
deploy_ok = True
except Exception as e: # noqa: BLE001 — a failed deploy is a reported INSTALL failure
print(f"!! deploy/readiness failed: {e}", flush=True)