fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT

Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom + OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C readiness-gating, no test weakened): - tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly. - runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op — plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default, while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing. Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md + BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section (241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated IDEAS.md/plan-sso-dep-testing.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:37:55 +01:00
parent 0b558529c9
commit 4b38b66fa5
6 changed files with 63 additions and 23 deletions
--- a/tests/lasuite-drive/ops.py
+++ b/tests/lasuite-drive/ops.py
@ -6,11 +6,35 @@ backupbot-labelled)."""

 import os
 import sys
+import time

 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
 from harness import lifecycle  # noqa: E402


+def _wait_collabora_ready(domain, timeout=420):
+    """Gate the upgrade op on collabora being FULLY ready (WOPI discovery endpoint → 200), not just
+    container 1/1 'running'. coolwsd takes ~2min to boot (pre-reads 1300+ l10n files + RSA keygen);
+    the install wait_healthy returns on container 1/1 while coolwsd is still loading. An in-place
+    `abra app deploy --chaos` upgrade that lands on a still-booting collabora SIGTERMs it mid-init
+    ("Shutdown requested while starting up", forced exit 70) → abra aborts the deploy (Q3.2a run 1,
+    JOURNAL 2026-05-29). Waiting for discovery=200 first makes the redeploy replace a ready collabora
+    cleanly. collabora routes on the COLLABORA_DOMAIN sibling (collabora-<domain>); /hosting/discovery
+    is the WOPI discovery endpoint celery's configure_wopi calls."""
+    host = f"collabora-{domain}"
+    deadline = time.time() + timeout
+    last = 0
+    while time.time() < deadline:
+        last = lifecycle.http_get(host, "/hosting/discovery", timeout=15)
+        if last == 200:
+            print(f"  pre_upgrade: collabora WOPI discovery ready (200) on {host}", flush=True)
+            return
+        time.sleep(5)
+    raise AssertionError(
+        f"collabora WOPI discovery not ready on {host} (last status {last}) within {timeout}s"
+    )
+
+
 def _psql(domain, sql):
    cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"'
    return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
@ -26,6 +50,9 @@ def _seed(domain, value):


 def pre_upgrade(domain, meta):
+    # Gate the chaos redeploy on a fully-ready collabora (else it kills a still-booting coolwsd and
+    # abra aborts the upgrade deploy — Q3.2a run 1). Then seed the data-integrity marker.
+    _wait_collabora_ready(domain)
    _seed(domain, "upgrade-survives")