fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT

Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom + OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C readiness-gating, no test weakened): - tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly. - runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op — plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default, while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing. Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md + BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section (241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated IDEAS.md/plan-sso-dep-testing.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:37:55 +01:00
parent 0b558529c9
commit 4b38b66fa5
6 changed files with 63 additions and 23 deletions
--- a/runner/harness/generic.py
+++ b/runner/harness/generic.py
@ -214,17 +214,22 @@ def assert_restore_healthy(domain: str, meta: dict) -> None:
 # ---- Op primitives (orchestrator-only; perform the op once, never assert) --------------------


-def perform_upgrade(domain: str, recipe: str, head_ref: str | None) -> dict[str, str | None]:
+def perform_upgrade(
+    domain: str, recipe: str, head_ref: str | None, deploy_timeout: int = 900
+) -> dict[str, str | None]:
    """Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
    PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
    to redeploy the running app at that checkout. This is the real upgrade the PR's changes are
    exercised by (vs the old 'upgrade to newest published tag', which never deployed PR-head code).
    Returns the pre-upgrade identity so the orchestrator records it for `assert_upgraded`'s move check
-    — after the chaos deploy the `chaos`(-version) label carries the PR-head commit, proving it."""
+    — after the chaos deploy the `chaos`(-version) label carries the PR-head commit, proving it.
+
+    `deploy_timeout` (recipe DEPLOY_TIMEOUT) is plumbed to the chaos redeploy so a heavy stack's
+    reconverge isn't SIGKILLed by abra.deploy's 900s default mid-wait."""
    before = lifecycle.deployed_identity(domain)
    if head_ref:
        lifecycle.recipe_checkout_ref(recipe, head_ref)
-    lifecycle.chaos_redeploy(domain)
+    lifecycle.chaos_redeploy(domain, deploy_timeout=deploy_timeout)
    after = lifecycle.deployed_identity(domain)
    # Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
    # PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.