fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT
Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom +
OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed
on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it
mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The
install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C
readiness-gating, no test weakened):
- tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery
on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly.
- runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op —
plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default,
while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow
collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing.
Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md +
BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section
(241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated
IDEAS.md/plan-sso-dep-testing.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -214,17 +214,22 @@ def assert_restore_healthy(domain: str, meta: dict) -> None:
|
||||
# ---- Op primitives (orchestrator-only; perform the op once, never assert) --------------------
|
||||
|
||||
|
||||
def perform_upgrade(domain: str, recipe: str, head_ref: str | None) -> dict[str, str | None]:
|
||||
def perform_upgrade(
|
||||
domain: str, recipe: str, head_ref: str | None, deploy_timeout: int = 900
|
||||
) -> dict[str, str | None]:
|
||||
"""Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
|
||||
PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
|
||||
to redeploy the running app at that checkout. This is the real upgrade the PR's changes are
|
||||
exercised by (vs the old 'upgrade to newest published tag', which never deployed PR-head code).
|
||||
Returns the pre-upgrade identity so the orchestrator records it for `assert_upgraded`'s move check
|
||||
— after the chaos deploy the `chaos`(-version) label carries the PR-head commit, proving it."""
|
||||
— after the chaos deploy the `chaos`(-version) label carries the PR-head commit, proving it.
|
||||
|
||||
`deploy_timeout` (recipe DEPLOY_TIMEOUT) is plumbed to the chaos redeploy so a heavy stack's
|
||||
reconverge isn't SIGKILLed by abra.deploy's 900s default mid-wait."""
|
||||
before = lifecycle.deployed_identity(domain)
|
||||
if head_ref:
|
||||
lifecycle.recipe_checkout_ref(recipe, head_ref)
|
||||
lifecycle.chaos_redeploy(domain)
|
||||
lifecycle.chaos_redeploy(domain, deploy_timeout=deploy_timeout)
|
||||
after = lifecycle.deployed_identity(domain)
|
||||
# Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
|
||||
# PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.
|
||||
|
||||
@ -316,11 +316,16 @@ def recipe_checkout_ref(recipe: str, ref: str) -> None:
|
||||
abra.recipe_checkout(recipe, ref)
|
||||
|
||||
|
||||
def chaos_redeploy(domain: str) -> None:
|
||||
def chaos_redeploy(domain: str, deploy_timeout: int = 900) -> None:
|
||||
"""In-place `abra app deploy --chaos`: redeploy the running app at the CURRENT recipe checkout
|
||||
(HC1: the PR-head code under test). This is the upgrade op, not a fresh install — it does NOT go
|
||||
through deploy_app, so the deploy-count guard (DG4.1) is not incremented."""
|
||||
abra.deploy(domain, chaos=True)
|
||||
through deploy_app, so the deploy-count guard (DG4.1) is not incremented.
|
||||
|
||||
`deploy_timeout` is the abra subprocess wrapper timeout; pass the recipe's DEPLOY_TIMEOUT so a
|
||||
heavy stack's reconverge (e.g. lasuite-drive's slow collabora/onlyoffice boot) isn't SIGKILLed
|
||||
by the 900s default while abra is still legitimately waiting (its internal TIMEOUT can be larger
|
||||
via the .env). Mirrors the install deploy_app timeout plumbing."""
|
||||
abra.deploy(domain, chaos=True, timeout=deploy_timeout)
|
||||
|
||||
|
||||
def backup_app(domain: str) -> str:
|
||||
|
||||
@ -239,13 +239,16 @@ def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, met
|
||||
sys.path.remove(d)
|
||||
|
||||
|
||||
def _perform_op(op: str, domain: str, recipe: str, head_ref: str | None, op_state: dict) -> None:
|
||||
def _perform_op(
|
||||
op: str, domain: str, recipe: str, head_ref: str | None, op_state: dict, deploy_timeout: int = 900
|
||||
) -> None:
|
||||
"""Perform the single mutating op ONCE (the harness owns the op, HC3). install has no op. Records
|
||||
what the assertions need (pre-upgrade identity, backup snapshot_id) into op_state. None of these
|
||||
call deploy_app, so the deploy-count guard (DG4.1) stays 1 — the in-place chaos upgrade is not a
|
||||
new install (HC1 reconciliation)."""
|
||||
new install (HC1 reconciliation). `deploy_timeout` (recipe DEPLOY_TIMEOUT) is plumbed to the
|
||||
upgrade chaos redeploy so a heavy reconverge isn't SIGKILLed by the 900s default mid-wait."""
|
||||
if op == "upgrade":
|
||||
before = generic.perform_upgrade(domain, recipe, head_ref)
|
||||
before = generic.perform_upgrade(domain, recipe, head_ref, deploy_timeout=deploy_timeout)
|
||||
op_state["upgrade"] = {"before": before, "head_ref": head_ref}
|
||||
elif op == "backup":
|
||||
op_state["backup"] = {"snapshot_id": generic.perform_backup(domain)}
|
||||
@ -287,7 +290,7 @@ def run_lifecycle_tier(
|
||||
# 1) pre-op seed hook + 2) the op ONCE (harness-owned). A failure here is an op failure → tier fail.
|
||||
try:
|
||||
_run_pre_hook(recipe, op, repo_local, domain, meta)
|
||||
_perform_op(op, domain, recipe, head_ref, op_state)
|
||||
_perform_op(op, domain, recipe, head_ref, op_state, deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)))
|
||||
with open(os.environ["CCCI_OP_STATE_FILE"], "w") as f:
|
||||
json.dump(op_state, f)
|
||||
except Exception as e: # noqa: BLE001 — a failed op is a reported tier failure, not a crash
|
||||
|
||||
Reference in New Issue
Block a user