fix(2): F2-12 lasuite-drive upgrade tier — own convergence wait (abra -c) + collabora READY_PROBE
Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init), even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold). Fix (cc-ci-side, stronger verification — not weaker): - abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's impatient monitor no longer FATAs (the stack spec is applied regardless). - perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade. - recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after the upgrade redeploy. No-op for recipes without a READY_PROBE. NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -215,7 +215,7 @@ def assert_restore_healthy(domain: str, meta: dict) -> None:
|
||||
|
||||
|
||||
def perform_upgrade(
|
||||
domain: str, recipe: str, head_ref: str | None, deploy_timeout: int = 900
|
||||
domain: str, recipe: str, head_ref: str | None, deploy_timeout: int = 900, meta: dict | None = None
|
||||
) -> dict[str, str | None]:
|
||||
"""Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
|
||||
PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
|
||||
@ -225,11 +225,28 @@ def perform_upgrade(
|
||||
— after the chaos deploy the `chaos`(-version) label carries the PR-head commit, proving it.
|
||||
|
||||
`deploy_timeout` (recipe DEPLOY_TIMEOUT) is plumbed to the chaos redeploy so a heavy stack's
|
||||
reconverge isn't SIGKILLed by abra.deploy's 900s default mid-wait."""
|
||||
reconverge isn't SIGKILLed by abra.deploy's 900s default mid-wait.
|
||||
|
||||
F2-12: the chaos redeploy runs with `--no-converge-checks` (abra's own convergence monitor FATAs
|
||||
on the heavy lasuite-drive prev→PR-head crossover while the NEW collabora's healthcheck is still
|
||||
in its start_period, even though it converges given swarm's healthcheck retries). We then own a
|
||||
STRICTER convergence+health wait here: services N/N (wait_healthy) + app HEALTH_PATH healthy +
|
||||
any recipe READY_PROBE (collabora WOPI discovery 200). This bounds readiness by OUR generous
|
||||
deadline, not abra's impatient one — and is stronger evidence than abra's monitor."""
|
||||
meta = meta or {}
|
||||
before = lifecycle.deployed_identity(domain)
|
||||
if head_ref:
|
||||
lifecycle.recipe_checkout_ref(recipe, head_ref)
|
||||
lifecycle.chaos_redeploy(domain, deploy_timeout=deploy_timeout)
|
||||
lifecycle.chaos_redeploy(domain, deploy_timeout=deploy_timeout, no_converge_checks=True)
|
||||
# Own the convergence verification (abra's monitor was skipped via -c).
|
||||
lifecycle.wait_healthy(
|
||||
domain,
|
||||
ok_codes=tuple(meta.get("HEALTH_OK", (200, 301, 302))),
|
||||
path=meta.get("HEALTH_PATH", "/"),
|
||||
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout)),
|
||||
http_timeout=int(meta.get("HTTP_TIMEOUT", 300)),
|
||||
)
|
||||
lifecycle.wait_ready_probes(meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout)))
|
||||
after = lifecycle.deployed_identity(domain)
|
||||
# Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
|
||||
# PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.
|
||||
|
||||
Reference in New Issue
Block a user