fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT

Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom +
OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed
on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it
mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The
install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C
readiness-gating, no test weakened):

- tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery
  on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly.
- runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op —
  plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default,
  while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow
  collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing.

Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md +
BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section
(241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated
IDEAS.md/plan-sso-dep-testing.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 10:37:55 +01:00
parent 0b558529c9
commit 4b38b66fa5
6 changed files with 63 additions and 23 deletions

View File

@ -133,7 +133,7 @@ Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md`
- [ ] **Q4.7** — plausible: enroll; specific (track a test event, query it back). - [ ] **Q4.7** — plausible: enroll; specific (track a test event, query it back).
- [x] **Q4.8** — uptime-kuma: enrolled. PARITY.md + recipe_meta.py + 3 functional tests - [x] **Q4.8** — uptime-kuma: enrolled. PARITY.md + recipe_meta.py + 3 functional tests
(health_check, socketio_handshake, spa_branding). Cold green (commit `1aaf3bd`). (health_check, socketio_handshake, spa_branding). Cold green (commit `1aaf3bd`).
Create-a-monitor in DEFERRED.md (Socket.IO client primitive + --extra-tests; F2-10 closed). Create-a-monitor in DEFERRED.md (Socket.IO client primitive + --extra; F2-10 closed).
- [ ] **Q4.9** — mailu: enroll; specific (create a mailbox, send/receive verification). - [ ] **Q4.9** — mailu: enroll; specific (create a mailbox, send/receive verification).
- [ ] **Q4.10** — drone: enroll; specific (create/list builds via API). - [ ] **Q4.10** — drone: enroll; specific (create/list builds via API).
- [ ] **Q4.11** — Q4 gate: each recipe green with parity + specific. - [ ] **Q4.11** — Q4 gate: each recipe green with parity + specific.

View File

@ -46,9 +46,9 @@ before the build is called done) — but does **not** force closure.
tradeoff is real — too-small N loses the test's meaning (state-group bloat is by definition a tradeoff is real — too-small N loses the test's meaning (state-group bloat is by definition a
large-state phenomenon), too-large N inflates per-run time. Defensible defer; operator-confirmed large-state phenomenon), too-large N inflates per-run time. Defensible defer; operator-confirmed
2026-05-28: heavier than needed for default CI. 2026-05-28: heavier than needed for default CI.
- **Re-entry trigger:** the `--extra-tests` opt-in flag (see linked IDEA) so this runs only when - **Re-entry trigger:** the `--extra` opt-in flag (see linked IDEA) so this runs only when
the operator explicitly asks for the heavy suite; or a dedicated long-running matrix instance. the operator explicitly asks for the heavy suite; or a dedicated long-running matrix instance.
- **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra-tests` flag for heavy/operational tests*. - **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra` flag for heavy/operational tests*.
### 2026-05-28 — matrix-synapse `test_complexity_limit.sh` port ### 2026-05-28 — matrix-synapse `test_complexity_limit.sh` port
- [ ] **What:** Port `recipe-info/matrix-synapse/tests/test_complexity_limit.sh` — exercise Synapse's - [ ] **What:** Port `recipe-info/matrix-synapse/tests/test_complexity_limit.sh` — exercise Synapse's
@ -56,8 +56,8 @@ before the build is called done) — but does **not** force closure.
- **Filed by:** Builder, phase 2 (Q4.1 matrix-synapse PARITY pass) - **Filed by:** Builder, phase 2 (Q4.1 matrix-synapse PARITY pass)
- **Reason for deferral:** Load-test class; needs many-event setup. Operator-confirmed 2026-05-28: - **Reason for deferral:** Load-test class; needs many-event setup. Operator-confirmed 2026-05-28:
more than needed for a default matrix CI test. more than needed for a default matrix CI test.
- **Re-entry trigger:** the `--extra-tests` opt-in flag (linked IDEA). - **Re-entry trigger:** the `--extra` opt-in flag (linked IDEA).
- **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra-tests` flag for heavy/operational tests*. - **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra` flag for heavy/operational tests*.
### 2026-05-28 — matrix-synapse `test_purge.sh` port ### 2026-05-28 — matrix-synapse `test_purge.sh` port
- [ ] **What:** Port `recipe-info/matrix-synapse/tests/test_purge.sh` — exercise the recipe's - [ ] **What:** Port `recipe-info/matrix-synapse/tests/test_purge.sh` — exercise the recipe's
@ -66,9 +66,9 @@ before the build is called done) — but does **not** force closure.
- **Reason for deferral:** Recipe-helper-script tests, not synapse-behaviour tests (orthogonal to - **Reason for deferral:** Recipe-helper-script tests, not synapse-behaviour tests (orthogonal to
default Phase-2 coverage). Operator-confirmed 2026-05-28: more than needed for a default matrix default Phase-2 coverage). Operator-confirmed 2026-05-28: more than needed for a default matrix
CI test. CI test.
- **Re-entry trigger:** the `--extra-tests` opt-in flag (linked IDEA) — so PRs touching the recipe's - **Re-entry trigger:** the `--extra` opt-in flag (linked IDEA) — so PRs touching the recipe's
abra helper scripts can opt in to exercising them. abra helper scripts can opt in to exercising them.
- **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra-tests` flag for heavy/operational tests*. - **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra` flag for heavy/operational tests*.
### 2026-05-28 — matrix-synapse media upload/download roundtrip ### 2026-05-28 — matrix-synapse media upload/download roundtrip
- [ ] **What:** Add `tests/matrix-synapse/functional/test_media_upload_roundtrip.py` exercising - [ ] **What:** Add `tests/matrix-synapse/functional/test_media_upload_roundtrip.py` exercising
@ -118,10 +118,10 @@ before the build is called done) — but does **not** force closure.
(parity health + Socket.IO handshake + SPA branding) cover the same handshake + bundle the (parity health + Socket.IO handshake + SPA branding) cover the same handshake + bundle the
setup-then-monitor flow would use; adding a full Socket.IO client is a substantial harness setup-then-monitor flow would use; adding a full Socket.IO client is a substantial harness
primitive worth deferring until either (a) another recipe also needs Socket.IO interaction or primitive worth deferring until either (a) another recipe also needs Socket.IO interaction or
(b) the `--extra-tests` flag lands so this can live in `extra/`. (b) the `--extra` flag lands so this can live in `extra/`.
- **Re-entry trigger:** the `--extra-tests` opt-in flag (linked IDEA) OR another recipe enrollment - **Re-entry trigger:** the `--extra` opt-in flag (linked IDEA) OR another recipe enrollment
that requires Socket.IO client primitives in the harness (whichever comes first). that requires Socket.IO client primitives in the harness (whichever comes first).
- **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra-tests` flag for heavy/operational tests*. - **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra` flag for heavy/operational tests*.
### 2026-05-28 — ghost create-a-post round-trip (§4.3 prescribed) ### 2026-05-28 — ghost create-a-post round-trip (§4.3 prescribed)
- [ ] **What:** Add `tests/ghost/functional/test_post_roundtrip.py` exercising Ghost's admin setup - [ ] **What:** Add `tests/ghost/functional/test_post_roundtrip.py` exercising Ghost's admin setup
@ -133,11 +133,11 @@ before the build is called done) — but does **not** force closure.
run-scoped) + JWT token management for the admin API. The current 3 tests run-scoped) + JWT token management for the admin API. The current 3 tests
(parity health + content_api + admin_redirect) cover the same Ghost-server / API / admin-route (parity health + content_api + admin_redirect) cover the same Ghost-server / API / admin-route
surface; the create-post flow is the natural §4.3 deeper test and is doable, but adds setup surface; the create-post flow is the natural §4.3 deeper test and is doable, but adds setup
state to manage. Reasonable to defer to the `--extra-tests` flag rollout OR a Phase-2 state to manage. Reasonable to defer to the `--extra` flag rollout OR a Phase-2
follow-up specifically for Q4 deeper tests. follow-up specifically for Q4 deeper tests.
- **Re-entry trigger:** the `--extra-tests` opt-in flag (linked IDEA) OR a Q4 deeper-test pass - **Re-entry trigger:** the `--extra` opt-in flag (linked IDEA) OR a Q4 deeper-test pass
before Phase-2 DONE if the Adversary calls for it (Phase-4 cleanup pass MUST review). before Phase-2 DONE if the Adversary calls for it (Phase-4 cleanup pass MUST review).
- **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra-tests` flag for heavy/operational tests*. - **Linked IDEA:** `cc-ci-plan/IDEAS.md`*Optional `--extra` flag for heavy/operational tests*.
### 2026-05-28 — Q2.2 authentik enrollment + `setup_authentik_realm` SSO backend ### 2026-05-28 — Q2.2 authentik enrollment + `setup_authentik_realm` SSO backend
- [ ] **What:** Enroll authentik in cc-ci tests/ (mirror-and-enroll if not yet mirrored) + add a - [ ] **What:** Enroll authentik in cc-ci tests/ (mirror-and-enroll if not yet mirrored) + add a

View File

@ -214,17 +214,22 @@ def assert_restore_healthy(domain: str, meta: dict) -> None:
# ---- Op primitives (orchestrator-only; perform the op once, never assert) -------------------- # ---- Op primitives (orchestrator-only; perform the op once, never assert) --------------------
def perform_upgrade(domain: str, recipe: str, head_ref: str | None) -> dict[str, str | None]: def perform_upgrade(
domain: str, recipe: str, head_ref: str | None, deploy_timeout: int = 900
) -> dict[str, str | None]:
"""Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the """Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos` PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
to redeploy the running app at that checkout. This is the real upgrade the PR's changes are to redeploy the running app at that checkout. This is the real upgrade the PR's changes are
exercised by (vs the old 'upgrade to newest published tag', which never deployed PR-head code). exercised by (vs the old 'upgrade to newest published tag', which never deployed PR-head code).
Returns the pre-upgrade identity so the orchestrator records it for `assert_upgraded`'s move check Returns the pre-upgrade identity so the orchestrator records it for `assert_upgraded`'s move check
— after the chaos deploy the `chaos`(-version) label carries the PR-head commit, proving it.""" — after the chaos deploy the `chaos`(-version) label carries the PR-head commit, proving it.
`deploy_timeout` (recipe DEPLOY_TIMEOUT) is plumbed to the chaos redeploy so a heavy stack's
reconverge isn't SIGKILLed by abra.deploy's 900s default mid-wait."""
before = lifecycle.deployed_identity(domain) before = lifecycle.deployed_identity(domain)
if head_ref: if head_ref:
lifecycle.recipe_checkout_ref(recipe, head_ref) lifecycle.recipe_checkout_ref(recipe, head_ref)
lifecycle.chaos_redeploy(domain) lifecycle.chaos_redeploy(domain, deploy_timeout=deploy_timeout)
after = lifecycle.deployed_identity(domain) after = lifecycle.deployed_identity(domain)
# Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the # Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
# PR-head we checked out — proving the upgrade deployed the code under test, not a published tag. # PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.

View File

@ -316,11 +316,16 @@ def recipe_checkout_ref(recipe: str, ref: str) -> None:
abra.recipe_checkout(recipe, ref) abra.recipe_checkout(recipe, ref)
def chaos_redeploy(domain: str) -> None: def chaos_redeploy(domain: str, deploy_timeout: int = 900) -> None:
"""In-place `abra app deploy --chaos`: redeploy the running app at the CURRENT recipe checkout """In-place `abra app deploy --chaos`: redeploy the running app at the CURRENT recipe checkout
(HC1: the PR-head code under test). This is the upgrade op, not a fresh install — it does NOT go (HC1: the PR-head code under test). This is the upgrade op, not a fresh install — it does NOT go
through deploy_app, so the deploy-count guard (DG4.1) is not incremented.""" through deploy_app, so the deploy-count guard (DG4.1) is not incremented.
abra.deploy(domain, chaos=True)
`deploy_timeout` is the abra subprocess wrapper timeout; pass the recipe's DEPLOY_TIMEOUT so a
heavy stack's reconverge (e.g. lasuite-drive's slow collabora/onlyoffice boot) isn't SIGKILLed
by the 900s default while abra is still legitimately waiting (its internal TIMEOUT can be larger
via the .env). Mirrors the install deploy_app timeout plumbing."""
abra.deploy(domain, chaos=True, timeout=deploy_timeout)
def backup_app(domain: str) -> str: def backup_app(domain: str) -> str:

View File

@ -239,13 +239,16 @@ def _run_pre_hook(recipe: str, op: str, repo_local: str | None, domain: str, met
sys.path.remove(d) sys.path.remove(d)
def _perform_op(op: str, domain: str, recipe: str, head_ref: str | None, op_state: dict) -> None: def _perform_op(
op: str, domain: str, recipe: str, head_ref: str | None, op_state: dict, deploy_timeout: int = 900
) -> None:
"""Perform the single mutating op ONCE (the harness owns the op, HC3). install has no op. Records """Perform the single mutating op ONCE (the harness owns the op, HC3). install has no op. Records
what the assertions need (pre-upgrade identity, backup snapshot_id) into op_state. None of these what the assertions need (pre-upgrade identity, backup snapshot_id) into op_state. None of these
call deploy_app, so the deploy-count guard (DG4.1) stays 1 — the in-place chaos upgrade is not a call deploy_app, so the deploy-count guard (DG4.1) stays 1 — the in-place chaos upgrade is not a
new install (HC1 reconciliation).""" new install (HC1 reconciliation). `deploy_timeout` (recipe DEPLOY_TIMEOUT) is plumbed to the
upgrade chaos redeploy so a heavy reconverge isn't SIGKILLed by the 900s default mid-wait."""
if op == "upgrade": if op == "upgrade":
before = generic.perform_upgrade(domain, recipe, head_ref) before = generic.perform_upgrade(domain, recipe, head_ref, deploy_timeout=deploy_timeout)
op_state["upgrade"] = {"before": before, "head_ref": head_ref} op_state["upgrade"] = {"before": before, "head_ref": head_ref}
elif op == "backup": elif op == "backup":
op_state["backup"] = {"snapshot_id": generic.perform_backup(domain)} op_state["backup"] = {"snapshot_id": generic.perform_backup(domain)}
@ -287,7 +290,7 @@ def run_lifecycle_tier(
# 1) pre-op seed hook + 2) the op ONCE (harness-owned). A failure here is an op failure → tier fail. # 1) pre-op seed hook + 2) the op ONCE (harness-owned). A failure here is an op failure → tier fail.
try: try:
_run_pre_hook(recipe, op, repo_local, domain, meta) _run_pre_hook(recipe, op, repo_local, domain, meta)
_perform_op(op, domain, recipe, head_ref, op_state) _perform_op(op, domain, recipe, head_ref, op_state, deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", 900)))
with open(os.environ["CCCI_OP_STATE_FILE"], "w") as f: with open(os.environ["CCCI_OP_STATE_FILE"], "w") as f:
json.dump(op_state, f) json.dump(op_state, f)
except Exception as e: # noqa: BLE001 — a failed op is a reported tier failure, not a crash except Exception as e: # noqa: BLE001 — a failed op is a reported tier failure, not a crash

View File

@ -6,11 +6,35 @@ backupbot-labelled)."""
import os import os
import sys import sys
import time
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner")) sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402 from harness import lifecycle # noqa: E402
def _wait_collabora_ready(domain, timeout=420):
"""Gate the upgrade op on collabora being FULLY ready (WOPI discovery endpoint → 200), not just
container 1/1 'running'. coolwsd takes ~2min to boot (pre-reads 1300+ l10n files + RSA keygen);
the install wait_healthy returns on container 1/1 while coolwsd is still loading. An in-place
`abra app deploy --chaos` upgrade that lands on a still-booting collabora SIGTERMs it mid-init
("Shutdown requested while starting up", forced exit 70) → abra aborts the deploy (Q3.2a run 1,
JOURNAL 2026-05-29). Waiting for discovery=200 first makes the redeploy replace a ready collabora
cleanly. collabora routes on the COLLABORA_DOMAIN sibling (collabora-<domain>); /hosting/discovery
is the WOPI discovery endpoint celery's configure_wopi calls."""
host = f"collabora-{domain}"
deadline = time.time() + timeout
last = 0
while time.time() < deadline:
last = lifecycle.http_get(host, "/hosting/discovery", timeout=15)
if last == 200:
print(f" pre_upgrade: collabora WOPI discovery ready (200) on {host}", flush=True)
return
time.sleep(5)
raise AssertionError(
f"collabora WOPI discovery not ready on {host} (last status {last}) within {timeout}s"
)
def _psql(domain, sql): def _psql(domain, sql):
cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"' cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"'
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip() return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
@ -26,6 +50,9 @@ def _seed(domain, value):
def pre_upgrade(domain, meta): def pre_upgrade(domain, meta):
# Gate the chaos redeploy on a fully-ready collabora (else it kills a still-booting coolwsd and
# abra aborts the upgrade deploy — Q3.2a run 1). Then seed the data-integrity marker.
_wait_collabora_ready(domain)
_seed(domain, "upgrade-survives") _seed(domain, "upgrade-survives")