Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init), even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold). Fix (cc-ci-side, stronger verification — not weaker): - abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's impatient monitor no longer FATAs (the stack spec is applied regardless). - perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade. - recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after the upgrade redeploy. No-op for recipes without a READY_PROBE. NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
66 lines
4.3 KiB
Python
66 lines
4.3 KiB
Python
# Per-recipe harness config for lasuite-drive (Phase 2 Q3.2 — multi-service + object-storage/S3 +
|
|
# WOPI office, OIDC-dependent). Sibling of lasuite-docs (same La Suite / impress lineage).
|
|
#
|
|
# Stack: app(frontend SPA) + backend(Django/drive) + celery + celery-beat + db(postgres) + redis +
|
|
# mailcatcher + minio(S3) + minio-createbuckets(one-shot) + collabora(WOPI office). ~10 services →
|
|
# generous timeouts.
|
|
#
|
|
# Health: the React SPA is served at `/` by the `app` service and returns 200 unauthenticated
|
|
# (login is OIDC-gated, exercised by the SSO functional tests, not by the install health check).
|
|
HEALTH_PATH = "/"
|
|
HEALTH_OK = (200, 301, 302)
|
|
# This is the heaviest stack in the Phase-2 set: 12 services incl. BOTH office backends
|
|
# (collabora/code ~1GB + onlyoffice/documentserver ~2GB) plus impress front/backend, postgres,
|
|
# minio, redis, nginx. Cold image pull + onlyoffice's multi-minute internal boot exceed the
|
|
# default abra TIMEOUT (300s) and even 900s, so allow a wide window (abra TIMEOUT below stays
|
|
# under DEPLOY_TIMEOUT so the Python subprocess never kills abra mid-wait).
|
|
DEPLOY_TIMEOUT = 1800
|
|
HTTP_TIMEOUT = 900
|
|
|
|
# Base deploy/lifecycle proven cold-green @2026-05-28 (install: pass; 12 services incl.
|
|
# onlyoffice+collabora) once the Docker Hub rate limit was fixed. The keycloak SSO dep is now
|
|
# enabled: declaring DEPS triggers the orchestrator's setup_custom_tests step (deploy keycloak +
|
|
# provision realm/client/user + run tests/lasuite-drive/setup_custom_tests.sh to wire OIDC env +
|
|
# in-place redeploy). functional/test_oidc_with_keycloak.py then exercises the SSO flow.
|
|
DEPS = ["keycloak"]
|
|
|
|
# Q3.2a (plan-lasuite-drive-oidc-robustness.md Part A): wire OIDC at INSTALL time, not via a
|
|
# post-deploy in-place `--chaos` redeploy. The orchestrator provisions the per-run realm on the
|
|
# live-warm keycloak BEFORE the single `abra app deploy`, and tests/lasuite-drive/install_steps.sh
|
|
# writes the OIDC env + client secret into the .env that one deploy reads. This eliminates the flaky
|
|
# 12-service reconverge (collabora WOPI-discovery race; JOURNAL Step 0). Drive boots fine with OIDC
|
|
# env set because keycloak is live-warm (discovery reachable at boot). setup_custom_tests.sh now
|
|
# only triggers the post-deploy MinIO bucket one-shot.
|
|
OIDC_AT_INSTALL = True
|
|
|
|
|
|
def READY_PROBE(domain):
|
|
"""Readiness signals beyond replica-convergence + the app HEALTH_PATH (Q3.2/F2-12). collabora's
|
|
coolwsd reports its container 1/1 'running' while still doing jail/config init, and its WOPI
|
|
discovery endpoint 404s until ready — so the harness waits for `/hosting/discovery` → 200 on the
|
|
collabora sibling host after the install deploy AND after the upgrade chaos redeploy. This is what
|
|
makes the heavy prev→PR-head crossover reliably green (the new collabora 25.04.9.x finishes init
|
|
within swarm's healthcheck retries; abra's own converge monitor was too impatient — F2-12)."""
|
|
label, _, rest = domain.partition(".")
|
|
return [{"host": f"collabora-{domain}", "path": "/hosting/discovery", "ok": (200,)}]
|
|
|
|
|
|
def EXTRA_ENV(domain):
|
|
# Two of lasuite-drive's services route on DOMAIN-DERIVED **nested** subdomains —
|
|
# `MINIO_DOMAIN="minio.${DOMAIN}"` and `COLLABORA_DOMAIN="collabora.${DOMAIN}"`. The cc-ci
|
|
# wildcard TLS cert is `*.ci.commoninternet.net` (single label only), so a 2-label name like
|
|
# `minio.lasuite-drive-pr0-abc.ci.commoninternet.net` is NOT covered → TLS failure on those
|
|
# routers. Flatten each to a single-label SIBLING under the wildcard (`minio-<domain>`,
|
|
# `collabora-<domain>`) so the existing wildcard cert covers them and Traefik routes them with
|
|
# no cert/gateway change. See DECISIONS.md "Phase 2 — nested DOMAIN-derived subdomains".
|
|
# `AWS_S3_DOMAIN_REPLACE` derives from MINIO_DOMAIN in-compose, so setting MINIO_DOMAIN is enough.
|
|
return {
|
|
"MINIO_DOMAIN": f"minio-{domain}",
|
|
"COLLABORA_DOMAIN": f"collabora-{domain}",
|
|
# abra's internal per-deploy convergence timeout (recipe TIMEOUT env, default 300s) is too
|
|
# short for this 12-service stack on a cold image cache (impress frontend/backend, minio,
|
|
# postgres, redis, collabora ~1GB, onlyoffice ~2GB). Bump so abra waits long enough for
|
|
# convergence; kept under DEPLOY_TIMEOUT (1800) so Python never kills abra mid-wait.
|
|
"TIMEOUT": "1500",
|
|
}
|