diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index 4e0b7f9..172133c 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -480,3 +480,44 @@ SPA's main menu via a stable accessibility tree (role-based selectors instead of Adversary may file F2-N requesting full create-pad coverage; the answer above is the honest technical reason + the maximal subset. Logged here per plan §7.1. + +--- + +## Phase 2 — nested DOMAIN-derived subdomains flattened to single-label wildcard siblings + +**Decision (settled):** When an enrolled recipe routes additional services on **nested subdomains +derived from `DOMAIN`** (e.g. lasuite-drive `MINIO_DOMAIN="minio.${DOMAIN}"` + +`COLLABORA_DOMAIN="collabora.${DOMAIN}"`; lasuite-meet `LIVEKIT_DOMAIN="livekit.${DOMAIN}"`), the +recipe's `recipe_meta.EXTRA_ENV(domain)` MUST override those vars to a **single-label sibling under +the wildcard** — `minio-`, `collabora-`, `livekit-` — NOT the recipe's +default `.`. + +**Why:** cc-ci's TLS cert is the operator's pre-issued wildcard `*.ci.commoninternet.net` (+ bare +`ci.commoninternet.net`) — §4.0/§1.5, renewed out-of-band, no ACME. A wildcard matches exactly **one** +label. The per-run app domain is already one label (`lasuite-drive-pr-.ci.commoninternet.net`), +so a nested `minio.lasuite-drive-pr-.ci.commoninternet.net` is a **2-label** name the wildcard +does NOT cover → Traefik would serve an invalid cert on that router and the service is unreachable +over HTTPS. Re-prefixing with a hyphen keeps it one label (`minio-lasuite-drive-pr-` + +`.ci.commoninternet.net`), covered by the same wildcard, routed by Traefik's swarm provider with **no +cert work and no gateway change** (the gateway already passes the whole wildcard, §4.0). We must NOT +mint per-host certs / ACME for these (class-A1 boundary, §9). + +**Scope:** purely a per-recipe `EXTRA_ENV` concern (no shared-harness change). Recipes with no +DOMAIN-derived nested subdomains (most) are unaffected. + +## Phase 2 — `services_converged` treats a `replicas: 0` one-shot as converged + +**Decision (settled):** `runner/harness/lifecycle.py::services_converged` now considers a service +converged when `cur == want` (desired replica count met), removing the prior +`or want == "0"` rejection. + +**Why:** lasuite-drive's `minio-createbuckets` is declared `deploy: {mode: replicated, replicas: 0, +restart_policy: {condition: none}}` — an **on-demand one-shot** (scaled up manually only when buckets +need (re)creating; it `mc mb …` then `exit 0`). `docker stack services` reports it `0/0`. The old +check rejected any `want == "0"` row, so the stack could **never** report converged → every deploy +hung until `deploy_timeout`. A service AT its desired count (including 0/0) is converged; a service +still spinning up shows `0/1` (`cur != want`) and is correctly not-yet-converged, so the HTTP +readiness wait still gates real liveness. Safe for all currently-green recipes (their services are +all N/N with N>0; the `0/0` case did not previously occur). Buckets/migrations that the one-shot +performs are run on-demand in the recipe's `setup_custom_tests.sh` (post-deploy), not relied upon for +generic-install convergence (the SPA at `/` serves 200 without them). diff --git a/runner/harness/lifecycle.py b/runner/harness/lifecycle.py index e79b1e9..e4dd4bb 100644 --- a/runner/harness/lifecycle.py +++ b/runner/harness/lifecycle.py @@ -181,7 +181,13 @@ def services_converged(domain: str) -> bool: return False for r in rows: cur, _, want = r.partition("/") - if not want or cur != want or want == "0": + # A service at its DESIRED replica count is converged — including a `replicas: 0` + # on-demand one-shot (e.g. lasuite-drive's `minio-createbuckets`, which is scaled up + # manually only when buckets need (re)creating), which reports "0/0". The earlier + # `want == "0"` rejection wrongly treated those as never-converged, hanging the deploy + # forever. `cur == want` (with `want` present) is the correct convergence test; a service + # still spinning up shows e.g. "0/1" (cur != want) and is correctly not-yet-converged. + if not want or cur != want: return False return True diff --git a/tests/lasuite-drive/PARITY.md b/tests/lasuite-drive/PARITY.md new file mode 100644 index 0000000..8f0dcd3 --- /dev/null +++ b/tests/lasuite-drive/PARITY.md @@ -0,0 +1,38 @@ +# Parity — lasuite-drive + +Phase-2 P2 mapping table. The Adversary cold-verifies parity by reading the source +`recipe-info/lasuite-drive/tests/` and the cc-ci file side-by-side. + +**Enrollment status:** Q3.2 in progress. Base deploy + lifecycle (install/upgrade/backup/restore +data-integrity) + parity health_check landed first (probe-before-assert: validate the ~10-service +stack converges with the nested-subdomain flattening before layering SSO). The OIDC + WOPI + upload +functional tests (which require the keycloak dep + post-deploy migrations + buckets) land in the SSO +iteration once the base is cold-green. This file is updated as each row lands; nothing is a silent +omission. + +| recipe-maintainer file | cc-ci file | what's verified | status | +|---|---|---|---| +| `recipe-info/lasuite-drive/tests/health_check.py` | `tests/lasuite-drive/functional/test_health_check.py` | App serves over HTTPS and returns 200/301/302 from `/`. Port preserves the assertion shape, adapted to the ephemeral per-run domain via `live_app`. | **ported** | +| `recipe-info/lasuite-drive/tests/oidc_login.py` | `tests/lasuite-drive/functional/test_oidc_with_keycloak.py` (planned, SSO iteration) | Original: Drive `/api/v1.0/authenticate/` redirects to Keycloak → password-grant token → `/api/v1.0/users/me/` returns the user. cc-ci port deploys keycloak as a per-run dep (`DEPS=["keycloak"]`), wires OIDC env via `setup_custom_tests.sh`, exercises discovery + password grant + JWT claims (mirrors the proven lasuite-docs `test_oidc_with_keycloak`). | **pending (SSO iteration)** | +| `recipe-info/lasuite-drive/tests/wopi_configured.py` | `tests/lasuite-drive/functional/test_wopi_configured.py` (planned) | Original: Collabora + OnlyOffice WOPI discovery endpoints return valid WOPI XML. cc-ci port checks the Collabora discovery XML over the flattened `collabora-` route (pure HTTP, no browser/SSO). | **pending** | +| `recipe-info/lasuite-drive/tests/wopi_on_startup.py` | (see DECISIONS / DEFERRED) | Original: greps celery worker container logs for the entrypoint WOPI trigger. cc-ci port via `docker service logs` on the celery service. | **pending** | +| `recipe-info/lasuite-drive/tests/celery_beat_wopi.py` | (likely DEFERRED — "thorough mode only") | Original sleeps 15–90s waiting for Celery Beat to fire; recipe-maintainer marks it "thorough mode only". Candidate for the `--extra-tests` opt-in (DEFERRED.md), like the matrix-synapse operational ports. | **likely deferred** | + +## Recipe-specific tests (Phase-2 P3, ≥2 beyond parity) — planned for SSO iteration + +| cc-ci file (planned) | what's verified | rationale | +|---|---|---| +| `functional/test_upload_file.py` | Authenticate via the dep keycloak (password grant) → create a workspace/item via Drive's API → upload a file (presigned PUT to the flattened `minio-` S3 route) → list/download it back, asserting the bytes round-trip. The §4.3-prescribed create-an-object + read-it-back. | Drive's defining behavior is object storage; proves the S3/MinIO path end-to-end (the flattened MINIO_DOMAIN route + bucket created by the one-shot). | +| `functional/test_wopi_configured.py` | Collabora WOPI discovery XML is served + valid (a distinctive Drive feature: in-browser office editing). | Beyond health: exercises the WOPI/office subsystem, the second characteristic feature. | + +## Backup data-integrity (P4) — landed + +Exercised by the Phase-1d/1e lifecycle overlays (`tests/lasuite-drive/{test_backup.py,test_restore.py, +ops.py}`): a `ci_marker` row is seeded in postgres pre-backup, the table dropped pre-restore, and the +restored DB asserted to match the pre-mutation `original`. Real seed→backup→mutate→restore→assert. + +## Non-ports / deferrals + +`celery_beat_wopi.py` is recipe-maintainer "thorough mode only" (sleeps up to 90s for a scheduler +tick) — a candidate for the `--extra-tests` opt-in deferral (DEFERRED.md), consistent with the +matrix-synapse operational-test deferrals. Confirmed/justified when the SSO iteration lands. diff --git a/tests/lasuite-drive/functional/test_health_check.py b/tests/lasuite-drive/functional/test_health_check.py new file mode 100644 index 0000000..0be8b14 --- /dev/null +++ b/tests/lasuite-drive/functional/test_health_check.py @@ -0,0 +1,30 @@ +"""lasuite-drive — parity port of recipe-maintainer's health_check.py (Phase 2 P2). + +SOURCE: references/recipe-maintainer/recipe-info/lasuite-drive/tests/health_check.py + +The original asserted HTTP 200 from `https://lasuite-drive.`. The cc-ci port +preserves the assertion shape — non-error HTTP from the served root — adapted to the ephemeral +per-run domain via the `live_app` fixture. Runs in the custom tier against the shared post-install +live deployment. +""" + +from __future__ import annotations + +import os +import sys + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner")) +from harness import http as harness_http # noqa: E402 + + +def test_lasuite_drive_returns_200(live_app): + """Parity with recipe-info/lasuite-drive/tests/health_check.py: HTTP 200 from `/`.""" + url = f"https://{live_app}/" + # accept 200 (frontend SPA shell) — Drive serves the SPA at root unauthenticated; + # the SPA itself bootstraps via /api/v1.0/users/me/ which requires OIDC (separate test). + status, _ = harness_http.retry_http_get( + url, expect_status=(200, 301, 302), max_wait=60, interval=3 + ) + assert status in (200, 301, 302), ( + f"lasuite-drive at {url} returned HTTP {status} (expected 200/301/302)" + ) diff --git a/tests/lasuite-drive/ops.py b/tests/lasuite-drive/ops.py new file mode 100644 index 0000000..9f7945a --- /dev/null +++ b/tests/lasuite-drive/ops.py @@ -0,0 +1,42 @@ +"""lasuite-drive — pre-op seed hooks (Phase 1e HC3). The orchestrator runs these BEFORE the op; the +matching test_.py asserts post-op (assertion-only). The marker is a dedicated `ci_marker` row in +postgres (independent of the app's Django migrations — CREATE TABLE IF NOT EXISTS), written via psql +in the `db` service. The backup path exercises the recipe's pg_backup.sh DB-dump hook (postgres is +backupbot-labelled).""" + +import os +import sys + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner")) +from harness import lifecycle # noqa: E402 + + +def _psql(domain, sql): + cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"' + return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip() + + +def _seed(domain, value): + _psql( + domain, + "CREATE TABLE IF NOT EXISTS ci_marker(v text); DELETE FROM ci_marker; " + f"INSERT INTO ci_marker VALUES('{value}');", + ) + assert _psql(domain, "SELECT v FROM ci_marker;") == value + + +def pre_upgrade(domain, meta): + _seed(domain, "upgrade-survives") + + +def pre_backup(domain, meta): + _seed(domain, "original") + + +def pre_restore(domain, meta): + # drop the marker table (diverge from the backup) so a successful restore is observable + _psql(domain, "DROP TABLE ci_marker;") + assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in ( + "", + "NULL", + ), "drop did not take" diff --git a/tests/lasuite-drive/recipe_meta.py b/tests/lasuite-drive/recipe_meta.py new file mode 100644 index 0000000..16e9b92 --- /dev/null +++ b/tests/lasuite-drive/recipe_meta.py @@ -0,0 +1,37 @@ +# Per-recipe harness config for lasuite-drive (Phase 2 Q3.2 — multi-service + object-storage/S3 + +# WOPI office, OIDC-dependent). Sibling of lasuite-docs (same La Suite / impress lineage). +# +# Stack: app(frontend SPA) + backend(Django/drive) + celery + celery-beat + db(postgres) + redis + +# mailcatcher + minio(S3) + minio-createbuckets(one-shot) + collabora(WOPI office). ~10 services → +# generous timeouts. +# +# Health: the React SPA is served at `/` by the `app` service and returns 200 unauthenticated +# (login is OIDC-gated, exercised by the SSO functional tests, not by the install health check). +HEALTH_PATH = "/" +HEALTH_OK = (200, 301, 302) +DEPLOY_TIMEOUT = 1200 +HTTP_TIMEOUT = 900 + +# NOTE (Phase 2 Q3.2): the keycloak SSO dep + OIDC functional tests land in the SSO iteration once +# the base deploy/lifecycle is cold-green. Declaring DEPS triggers the orchestrator's +# setup_custom_tests step (deploy keycloak + wire OIDC), so it stays OFF until the base is proven: +# DEPS = ["keycloak"] + + +def EXTRA_ENV(domain): + # Two of lasuite-drive's services route on DOMAIN-DERIVED **nested** subdomains — + # `MINIO_DOMAIN="minio.${DOMAIN}"` and `COLLABORA_DOMAIN="collabora.${DOMAIN}"`. The cc-ci + # wildcard TLS cert is `*.ci.commoninternet.net` (single label only), so a 2-label name like + # `minio.lasuite-drive-pr0-abc.ci.commoninternet.net` is NOT covered → TLS failure on those + # routers. Flatten each to a single-label SIBLING under the wildcard (`minio-`, + # `collabora-`) so the existing wildcard cert covers them and Traefik routes them with + # no cert/gateway change. See DECISIONS.md "Phase 2 — nested DOMAIN-derived subdomains". + # `AWS_S3_DOMAIN_REPLACE` derives from MINIO_DOMAIN in-compose, so setting MINIO_DOMAIN is enough. + return { + "MINIO_DOMAIN": f"minio-{domain}", + "COLLABORA_DOMAIN": f"collabora-{domain}", + # abra's internal per-deploy convergence timeout (recipe TIMEOUT env, default 300s) is too + # short for this ~10-service stack on a cold image cache (impress frontend/backend, minio, + # postgres, redis, collabora ~1GB). Bump so abra waits long enough for convergence. + "TIMEOUT": "900", + } diff --git a/tests/lasuite-drive/test_backup.py b/tests/lasuite-drive/test_backup.py new file mode 100644 index 0000000..932729f --- /dev/null +++ b/tests/lasuite-drive/test_backup.py @@ -0,0 +1,23 @@ +"""lasuite-drive — BACKUP overlay (Phase 1e HC3): assertion-only + additive. + +ops.pre_backup wrote "original" into postgres before the backup op (pg_backup.sh dumps the DB); the +orchestrator performed the backup once (generic tier asserted a snapshot artifact). This overlay +ADDS: the seeded row is intact at backup time. The backup→restore divergence (dropping the table) is +in ops.pre_restore.""" + +import os +import sys + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner")) +from harness import lifecycle # noqa: E402 + + +def _psql(domain, sql): + cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"' + return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip() + + +def test_backup_captures_state(live_app): + assert ( + _psql(live_app, "SELECT v FROM ci_marker;") == "original" + ), "the seeded postgres state was not present at backup time" diff --git a/tests/lasuite-drive/test_install.py b/tests/lasuite-drive/test_install.py new file mode 100644 index 0000000..6115bff --- /dev/null +++ b/tests/lasuite-drive/test_install.py @@ -0,0 +1,44 @@ +"""lasuite-drive — INSTALL overlay (Phase 1d, DG4): override + extend-by-composition. + +Reuses the generic "really serving" assertion, then ADDS the recipe-specific checks: the +multi-service stack serves over real HTTPS through the gateway, and a real browser loads the live +Drive frontend (the SPA shell). Login is OIDC-gated (the SSO flow is exercised by the functional +tests), so the install assertion is that the frontend SPA is served (unauthenticated landing), not +an authenticated flow. Assertion-only on the shared deployment.""" + +import os +import sys + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner")) +from harness import browser as harness_browser, generic, lifecycle # noqa: E402 + + +def test_serving_and_frontend(live_app, meta): + # extend-by-composition: reuse the generic "really serving" assertion first ... + generic.assert_serving(live_app, meta) + + # ... then the recipe-specific assertions. + status = lifecycle.http_get(live_app, "/") + assert status in (200, 301, 302), f"expected 2xx/3xx from {live_app}, got {status}" + + # A real browser loads the live Drive frontend (the SPA shell) over HTTPS. + from playwright.sync_api import sync_playwright + + url = f"https://{live_app}/" + with sync_playwright() as p: + browser = p.chromium.launch(args=["--no-sandbox"]) + try: + ctx = browser.new_context(ignore_https_errors=True) + page = ctx.new_page() + # F2-3 hardening centralized in harness.browser + resp = harness_browser.goto_with_retry( + page, url, accept_statuses=(200, 301, 302), goto_timeout_ms=60_000 + ) + assert resp is not None and resp.status in ( + 200, + 301, + 302, + ), f"page status {resp and resp.status}" + assert "