feat(2): Q3.2 lasuite-drive base enrollment + nested-subdomain + replicas:0 harness fixes

- harness: services_converged treats replicas:0 one-shot (minio-createbuckets) as
  converged (cur==want); removes the want==0 rejection that hung deploys. DECISIONS.md.
- recipe_meta.EXTRA_ENV flattens MINIO_DOMAIN/COLLABORA_DOMAIN to single-label wildcard
  siblings (the *.ci.commoninternet.net cert covers one label only). DECISIONS.md.
- lifecycle overlays (install/upgrade/backup/restore) + ops.py postgres ci_marker
  data-integrity (db user/name=drive). Parity health_check functional test. PARITY.md.
- DEPS=[keycloak] + OIDC/WOPI/upload functional tests deferred to the SSO iteration
  (probe-before-assert: prove the ~10-service base deploy converges first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 19:54:31 +01:00
parent 9aa045de86
commit f59d8e6996
10 changed files with 306 additions and 1 deletions

View File

@ -480,3 +480,44 @@ SPA's main menu via a stable accessibility tree (role-based selectors instead of
Adversary may file F2-N requesting full create-pad coverage; the answer above is the
honest technical reason + the maximal subset. Logged here per plan §7.1.
---
## Phase 2 — nested DOMAIN-derived subdomains flattened to single-label wildcard siblings
**Decision (settled):** When an enrolled recipe routes additional services on **nested subdomains
derived from `DOMAIN`** (e.g. lasuite-drive `MINIO_DOMAIN="minio.${DOMAIN}"` +
`COLLABORA_DOMAIN="collabora.${DOMAIN}"`; lasuite-meet `LIVEKIT_DOMAIN="livekit.${DOMAIN}"`), the
recipe's `recipe_meta.EXTRA_ENV(domain)` MUST override those vars to a **single-label sibling under
the wildcard** — `minio-<domain>`, `collabora-<domain>`, `livekit-<domain>` — NOT the recipe's
default `<svc>.<domain>`.
**Why:** cc-ci's TLS cert is the operator's pre-issued wildcard `*.ci.commoninternet.net` (+ bare
`ci.commoninternet.net`) — §4.0/§1.5, renewed out-of-band, no ACME. A wildcard matches exactly **one**
label. The per-run app domain is already one label (`lasuite-drive-pr<n>-<sha>.ci.commoninternet.net`),
so a nested `minio.lasuite-drive-pr<n>-<sha>.ci.commoninternet.net` is a **2-label** name the wildcard
does NOT cover → Traefik would serve an invalid cert on that router and the service is unreachable
over HTTPS. Re-prefixing with a hyphen keeps it one label (`minio-lasuite-drive-pr<n>-<sha>` +
`.ci.commoninternet.net`), covered by the same wildcard, routed by Traefik's swarm provider with **no
cert work and no gateway change** (the gateway already passes the whole wildcard, §4.0). We must NOT
mint per-host certs / ACME for these (class-A1 boundary, §9).
**Scope:** purely a per-recipe `EXTRA_ENV` concern (no shared-harness change). Recipes with no
DOMAIN-derived nested subdomains (most) are unaffected.
## Phase 2 — `services_converged` treats a `replicas: 0` one-shot as converged
**Decision (settled):** `runner/harness/lifecycle.py::services_converged` now considers a service
converged when `cur == want` (desired replica count met), removing the prior
`or want == "0"` rejection.
**Why:** lasuite-drive's `minio-createbuckets` is declared `deploy: {mode: replicated, replicas: 0,
restart_policy: {condition: none}}` — an **on-demand one-shot** (scaled up manually only when buckets
need (re)creating; it `mc mb …` then `exit 0`). `docker stack services` reports it `0/0`. The old
check rejected any `want == "0"` row, so the stack could **never** report converged → every deploy
hung until `deploy_timeout`. A service AT its desired count (including 0/0) is converged; a service
still spinning up shows `0/1` (`cur != want`) and is correctly not-yet-converged, so the HTTP
readiness wait still gates real liveness. Safe for all currently-green recipes (their services are
all N/N with N>0; the `0/0` case did not previously occur). Buckets/migrations that the one-shot
performs are run on-demand in the recipe's `setup_custom_tests.sh` (post-deploy), not relied upon for
generic-install convergence (the SPA at `/` serves 200 without them).

View File

@ -181,7 +181,13 @@ def services_converged(domain: str) -> bool:
return False
for r in rows:
cur, _, want = r.partition("/")
if not want or cur != want or want == "0":
# A service at its DESIRED replica count is converged — including a `replicas: 0`
# on-demand one-shot (e.g. lasuite-drive's `minio-createbuckets`, which is scaled up
# manually only when buckets need (re)creating), which reports "0/0". The earlier
# `want == "0"` rejection wrongly treated those as never-converged, hanging the deploy
# forever. `cur == want` (with `want` present) is the correct convergence test; a service
# still spinning up shows e.g. "0/1" (cur != want) and is correctly not-yet-converged.
if not want or cur != want:
return False
return True

View File

@ -0,0 +1,38 @@
# Parity — lasuite-drive
Phase-2 P2 mapping table. The Adversary cold-verifies parity by reading the source
`recipe-info/lasuite-drive/tests/<file>` and the cc-ci file side-by-side.
**Enrollment status:** Q3.2 in progress. Base deploy + lifecycle (install/upgrade/backup/restore
data-integrity) + parity health_check landed first (probe-before-assert: validate the ~10-service
stack converges with the nested-subdomain flattening before layering SSO). The OIDC + WOPI + upload
functional tests (which require the keycloak dep + post-deploy migrations + buckets) land in the SSO
iteration once the base is cold-green. This file is updated as each row lands; nothing is a silent
omission.
| recipe-maintainer file | cc-ci file | what's verified | status |
|---|---|---|---|
| `recipe-info/lasuite-drive/tests/health_check.py` | `tests/lasuite-drive/functional/test_health_check.py` | App serves over HTTPS and returns 200/301/302 from `/`. Port preserves the assertion shape, adapted to the ephemeral per-run domain via `live_app`. | **ported** |
| `recipe-info/lasuite-drive/tests/oidc_login.py` | `tests/lasuite-drive/functional/test_oidc_with_keycloak.py` (planned, SSO iteration) | Original: Drive `/api/v1.0/authenticate/` redirects to Keycloak → password-grant token → `/api/v1.0/users/me/` returns the user. cc-ci port deploys keycloak as a per-run dep (`DEPS=["keycloak"]`), wires OIDC env via `setup_custom_tests.sh`, exercises discovery + password grant + JWT claims (mirrors the proven lasuite-docs `test_oidc_with_keycloak`). | **pending (SSO iteration)** |
| `recipe-info/lasuite-drive/tests/wopi_configured.py` | `tests/lasuite-drive/functional/test_wopi_configured.py` (planned) | Original: Collabora + OnlyOffice WOPI discovery endpoints return valid WOPI XML. cc-ci port checks the Collabora discovery XML over the flattened `collabora-<domain>` route (pure HTTP, no browser/SSO). | **pending** |
| `recipe-info/lasuite-drive/tests/wopi_on_startup.py` | (see DECISIONS / DEFERRED) | Original: greps celery worker container logs for the entrypoint WOPI trigger. cc-ci port via `docker service logs` on the celery service. | **pending** |
| `recipe-info/lasuite-drive/tests/celery_beat_wopi.py` | (likely DEFERRED — "thorough mode only") | Original sleeps 1590s waiting for Celery Beat to fire; recipe-maintainer marks it "thorough mode only". Candidate for the `--extra-tests` opt-in (DEFERRED.md), like the matrix-synapse operational ports. | **likely deferred** |
## Recipe-specific tests (Phase-2 P3, ≥2 beyond parity) — planned for SSO iteration
| cc-ci file (planned) | what's verified | rationale |
|---|---|---|
| `functional/test_upload_file.py` | Authenticate via the dep keycloak (password grant) → create a workspace/item via Drive's API → upload a file (presigned PUT to the flattened `minio-<domain>` S3 route) → list/download it back, asserting the bytes round-trip. The §4.3-prescribed create-an-object + read-it-back. | Drive's defining behavior is object storage; proves the S3/MinIO path end-to-end (the flattened MINIO_DOMAIN route + bucket created by the one-shot). |
| `functional/test_wopi_configured.py` | Collabora WOPI discovery XML is served + valid (a distinctive Drive feature: in-browser office editing). | Beyond health: exercises the WOPI/office subsystem, the second characteristic feature. |
## Backup data-integrity (P4) — landed
Exercised by the Phase-1d/1e lifecycle overlays (`tests/lasuite-drive/{test_backup.py,test_restore.py,
ops.py}`): a `ci_marker` row is seeded in postgres pre-backup, the table dropped pre-restore, and the
restored DB asserted to match the pre-mutation `original`. Real seed→backup→mutate→restore→assert.
## Non-ports / deferrals
`celery_beat_wopi.py` is recipe-maintainer "thorough mode only" (sleeps up to 90s for a scheduler
tick) — a candidate for the `--extra-tests` opt-in deferral (DEFERRED.md), consistent with the
matrix-synapse operational-test deferrals. Confirmed/justified when the SSO iteration lands.

View File

@ -0,0 +1,30 @@
"""lasuite-drive — parity port of recipe-maintainer's health_check.py (Phase 2 P2).
SOURCE: references/recipe-maintainer/recipe-info/lasuite-drive/tests/health_check.py
The original asserted HTTP 200 from `https://lasuite-drive.<DOMAIN_SUFFIX>`. The cc-ci port
preserves the assertion shape — non-error HTTP from the served root — adapted to the ephemeral
per-run domain via the `live_app` fixture. Runs in the custom tier against the shared post-install
live deployment.
"""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
from harness import http as harness_http # noqa: E402
def test_lasuite_drive_returns_200(live_app):
"""Parity with recipe-info/lasuite-drive/tests/health_check.py: HTTP 200 from `/`."""
url = f"https://{live_app}/"
# accept 200 (frontend SPA shell) — Drive serves the SPA at root unauthenticated;
# the SPA itself bootstraps via /api/v1.0/users/me/ which requires OIDC (separate test).
status, _ = harness_http.retry_http_get(
url, expect_status=(200, 301, 302), max_wait=60, interval=3
)
assert status in (200, 301, 302), (
f"lasuite-drive at {url} returned HTTP {status} (expected 200/301/302)"
)

View File

@ -0,0 +1,42 @@
"""lasuite-drive — pre-op seed hooks (Phase 1e HC3). The orchestrator runs these BEFORE the op; the
matching test_<op>.py asserts post-op (assertion-only). The marker is a dedicated `ci_marker` row in
postgres (independent of the app's Django migrations — CREATE TABLE IF NOT EXISTS), written via psql
in the `db` service. The backup path exercises the recipe's pg_backup.sh DB-dump hook (postgres is
backupbot-labelled)."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
def _psql(domain, sql):
cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"'
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
def _seed(domain, value):
_psql(
domain,
"CREATE TABLE IF NOT EXISTS ci_marker(v text); DELETE FROM ci_marker; "
f"INSERT INTO ci_marker VALUES('{value}');",
)
assert _psql(domain, "SELECT v FROM ci_marker;") == value
def pre_upgrade(domain, meta):
_seed(domain, "upgrade-survives")
def pre_backup(domain, meta):
_seed(domain, "original")
def pre_restore(domain, meta):
# drop the marker table (diverge from the backup) so a successful restore is observable
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"

View File

@ -0,0 +1,37 @@
# Per-recipe harness config for lasuite-drive (Phase 2 Q3.2 — multi-service + object-storage/S3 +
# WOPI office, OIDC-dependent). Sibling of lasuite-docs (same La Suite / impress lineage).
#
# Stack: app(frontend SPA) + backend(Django/drive) + celery + celery-beat + db(postgres) + redis +
# mailcatcher + minio(S3) + minio-createbuckets(one-shot) + collabora(WOPI office). ~10 services →
# generous timeouts.
#
# Health: the React SPA is served at `/` by the `app` service and returns 200 unauthenticated
# (login is OIDC-gated, exercised by the SSO functional tests, not by the install health check).
HEALTH_PATH = "/"
HEALTH_OK = (200, 301, 302)
DEPLOY_TIMEOUT = 1200
HTTP_TIMEOUT = 900
# NOTE (Phase 2 Q3.2): the keycloak SSO dep + OIDC functional tests land in the SSO iteration once
# the base deploy/lifecycle is cold-green. Declaring DEPS triggers the orchestrator's
# setup_custom_tests step (deploy keycloak + wire OIDC), so it stays OFF until the base is proven:
# DEPS = ["keycloak"]
def EXTRA_ENV(domain):
# Two of lasuite-drive's services route on DOMAIN-DERIVED **nested** subdomains —
# `MINIO_DOMAIN="minio.${DOMAIN}"` and `COLLABORA_DOMAIN="collabora.${DOMAIN}"`. The cc-ci
# wildcard TLS cert is `*.ci.commoninternet.net` (single label only), so a 2-label name like
# `minio.lasuite-drive-pr0-abc.ci.commoninternet.net` is NOT covered → TLS failure on those
# routers. Flatten each to a single-label SIBLING under the wildcard (`minio-<domain>`,
# `collabora-<domain>`) so the existing wildcard cert covers them and Traefik routes them with
# no cert/gateway change. See DECISIONS.md "Phase 2 — nested DOMAIN-derived subdomains".
# `AWS_S3_DOMAIN_REPLACE` derives from MINIO_DOMAIN in-compose, so setting MINIO_DOMAIN is enough.
return {
"MINIO_DOMAIN": f"minio-{domain}",
"COLLABORA_DOMAIN": f"collabora-{domain}",
# abra's internal per-deploy convergence timeout (recipe TIMEOUT env, default 300s) is too
# short for this ~10-service stack on a cold image cache (impress frontend/backend, minio,
# postgres, redis, collabora ~1GB). Bump so abra waits long enough for convergence.
"TIMEOUT": "900",
}

View File

@ -0,0 +1,23 @@
"""lasuite-drive — BACKUP overlay (Phase 1e HC3): assertion-only + additive.
ops.pre_backup wrote "original" into postgres before the backup op (pg_backup.sh dumps the DB); the
orchestrator performed the backup once (generic tier asserted a snapshot artifact). This overlay
ADDS: the seeded row is intact at backup time. The backup→restore divergence (dropping the table) is
in ops.pre_restore."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
def _psql(domain, sql):
cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"'
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
def test_backup_captures_state(live_app):
assert (
_psql(live_app, "SELECT v FROM ci_marker;") == "original"
), "the seeded postgres state was not present at backup time"

View File

@ -0,0 +1,44 @@
"""lasuite-drive — INSTALL overlay (Phase 1d, DG4): override + extend-by-composition.
Reuses the generic "really serving" assertion, then ADDS the recipe-specific checks: the
multi-service stack serves over real HTTPS through the gateway, and a real browser loads the live
Drive frontend (the SPA shell). Login is OIDC-gated (the SSO flow is exercised by the functional
tests), so the install assertion is that the frontend SPA is served (unauthenticated landing), not
an authenticated flow. Assertion-only on the shared deployment."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import browser as harness_browser, generic, lifecycle # noqa: E402
def test_serving_and_frontend(live_app, meta):
# extend-by-composition: reuse the generic "really serving" assertion first ...
generic.assert_serving(live_app, meta)
# ... then the recipe-specific assertions.
status = lifecycle.http_get(live_app, "/")
assert status in (200, 301, 302), f"expected 2xx/3xx from {live_app}, got {status}"
# A real browser loads the live Drive frontend (the SPA shell) over HTTPS.
from playwright.sync_api import sync_playwright
url = f"https://{live_app}/"
with sync_playwright() as p:
browser = p.chromium.launch(args=["--no-sandbox"])
try:
ctx = browser.new_context(ignore_https_errors=True)
page = ctx.new_page()
# F2-3 hardening centralized in harness.browser
resp = harness_browser.goto_with_retry(
page, url, accept_statuses=(200, 301, 302), goto_timeout_ms=60_000
)
assert resp is not None and resp.status in (
200,
301,
302,
), f"page status {resp and resp.status}"
assert "<html" in page.content().lower(), "no HTML served by the frontend"
finally:
browser.close()

View File

@ -0,0 +1,22 @@
"""lasuite-drive — RESTORE overlay (Phase 1e HC3): data-integrity, assertion-only + additive.
ops.pre_restore dropped the marker table (diverge); the orchestrator restored once (generic tier
asserted healthy/serving; the recipe's restore reloads the dump). This overlay ADDS: the restored DB
matches the pre-mutation "original". Read via psql in the `db` service."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
def _psql(domain, sql):
cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"'
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
def test_restore_returns_state(live_app):
assert (
_psql(live_app, "SELECT v FROM ci_marker;") == "original"
), "restore did not return the pre-mutation postgres state"

View File

@ -0,0 +1,22 @@
"""lasuite-drive — UPGRADE overlay (Phase 1e HC3): data-continuity, assertion-only + additive.
ops.pre_upgrade wrote a postgres marker row before the upgrade; the orchestrator performed the
upgrade once (generic tier asserted reconverge/serving/moved). This overlay ADDS: the postgres data
survived. Read via psql in the `db` service."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
def _psql(domain, sql):
cmd = f'PGPASSWORD=$(cat /run/secrets/postgres_p) psql -U drive -d drive -tAc "{sql}"'
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
def test_upgrade_preserves_data(live_app):
assert (
_psql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives"
), "postgres data did not survive the upgrade"