feat(2): Q3.2 lasuite-drive base enrollment + nested-subdomain + replicas:0 harness fixes

- harness: services_converged treats replicas:0 one-shot (minio-createbuckets) as
  converged (cur==want); removes the want==0 rejection that hung deploys. DECISIONS.md.
- recipe_meta.EXTRA_ENV flattens MINIO_DOMAIN/COLLABORA_DOMAIN to single-label wildcard
  siblings (the *.ci.commoninternet.net cert covers one label only). DECISIONS.md.
- lifecycle overlays (install/upgrade/backup/restore) + ops.py postgres ci_marker
  data-integrity (db user/name=drive). Parity health_check functional test. PARITY.md.
- DEPS=[keycloak] + OIDC/WOPI/upload functional tests deferred to the SSO iteration
  (probe-before-assert: prove the ~10-service base deploy converges first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 19:54:31 +01:00
parent 9aa045de86
commit f59d8e6996
10 changed files with 306 additions and 1 deletions

View File

@ -480,3 +480,44 @@ SPA's main menu via a stable accessibility tree (role-based selectors instead of
Adversary may file F2-N requesting full create-pad coverage; the answer above is the
honest technical reason + the maximal subset. Logged here per plan §7.1.
---
## Phase 2 — nested DOMAIN-derived subdomains flattened to single-label wildcard siblings
**Decision (settled):** When an enrolled recipe routes additional services on **nested subdomains
derived from `DOMAIN`** (e.g. lasuite-drive `MINIO_DOMAIN="minio.${DOMAIN}"` +
`COLLABORA_DOMAIN="collabora.${DOMAIN}"`; lasuite-meet `LIVEKIT_DOMAIN="livekit.${DOMAIN}"`), the
recipe's `recipe_meta.EXTRA_ENV(domain)` MUST override those vars to a **single-label sibling under
the wildcard** — `minio-<domain>`, `collabora-<domain>`, `livekit-<domain>` — NOT the recipe's
default `<svc>.<domain>`.
**Why:** cc-ci's TLS cert is the operator's pre-issued wildcard `*.ci.commoninternet.net` (+ bare
`ci.commoninternet.net`) — §4.0/§1.5, renewed out-of-band, no ACME. A wildcard matches exactly **one**
label. The per-run app domain is already one label (`lasuite-drive-pr<n>-<sha>.ci.commoninternet.net`),
so a nested `minio.lasuite-drive-pr<n>-<sha>.ci.commoninternet.net` is a **2-label** name the wildcard
does NOT cover → Traefik would serve an invalid cert on that router and the service is unreachable
over HTTPS. Re-prefixing with a hyphen keeps it one label (`minio-lasuite-drive-pr<n>-<sha>` +
`.ci.commoninternet.net`), covered by the same wildcard, routed by Traefik's swarm provider with **no
cert work and no gateway change** (the gateway already passes the whole wildcard, §4.0). We must NOT
mint per-host certs / ACME for these (class-A1 boundary, §9).
**Scope:** purely a per-recipe `EXTRA_ENV` concern (no shared-harness change). Recipes with no
DOMAIN-derived nested subdomains (most) are unaffected.
## Phase 2 — `services_converged` treats a `replicas: 0` one-shot as converged
**Decision (settled):** `runner/harness/lifecycle.py::services_converged` now considers a service
converged when `cur == want` (desired replica count met), removing the prior
`or want == "0"` rejection.
**Why:** lasuite-drive's `minio-createbuckets` is declared `deploy: {mode: replicated, replicas: 0,
restart_policy: {condition: none}}` — an **on-demand one-shot** (scaled up manually only when buckets
need (re)creating; it `mc mb …` then `exit 0`). `docker stack services` reports it `0/0`. The old
check rejected any `want == "0"` row, so the stack could **never** report converged → every deploy
hung until `deploy_timeout`. A service AT its desired count (including 0/0) is converged; a service
still spinning up shows `0/1` (`cur != want`) and is correctly not-yet-converged, so the HTTP
readiness wait still gates real liveness. Safe for all currently-green recipes (their services are
all N/N with N>0; the `0/0` case did not previously occur). Buckets/migrations that the one-shot
performs are run on-demand in the recipe's `setup_custom_tests.sh` (post-deploy), not relied upon for
generic-install convergence (the SPA at `/` serves 200 without them).