# Plan — server regression canaries (codified E2E self-tests) **Status:** PROPOSED — queued as the loops' next phase after `mirror` (mirror-enroll). Single loop pair, one phase at a time. **Owner:** Builder + Adversary loops. **Created:** 2026-06-02. **Author:** Claude Sonnet 4.6 orchestrator. ## Goal Give the cc-ci server a **standing, codified regression suite in Python (pytest)** so we can keep modifying the server without silently breaking it. Not a prompt an agent re-runs — a deterministic test artifact in the repo, runnable in CI with no LLM in the loop. The suite proves the server can do **both** halves of its job — and the second is the one that bites: 1. **Confirm a healthy app is healthy**, end-to-end, with *semantic* assertions at every tier (NOT just an exit code / "pass"). 2. **Catch a broken app** — a deliberately-broken canary that the server MUST report **RED**. A server that goes *false-green* (reports PASS while the app is broken) is the scariest regression; we already saw a *fabricated* FULL-PASS during the build. This guard makes false-green a test failure. **Scope this round (operator-chosen):** E2E canaries (good + bad) only. NOT the fast logic-unit layer (bridge trigger rules / verdict / redaction unit tests) — that's a good follow-up phase, deferred. ## Canaries | Canary | Recipe | Why | Expected verdict | |---|---|---|---| | **Simple (good)** | `custom-html-tiny` | Minimal, fast, few deps — quick signal | GREEN | | **Significant (good)** | `lasuite-docs` | Multi-service: backend + Postgres + Collabora WOPI + keycloak OIDC — exercises real breadth | GREEN | | **Known-BAD: custom-assertion** | a seeded fixture (see below) | App comes up healthy but a functional/custom assertion is violated | **RED** | | **Known-BAD: per-tier ×4** | `custom-html-tiny` broken at one tier each (see below) | install / upgrade / backup / restore each fail in turn | **RED** at the intended tier | **Known-bad fixture (custom-assertion):** reuse/recreate the phase-5 seeded case — `custom-html` branch `v5-stale-docroot` (serves `.txt` as `application/octet-stream` while the app is externally healthy), which already produced a RED build (#75) with only the content-type custom assertion failing. The regression test asserts the harness returns **RED** for this fixture. (If that branch is gone, recreate the pattern: an app that is up + passes lifecycle tiers but fails one functional assertion.) Pin by SHA. ### Per-tier RED canaries — prove the server catches failure at EVERY tier (fast) The single fixture above only proves the server catches a *custom-assertion* failure. Add **one RED canary per lifecycle tier** so we prove the server reports RED at each of install / upgrade / backup / restore — false-green is the scariest regression, and it can hide at any tier (e.g. restore silently restoring nothing, the ghost/mattermost class of bug). Use the **simplest recipe — `custom-html-tiny`** (static content, deploys in seconds) so all four run **fast**; each is a fixture broken at exactly one tier, pinned by commit SHA. | RED canary | How it's broken (custom-html-tiny fixture) | Expected harness result | |---|---|---| | **install** | image tag that never becomes healthy / a healthcheck that can't pass | **install tier RED** | | **upgrade** | installs clean; the upgrade target breaks the container so post-upgrade health fails | install PASS, **upgrade tier RED** | | **backup** | install+upgrade clean; backup misconfigured (backupbot label/target wrong → backup errors or yields no artifact) | **backup tier RED** | | **restore** | backup succeeds; restore is a no-op (hook does nothing) so the pre-seeded marker is ABSENT after restore | **restore tier RED** (the scariest false-green) | Each pytest asserts **precisely**: overall verdict RED, the failing tier is the *intended* one, AND the tiers *before* it PASSED (e.g. upgrade-RED requires install to have passed) — so it proves "catches a failure **at this tier**", not merely "fails somewhere". These four form the **fast subset** of the suite; consider a sub-marker (`@pytest.mark.canary_fast`) so they can optionally run as a quicker pre-check while the slow good canaries (esp. lasuite-docs) stay on the milestone cadence below. ## What "works as expected" means per tier (real assertions, not exit codes) For each good canary the test drives the real cold full suite and asserts the *observable outcome*: - **install** → app returns HTTP 200, expected page content/marker present, all services converged. - **upgrade** → a record/marker seeded **before** the upgrade is still present **after**; the deployed version actually bumped; app still 200. (For lasuite-docs: a created doc survives; OIDC login still works; Collabora WOPI still 200.) - **backup → restore** → seed data, wipe the app, restore, assert the seeded data is back. - **secrets persistence** → the same generated app secret is identical across install→upgrade→restore. - **redaction** → grep the published logs AND the dashboard for the generated secret value(s) → assert **absent**. - **teardown** → after the run, assert no leftover containers/volumes/secrets for the canary. ## Design - **Location:** `tests/regression/` in the cc-ci repo. `pytest`, parametrized over the canaries. - **Driver:** wrap the existing real cold path (`.../ci-test-review/verify-pr.sh` / the runner `runner/run_recipe_ci.py`) — do NOT reimplement the harness; call it and assert on its real outputs (build verdict, summary, dashboard JSON, published logs). The semantic per-tier assertions live in the test, layered on top of the harness's own pass/fail so a tier that the harness calls "pass" but that didn't actually preserve data still fails the regression test. - **Markers:** `@pytest.mark.canary` (slow E2E) so they can be selected/excluded; the suite is run on-demand / pre-merge / nightly, not on every fast commit. - **Determinism:** pin canary recipe versions/SHAs and the bad-fixture SHA. Record run artifacts (build URL, logs) on failure for triage. ## How it runs — CADENCE POLICY (important) **Do NOT run these canaries on every commit/PR.** They are slow + resource-heavy (full lifecycle on lasuite-docs is minutes, needs the live server/abra/Swarm). Run them **deliberately at milestones**: - **polishing passes**, **code reviews**, and **releases** of the cc-ci server — i.e. before we trust a batch of server changes, not on each incremental commit. - On-demand: `pytest -m canary` against the live cc-ci server (from the orchestrator or the host). - They are explicitly **opt-in** (the `@pytest.mark.canary` marker keeps them out of any fast/default run). If wired to `!testme` on the cc-ci repo, gate it behind a deliberate trigger (e.g. a `run-canaries` label or a `!testme --canary`), **not** an automatic run on every cc-ci PR. - Document this cadence in `tests/regression/README.md` so future contributors don't wire it into the per-commit path. ## Definition of Done (Adversary-verified) 1. `tests/regression/` pytest suite exists and is committed (cc-ci repo PR). 2. Run GREEN on both good canaries (`custom-html-tiny`, `lasuite-docs`) with the per-tier semantic assertions actually executing (Adversary confirms the assertions FAIL if you tamper with an outcome — i.e. the assertions have teeth, they're not vacuous). 3. The custom-assertion known-bad canary makes the suite assert **RED** — and the Adversary confirms that if the server *wrongly* returned green for it, the regression test would FAIL (false-green caught). 4. **The four per-tier RED canaries** (install/upgrade/backup/restore, on `custom-html-tiny`) each make the suite assert RED **at the intended tier**, with the prior tiers asserted PASS. Adversary confirms each has teeth: if the server wrongly green-lit that tier, the corresponding test would FAIL. They run fast. 5. A short `tests/regression/README.md`: how to run it, what each canary guards, how to add a canary. 6. NOT merged — recipe/test PR opened for operator review (loops never merge). ## Risks / notes - **Slow + resource-heavy:** full lifecycle on lasuite-docs is minutes and needs the live server/abra/ Swarm. Keep it `-m canary`, not in the fast path. Watch disk (lasuite-docs upgrade was disk-sensitive). - **Flakiness:** lasuite-docs had transient WOPI-404 / Collabora convergence races (see `plan-lasuite-drive-*`); use the harness's existing readiness probes; if flaky, add bounded retries on *readiness only* (never on the correctness assertion). - **Don't weaken to pass:** the whole point is teeth. A flaky correctness assertion is a real signal, not something to relax. ## Out of scope (deferred to a follow-up phase) - Fast logic-unit layer (bridge `!testme`/`!testmexyz`/non-collaborator rules, verdict computation, redaction function, summary formatting) as second-level pytest run on every commit. - Golden-output snapshots, concurrency-collision canary, perf-budget assertions.