cc-ci-orchestrator/cc-ci-plan/plan-server-regression-canaries.md

# Plan — server regression canaries (codified E2E self-tests)

**Status:** PROPOSED — queued as the loops' next phase after `mirror` (mirror-enroll). Single loop pair,
one phase at a time.
**Owner:** Builder + Adversary loops. **Created:** 2026-06-02. **Author:** Claude Sonnet 4.6 orchestrator.

## Goal

Give the cc-ci server a **standing, codified regression suite in Python (pytest)** so we can keep
modifying the server without silently breaking it. Not a prompt an agent re-runs — a deterministic
test artifact in the repo, runnable in CI with no LLM in the loop.

The suite proves the server can do **both** halves of its job — and the second is the one that bites:
1. **Confirm a healthy app is healthy**, end-to-end, with *semantic* assertions at every tier (NOT
   just an exit code / "pass").
2. **Catch a broken app** — a deliberately-broken canary that the server MUST report **RED**. A server
   that goes *false-green* (reports PASS while the app is broken) is the scariest regression; we already
   saw a *fabricated* FULL-PASS during the build. This guard makes false-green a test failure.

**Scope this round (operator-chosen):** E2E canaries (good + bad) only. NOT the fast logic-unit layer
(bridge trigger rules / verdict / redaction unit tests) — that's a good follow-up phase, deferred.

## Canaries

| Canary | Recipe | Why | Expected verdict |
|---|---|---|---|
| **Simple (good)** | `custom-html-tiny` | Minimal, fast, few deps — quick signal | GREEN |
| **Significant (good)** | `lasuite-docs` | Multi-service: backend + Postgres + Collabora WOPI + keycloak OIDC — exercises real breadth | GREEN |
| **Known-BAD: custom-assertion** | a seeded fixture (see below) | App comes up healthy but a functional/custom assertion is violated | **RED** |
| **Known-BAD: per-tier ×4** | `custom-html-tiny` broken at one tier each (see below) | install / upgrade / backup / restore each fail in turn | **RED** at the intended tier |

**Known-bad fixture (custom-assertion):** reuse/recreate the phase-5 seeded case — `custom-html` branch
`v5-stale-docroot` (serves `.txt` as `application/octet-stream` while the app is externally healthy),
which already produced a RED build (#75) with only the content-type custom assertion failing. The
regression test asserts the harness returns **RED** for this fixture. (If that branch is gone, recreate
the pattern: an app that is up + passes lifecycle tiers but fails one functional assertion.) Pin by SHA.

### Per-tier RED canaries — prove the server catches failure at EVERY tier (fast)

The single fixture above only proves the server catches a *custom-assertion* failure. Add **one RED
canary per lifecycle tier** so we prove the server reports RED at each of install / upgrade / backup /
restore — false-green is the scariest regression, and it can hide at any tier (e.g. restore silently
restoring nothing, the ghost/mattermost class of bug). Use the **simplest recipe — `custom-html-tiny`**
(static content, deploys in seconds) so all four run **fast**; each is a fixture broken at exactly one
tier, pinned by commit SHA.

| RED canary | How it's broken (custom-html-tiny fixture) | Expected harness result |
|---|---|---|
| **install** | image tag that never becomes healthy / a healthcheck that can't pass | **install tier RED** |
| **upgrade** | installs clean; the upgrade target breaks the container so post-upgrade health fails | install PASS, **upgrade tier RED** |
| **backup** | install+upgrade clean; backup misconfigured (backupbot label/target wrong → backup errors or yields no artifact) | **backup tier RED** |
| **restore** | backup succeeds; restore is a no-op (hook does nothing) so the pre-seeded marker is ABSENT after restore | **restore tier RED** (the scariest false-green) |

Each pytest asserts **precisely**: overall verdict RED, the failing tier is the *intended* one, AND the
tiers *before* it PASSED (e.g. upgrade-RED requires install to have passed) — so it proves "catches a
failure **at this tier**", not merely "fails somewhere". These four form the **fast subset** of the
suite; consider a sub-marker (`@pytest.mark.canary_fast`) so they can optionally run as a quicker
pre-check while the slow good canaries (esp. lasuite-docs) stay on the milestone cadence below.

## What "works as expected" means per tier (real assertions, not exit codes)

For each good canary the test drives the real cold full suite and asserts the *observable outcome*:
- **install** → app returns HTTP 200, expected page content/marker present, all services converged.
- **upgrade** → a record/marker seeded **before** the upgrade is still present **after**; the deployed
  version actually bumped; app still 200. (For lasuite-docs: a created doc survives; OIDC login still
  works; Collabora WOPI still 200.)
- **backup → restore** → seed data, wipe the app, restore, assert the seeded data is back.
- **secrets persistence** → the same generated app secret is identical across install→upgrade→restore.
- **redaction** → grep the published logs AND the dashboard for the generated secret value(s) → assert
  **absent**.
- **teardown** → after the run, assert no leftover containers/volumes/secrets for the canary.

## Design

- **Location:** `tests/regression/` in the cc-ci repo. `pytest`, parametrized over the canaries.
- **Driver:** wrap the existing real cold path (`.../ci-test-review/verify-pr.sh` / the runner
  `runner/run_recipe_ci.py`) — do NOT reimplement the harness; call it and assert on its real outputs
  (build verdict, summary, dashboard JSON, published logs). The semantic per-tier assertions live in
  the test, layered on top of the harness's own pass/fail so a tier that the harness calls "pass" but
  that didn't actually preserve data still fails the regression test.
- **Markers:** `@pytest.mark.canary` (slow E2E) so they can be selected/excluded; the suite is run
  on-demand / pre-merge / nightly, not on every fast commit.
- **Determinism:** pin canary recipe versions/SHAs and the bad-fixture SHA. Record run artifacts
  (build URL, logs) on failure for triage.

## How it runs — CADENCE POLICY (important)
**Do NOT run these canaries on every commit/PR.** They are slow + resource-heavy (full lifecycle on
lasuite-docs is minutes, needs the live server/abra/Swarm). Run them **deliberately at milestones**:
- **polishing passes**, **code reviews**, and **releases** of the cc-ci server — i.e. before we trust a
  batch of server changes, not on each incremental commit.
- On-demand: `pytest -m canary` against the live cc-ci server (from the orchestrator or the host).
- They are explicitly **opt-in** (the `@pytest.mark.canary` marker keeps them out of any fast/default
  run). If wired to `!testme` on the cc-ci repo, gate it behind a deliberate trigger (e.g. a
  `run-canaries` label or a `!testme --canary`), **not** an automatic run on every cc-ci PR.
- Document this cadence in `tests/regression/README.md` so future contributors don't wire it into the
  per-commit path.

## Definition of Done (Adversary-verified)
1. `tests/regression/` pytest suite exists and is committed (cc-ci repo PR).
2. Run GREEN on both good canaries (`custom-html-tiny`, `lasuite-docs`) with the per-tier semantic
   assertions actually executing (Adversary confirms the assertions FAIL if you tamper with an outcome —
   i.e. the assertions have teeth, they're not vacuous).
3. The custom-assertion known-bad canary makes the suite assert **RED** — and the Adversary confirms
   that if the server *wrongly* returned green for it, the regression test would FAIL (false-green caught).
4. **The four per-tier RED canaries** (install/upgrade/backup/restore, on `custom-html-tiny`) each make
   the suite assert RED **at the intended tier**, with the prior tiers asserted PASS. Adversary confirms
   each has teeth: if the server wrongly green-lit that tier, the corresponding test would FAIL. They run
   fast.
5. A short `tests/regression/README.md`: how to run it, what each canary guards, how to add a canary.
6. NOT merged — recipe/test PR opened for operator review (loops never merge).

## Risks / notes
- **Slow + resource-heavy:** full lifecycle on lasuite-docs is minutes and needs the live server/abra/
  Swarm. Keep it `-m canary`, not in the fast path. Watch disk (lasuite-docs upgrade was disk-sensitive).
- **Flakiness:** lasuite-docs had transient WOPI-404 / Collabora convergence races (see
  `plan-lasuite-drive-*`); use the harness's existing readiness probes; if flaky, add bounded retries on
  *readiness only* (never on the correctness assertion).
- **Don't weaken to pass:** the whole point is teeth. A flaky correctness assertion is a real signal, not
  something to relax.

## Out of scope (deferred to a follow-up phase)
- Fast logic-unit layer (bridge `!testme`/`!testmexyz`/non-collaborator rules, verdict computation,
  redaction function, summary formatting) as second-level pytest run on every commit.
- Golden-output snapshots, concurrency-collision canary, perf-budget assertions.