One deliberately-broken custom-html-tiny fixture per lifecycle tier so the suite proves the server reports RED at EVERY tier (not just one) — each asserts RED at the intended tier with prior tiers PASS, so it's 'catches a failure at this tier', not 'fails somewhere'. Fast (simplest recipe); the fast subset of the suite vs the slow good canaries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
125 lines
8.8 KiB
Markdown
125 lines
8.8 KiB
Markdown
# Plan — server regression canaries (codified E2E self-tests)
|
||
|
||
**Status:** PROPOSED — queued as the loops' next phase after `mirror` (mirror-enroll). Single loop pair,
|
||
one phase at a time.
|
||
**Owner:** Builder + Adversary loops. **Created:** 2026-06-02. **Author:** Claude Sonnet 4.6 orchestrator.
|
||
|
||
## Goal
|
||
|
||
Give the cc-ci server a **standing, codified regression suite in Python (pytest)** so we can keep
|
||
modifying the server without silently breaking it. Not a prompt an agent re-runs — a deterministic
|
||
test artifact in the repo, runnable in CI with no LLM in the loop.
|
||
|
||
The suite proves the server can do **both** halves of its job — and the second is the one that bites:
|
||
1. **Confirm a healthy app is healthy**, end-to-end, with *semantic* assertions at every tier (NOT
|
||
just an exit code / "pass").
|
||
2. **Catch a broken app** — a deliberately-broken canary that the server MUST report **RED**. A server
|
||
that goes *false-green* (reports PASS while the app is broken) is the scariest regression; we already
|
||
saw a *fabricated* FULL-PASS during the build. This guard makes false-green a test failure.
|
||
|
||
**Scope this round (operator-chosen):** E2E canaries (good + bad) only. NOT the fast logic-unit layer
|
||
(bridge trigger rules / verdict / redaction unit tests) — that's a good follow-up phase, deferred.
|
||
|
||
## Canaries
|
||
|
||
| Canary | Recipe | Why | Expected verdict |
|
||
|---|---|---|---|
|
||
| **Simple (good)** | `custom-html-tiny` | Minimal, fast, few deps — quick signal | GREEN |
|
||
| **Significant (good)** | `lasuite-docs` | Multi-service: backend + Postgres + Collabora WOPI + keycloak OIDC — exercises real breadth | GREEN |
|
||
| **Known-BAD: custom-assertion** | a seeded fixture (see below) | App comes up healthy but a functional/custom assertion is violated | **RED** |
|
||
| **Known-BAD: per-tier ×4** | `custom-html-tiny` broken at one tier each (see below) | install / upgrade / backup / restore each fail in turn | **RED** at the intended tier |
|
||
|
||
**Known-bad fixture (custom-assertion):** reuse/recreate the phase-5 seeded case — `custom-html` branch
|
||
`v5-stale-docroot` (serves `.txt` as `application/octet-stream` while the app is externally healthy),
|
||
which already produced a RED build (#75) with only the content-type custom assertion failing. The
|
||
regression test asserts the harness returns **RED** for this fixture. (If that branch is gone, recreate
|
||
the pattern: an app that is up + passes lifecycle tiers but fails one functional assertion.) Pin by SHA.
|
||
|
||
### Per-tier RED canaries — prove the server catches failure at EVERY tier (fast)
|
||
|
||
The single fixture above only proves the server catches a *custom-assertion* failure. Add **one RED
|
||
canary per lifecycle tier** so we prove the server reports RED at each of install / upgrade / backup /
|
||
restore — false-green is the scariest regression, and it can hide at any tier (e.g. restore silently
|
||
restoring nothing, the ghost/mattermost class of bug). Use the **simplest recipe — `custom-html-tiny`**
|
||
(static content, deploys in seconds) so all four run **fast**; each is a fixture broken at exactly one
|
||
tier, pinned by commit SHA.
|
||
|
||
| RED canary | How it's broken (custom-html-tiny fixture) | Expected harness result |
|
||
|---|---|---|
|
||
| **install** | image tag that never becomes healthy / a healthcheck that can't pass | **install tier RED** |
|
||
| **upgrade** | installs clean; the upgrade target breaks the container so post-upgrade health fails | install PASS, **upgrade tier RED** |
|
||
| **backup** | install+upgrade clean; backup misconfigured (backupbot label/target wrong → backup errors or yields no artifact) | **backup tier RED** |
|
||
| **restore** | backup succeeds; restore is a no-op (hook does nothing) so the pre-seeded marker is ABSENT after restore | **restore tier RED** (the scariest false-green) |
|
||
|
||
Each pytest asserts **precisely**: overall verdict RED, the failing tier is the *intended* one, AND the
|
||
tiers *before* it PASSED (e.g. upgrade-RED requires install to have passed) — so it proves "catches a
|
||
failure **at this tier**", not merely "fails somewhere". These four form the **fast subset** of the
|
||
suite; consider a sub-marker (`@pytest.mark.canary_fast`) so they can optionally run as a quicker
|
||
pre-check while the slow good canaries (esp. lasuite-docs) stay on the milestone cadence below.
|
||
|
||
## What "works as expected" means per tier (real assertions, not exit codes)
|
||
|
||
For each good canary the test drives the real cold full suite and asserts the *observable outcome*:
|
||
- **install** → app returns HTTP 200, expected page content/marker present, all services converged.
|
||
- **upgrade** → a record/marker seeded **before** the upgrade is still present **after**; the deployed
|
||
version actually bumped; app still 200. (For lasuite-docs: a created doc survives; OIDC login still
|
||
works; Collabora WOPI still 200.)
|
||
- **backup → restore** → seed data, wipe the app, restore, assert the seeded data is back.
|
||
- **secrets persistence** → the same generated app secret is identical across install→upgrade→restore.
|
||
- **redaction** → grep the published logs AND the dashboard for the generated secret value(s) → assert
|
||
**absent**.
|
||
- **teardown** → after the run, assert no leftover containers/volumes/secrets for the canary.
|
||
|
||
## Design
|
||
|
||
- **Location:** `tests/regression/` in the cc-ci repo. `pytest`, parametrized over the canaries.
|
||
- **Driver:** wrap the existing real cold path (`.../ci-test-review/verify-pr.sh` / the runner
|
||
`runner/run_recipe_ci.py`) — do NOT reimplement the harness; call it and assert on its real outputs
|
||
(build verdict, summary, dashboard JSON, published logs). The semantic per-tier assertions live in
|
||
the test, layered on top of the harness's own pass/fail so a tier that the harness calls "pass" but
|
||
that didn't actually preserve data still fails the regression test.
|
||
- **Markers:** `@pytest.mark.canary` (slow E2E) so they can be selected/excluded; the suite is run
|
||
on-demand / pre-merge / nightly, not on every fast commit.
|
||
- **Determinism:** pin canary recipe versions/SHAs and the bad-fixture SHA. Record run artifacts
|
||
(build URL, logs) on failure for triage.
|
||
|
||
## How it runs — CADENCE POLICY (important)
|
||
**Do NOT run these canaries on every commit/PR.** They are slow + resource-heavy (full lifecycle on
|
||
lasuite-docs is minutes, needs the live server/abra/Swarm). Run them **deliberately at milestones**:
|
||
- **polishing passes**, **code reviews**, and **releases** of the cc-ci server — i.e. before we trust a
|
||
batch of server changes, not on each incremental commit.
|
||
- On-demand: `pytest -m canary` against the live cc-ci server (from the orchestrator or the host).
|
||
- They are explicitly **opt-in** (the `@pytest.mark.canary` marker keeps them out of any fast/default
|
||
run). If wired to `!testme` on the cc-ci repo, gate it behind a deliberate trigger (e.g. a
|
||
`run-canaries` label or a `!testme --canary`), **not** an automatic run on every cc-ci PR.
|
||
- Document this cadence in `tests/regression/README.md` so future contributors don't wire it into the
|
||
per-commit path.
|
||
|
||
## Definition of Done (Adversary-verified)
|
||
1. `tests/regression/` pytest suite exists and is committed (cc-ci repo PR).
|
||
2. Run GREEN on both good canaries (`custom-html-tiny`, `lasuite-docs`) with the per-tier semantic
|
||
assertions actually executing (Adversary confirms the assertions FAIL if you tamper with an outcome —
|
||
i.e. the assertions have teeth, they're not vacuous).
|
||
3. The custom-assertion known-bad canary makes the suite assert **RED** — and the Adversary confirms
|
||
that if the server *wrongly* returned green for it, the regression test would FAIL (false-green caught).
|
||
4. **The four per-tier RED canaries** (install/upgrade/backup/restore, on `custom-html-tiny`) each make
|
||
the suite assert RED **at the intended tier**, with the prior tiers asserted PASS. Adversary confirms
|
||
each has teeth: if the server wrongly green-lit that tier, the corresponding test would FAIL. They run
|
||
fast.
|
||
5. A short `tests/regression/README.md`: how to run it, what each canary guards, how to add a canary.
|
||
6. NOT merged — recipe/test PR opened for operator review (loops never merge).
|
||
|
||
## Risks / notes
|
||
- **Slow + resource-heavy:** full lifecycle on lasuite-docs is minutes and needs the live server/abra/
|
||
Swarm. Keep it `-m canary`, not in the fast path. Watch disk (lasuite-docs upgrade was disk-sensitive).
|
||
- **Flakiness:** lasuite-docs had transient WOPI-404 / Collabora convergence races (see
|
||
`plan-lasuite-drive-*`); use the harness's existing readiness probes; if flaky, add bounded retries on
|
||
*readiness only* (never on the correctness assertion).
|
||
- **Don't weaken to pass:** the whole point is teeth. A flaky correctness assertion is a real signal, not
|
||
something to relax.
|
||
|
||
## Out of scope (deferred to a follow-up phase)
|
||
- Fast logic-unit layer (bridge `!testme`/`!testmexyz`/non-collaborator rules, verdict computation,
|
||
redaction function, summary formatting) as second-level pytest run on every commit.
|
||
- Golden-output snapshots, concurrency-collision canary, perf-budget assertions.
|