6.6 KiB
Plan — server regression canaries (codified E2E self-tests)
Status: PROPOSED — queued as the loops' next phase after mirror (mirror-enroll). Single loop pair,
one phase at a time.
Owner: Builder + Adversary loops. Created: 2026-06-02. Author: Claude Sonnet 4.6 orchestrator.
Goal
Give the cc-ci server a standing, codified regression suite in Python (pytest) so we can keep modifying the server without silently breaking it. Not a prompt an agent re-runs — a deterministic test artifact in the repo, runnable in CI with no LLM in the loop.
The suite proves the server can do both halves of its job — and the second is the one that bites:
- Confirm a healthy app is healthy, end-to-end, with semantic assertions at every tier (NOT just an exit code / "pass").
- Catch a broken app — a deliberately-broken canary that the server MUST report RED. A server that goes false-green (reports PASS while the app is broken) is the scariest regression; we already saw a fabricated FULL-PASS during the build. This guard makes false-green a test failure.
Scope this round (operator-chosen): E2E canaries (good + bad) only. NOT the fast logic-unit layer (bridge trigger rules / verdict / redaction unit tests) — that's a good follow-up phase, deferred.
Canaries
| Canary | Recipe | Why | Expected verdict |
|---|---|---|---|
| Simple (good) | custom-html-tiny |
Minimal, fast, few deps — quick signal | GREEN |
| Significant (good) | lasuite-docs |
Multi-service: backend + Postgres + Collabora WOPI + keycloak OIDC — exercises real breadth | GREEN |
| Known-BAD (false-green guard) | a seeded fixture (see below) | App comes up healthy but a semantic tier assertion is violated | RED |
Known-bad fixture: reuse/recreate the phase-5 seeded case — custom-html branch v5-stale-docroot
(serves .txt as application/octet-stream while the app is externally healthy), which already
produced a RED build (#75) with only the content-type custom assertion failing. The regression test
asserts the harness returns RED for this fixture. (If that branch is gone, recreate the pattern:
an app that is up + passes lifecycle tiers but fails one functional assertion.) Pin the fixture by
commit SHA so it's stable.
What "works as expected" means per tier (real assertions, not exit codes)
For each good canary the test drives the real cold full suite and asserts the observable outcome:
- install → app returns HTTP 200, expected page content/marker present, all services converged.
- upgrade → a record/marker seeded before the upgrade is still present after; the deployed version actually bumped; app still 200. (For lasuite-docs: a created doc survives; OIDC login still works; Collabora WOPI still 200.)
- backup → restore → seed data, wipe the app, restore, assert the seeded data is back.
- secrets persistence → the same generated app secret is identical across install→upgrade→restore.
- redaction → grep the published logs AND the dashboard for the generated secret value(s) → assert absent.
- teardown → after the run, assert no leftover containers/volumes/secrets for the canary.
Design
- Location:
tests/regression/in the cc-ci repo.pytest, parametrized over the canaries. - Driver: wrap the existing real cold path (
.../ci-test-review/verify-pr.sh/ the runnerrunner/run_recipe_ci.py) — do NOT reimplement the harness; call it and assert on its real outputs (build verdict, summary, dashboard JSON, published logs). The semantic per-tier assertions live in the test, layered on top of the harness's own pass/fail so a tier that the harness calls "pass" but that didn't actually preserve data still fails the regression test. - Markers:
@pytest.mark.canary(slow E2E) so they can be selected/excluded; the suite is run on-demand / pre-merge / nightly, not on every fast commit. - Determinism: pin canary recipe versions/SHAs and the bad-fixture SHA. Record run artifacts (build URL, logs) on failure for triage.
How it runs — CADENCE POLICY (important)
Do NOT run these canaries on every commit/PR. They are slow + resource-heavy (full lifecycle on lasuite-docs is minutes, needs the live server/abra/Swarm). Run them deliberately at milestones:
- polishing passes, code reviews, and releases of the cc-ci server — i.e. before we trust a batch of server changes, not on each incremental commit.
- On-demand:
pytest -m canaryagainst the live cc-ci server (from the orchestrator or the host). - They are explicitly opt-in (the
@pytest.mark.canarymarker keeps them out of any fast/default run). If wired to!testmeon the cc-ci repo, gate it behind a deliberate trigger (e.g. arun-canarieslabel or a!testme --canary), not an automatic run on every cc-ci PR. - Document this cadence in
tests/regression/README.mdso future contributors don't wire it into the per-commit path.
Definition of Done (Adversary-verified)
tests/regression/pytest suite exists and is committed (cc-ci repo PR).- Run GREEN on both good canaries (
custom-html-tiny,lasuite-docs) with the per-tier semantic assertions actually executing (Adversary confirms the assertions FAIL if you tamper with an outcome — i.e. the assertions have teeth, they're not vacuous). - The known-bad canary makes the suite assert RED — and the Adversary confirms that if the server wrongly returned green for it, the regression test would FAIL (false-green is caught).
- A short
tests/regression/README.md: how to run it, what each canary guards, how to add a canary. - NOT merged — recipe/test PR opened for operator review (loops never merge).
Risks / notes
- Slow + resource-heavy: full lifecycle on lasuite-docs is minutes and needs the live server/abra/
Swarm. Keep it
-m canary, not in the fast path. Watch disk (lasuite-docs upgrade was disk-sensitive). - Flakiness: lasuite-docs had transient WOPI-404 / Collabora convergence races (see
plan-lasuite-drive-*); use the harness's existing readiness probes; if flaky, add bounded retries on readiness only (never on the correctness assertion). - Don't weaken to pass: the whole point is teeth. A flaky correctness assertion is a real signal, not something to relax.
Out of scope (deferred to a follow-up phase)
- Fast logic-unit layer (bridge
!testme/!testmexyz/non-collaborator rules, verdict computation, redaction function, summary formatting) as second-level pytest run on every commit. - Golden-output snapshots, concurrency-collision canary, perf-budget assertions.