plan: server regression canaries (codified E2E good+bad self-tests)
E2E pytest canaries proving the server confirms a healthy app healthy (semantic per-tier assertions, not just exit codes) AND catches a broken one (false-green guard). Good canaries: custom-html-tiny + lasuite-docs; known-bad fixture must report RED. Queued as the loops' next phase after mirror-enroll. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
91
cc-ci-plan/plan-server-regression-canaries.md
Normal file
91
cc-ci-plan/plan-server-regression-canaries.md
Normal file
@ -0,0 +1,91 @@
|
||||
# Plan — server regression canaries (codified E2E self-tests)
|
||||
|
||||
**Status:** PROPOSED — queued as the loops' next phase after `mirror` (mirror-enroll). Single loop pair,
|
||||
one phase at a time.
|
||||
**Owner:** Builder + Adversary loops. **Created:** 2026-06-02. **Author:** Claude Sonnet 4.6 orchestrator.
|
||||
|
||||
## Goal
|
||||
|
||||
Give the cc-ci server a **standing, codified regression suite in Python (pytest)** so we can keep
|
||||
modifying the server without silently breaking it. Not a prompt an agent re-runs — a deterministic
|
||||
test artifact in the repo, runnable in CI with no LLM in the loop.
|
||||
|
||||
The suite proves the server can do **both** halves of its job — and the second is the one that bites:
|
||||
1. **Confirm a healthy app is healthy**, end-to-end, with *semantic* assertions at every tier (NOT
|
||||
just an exit code / "pass").
|
||||
2. **Catch a broken app** — a deliberately-broken canary that the server MUST report **RED**. A server
|
||||
that goes *false-green* (reports PASS while the app is broken) is the scariest regression; we already
|
||||
saw a *fabricated* FULL-PASS during the build. This guard makes false-green a test failure.
|
||||
|
||||
**Scope this round (operator-chosen):** E2E canaries (good + bad) only. NOT the fast logic-unit layer
|
||||
(bridge trigger rules / verdict / redaction unit tests) — that's a good follow-up phase, deferred.
|
||||
|
||||
## Canaries
|
||||
|
||||
| Canary | Recipe | Why | Expected verdict |
|
||||
|---|---|---|---|
|
||||
| **Simple (good)** | `custom-html-tiny` | Minimal, fast, few deps — quick signal | GREEN |
|
||||
| **Significant (good)** | `lasuite-docs` | Multi-service: backend + Postgres + Collabora WOPI + keycloak OIDC — exercises real breadth | GREEN |
|
||||
| **Known-BAD (false-green guard)** | a seeded fixture (see below) | App comes up healthy but a semantic tier assertion is violated | **RED** |
|
||||
|
||||
**Known-bad fixture:** reuse/recreate the phase-5 seeded case — `custom-html` branch `v5-stale-docroot`
|
||||
(serves `.txt` as `application/octet-stream` while the app is externally healthy), which already
|
||||
produced a RED build (#75) with only the content-type custom assertion failing. The regression test
|
||||
asserts the harness returns **RED** for this fixture. (If that branch is gone, recreate the pattern:
|
||||
an app that is up + passes lifecycle tiers but fails one functional assertion.) Pin the fixture by
|
||||
commit SHA so it's stable.
|
||||
|
||||
## What "works as expected" means per tier (real assertions, not exit codes)
|
||||
|
||||
For each good canary the test drives the real cold full suite and asserts the *observable outcome*:
|
||||
- **install** → app returns HTTP 200, expected page content/marker present, all services converged.
|
||||
- **upgrade** → a record/marker seeded **before** the upgrade is still present **after**; the deployed
|
||||
version actually bumped; app still 200. (For lasuite-docs: a created doc survives; OIDC login still
|
||||
works; Collabora WOPI still 200.)
|
||||
- **backup → restore** → seed data, wipe the app, restore, assert the seeded data is back.
|
||||
- **secrets persistence** → the same generated app secret is identical across install→upgrade→restore.
|
||||
- **redaction** → grep the published logs AND the dashboard for the generated secret value(s) → assert
|
||||
**absent**.
|
||||
- **teardown** → after the run, assert no leftover containers/volumes/secrets for the canary.
|
||||
|
||||
## Design
|
||||
|
||||
- **Location:** `tests/regression/` in the cc-ci repo. `pytest`, parametrized over the canaries.
|
||||
- **Driver:** wrap the existing real cold path (`.../ci-test-review/verify-pr.sh` / the runner
|
||||
`runner/run_recipe_ci.py`) — do NOT reimplement the harness; call it and assert on its real outputs
|
||||
(build verdict, summary, dashboard JSON, published logs). The semantic per-tier assertions live in
|
||||
the test, layered on top of the harness's own pass/fail so a tier that the harness calls "pass" but
|
||||
that didn't actually preserve data still fails the regression test.
|
||||
- **Markers:** `@pytest.mark.canary` (slow E2E) so they can be selected/excluded; the suite is run
|
||||
on-demand / pre-merge / nightly, not on every fast commit.
|
||||
- **Determinism:** pin canary recipe versions/SHAs and the bad-fixture SHA. Record run artifacts
|
||||
(build URL, logs) on failure for triage.
|
||||
|
||||
## How it runs (dogfood)
|
||||
- On-demand: `pytest -m canary` against the live cc-ci server (from the orchestrator or the host).
|
||||
- Dogfood: cc-ci is itself enrolled in `!testme`, so a server-change PR on the cc-ci repo can trigger
|
||||
the canary run automatically — a server regression shows up as a RED check on the very PR that caused it.
|
||||
|
||||
## Definition of Done (Adversary-verified)
|
||||
1. `tests/regression/` pytest suite exists and is committed (cc-ci repo PR).
|
||||
2. Run GREEN on both good canaries (`custom-html-tiny`, `lasuite-docs`) with the per-tier semantic
|
||||
assertions actually executing (Adversary confirms the assertions FAIL if you tamper with an outcome —
|
||||
i.e. the assertions have teeth, they're not vacuous).
|
||||
3. The known-bad canary makes the suite assert **RED** — and the Adversary confirms that if the server
|
||||
*wrongly* returned green for it, the regression test would FAIL (false-green is caught).
|
||||
4. A short `tests/regression/README.md`: how to run it, what each canary guards, how to add a canary.
|
||||
5. NOT merged — recipe/test PR opened for operator review (loops never merge).
|
||||
|
||||
## Risks / notes
|
||||
- **Slow + resource-heavy:** full lifecycle on lasuite-docs is minutes and needs the live server/abra/
|
||||
Swarm. Keep it `-m canary`, not in the fast path. Watch disk (lasuite-docs upgrade was disk-sensitive).
|
||||
- **Flakiness:** lasuite-docs had transient WOPI-404 / Collabora convergence races (see
|
||||
`plan-lasuite-drive-*`); use the harness's existing readiness probes; if flaky, add bounded retries on
|
||||
*readiness only* (never on the correctness assertion).
|
||||
- **Don't weaken to pass:** the whole point is teeth. A flaky correctness assertion is a real signal, not
|
||||
something to relax.
|
||||
|
||||
## Out of scope (deferred to a follow-up phase)
|
||||
- Fast logic-unit layer (bridge `!testme`/`!testmexyz`/non-collaborator rules, verdict computation,
|
||||
redaction function, summary formatting) as second-level pytest run on every commit.
|
||||
- Golden-output snapshots, concurrency-collision canary, perf-budget assertions.
|
||||
Reference in New Issue
Block a user