Files
cc-ci-orchestrator/cc-ci-plan/plan-server-regression-canaries.md
autonomic-bot 7bdeb74449 plan(regression): add per-tier RED canaries (install/upgrade/backup/restore)
One deliberately-broken custom-html-tiny fixture per lifecycle tier so the
suite proves the server reports RED at EVERY tier (not just one) — each
asserts RED at the intended tier with prior tiers PASS, so it's 'catches a
failure at this tier', not 'fails somewhere'. Fast (simplest recipe); the
fast subset of the suite vs the slow good canaries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 01:28:23 +00:00

125 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Plan — server regression canaries (codified E2E self-tests)
**Status:** PROPOSED — queued as the loops' next phase after `mirror` (mirror-enroll). Single loop pair,
one phase at a time.
**Owner:** Builder + Adversary loops. **Created:** 2026-06-02. **Author:** Claude Sonnet 4.6 orchestrator.
## Goal
Give the cc-ci server a **standing, codified regression suite in Python (pytest)** so we can keep
modifying the server without silently breaking it. Not a prompt an agent re-runs — a deterministic
test artifact in the repo, runnable in CI with no LLM in the loop.
The suite proves the server can do **both** halves of its job — and the second is the one that bites:
1. **Confirm a healthy app is healthy**, end-to-end, with *semantic* assertions at every tier (NOT
just an exit code / "pass").
2. **Catch a broken app** — a deliberately-broken canary that the server MUST report **RED**. A server
that goes *false-green* (reports PASS while the app is broken) is the scariest regression; we already
saw a *fabricated* FULL-PASS during the build. This guard makes false-green a test failure.
**Scope this round (operator-chosen):** E2E canaries (good + bad) only. NOT the fast logic-unit layer
(bridge trigger rules / verdict / redaction unit tests) — that's a good follow-up phase, deferred.
## Canaries
| Canary | Recipe | Why | Expected verdict |
|---|---|---|---|
| **Simple (good)** | `custom-html-tiny` | Minimal, fast, few deps — quick signal | GREEN |
| **Significant (good)** | `lasuite-docs` | Multi-service: backend + Postgres + Collabora WOPI + keycloak OIDC — exercises real breadth | GREEN |
| **Known-BAD: custom-assertion** | a seeded fixture (see below) | App comes up healthy but a functional/custom assertion is violated | **RED** |
| **Known-BAD: per-tier ×4** | `custom-html-tiny` broken at one tier each (see below) | install / upgrade / backup / restore each fail in turn | **RED** at the intended tier |
**Known-bad fixture (custom-assertion):** reuse/recreate the phase-5 seeded case — `custom-html` branch
`v5-stale-docroot` (serves `.txt` as `application/octet-stream` while the app is externally healthy),
which already produced a RED build (#75) with only the content-type custom assertion failing. The
regression test asserts the harness returns **RED** for this fixture. (If that branch is gone, recreate
the pattern: an app that is up + passes lifecycle tiers but fails one functional assertion.) Pin by SHA.
### Per-tier RED canaries — prove the server catches failure at EVERY tier (fast)
The single fixture above only proves the server catches a *custom-assertion* failure. Add **one RED
canary per lifecycle tier** so we prove the server reports RED at each of install / upgrade / backup /
restore — false-green is the scariest regression, and it can hide at any tier (e.g. restore silently
restoring nothing, the ghost/mattermost class of bug). Use the **simplest recipe — `custom-html-tiny`**
(static content, deploys in seconds) so all four run **fast**; each is a fixture broken at exactly one
tier, pinned by commit SHA.
| RED canary | How it's broken (custom-html-tiny fixture) | Expected harness result |
|---|---|---|
| **install** | image tag that never becomes healthy / a healthcheck that can't pass | **install tier RED** |
| **upgrade** | installs clean; the upgrade target breaks the container so post-upgrade health fails | install PASS, **upgrade tier RED** |
| **backup** | install+upgrade clean; backup misconfigured (backupbot label/target wrong → backup errors or yields no artifact) | **backup tier RED** |
| **restore** | backup succeeds; restore is a no-op (hook does nothing) so the pre-seeded marker is ABSENT after restore | **restore tier RED** (the scariest false-green) |
Each pytest asserts **precisely**: overall verdict RED, the failing tier is the *intended* one, AND the
tiers *before* it PASSED (e.g. upgrade-RED requires install to have passed) — so it proves "catches a
failure **at this tier**", not merely "fails somewhere". These four form the **fast subset** of the
suite; consider a sub-marker (`@pytest.mark.canary_fast`) so they can optionally run as a quicker
pre-check while the slow good canaries (esp. lasuite-docs) stay on the milestone cadence below.
## What "works as expected" means per tier (real assertions, not exit codes)
For each good canary the test drives the real cold full suite and asserts the *observable outcome*:
- **install** → app returns HTTP 200, expected page content/marker present, all services converged.
- **upgrade** → a record/marker seeded **before** the upgrade is still present **after**; the deployed
version actually bumped; app still 200. (For lasuite-docs: a created doc survives; OIDC login still
works; Collabora WOPI still 200.)
- **backup → restore** → seed data, wipe the app, restore, assert the seeded data is back.
- **secrets persistence** → the same generated app secret is identical across install→upgrade→restore.
- **redaction** → grep the published logs AND the dashboard for the generated secret value(s) → assert
**absent**.
- **teardown** → after the run, assert no leftover containers/volumes/secrets for the canary.
## Design
- **Location:** `tests/regression/` in the cc-ci repo. `pytest`, parametrized over the canaries.
- **Driver:** wrap the existing real cold path (`.../ci-test-review/verify-pr.sh` / the runner
`runner/run_recipe_ci.py`) — do NOT reimplement the harness; call it and assert on its real outputs
(build verdict, summary, dashboard JSON, published logs). The semantic per-tier assertions live in
the test, layered on top of the harness's own pass/fail so a tier that the harness calls "pass" but
that didn't actually preserve data still fails the regression test.
- **Markers:** `@pytest.mark.canary` (slow E2E) so they can be selected/excluded; the suite is run
on-demand / pre-merge / nightly, not on every fast commit.
- **Determinism:** pin canary recipe versions/SHAs and the bad-fixture SHA. Record run artifacts
(build URL, logs) on failure for triage.
## How it runs — CADENCE POLICY (important)
**Do NOT run these canaries on every commit/PR.** They are slow + resource-heavy (full lifecycle on
lasuite-docs is minutes, needs the live server/abra/Swarm). Run them **deliberately at milestones**:
- **polishing passes**, **code reviews**, and **releases** of the cc-ci server — i.e. before we trust a
batch of server changes, not on each incremental commit.
- On-demand: `pytest -m canary` against the live cc-ci server (from the orchestrator or the host).
- They are explicitly **opt-in** (the `@pytest.mark.canary` marker keeps them out of any fast/default
run). If wired to `!testme` on the cc-ci repo, gate it behind a deliberate trigger (e.g. a
`run-canaries` label or a `!testme --canary`), **not** an automatic run on every cc-ci PR.
- Document this cadence in `tests/regression/README.md` so future contributors don't wire it into the
per-commit path.
## Definition of Done (Adversary-verified)
1. `tests/regression/` pytest suite exists and is committed (cc-ci repo PR).
2. Run GREEN on both good canaries (`custom-html-tiny`, `lasuite-docs`) with the per-tier semantic
assertions actually executing (Adversary confirms the assertions FAIL if you tamper with an outcome —
i.e. the assertions have teeth, they're not vacuous).
3. The custom-assertion known-bad canary makes the suite assert **RED** — and the Adversary confirms
that if the server *wrongly* returned green for it, the regression test would FAIL (false-green caught).
4. **The four per-tier RED canaries** (install/upgrade/backup/restore, on `custom-html-tiny`) each make
the suite assert RED **at the intended tier**, with the prior tiers asserted PASS. Adversary confirms
each has teeth: if the server wrongly green-lit that tier, the corresponding test would FAIL. They run
fast.
5. A short `tests/regression/README.md`: how to run it, what each canary guards, how to add a canary.
6. NOT merged — recipe/test PR opened for operator review (loops never merge).
## Risks / notes
- **Slow + resource-heavy:** full lifecycle on lasuite-docs is minutes and needs the live server/abra/
Swarm. Keep it `-m canary`, not in the fast path. Watch disk (lasuite-docs upgrade was disk-sensitive).
- **Flakiness:** lasuite-docs had transient WOPI-404 / Collabora convergence races (see
`plan-lasuite-drive-*`); use the harness's existing readiness probes; if flaky, add bounded retries on
*readiness only* (never on the correctness assertion).
- **Don't weaken to pass:** the whole point is teeth. A flaky correctness assertion is a real signal, not
something to relax.
## Out of scope (deferred to a follow-up phase)
- Fast logic-unit layer (bridge `!testme`/`!testmexyz`/non-collaborator rules, verdict computation,
redaction function, summary formatting) as second-level pytest run on every commit.
- Golden-output snapshots, concurrency-collision canary, perf-budget assertions.