Files
cc-ci-orchestrator/cc-ci-plan/plan-server-regression-canaries.md
autonomic-bot 7bdeb74449 plan(regression): add per-tier RED canaries (install/upgrade/backup/restore)
One deliberately-broken custom-html-tiny fixture per lifecycle tier so the
suite proves the server reports RED at EVERY tier (not just one) — each
asserts RED at the intended tier with prior tiers PASS, so it's 'catches a
failure at this tier', not 'fails somewhere'. Fast (simplest recipe); the
fast subset of the suite vs the slow good canaries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 01:28:23 +00:00

8.8 KiB
Raw Blame History

Plan — server regression canaries (codified E2E self-tests)

Status: PROPOSED — queued as the loops' next phase after mirror (mirror-enroll). Single loop pair, one phase at a time. Owner: Builder + Adversary loops. Created: 2026-06-02. Author: Claude Sonnet 4.6 orchestrator.

Goal

Give the cc-ci server a standing, codified regression suite in Python (pytest) so we can keep modifying the server without silently breaking it. Not a prompt an agent re-runs — a deterministic test artifact in the repo, runnable in CI with no LLM in the loop.

The suite proves the server can do both halves of its job — and the second is the one that bites:

  1. Confirm a healthy app is healthy, end-to-end, with semantic assertions at every tier (NOT just an exit code / "pass").
  2. Catch a broken app — a deliberately-broken canary that the server MUST report RED. A server that goes false-green (reports PASS while the app is broken) is the scariest regression; we already saw a fabricated FULL-PASS during the build. This guard makes false-green a test failure.

Scope this round (operator-chosen): E2E canaries (good + bad) only. NOT the fast logic-unit layer (bridge trigger rules / verdict / redaction unit tests) — that's a good follow-up phase, deferred.

Canaries

Canary Recipe Why Expected verdict
Simple (good) custom-html-tiny Minimal, fast, few deps — quick signal GREEN
Significant (good) lasuite-docs Multi-service: backend + Postgres + Collabora WOPI + keycloak OIDC — exercises real breadth GREEN
Known-BAD: custom-assertion a seeded fixture (see below) App comes up healthy but a functional/custom assertion is violated RED
Known-BAD: per-tier ×4 custom-html-tiny broken at one tier each (see below) install / upgrade / backup / restore each fail in turn RED at the intended tier

Known-bad fixture (custom-assertion): reuse/recreate the phase-5 seeded case — custom-html branch v5-stale-docroot (serves .txt as application/octet-stream while the app is externally healthy), which already produced a RED build (#75) with only the content-type custom assertion failing. The regression test asserts the harness returns RED for this fixture. (If that branch is gone, recreate the pattern: an app that is up + passes lifecycle tiers but fails one functional assertion.) Pin by SHA.

Per-tier RED canaries — prove the server catches failure at EVERY tier (fast)

The single fixture above only proves the server catches a custom-assertion failure. Add one RED canary per lifecycle tier so we prove the server reports RED at each of install / upgrade / backup / restore — false-green is the scariest regression, and it can hide at any tier (e.g. restore silently restoring nothing, the ghost/mattermost class of bug). Use the simplest recipe — custom-html-tiny (static content, deploys in seconds) so all four run fast; each is a fixture broken at exactly one tier, pinned by commit SHA.

RED canary How it's broken (custom-html-tiny fixture) Expected harness result
install image tag that never becomes healthy / a healthcheck that can't pass install tier RED
upgrade installs clean; the upgrade target breaks the container so post-upgrade health fails install PASS, upgrade tier RED
backup install+upgrade clean; backup misconfigured (backupbot label/target wrong → backup errors or yields no artifact) backup tier RED
restore backup succeeds; restore is a no-op (hook does nothing) so the pre-seeded marker is ABSENT after restore restore tier RED (the scariest false-green)

Each pytest asserts precisely: overall verdict RED, the failing tier is the intended one, AND the tiers before it PASSED (e.g. upgrade-RED requires install to have passed) — so it proves "catches a failure at this tier", not merely "fails somewhere". These four form the fast subset of the suite; consider a sub-marker (@pytest.mark.canary_fast) so they can optionally run as a quicker pre-check while the slow good canaries (esp. lasuite-docs) stay on the milestone cadence below.

What "works as expected" means per tier (real assertions, not exit codes)

For each good canary the test drives the real cold full suite and asserts the observable outcome:

  • install → app returns HTTP 200, expected page content/marker present, all services converged.
  • upgrade → a record/marker seeded before the upgrade is still present after; the deployed version actually bumped; app still 200. (For lasuite-docs: a created doc survives; OIDC login still works; Collabora WOPI still 200.)
  • backup → restore → seed data, wipe the app, restore, assert the seeded data is back.
  • secrets persistence → the same generated app secret is identical across install→upgrade→restore.
  • redaction → grep the published logs AND the dashboard for the generated secret value(s) → assert absent.
  • teardown → after the run, assert no leftover containers/volumes/secrets for the canary.

Design

  • Location: tests/regression/ in the cc-ci repo. pytest, parametrized over the canaries.
  • Driver: wrap the existing real cold path (.../ci-test-review/verify-pr.sh / the runner runner/run_recipe_ci.py) — do NOT reimplement the harness; call it and assert on its real outputs (build verdict, summary, dashboard JSON, published logs). The semantic per-tier assertions live in the test, layered on top of the harness's own pass/fail so a tier that the harness calls "pass" but that didn't actually preserve data still fails the regression test.
  • Markers: @pytest.mark.canary (slow E2E) so they can be selected/excluded; the suite is run on-demand / pre-merge / nightly, not on every fast commit.
  • Determinism: pin canary recipe versions/SHAs and the bad-fixture SHA. Record run artifacts (build URL, logs) on failure for triage.

How it runs — CADENCE POLICY (important)

Do NOT run these canaries on every commit/PR. They are slow + resource-heavy (full lifecycle on lasuite-docs is minutes, needs the live server/abra/Swarm). Run them deliberately at milestones:

  • polishing passes, code reviews, and releases of the cc-ci server — i.e. before we trust a batch of server changes, not on each incremental commit.
  • On-demand: pytest -m canary against the live cc-ci server (from the orchestrator or the host).
  • They are explicitly opt-in (the @pytest.mark.canary marker keeps them out of any fast/default run). If wired to !testme on the cc-ci repo, gate it behind a deliberate trigger (e.g. a run-canaries label or a !testme --canary), not an automatic run on every cc-ci PR.
  • Document this cadence in tests/regression/README.md so future contributors don't wire it into the per-commit path.

Definition of Done (Adversary-verified)

  1. tests/regression/ pytest suite exists and is committed (cc-ci repo PR).
  2. Run GREEN on both good canaries (custom-html-tiny, lasuite-docs) with the per-tier semantic assertions actually executing (Adversary confirms the assertions FAIL if you tamper with an outcome — i.e. the assertions have teeth, they're not vacuous).
  3. The custom-assertion known-bad canary makes the suite assert RED — and the Adversary confirms that if the server wrongly returned green for it, the regression test would FAIL (false-green caught).
  4. The four per-tier RED canaries (install/upgrade/backup/restore, on custom-html-tiny) each make the suite assert RED at the intended tier, with the prior tiers asserted PASS. Adversary confirms each has teeth: if the server wrongly green-lit that tier, the corresponding test would FAIL. They run fast.
  5. A short tests/regression/README.md: how to run it, what each canary guards, how to add a canary.
  6. NOT merged — recipe/test PR opened for operator review (loops never merge).

Risks / notes

  • Slow + resource-heavy: full lifecycle on lasuite-docs is minutes and needs the live server/abra/ Swarm. Keep it -m canary, not in the fast path. Watch disk (lasuite-docs upgrade was disk-sensitive).
  • Flakiness: lasuite-docs had transient WOPI-404 / Collabora convergence races (see plan-lasuite-drive-*); use the harness's existing readiness probes; if flaky, add bounded retries on readiness only (never on the correctness assertion).
  • Don't weaken to pass: the whole point is teeth. A flaky correctness assertion is a real signal, not something to relax.

Out of scope (deferred to a follow-up phase)

  • Fast logic-unit layer (bridge !testme/!testmexyz/non-collaborator rules, verdict computation, redaction function, summary formatting) as second-level pytest run on every commit.
  • Golden-output snapshots, concurrency-collision canary, perf-budget assertions.