feat(regression): add tests/regression/ E2E canary suite

Three canaries (@pytest.mark.canary) drive the real cold CI lifecycle: - good-simple: custom-html-tiny @ main (435df8fc) — fast signal, expects GREEN - good-significant: lasuite-docs @ main (290a8ad7) — multi-service, expects GREEN - bad-false-green: custom-html @ v5-stale-docroot (71e7326a) — expects RED Semantic teeth: beyond exit-code, each test asserts that specific named tests ran in results.json stages (test_serving, test_serving_and_frontend, test_content_type). If an assertion is removed, the named test disappears → regression test fails. Includes conftest (run_recipe_ci helper + stage_has_{passing,failing}_test), README (cadence policy, how to run, how to add), and phase state files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-02 01:25:55 +00:00
parent 91a7088f56
commit fd3db37c49
6 changed files with 566 additions and 1 deletions
--- a/tests/regression/README.md
+++ b/tests/regression/README.md
@ -0,0 +1,136 @@
+# Regression canaries — E2E self-tests for the cc-ci server
+
+A standing pytest suite that drives the **real** cc-ci lifecycle harness against pinned canary
+recipes and verifies both halves of the server's job:
+
+1. **Good canaries** — healthy apps are reported GREEN (install + upgrade + backup/restore pass).
+2. **Bad canary** — broken apps are caught RED; a false-green makes the regression test itself fail.
+
+These tests run the full cold lifecycle on the live cc-ci server. They are **slow** (minutes per
+canary) and **opt-in** — kept out of the per-commit fast path by the `canary` marker.
+
+---
+
+## How to run
+
+Run on the cc-ci server (abra + Docker + Swarm required):
+
+```bash
+ssh cc-ci
+cd /root/cc-ci            # or wherever the repo is checked out
+cc-ci-run python -m pytest tests/regression/ -m canary -v
+```
+
+Or a single canary:
+
+```bash
+cc-ci-run python -m pytest tests/regression/ -m canary -k good-simple -v
+```
+
+From the orchestrator:
+
+```bash
+ssh cc-ci "cd /root/cc-ci && cc-ci-run python -m pytest tests/regression/ -m canary -v"
+```
+
+---
+
+## Canaries
+
+| ID | Recipe | Purpose | Expected verdict |
+|----|--------|---------|-----------------|
+| `good-simple` | `custom-html-tiny` | Minimal static server — fast signal | GREEN |
+| `good-significant` | `lasuite-docs` | Multi-service (backend + Postgres + Collabora + OIDC) | GREEN |
+| `bad-false-green` | `custom-html` @ `v5-stale-docroot` | App is UP but serves wrong Content-Type — catches false-green | RED |
+
+### Why the bad canary exists
+
+The scariest regression is a **false-green**: the server reports PASS while the app is broken.
+We already saw a fabricated full-PASS during the build. The `bad-false-green` canary pins a known-
+broken fixture (`v5-stale-docroot`: nginx serves `.txt` as `application/octet-stream`). The
+harness's `test_content_type_html_and_txt` catches this and returns RED (build #75 was RED for
+exactly this fixture).
+
+The regression test asserts `rc != 0`. If the harness ever wrongly returns green for this fixture,
+that assert fires — false-green is caught before any merge.
+
+---
+
+## What each canary verifies
+
+### Per-tier semantic assertions (the "teeth")
+
+The tests assert MORE than the harness exit code: they check that **specific named assertions**
+ran and got the expected result. This guards against a different failure mode — a tier that
+nominally "passes" because the assertion was silently removed or made vacuous.
+
+| Stage | Test name | What it proves |
+|-------|-----------|---------------|
+| install | `test_serving` | Generic HTTP readiness check actually ran |
+| install | `test_serving_and_frontend` | Lasuite-docs frontend (SPA shell) actually loaded |
+| custom | `test_content_type` | Content-type assertion actually ran (bad canary only) |
+
+If a tier assertion is removed: the named test disappears from `results.json` → the semantic
+check fires → the regression suite catches the removal.
+
+### Additional structural assertions (good canaries)
+
+- `install` tier: "pass" (not fail, not skip)
+- No tier is "fail" (skips acceptable for recipes without backup/custom tests)
+- `flags.clean_teardown = True` (no leftover containers/volumes/secrets)
+- `flags.no_secret_leak = True` (no secret value in the results artifact)
+
+---
+
+## Cadence policy
+
+**Do NOT run on every commit or PR.** These are slow and resource-heavy. Run them:
+
+- Before a **release** of the cc-ci server (after a batch of server changes).
+- As a **polishing pass** or pre-merge check for significant server refactors.
+- On-demand when you suspect a regression: `pytest -m canary`.
+
+They are NOT wired to the per-commit Drone pipeline. If adding a `!testme`-style trigger for the
+cc-ci repo, gate it behind a deliberate label (e.g. `run-canaries`) — not an automatic run on
+every push.
+
+---
+
+## How to add a canary
+
+1. Identify a recipe that is already deployable and has pinned version tags.
+2. Decide the expected verdict (GREEN or RED) and which tier assertions have teeth.
+3. Add an entry to `CANARIES` in `test_canaries.py`:
+
+```python
+{
+    "id": "good-myrecipe",
+    "recipe": "my-recipe",
+    "src": "recipe-maintainers/my-recipe",
+    "ref": "<pinned-sha>",           # pin to a specific commit for stability
+    "expected_green": True,
+    "stage_pass_checks": [
+        ("install", "test_serving"),  # verify this named test ran and passed
+    ],
+    "stage_fail_checks": [],
+}
+```
+
+4. Run the canary once to confirm it passes:
+   `cc-ci-run python -m pytest tests/regression/ -m canary -k good-myrecipe -v`
+
+5. Update the pin comment with the date and the recipe version it was pinned at.
+
+---
+
+## Pin maintenance
+
+Canary refs are pinned to specific SHAs for stability. When a recipe publishes a new release:
+
+1. Update the `"ref"` SHA in the canary definition (use the new main-branch HEAD).
+2. Update the pin comment with the new date/version.
+3. Re-run the canary to confirm GREEN before committing the pin update.
+
+The bad canary (`v5-stale-docroot`) is a stable fixture branch — update only if the branch is
+deleted. If deleted, recreate the pattern: an app that is up + passes lifecycle tiers but fails
+one functional assertion.
--- a/tests/regression/conftest.py
+++ b/tests/regression/conftest.py
@ -0,0 +1,102 @@
+"""Shared fixtures and helpers for E2E canary regression tests.
+
+The regression tests call the real cc-ci harness (run_recipe_ci.py) as a subprocess and assert on
+its outputs (exit code, results.json). They run ON the cc-ci server, not the orchestrator — abra,
+Docker, and Swarm must be present.
+
+Invoke: cc-ci-run python -m pytest tests/regression/ -m canary -v
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import subprocess
+import sys
+import time
+
+ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+
+def pytest_configure(config):
+    config.addinivalue_line(
+        "markers",
+        "canary: slow E2E canary test — drives the full cold CI lifecycle; run on-demand only.",
+    )
+
+
+def run_recipe_ci(
+    recipe: str,
+    src: str,
+    ref: str,
+    pr: str = "0",
+    stages: str = "install,upgrade,backup,restore,custom",
+    runs_dir: str | None = None,
+    run_id_prefix: str = "regression",
+    timeout: int = 3600,
+) -> tuple[int, dict | None, str]:
+    """Invoke run_recipe_ci.py with the given canary params.
+
+    Returns (rc, results_dict_or_None, run_artifact_dir).
+    Stdout/stderr stream live so a human can follow progress.
+    """
+    ts = int(time.time())
+    run_id = f"{run_id_prefix}-{recipe}-{ref[:12]}-{ts}"
+    if runs_dir is None:
+        runs_dir = "/var/lib/cc-ci-runs"
+
+    env = dict(os.environ)
+    env.update(
+        {
+            "RECIPE": recipe,
+            "REF": ref,
+            "SRC": src,
+            "PR": pr,
+            "STAGES": stages,
+            "CCCI_RUN_ID": run_id,
+            "CCCI_RUNS_DIR": runs_dir,
+            "HOME": "/root",
+        }
+    )
+    # Keep PLAYWRIGHT env from the outer cc-ci-run wrapper (already in os.environ if running under it)
+
+    script = os.path.join(ROOT, "runner", "run_recipe_ci.py")
+    result = subprocess.run(
+        [sys.executable, script],
+        env=env,
+        timeout=timeout,
+    )
+    rc = result.returncode
+
+    artifact_dir = os.path.join(runs_dir, run_id)
+    results_path = os.path.join(artifact_dir, "results.json")
+    results_data: dict | None = None
+    if os.path.exists(results_path):
+        with open(results_path) as f:
+            results_data = json.load(f)
+
+    return rc, results_data, artifact_dir
+
+
+def find_stage_tests(results: dict, stage_name: str) -> list[dict]:
+    """Return the per-test list for a named stage from results.json, or []."""
+    for stage in results.get("stages", []):
+        if stage.get("name") == stage_name:
+            return stage.get("tests", [])
+    return []
+
+
+def stage_has_passing_test(results: dict, stage_name: str, test_name_substr: str) -> bool:
+    """True if the named stage contains a passing test whose name includes test_name_substr."""
+    for t in find_stage_tests(results, stage_name):
+        if test_name_substr in t.get("name", "") and t.get("status") == "pass":
+            return True
+    return False
+
+
+def stage_has_failing_test(results: dict, stage_name: str, test_name_substr: str) -> bool:
+    """True if the named stage contains a failing test whose name includes test_name_substr."""
+    for t in find_stage_tests(results, stage_name):
+        if test_name_substr in t.get("name", "") and t.get("status") in ("fail", "error"):
+            return True
+    return False
--- a/tests/regression/test_canaries.py
+++ b/tests/regression/test_canaries.py
@ -0,0 +1,181 @@
+"""E2E canary regression tests — the server's standing self-test suite.
+
+Three canaries prove both halves of the server's job:
+  1. GREEN canaries — good apps are reported healthy (install+upgrade+backup/restore pass).
+  2. RED canary    — broken apps are caught; a false-green makes THIS test fail.
+
+Run: cc-ci-run python -m pytest tests/regression/ -m canary -v
+Slow: each canary drives the full cold lifecycle on the live server (minutes per run).
+
+Pin policy: canary refs are pinned to specific SHAs for stability. Update them when the recipe
+publishes a new release and the pin is stale (re-run to confirm GREEN before updating).
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from .conftest import run_recipe_ci, stage_has_failing_test, stage_has_passing_test
+
+# ---------------------------------------------------------------------------
+# Canary definitions
+# ---------------------------------------------------------------------------
+
+# Good canary 1: minimal static-file server — fast signal, few deps.
+_SIMPLE = {
+    "id": "good-simple",
+    "recipe": "custom-html-tiny",
+    "src": "recipe-maintainers/custom-html-tiny",
+    # Pin: main @ 2026-06-02 — update if the recipe publishes a new release and pin goes stale.
+    "ref": "435df8fc98ef7598084fcffcd6225470eca80053",
+    "expected_green": True,
+    # Named tests that MUST appear with "pass" in the result — these are the semantic teeth.
+    # If the generic install assertion is removed/vacated, test_serving disappears → this fails.
+    "stage_pass_checks": [
+        ("install", "test_serving"),
+    ],
+    "stage_fail_checks": [],
+}
+
+# Good canary 2: multi-service stack — backend + Postgres + Collabora WOPI + OIDC.
+# Exercises real breadth. Slowest canary (~10-20 min full lifecycle).
+_SIGNIFICANT = {
+    "id": "good-significant",
+    "recipe": "lasuite-docs",
+    "src": "recipe-maintainers/lasuite-docs",
+    # Pin: main @ 2026-06-02
+    "ref": "290a8ad72d06232f0b3f302d976af14bef0f3c53",
+    "expected_green": True,
+    "stage_pass_checks": [
+        ("install", "test_serving_and_frontend"),
+    ],
+    "stage_fail_checks": [],
+}
+
+# Bad canary: app is UP + passes all lifecycle tiers but the custom functional assertion detects a
+# semantic defect (wrong Content-Type for .txt files). The harness MUST report RED.
+# If the harness wrongly returns green for this fixture, assert rc != 0 fails → false-green caught.
+_BAD = {
+    "id": "bad-false-green",
+    "recipe": "custom-html",
+    "src": "recipe-maintainers/custom-html",
+    # Pin: v5-stale-docroot @ 71e7326 — serves .txt as application/octet-stream; build #75 was RED.
+    # Recreate pattern if branch disappears: app up + passes lifecycle, fails one content assertion.
+    "ref": "71e7326a99bbb69035a046fba8fa51859ca66115",
+    "expected_green": False,
+    # The specific test that must have FAILED, proving the content-type assertion has teeth.
+    # If the assertion is vacated and the test disappears, stage_has_failing_test() returns False
+    # → the assert below fails → we detect that the guard was removed.
+    "stage_pass_checks": [],
+    "stage_fail_checks": [
+        ("custom", "test_content_type"),
+    ],
+}
+
+CANARIES = [_SIMPLE, _SIGNIFICANT, _BAD]
+
+
+# ---------------------------------------------------------------------------
+# Test
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.canary
+@pytest.mark.parametrize("canary", CANARIES, ids=[c["id"] for c in CANARIES])
+def test_canary(canary, tmp_path):
+    """Drive the full cold CI lifecycle for a canary recipe and verify the outcome.
+
+    For GREEN canaries: proves the harness correctly reports a healthy app as healthy, and that
+    the per-tier semantic assertions actually ran (not vacuous).
+
+    For the RED canary: proves the harness catches a broken app — if the harness wrongly returned
+    green, `assert rc != 0` fails, catching the false-green.
+    """
+    rc, results, artifact_dir = run_recipe_ci(
+        recipe=canary["recipe"],
+        src=canary["src"],
+        ref=canary["ref"],
+        runs_dir=str(tmp_path),
+    )
+
+    _note = f"artifact_dir={artifact_dir}"  # visible in -v output via assert messages
+
+    if canary["expected_green"]:
+        _assert_green(rc, results, canary, _note)
+    else:
+        _assert_red(rc, results, canary, _note)
+
+
+def _assert_green(rc: int, results: dict | None, canary: dict, note: str) -> None:
+    """Assert a good-canary run is GREEN with real semantic assertions."""
+
+    # 1. Harness exit code must be 0 (GREEN).
+    assert rc == 0, f"[{canary['id']}] harness returned non-zero rc={rc} — expected GREEN. {note}"
+
+    assert (
+        results is not None
+    ), f"[{canary['id']}] results.json not written — harness may have crashed. {note}"
+
+    # 2. Install tier must have passed.
+    assert results.get("results", {}).get("install") == "pass", (
+        f"[{canary['id']}] install tier did not pass: " f"results={results.get('results')}. {note}"
+    )
+
+    # 3. No tier may have FAILED (skips are acceptable for recipes without backup or custom tests).
+    failed_tiers = [t for t, s in results.get("results", {}).items() if s == "fail"]
+    assert not failed_tiers, f"[{canary['id']}] tiers failed: {failed_tiers}. {note}"
+
+    # 4. Teardown must be clean (no leftover containers/volumes/secrets).
+    assert (
+        results.get("flags", {}).get("clean_teardown") is True
+    ), f"[{canary['id']}] clean_teardown=False — residual state left on server. {note}"
+
+    # 5. No secret values leaked into the results artifact.
+    assert (
+        results.get("flags", {}).get("no_secret_leak") is True
+    ), f"[{canary['id']}] no_secret_leak=False — a secret value appeared in results.json. {note}"
+
+    # 6. Semantic stage assertions — TEETH CHECK.
+    # These verify that specific named tests actually ran and passed in the expected stage.
+    # If a tier assertion is removed or made vacuous, the named test disappears from results.json
+    # and this assert fires — proving the regression suite guards against silent test removal.
+    for stage_name, test_name_substr in canary.get("stage_pass_checks", []):
+        assert stage_has_passing_test(results, stage_name, test_name_substr), (
+            f"[{canary['id']}] expected a passing test containing {test_name_substr!r} in "
+            f"stage={stage_name!r}, but none found. "
+            f"Stage tests: {[t['name'] for t in _stage_tests(results, stage_name)]}. {note}"
+        )
+
+
+def _assert_red(rc: int, results: dict | None, canary: dict, note: str) -> None:
+    """Assert a bad-canary run is RED (false-green guard).
+
+    The PRIMARY assertion is rc != 0. If the harness wrongly returns 0 (green) for this fixture,
+    this assert fails → the regression suite catches the false-green. This is the core guard.
+    """
+
+    # PRIMARY: harness must return non-zero (RED).
+    # If the harness returns 0 for a broken app, the regression suite fails here — false-green caught.
+    assert rc != 0, (
+        f"[{canary['id']}] harness returned rc=0 (GREEN) for a KNOWN-BAD fixture — "
+        f"FALSE-GREEN detected. The harness failed to catch the broken app. {note}"
+    )
+
+    # SECONDARY: verify the specific failing test is present in results.json.
+    # If the content-type assertion is removed/vacuated, stage_has_failing_test() returns False here
+    # → this assert fires → we detect that the guard itself was removed (a meta-failure).
+    if results is not None:
+        for stage_name, test_name_substr in canary.get("stage_fail_checks", []):
+            assert stage_has_failing_test(results, stage_name, test_name_substr), (
+                f"[{canary['id']}] expected a failing test containing {test_name_substr!r} in "
+                f"stage={stage_name!r}, but none found. "
+                f"The guard may have been removed or vacuated. "
+                f"Stage tests: {[t['name'] for t in _stage_tests(results, stage_name)]}. {note}"
+            )
+
+
+def _stage_tests(results: dict, stage_name: str) -> list[dict]:
+    for stage in results.get("stages", []):
+        if stage.get("name") == stage_name:
+            return stage.get("tests", [])
+    return []