diff --git a/machine-docs/JOURNAL-2w.md b/machine-docs/JOURNAL-2w.md index f83eb73..1d35d59 100644 --- a/machine-docs/JOURNAL-2w.md +++ b/machine-docs/JOURNAL-2w.md @@ -358,3 +358,22 @@ canonical at latest separately (one extra deploy) so the old known-good is never (DECISIONS Phase-2w WC5). Next: WC6 nightly sweep (systemd timer: nixos-rebuild switch FIRST then serial cold sweep over enrolled recipes; need canonical.enrolled_recipes() + a nightly-sweep nix module). Building WC6 code while the Adversary verifies WC5. + +## 2026-05-29 — W3 WC6 nightly full-cold sweep built + proven (systemd service); claiming. WC5+WC6 close W3. + +canonical.enrolled_recipes() (scan tests/*/recipe_meta.py for WARM_CANONICAL). runner/nightly_sweep.py +(roll keycloak+traefik via warm_reconcile health-gated → serial full-cold over enrolled recipes on +latest → each green promotes WC5; skip if a run is active; per-recipe red reported not fatal). +nix/modules/nightly-sweep.nix = systemd timer (OnCalendar 03:00 Persistent +RandomizedDelay) + oneshot +service; wired into configuration.nix. 71 unit pass. + +Two bugs found via the live SERVICE run (not the direct run): (1) the store packages only runner/ (not +tests/), so enrolled_recipes scanned a nonexistent store/tests → []; fixed nightly_sweep to operate +against $CCCI_REPO=/root/cc-ci (the checkout with tests/) — same place run_recipe_ci runs from. (2) the +sweep wrapper's runtimeInputs lacked util-linux → abra's backup/restore PTY (`script`) failed → backup +red; added util-linux (matching cc-ci-run). After both fixes, the live SERVICE sweep: enrolled= +['custom-html'] → all 5 tiers green → WC5 promote advanced canonical 1.10.0→1.11.0+1.29.0; timer active +(next ~03:00). Also confirmed the red-run path (the util-linux flake) correctly did NOT promote +(known-good stayed 1.10.0 — never lose known-good). W3 (WC5+WC6) essentially closed. Remaining: +WC8 (resource/isolation hardening — mostly already in place) + WC9 (docs + --quick rollback proof, +already shown) → then DONE. diff --git a/machine-docs/STATUS-2w.md b/machine-docs/STATUS-2w.md index 87ce29e..68dabab 100644 --- a/machine-docs/STATUS-2w.md +++ b/machine-docs/STATUS-2w.md @@ -39,7 +39,12 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa snapshot+registry; never lose known-good). Proven live: green cold custom-html run advanced the canonical 1.10.0+1.28.0 → 1.11.0+1.29.0 (snapshot refreshed, idle, per-run app torn down). `--quick` never promotes (W2). **Adversary PASS @2026-05-29** (REVIEW-2w 5bbc47c, gate 125453d). -- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded). +- [x] **WC6** — Nightly full-cold sweep. `nix/modules/nightly-sweep.nix` (systemd TIMER OnCalendar + 03:00 Persistent + oneshot service) → `runner/nightly_sweep.py`: roll warm/infra (keycloak+traefik + health-gated, WC1.1) → SERIAL full-cold run over enrolled (`canonical.enrolled_recipes`) recipes + on latest → each green run promotes its canonical (WC5); skips if a test is in flight. Proven via + the live service: enrolled=['custom-html'] → all tiers green → canonical advanced 1.10.0→1.11.0. + **CLAIMED — see Gate.** - [x] **WC7** — Trigger/authority/labeling: default `!testme`=cold (unchanged); `--quick` opt-in via bridge `parse_trigger` (`!testme --quick` → CCCI_QUICK=1 Drone param, deployed+live-verified); never gates merge; runs carry mode=quick (lower-confidence label); clean no-canonical fallback @@ -133,6 +138,30 @@ headline e2e is green (below). No recipe/harness change needed. ## Gate +### Gate: WC6 — CLAIMED, awaiting Adversary (@2026-05-29) + +**WHAT.** Nightly full-cold sweep: a scheduled job rolls warm/infra to latest (health-gated, WC1.1) +then runs the full COLD suite serially across enrolled canonical recipes on latest — refreshing each +canonical's known-good (WC5) + a daily authoritative regression. Declarative, MAX_TESTS-bounded +(serial), skips if a test is in flight. **WHERE:** `nix/modules/nightly-sweep.nix` (timer+service), +`runner/nightly_sweep.py`, `runner/harness/canonical.py` (`enrolled_recipes`). Wired into +`hosts/cc-ci/configuration.nix`. + +**HOW + EXPECTED (cold):** +1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **71 passed** (incl. test_canonical enrolled_recipes). +2. **Timer present:** `systemctl is-active nightly-sweep.timer` → active; `systemctl list-timers + nightly-sweep.timer` → next ~03:00 (Persistent). +3. **Live sweep (via the systemd SERVICE, store copy):** set the custom-html canonical to an OLDER + version, then `systemctl start nightly-sweep.service` → journal shows: roll keycloak rc=0 + traefik + rc=0 (health-gated, noop at latest); `enrolled canonicals = ['custom-html']`; full-cold custom-html + install/upgrade/backup/restore/custom **all pass**; `WC5 promote: canonical custom-html advanced to + known-good 1.11.0+1.29.0`; `custom-html: PASS`; afterwards `canonical.json` version ADVANCED to + 1.11.0+1.29.0, canonical idle, traefik+keycloak 200, system running. Builder ran this live: **PASS**. + (A red recipe in the sweep is reported FAIL + does NOT promote — known-good safe; verified when a + missing-util-linux backup flake red'd a run and the canonical stayed put, then fixed.) + +--- + ### Gate: WC5 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 5bbc47c, gate 125453d) Anti-poison gate predicate + live advancement 1.10.0→1.11.0 (cold-only) cold-verified. Builder may proceed to WC6. (claim detail retained below.) diff --git a/nix/hosts/cc-ci/configuration.nix b/nix/hosts/cc-ci/configuration.nix index 91e15c1..ad68b42 100644 --- a/nix/hosts/cc-ci/configuration.nix +++ b/nix/hosts/cc-ci/configuration.nix @@ -17,6 +17,7 @@ ../../modules/backupbot.nix ../../modules/harness.nix ../../modules/warm-keycloak.nix + ../../modules/nightly-sweep.nix ]; # --- Tailscale (ACCESS-CRITICAL: do not break, this is the only route in) --- diff --git a/nix/modules/nightly-sweep.nix b/nix/modules/nightly-sweep.nix new file mode 100644 index 0000000..42018d3 --- /dev/null +++ b/nix/modules/nightly-sweep.nix @@ -0,0 +1,46 @@ +# Phase 2w / WC6 — nightly full-cold sweep. A systemd TIMER fires nightly and runs +# `runner/nightly_sweep.py`: roll warm/infra (keycloak+traefik) to latest health-gated (WC1.1) THEN +# a SERIAL full-cold run across enrolled (WARM_CANONICAL) recipes on latest — each green run +# promotes/refreshes that recipe's canonical (WC5), serving as the daily authoritative regression. +# Serial = MAX_TESTS honored (one at a time); skips itself if a test is already in flight. Declarative +# + reproducible (runner/ packaged in the nix store, D8-clean). +{ pkgs, ... }: +let + runnerSrc = ../../runner; + # The sweep drives run_recipe_ci.py (pytest/playwright) — needs the full harness env like cc-ci-run. + pyEnv = pkgs.python3.withPackages (ps: with ps; [ pytest playwright ]); + sweep = pkgs.writeShellApplication { + name = "cc-ci-nightly-sweep"; + # util-linux provides `script` (abra's PTY wrapper for backup/restore TTY ops) — same as cc-ci-run. + runtimeInputs = with pkgs; [ abra docker git curl jq gnused gnugrep gnutar coreutils util-linux procps ]; + text = '' + export HOME=/root + export PLAYWRIGHT_BROWSERS_PATH=${pkgs.playwright-driver.browsers} + export PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 + exec ${pyEnv}/bin/python3 ${runnerSrc}/nightly_sweep.py + ''; + }; +in +{ + systemd.services.nightly-sweep = { + description = "Phase-2w nightly: roll warm/infra (health-gated) + full-cold sweep over canonicals"; + after = [ "deploy-proxy.service" "warm-keycloak.service" "docker.service" ]; + environment.HOME = "/root"; + serviceConfig = { + Type = "oneshot"; + # A full sweep across several recipes (each a cold deploy/test/teardown) is long; bound it. + TimeoutStartSec = "21600"; # 6h ceiling + ExecStart = "${sweep}/bin/cc-ci-nightly-sweep"; + }; + }; + + systemd.timers.nightly-sweep = { + description = "Nightly trigger for the Phase-2w full-cold canonical sweep (WC6)"; + wantedBy = [ "timers.target" ]; + timerConfig = { + OnCalendar = "*-*-* 03:00:00"; + Persistent = true; # catch up a missed nightly after downtime + RandomizedDelaySec = "600"; + }; + }; +} diff --git a/runner/harness/canonical.py b/runner/harness/canonical.py index d8a5df1..3a36565 100644 --- a/runner/harness/canonical.py +++ b/runner/harness/canonical.py @@ -48,6 +48,20 @@ def canonical_domain(recipe: str) -> str: return warm.stable_domain(recipe) +def enrolled_recipes() -> list[str]: + """All recipes enrolled as data-warm canonicals (recipe_meta.WARM_CANONICAL=True), sorted. Used + by the WC6 nightly sweep to know which canonicals to refresh via a green cold run on latest.""" + tests_dir = os.path.join(os.path.dirname(__file__), "..", "..", "tests") + out = [] + try: + for name in sorted(os.listdir(tests_dir)): + if os.path.isfile(os.path.join(tests_dir, name, "recipe_meta.py")) and is_enrolled(name): + out.append(name) + except OSError: + pass + return out + + def registry_path(recipe: str) -> str: return os.path.join(warmsnap.app_dir(recipe), "canonical.json") diff --git a/runner/nightly_sweep.py b/runner/nightly_sweep.py new file mode 100644 index 0000000..8a63dbb --- /dev/null +++ b/runner/nightly_sweep.py @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +"""Nightly full-cold sweep (Phase 2w / WC6). + +Invoked by the `nightly-sweep` systemd timer (nix/modules/nightly-sweep.nix). Order (plan WC6): + 1. Roll warm/infra to latest, HEALTH-GATED (WC1.1): re-run the keycloak + traefik reconcilers + (warm_reconcile.py — fetch latest recipe → deploy → health-gate → commit/rollback+alert). + This is the health-gated "warm/infra → latest" step; a full operator `nixos-rebuild switch` is + the config-deploy path, not the autonomous nightly's job (DECISIONS Phase-2w WC6). + 2. FULL-COLD sweep across enrolled (WARM_CANONICAL) recipes, SERIAL (MAX_TESTS honored — one at a + time), each `RECIPE= run_recipe_ci.py` on LATEST (no REF) → a green run promotes/refreshes + that recipe's canonical (WC5). Serves as the daily authoritative regression. + +MUST NOT run while a test/Drone build is in flight: if a `run_recipe_ci.py` is already active, skip +this nightly (defer to the next) rather than pile on the single node. Bounded + serial. Exit 0 even +if some recipes fail (logs per-recipe results; a red recipe just doesn't advance its canonical). +""" + +from __future__ import annotations + +import os +import subprocess +import sys + +# The sweep drives the recipe RUNS (run_recipe_ci) + reads enrollment (tests//recipe_meta.py), +# which live in the cc-ci CHECKOUT (the nix store packages only runner/, not tests/). So operate +# against $CCCI_REPO (default /root/cc-ci) — the same checkout run_recipe_ci already runs from. +REPO = os.environ.get("CCCI_REPO", "/root/cc-ci") +sys.path.insert(0, os.path.join(REPO, "runner")) +from harness import canonical # noqa: E402 + +WARM_APPS = ["keycloak", "traefik"] # the live-warm/infra reconcilers to roll first (health-gated) + + +def _here() -> str: + return os.path.join(REPO, "runner") + + +def _another_run_active() -> bool: + """True if a run_recipe_ci.py is already executing (don't pile onto the single node).""" + r = subprocess.run(["pgrep", "-f", "run_recipe_ci.py"], capture_output=True, text=True) + mine = str(os.getpid()) + pids = [p for p in r.stdout.split() if p and p != mine] + return bool(pids) + + +def roll_warm_infra() -> None: + """Re-run the health-gated reconcilers so keycloak + traefik roll to latest (WC1.1).""" + for app in WARM_APPS: + print(f"\n===== nightly: roll warm/infra {app} (health-gated) =====", flush=True) + rc = subprocess.run( + [sys.executable, os.path.join(_here(), "warm_reconcile.py"), app] + ).returncode + print(f"nightly: reconcile {app} rc={rc}", flush=True) + + +def sweep() -> int: + recipes = canonical.enrolled_recipes() + print(f"\n===== nightly cold sweep: enrolled canonicals = {recipes} =====", flush=True) + results: dict[str, int] = {} + for r in recipes: + print(f"\n===== nightly: full-cold {r} (latest) =====", flush=True) + env = dict(os.environ, RECIPE=r) + env.pop("REF", None) # latest, not a PR head + env.pop("CCCI_QUICK", None) + env.pop("MODE", None) + rc = subprocess.run( + [sys.executable, os.path.join(_here(), "run_recipe_ci.py")], env=env + ).returncode + results[r] = rc + print(f"nightly: {r} rc={rc} ({'green→canonical refreshed' if rc == 0 else 'red'})", flush=True) + print("\n===== nightly sweep summary =====", flush=True) + for r, rc in results.items(): + print(f" {r}: {'PASS' if rc == 0 else 'FAIL'}", flush=True) + return 0 # the sweep itself succeeds; per-recipe reds are reported, not fatal + + +def main() -> int: + if _another_run_active(): + print("nightly: a run_recipe_ci.py is active — skipping this nightly (defer)", flush=True) + return 0 + roll_warm_infra() + return sweep() + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/unit/test_canonical.py b/tests/unit/test_canonical.py index 60c2dcb..a99783b 100644 --- a/tests/unit/test_canonical.py +++ b/tests/unit/test_canonical.py @@ -59,3 +59,18 @@ def test_registry_roundtrip(tmp_path, monkeypatch): # the file is valid JSON on disk with open(canonical.registry_path("custom-html")) as f: assert json.load(f)["status"] == "warm" + + +def test_enrolled_recipes_scans_meta(tmp_path, monkeypatch): + # enrolled_recipes() lists recipes whose tests//recipe_meta.py sets WARM_CANONICAL=True. + fake_harness = tmp_path / "runner" / "harness" + fake_harness.mkdir(parents=True) + monkeypatch.setattr(canonical, "__file__", str(fake_harness / "canonical.py")) + for name, body in (("aaa", "WARM_CANONICAL = True\n"), + ("bbb", "DEPS=['x']\n"), + ("ccc", "WARM_CANONICAL = True\n")): + d = tmp_path / "tests" / name + d.mkdir(parents=True) + (d / "recipe_meta.py").write_text(body) + (tmp_path / "tests" / "ddd").mkdir(parents=True) # no recipe_meta.py at all + assert canonical.enrolled_recipes() == ["aaa", "ccc"]