claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live

canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 04:33:08 +01:00
parent 1e40a460ba
commit 465e1059b0
7 changed files with 211 additions and 1 deletions

View File

@ -358,3 +358,22 @@ canonical at latest separately (one extra deploy) so the old known-good is never
(DECISIONS Phase-2w WC5). Next: WC6 nightly sweep (systemd timer: nixos-rebuild switch FIRST then
serial cold sweep over enrolled recipes; need canonical.enrolled_recipes() + a nightly-sweep nix
module). Building WC6 code while the Adversary verifies WC5.
## 2026-05-29 — W3 WC6 nightly full-cold sweep built + proven (systemd service); claiming. WC5+WC6 close W3.
canonical.enrolled_recipes() (scan tests/*/recipe_meta.py for WARM_CANONICAL). runner/nightly_sweep.py
(roll keycloak+traefik via warm_reconcile health-gated → serial full-cold over enrolled recipes on
latest → each green promotes WC5; skip if a run is active; per-recipe red reported not fatal).
nix/modules/nightly-sweep.nix = systemd timer (OnCalendar 03:00 Persistent +RandomizedDelay) + oneshot
service; wired into configuration.nix. 71 unit pass.
Two bugs found via the live SERVICE run (not the direct run): (1) the store packages only runner/ (not
tests/), so enrolled_recipes scanned a nonexistent store/tests → []; fixed nightly_sweep to operate
against $CCCI_REPO=/root/cc-ci (the checkout with tests/) — same place run_recipe_ci runs from. (2) the
sweep wrapper's runtimeInputs lacked util-linux → abra's backup/restore PTY (`script`) failed → backup
red; added util-linux (matching cc-ci-run). After both fixes, the live SERVICE sweep: enrolled=
['custom-html'] → all 5 tiers green → WC5 promote advanced canonical 1.10.0→1.11.0+1.29.0; timer active
(next ~03:00). Also confirmed the red-run path (the util-linux flake) correctly did NOT promote
(known-good stayed 1.10.0 — never lose known-good). W3 (WC5+WC6) essentially closed. Remaining:
WC8 (resource/isolation hardening — mostly already in place) + WC9 (docs + --quick rollback proof,
already shown) → then DONE.

View File

@ -39,7 +39,12 @@ nightly full-cold sweep. Definition of Done = WC1WC9 (plan §1), each Adversa
snapshot+registry; never lose known-good). Proven live: green cold custom-html run advanced the
canonical 1.10.0+1.28.0 → 1.11.0+1.29.0 (snapshot refreshed, idle, per-run app torn down).
`--quick` never promotes (W2). **Adversary PASS @2026-05-29** (REVIEW-2w 5bbc47c, gate 125453d).
- [ ] **WC6** — Nightly full-cold sweep (scheduled, declarative, MAX_TESTS-bounded).
- [x] **WC6** — Nightly full-cold sweep. `nix/modules/nightly-sweep.nix` (systemd TIMER OnCalendar
03:00 Persistent + oneshot service) → `runner/nightly_sweep.py`: roll warm/infra (keycloak+traefik
health-gated, WC1.1) → SERIAL full-cold run over enrolled (`canonical.enrolled_recipes`) recipes
on latest → each green run promotes its canonical (WC5); skips if a test is in flight. Proven via
the live service: enrolled=['custom-html'] → all tiers green → canonical advanced 1.10.0→1.11.0.
**CLAIMED — see Gate.**
- [x] **WC7** — Trigger/authority/labeling: default `!testme`=cold (unchanged); `--quick` opt-in via
bridge `parse_trigger` (`!testme --quick` → CCCI_QUICK=1 Drone param, deployed+live-verified);
never gates merge; runs carry mode=quick (lower-confidence label); clean no-canonical fallback
@ -133,6 +138,30 @@ headline e2e is green (below). No recipe/harness change needed.
## Gate
### Gate: WC6 — CLAIMED, awaiting Adversary (@2026-05-29)
**WHAT.** Nightly full-cold sweep: a scheduled job rolls warm/infra to latest (health-gated, WC1.1)
then runs the full COLD suite serially across enrolled canonical recipes on latest — refreshing each
canonical's known-good (WC5) + a daily authoritative regression. Declarative, MAX_TESTS-bounded
(serial), skips if a test is in flight. **WHERE:** `nix/modules/nightly-sweep.nix` (timer+service),
`runner/nightly_sweep.py`, `runner/harness/canonical.py` (`enrolled_recipes`). Wired into
`hosts/cc-ci/configuration.nix`.
**HOW + EXPECTED (cold):**
1. **Units:** `cc-ci-run -m pytest tests/unit -q`**71 passed** (incl. test_canonical enrolled_recipes).
2. **Timer present:** `systemctl is-active nightly-sweep.timer` → active; `systemctl list-timers
nightly-sweep.timer` → next ~03:00 (Persistent).
3. **Live sweep (via the systemd SERVICE, store copy):** set the custom-html canonical to an OLDER
version, then `systemctl start nightly-sweep.service` → journal shows: roll keycloak rc=0 + traefik
rc=0 (health-gated, noop at latest); `enrolled canonicals = ['custom-html']`; full-cold custom-html
install/upgrade/backup/restore/custom **all pass**; `WC5 promote: canonical custom-html advanced to
known-good 1.11.0+1.29.0`; `custom-html: PASS`; afterwards `canonical.json` version ADVANCED to
1.11.0+1.29.0, canonical idle, traefik+keycloak 200, system running. Builder ran this live: **PASS**.
(A red recipe in the sweep is reported FAIL + does NOT promote — known-good safe; verified when a
missing-util-linux backup flake red'd a run and the canonical stayed put, then fixed.)
---
### Gate: WC5 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w 5bbc47c, gate 125453d)
Anti-poison gate predicate + live advancement 1.10.0→1.11.0 (cold-only) cold-verified. Builder may
proceed to WC6. (claim detail retained below.)

View File

@ -17,6 +17,7 @@
../../modules/backupbot.nix
../../modules/harness.nix
../../modules/warm-keycloak.nix
../../modules/nightly-sweep.nix
];
# --- Tailscale (ACCESS-CRITICAL: do not break, this is the only route in) ---

View File

@ -0,0 +1,46 @@
# Phase 2w / WC6 — nightly full-cold sweep. A systemd TIMER fires nightly and runs
# `runner/nightly_sweep.py`: roll warm/infra (keycloak+traefik) to latest health-gated (WC1.1) THEN
# a SERIAL full-cold run across enrolled (WARM_CANONICAL) recipes on latest — each green run
# promotes/refreshes that recipe's canonical (WC5), serving as the daily authoritative regression.
# Serial = MAX_TESTS honored (one at a time); skips itself if a test is already in flight. Declarative
# + reproducible (runner/ packaged in the nix store, D8-clean).
{ pkgs, ... }:
let
runnerSrc = ../../runner;
# The sweep drives run_recipe_ci.py (pytest/playwright) — needs the full harness env like cc-ci-run.
pyEnv = pkgs.python3.withPackages (ps: with ps; [ pytest playwright ]);
sweep = pkgs.writeShellApplication {
name = "cc-ci-nightly-sweep";
# util-linux provides `script` (abra's PTY wrapper for backup/restore TTY ops) — same as cc-ci-run.
runtimeInputs = with pkgs; [ abra docker git curl jq gnused gnugrep gnutar coreutils util-linux procps ];
text = ''
export HOME=/root
export PLAYWRIGHT_BROWSERS_PATH=${pkgs.playwright-driver.browsers}
export PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1
exec ${pyEnv}/bin/python3 ${runnerSrc}/nightly_sweep.py
'';
};
in
{
systemd.services.nightly-sweep = {
description = "Phase-2w nightly: roll warm/infra (health-gated) + full-cold sweep over canonicals";
after = [ "deploy-proxy.service" "warm-keycloak.service" "docker.service" ];
environment.HOME = "/root";
serviceConfig = {
Type = "oneshot";
# A full sweep across several recipes (each a cold deploy/test/teardown) is long; bound it.
TimeoutStartSec = "21600"; # 6h ceiling
ExecStart = "${sweep}/bin/cc-ci-nightly-sweep";
};
};
systemd.timers.nightly-sweep = {
description = "Nightly trigger for the Phase-2w full-cold canonical sweep (WC6)";
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "*-*-* 03:00:00";
Persistent = true; # catch up a missed nightly after downtime
RandomizedDelaySec = "600";
};
};
}

View File

@ -48,6 +48,20 @@ def canonical_domain(recipe: str) -> str:
return warm.stable_domain(recipe)
def enrolled_recipes() -> list[str]:
"""All recipes enrolled as data-warm canonicals (recipe_meta.WARM_CANONICAL=True), sorted. Used
by the WC6 nightly sweep to know which canonicals to refresh via a green cold run on latest."""
tests_dir = os.path.join(os.path.dirname(__file__), "..", "..", "tests")
out = []
try:
for name in sorted(os.listdir(tests_dir)):
if os.path.isfile(os.path.join(tests_dir, name, "recipe_meta.py")) and is_enrolled(name):
out.append(name)
except OSError:
pass
return out
def registry_path(recipe: str) -> str:
return os.path.join(warmsnap.app_dir(recipe), "canonical.json")

86
runner/nightly_sweep.py Normal file
View File

@ -0,0 +1,86 @@
#!/usr/bin/env python3
"""Nightly full-cold sweep (Phase 2w / WC6).
Invoked by the `nightly-sweep` systemd timer (nix/modules/nightly-sweep.nix). Order (plan WC6):
1. Roll warm/infra to latest, HEALTH-GATED (WC1.1): re-run the keycloak + traefik reconcilers
(warm_reconcile.py <app> — fetch latest recipe → deploy → health-gate → commit/rollback+alert).
This is the health-gated "warm/infra → latest" step; a full operator `nixos-rebuild switch` is
the config-deploy path, not the autonomous nightly's job (DECISIONS Phase-2w WC6).
2. FULL-COLD sweep across enrolled (WARM_CANONICAL) recipes, SERIAL (MAX_TESTS honored — one at a
time), each `RECIPE=<r> run_recipe_ci.py` on LATEST (no REF) → a green run promotes/refreshes
that recipe's canonical (WC5). Serves as the daily authoritative regression.
MUST NOT run while a test/Drone build is in flight: if a `run_recipe_ci.py` is already active, skip
this nightly (defer to the next) rather than pile on the single node. Bounded + serial. Exit 0 even
if some recipes fail (logs per-recipe results; a red recipe just doesn't advance its canonical).
"""
from __future__ import annotations
import os
import subprocess
import sys
# The sweep drives the recipe RUNS (run_recipe_ci) + reads enrollment (tests/<r>/recipe_meta.py),
# which live in the cc-ci CHECKOUT (the nix store packages only runner/, not tests/). So operate
# against $CCCI_REPO (default /root/cc-ci) — the same checkout run_recipe_ci already runs from.
REPO = os.environ.get("CCCI_REPO", "/root/cc-ci")
sys.path.insert(0, os.path.join(REPO, "runner"))
from harness import canonical # noqa: E402
WARM_APPS = ["keycloak", "traefik"] # the live-warm/infra reconcilers to roll first (health-gated)
def _here() -> str:
return os.path.join(REPO, "runner")
def _another_run_active() -> bool:
"""True if a run_recipe_ci.py is already executing (don't pile onto the single node)."""
r = subprocess.run(["pgrep", "-f", "run_recipe_ci.py"], capture_output=True, text=True)
mine = str(os.getpid())
pids = [p for p in r.stdout.split() if p and p != mine]
return bool(pids)
def roll_warm_infra() -> None:
"""Re-run the health-gated reconcilers so keycloak + traefik roll to latest (WC1.1)."""
for app in WARM_APPS:
print(f"\n===== nightly: roll warm/infra {app} (health-gated) =====", flush=True)
rc = subprocess.run(
[sys.executable, os.path.join(_here(), "warm_reconcile.py"), app]
).returncode
print(f"nightly: reconcile {app} rc={rc}", flush=True)
def sweep() -> int:
recipes = canonical.enrolled_recipes()
print(f"\n===== nightly cold sweep: enrolled canonicals = {recipes} =====", flush=True)
results: dict[str, int] = {}
for r in recipes:
print(f"\n===== nightly: full-cold {r} (latest) =====", flush=True)
env = dict(os.environ, RECIPE=r)
env.pop("REF", None) # latest, not a PR head
env.pop("CCCI_QUICK", None)
env.pop("MODE", None)
rc = subprocess.run(
[sys.executable, os.path.join(_here(), "run_recipe_ci.py")], env=env
).returncode
results[r] = rc
print(f"nightly: {r} rc={rc} ({'green→canonical refreshed' if rc == 0 else 'red'})", flush=True)
print("\n===== nightly sweep summary =====", flush=True)
for r, rc in results.items():
print(f" {r}: {'PASS' if rc == 0 else 'FAIL'}", flush=True)
return 0 # the sweep itself succeeds; per-recipe reds are reported, not fatal
def main() -> int:
if _another_run_active():
print("nightly: a run_recipe_ci.py is active — skipping this nightly (defer)", flush=True)
return 0
roll_warm_infra()
return sweep()
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -59,3 +59,18 @@ def test_registry_roundtrip(tmp_path, monkeypatch):
# the file is valid JSON on disk
with open(canonical.registry_path("custom-html")) as f:
assert json.load(f)["status"] == "warm"
def test_enrolled_recipes_scans_meta(tmp_path, monkeypatch):
# enrolled_recipes() lists recipes whose tests/<r>/recipe_meta.py sets WARM_CANONICAL=True.
fake_harness = tmp_path / "runner" / "harness"
fake_harness.mkdir(parents=True)
monkeypatch.setattr(canonical, "__file__", str(fake_harness / "canonical.py"))
for name, body in (("aaa", "WARM_CANONICAL = True\n"),
("bbb", "DEPS=['x']\n"),
("ccc", "WARM_CANONICAL = True\n")):
d = tmp_path / "tests" / name
d.mkdir(parents=True)
(d / "recipe_meta.py").write_text(body)
(tmp_path / "tests" / "ddd").mkdir(parents=True) # no recipe_meta.py at all
assert canonical.enrolled_recipes() == ["aaa", "ccc"]