diff --git a/docs/warm.md b/docs/warm.md new file mode 100644 index 0000000..a8044bc --- /dev/null +++ b/docs/warm.md @@ -0,0 +1,116 @@ +# Warm deployments + `--quick` CI mode (Phase 2w) + +cc-ci keeps a small set of apps **warm** so SSO-dependent tests and an opt-in fast lane avoid paying +the full cold-provisioning cost every run. Three states (use these terms): + +- **live-warm** — actually deployed and running (keycloak, traefik): instant to use, costs RAM. +- **data-warm** — *undeployed* (RAM freed) but its **data volume is retained**, so a later + `abra app deploy` reattaches it and boots warm (skips fresh DB-init/first-boot); costs only disk. +- **cold** — no retained data: fresh `abra app new` + new volume + full lifecycle + teardown that + deletes the volume. **The authoritative default** (`!testme` = full cold). + +**Stable-domain scheme:** warm apps live at `warm-.ci.commoninternet.net` — deliberately +distinct from the cold per-run scheme `-<6hex>.ci...` so a warm app is never confused +with a disposable cold run. Warm volumes + snapshots live under `/var/lib/ci-warm//` and are +**cache, not source** — re-seeded by cold runs, **excluded from the D8 reproducibility closure** (no +Nix module declares them as a source). + +## Live-warm keycloak + traefik — auto-update, health-gated, with rollback + +Both are **unpinned** and reconciled by `runner/warm_reconcile.py ` (driven by the systemd +oneshots `warm-keycloak.service` / `deploy-proxy.service`, re-run every activation/boot). On each +reconcile (and nightly, WC6): + +1. **WC1.2 pre-deploy safety gate (first).** Compare current→latest. **Auto-apply only non-major + (patch/minor) bumps with no manual-migration release notes.** A **MAJOR** recipe/app-version bump, + or a target whose `releaseNotes/.md` flags a manual migration, is **NOT auto-applied** — + stay on current + write an alert with the notes for the operator. (A health pass ≠ migration done.) +2. **WC1.1 post-deploy health gate.** Record running version = last-good → deploy latest → + health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good + alert.** + - **keycloak is stateful:** undeploy → **snapshot the data volume** → deploy latest → on failure + **restore the snapshot** + redeploy the prior version (a forward DB migration makes a + version-only rollback unsafe). + - **traefik is stateless:** version rollback only (no snapshot). + +keycloak is the **shared SSO provider**: SSO-dependent recipes point their `setup_custom_tests` at +the one warm keycloak and create a **per-run namespaced realm** `-<6hex>` (created at run +start, deleted at run end). Concurrent dependents get distinct realms; orphaned realms (crashed runs) +are reaped by hex not matching a live app stack. + +**Alerts.** A reconciler that rolls back (WC1.1) or holds an upgrade (WC1.2) writes a sentinel JSON to +`/var/lib/ci-warm/alerts/*.json`. The Builder loop relays new alerts (PushNotification) and archives +them to `alerts/seen/` — bridging the autonomous reconciler to operator visibility. + +## Data-warm canonicals (WC2/WC3) + +A **canonical** is a per-recipe known-good deployment at `warm-`, kept data-warm +(undeployed-when-idle, volume retained), tracked by `runner/harness/canonical.py`: + +- **Enroll a recipe:** set `WARM_CANONICAL = True` in `tests//recipe_meta.py`. That's it. +- **Registry:** `/var/lib/ci-warm//canonical.json` = `{recipe, domain, version, commit, + status, ts}`. +- **Known-good snapshot (WC3):** `runner/harness/warmsnap.py` takes a **raw per-volume tar while the + app is UNDEPLOYED** under `/var/lib/ci-warm//snapshot/` — **one last-good per app**, atomic + replace. `restore()` clears + untars each volume back; proven to round-trip data. + +## `--quick` opt-in fast lane (WC4/WC7) + +`!testme` = full **cold** (default, authoritative). `!testme --quick` = opt-in **lower-confidence** +fast lane (the bridge parses it → `CCCI_QUICK=1` Drone param; `run_quick` in `run_recipe_ci.py`): + +1. Reattach the canonical (`deploy_canonical` — warm boot at known-good) → wait healthy. +2. (deps) use the warm keycloak + a per-run realm. +3. **Upgrade in place to the PR head** (chaos) — the op, once. +4. Assert: generic UPGRADE (reconverge + moved + serving) + recipe overlay + custom. +5. **PASS → undeploy-keep-volume; known-good UNCHANGED (never promote).** + **FAIL → restore the last-known-good snapshot + undeploy (roll back, data safe).** + +`--quick` **never gates merge** and **never advances the canonical**. If no canonical exists it falls +back cleanly to a full cold run (the PR is still tested). + +## Cold-only canonical advancement (WC5) + nightly sweep (WC6) + +- **WC5 promote-on-green-cold.** A **GREEN full-cold run on LATEST** (no PR head) of an enrolled + recipe re-seeds the canonical at the green-verified latest (snapshot + registry, atomic). The + old known-good is replaced **only** after green — **never lost on a red run**. The FIRST green cold + run seeds the canonical. A PR `!testme` (carries REF) and `--quick` **never** promote — only + cold-on-latest (the nightly sweep, or a manual `RECIPE=` run) advances it. +- **WC6 nightly sweep.** `nightly-sweep.timer` (03:00, Persistent) → `nightly_sweep.py`: roll + warm/infra to latest (health-gated, WC1.1) → **serial** full-cold run across enrolled recipes on + latest (each green run promotes its canonical) → prune stale warm data → log disk. Serial honors + MAX_TESTS; skips if a test is already in flight. + +## Resource safety + isolation (WC8) + +- **Serialize:** `DRONE_RUNNER_CAPACITY = MAX_TESTS` (default 1); the nightly sweep is serial and + skips if a `run_recipe_ci.py` is active. At most MAX_TESTS apps are ever live at once. +- **Warm keycloak shared safely** via per-run namespaced realms (above); orphan realms reaped. +- **Disk** (warm is the budget, not RAM): `virtualisation.docker.autoPrune` prunes + images/containers/networks/build-cache older than 24h but **never `--volumes`** (so data-warm + canonical volumes survive). Each canonical = one data volume + one snapshot (small; the keycloak DB + snapshot ~300M dominates). `canonical.prune_stale()` (run nightly) drops warm data for + **de-enrolled** canonicals. Monitor with `df -h /` (the nightly logs it). +- **Cold teardown stays sacred:** a cold per-run app's volumes/secrets are always deleted at run end + (or janitor-reaped); promote re-seeds the canonical separately (never reuses a per-run volume). +- **Excluded from D8:** `/var/lib/ci-warm/` is runtime cache — no Nix module declares it as a source; + a from-scratch rebuild re-seeds canonicals via cold runs, it does not restore them. + +## The `--quick` rollback proof (WC9) + +Deliberately failing a PR under `--quick` restores the canonical's last-known-good intact, and a +`--quick` pass does not move the known-good — both proven live on the custom-html canonical: +- **PASS keeps known-good:** a `--quick` PASS run left the registry version + the snapshot tar + **byte-identical** (Adversary-verified sha256) and the canonical idle with its volume retained. +- **FAIL restores known-good:** a `--quick` run against a broken PR head (bad image) → `quick FAIL → + restored known-good data; canonical idle`, exit 1; the snapshot was byte-identical, the known-good + marker was back, the app served 200, and the broken image was gone. The known-good version was + never advanced. + +## Operate / debug + +- Inspect a canonical: `cat /var/lib/ci-warm//canonical.json`; `warmsnap` snapshot under + `…/snapshot/`. Enrolled recipes: `canonical.enrolled_recipes()`. +- Run a quick test manually: `RECIPE= CCCI_QUICK=1 cc-ci-run runner/run_recipe_ci.py`. +- Trigger the nightly sweep: `systemctl start nightly-sweep.service` (journal shows the roll + sweep). +- Roll/repair warm keycloak or traefik: `cc-ci-run runner/warm_reconcile.py {keycloak|traefik}`. +- Alerts: `ls /var/lib/ci-warm/alerts/` (active) and `…/seen/` (relayed). diff --git a/machine-docs/JOURNAL-2w.md b/machine-docs/JOURNAL-2w.md index 1d35d59..aad0a1a 100644 --- a/machine-docs/JOURNAL-2w.md +++ b/machine-docs/JOURNAL-2w.md @@ -377,3 +377,20 @@ red; added util-linux (matching cc-ci-run). After both fixes, the live SERVICE s (known-good stayed 1.10.0 — never lose known-good). W3 (WC5+WC6) essentially closed. Remaining: WC8 (resource/isolation hardening — mostly already in place) + WC9 (docs + --quick rollback proof, already shown) → then DONE. + +## 2026-05-29 — W4 WC8 + WC9 (final gates) built + claimed; DONE pending their PASS + +WC6 ADVERSARY PASS (REVIEW-2w b8b698e). Then built the final two: +- **WC8 resource safety + isolation** — most was already in place; consolidated + added the missing + piece: `canonical.prune_stale()` drops `/var/lib/ci-warm//` + the `warm-` volumes + for DE-ENROLLED canonicals (keeps enrolled + reconciler dirs keycloak/traefik + alerts/), wired + into the nightly sweep + a `df` log. +1 unit (72 pass). Verified live: DRONE_RUNNER_CAPACITY=maxTests + (serialize); autoPrune flags drop `--volumes` (warm vols survive); `grep ci-warm nix/` = comment + only (excluded from D8); disk 50%, warm ~318M. +- **WC9 docs** — `docs/warm.md`: the full warm/quick model (live/data-warm/cold, warm- scheme, + health-gated reconcilers + WC1.2 safety gate + alerts, canonicals + warmsnap + enroll, --quick, + promote-on-green-cold, nightly sweep, resource safety, operate/debug) + the `--quick` rollback proof + (FAIL restores exact known-good; PASS byte-identical snapshot — proven W2/WC4). + +Claimed WC8+WC9 (the final gates). On their PASS, EVERY WC1–WC9 (incl WC1.1/WC1.2) is Adversary-verified +→ write `## DONE` to STATUS-2w (handshake: <24h PASS for all + no VETO) → watchdog returns to Phase 2. diff --git a/machine-docs/STATUS-2w.md b/machine-docs/STATUS-2w.md index 68dabab..ea4d804 100644 --- a/machine-docs/STATUS-2w.md +++ b/machine-docs/STATUS-2w.md @@ -44,15 +44,19 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa health-gated, WC1.1) → SERIAL full-cold run over enrolled (`canonical.enrolled_recipes`) recipes on latest → each green run promotes its canonical (WC5); skips if a test is in flight. Proven via the live service: enrolled=['custom-html'] → all tiers green → canonical advanced 1.10.0→1.11.0. - **CLAIMED — see Gate.** + **Adversary PASS @2026-05-29** (REVIEW-2w b8b698e, gate 465e105). - [x] **WC7** — Trigger/authority/labeling: default `!testme`=cold (unchanged); `--quick` opt-in via bridge `parse_trigger` (`!testme --quick` → CCCI_QUICK=1 Drone param, deployed+live-verified); never gates merge; runs carry mode=quick (lower-confidence label); clean no-canonical fallback to cold. **Adversary PASS @2026-05-29** (REVIEW-2w 31f0e42, gate 3ff2bf6). -- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via - per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure. -- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`, - confirm last-known-good restored intact; a `--quick` pass did not move the known-good). +- [x] **WC8** — Resource safety + isolation: serialize via `DRONE_RUNNER_CAPACITY=MAX_TESTS` + serial + nightly that skips-if-test-active; warm keycloak shared via per-run realms (WC1); disk + monitored+pruned (autoPrune drops `--volumes` so warm vols survive; `canonical.prune_stale` + drops de-enrolled warm data nightly; nightly logs `df`); cold teardown sacred; warm data + EXCLUDED from D8 (no Nix module references `/var/lib/ci-warm` as a source). **CLAIMED — see Gate.** +- [x] **WC9** — `docs/warm.md` documents the full warm/quick model; the `--quick` rollback proof + (FAIL restores last-known-good intact; PASS doesn't move it) is proven live (W2 FAIL + WC4 + Adversary byte-identical-snapshot verify). **CLAIMED — see Gate.** ## Milestones (plan §3) - **W0** — Warm keycloak (WC1/WC1.1-keycloak/WC1.2). ✅ Adversary PASS @2026-05-29. @@ -138,7 +142,42 @@ headline e2e is green (below). No recipe/harness change needed. ## Gate -### Gate: WC6 — CLAIMED, awaiting Adversary (@2026-05-29) +### Gate: WC8 + WC9 — CLAIMED, awaiting Adversary (@2026-05-29) [FINAL gates] + +**WHAT.** WC8 resource safety/isolation (consolidated + a stale-warm prune) + WC9 docs + the proven +`--quick` rollback. **WHERE:** `runner/harness/canonical.py` (`prune_stale`), `runner/nightly_sweep.py` +(prune + df after sweep), `nix/modules/{drone-runner,swarm}.nix` (capacity, autoPrune), `docs/warm.md`. + +**HOW + EXPECTED (cold):** +1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **72 passed** (incl. test_canonical prune_stale: + drops de-enrolled canonical dirs, keeps enrolled + reconciler dirs + alerts/). +2. **WC8 serialize:** `grep DRONE_RUNNER_CAPACITY nix/modules/drone-runner.nix` → `= maxTests` + (MAX_TESTS, default 1); `nightly_sweep.py` `_another_run_active()` skips if a run is in flight; + sweep loop is serial. +3. **WC8 disk/prune:** `grep flags nix/modules/swarm.nix` → `[ "--all" "--filter" "until=24h" ]` + (NO `--volumes` → warm volumes survive); `canonical.prune_stale()` drops `/var/lib/ci-warm//` + (+ its `warm-` volumes) for recipes no longer WARM_CANONICAL, run nightly; `df -h /` logged by + the sweep. Live: disk `/` 50% (14G free); warm total ~318M (keycloak DB snapshot dominates). +4. **WC8 cold teardown sacred:** proven across W2/WC5/WC6 (no `-<6hex>` leftovers post-run). +5. **WC8 excluded from D8:** `grep -rn ci-warm nix/` → only a COMMENT (no Nix source declares + `/var/lib/ci-warm`); it's runtime cache re-seeded by cold runs. +6. **WC9 docs:** `docs/warm.md` covers live-warm/data-warm/cold, the reconcilers + health-gate + + safety gate + alerts, canonicals + snapshots + enroll, `--quick`, promote-on-green-cold, the + nightly sweep, resource safety, and the `--quick` rollback proof + operate/debug. +7. **WC9 `--quick` rollback proof:** already cold-verified — W2 FAIL run restored the exact + known-good; WC4 Adversary verify confirmed a PASS run leaves the snapshot byte-identical (does NOT + move the known-good). Re-runnable per docs/warm.md "The --quick rollback proof". + +**On WC8+WC9 PASS → ALL of WC1–WC9 (incl WC1.1/WC1.2) verified → Builder writes `## DONE`.** + +--- + +### Gate: WC6 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w b8b698e, gate 465e105) +Declarative timer (Persistent) + orchestration + the live systemd-service run (infra roll +health-gated → serial cold sweep → canonical advanced, infra healthy, no leftovers) cold-verified. +Builder may proceed to W4 (WC8/WC9). (claim detail retained below.) + +### (claimed, now PASS) Gate: WC6 — CLAIMED detail **WHAT.** Nightly full-cold sweep: a scheduled job rolls warm/infra to latest (health-gated, WC1.1) then runs the full COLD suite serially across enrolled canonical recipes on latest — refreshing each diff --git a/runner/harness/canonical.py b/runner/harness/canonical.py index 3a36565..64e21b2 100644 --- a/runner/harness/canonical.py +++ b/runner/harness/canonical.py @@ -138,6 +138,36 @@ def undeploy_keep_volume(recipe: str) -> None: _set_status(recipe, "idle") +def prune_stale() -> list[str]: + """WC8 disk hygiene: remove warm data for DE-ENROLLED canonicals — a `/var/lib/ci-warm//` + that carries a `canonical.json` but whose recipe is no longer enrolled (WARM_CANONICAL dropped). + Drops the dir (snapshot + registry) AND the retained `warm-` data volumes. Leaves the + live-warm reconciler dirs (keycloak/traefik — they have a `last_good`, no `canonical.json`), + `alerts/`, and currently-enrolled canonicals untouched. Returns the recipes pruned.""" + import shutil + import subprocess + + root = warmsnap.warm_root() + keep = set(enrolled_recipes()) + pruned: list[str] = [] + try: + entries = sorted(os.listdir(root)) + except OSError: + return pruned + for name in entries: + d = os.path.join(root, name) + if not os.path.isdir(d) or name in keep: + continue + if not os.path.isfile(os.path.join(d, "canonical.json")): + continue # not a data-warm canonical (e.g. keycloak/traefik reconciler dir, alerts/) + # drop the retained warm- volumes, then the snapshot/registry dir + for vol in warmsnap.stack_volumes(canonical_domain(name)): + subprocess.run(["docker", "volume", "rm", vol], capture_output=True, text=True) + shutil.rmtree(d, ignore_errors=True) + pruned.append(name) + return pruned + + def seed_canonical(recipe: str, version: str, commit: str | None = None) -> dict: """Record (already deployed at `version`) as the recipe's canonical: write the registry, then (app must be UNDEPLOYED) take the known-good snapshot. Caller deploys + verifies diff --git a/runner/nightly_sweep.py b/runner/nightly_sweep.py index 8a63dbb..cf233c5 100644 --- a/runner/nightly_sweep.py +++ b/runner/nightly_sweep.py @@ -68,6 +68,12 @@ def sweep() -> int: ).returncode results[r] = rc print(f"nightly: {r} rc={rc} ({'green→canonical refreshed' if rc == 0 else 'red'})", flush=True) + # WC8 disk hygiene: drop warm data for de-enrolled canonicals; log the disk budget. + pruned = canonical.prune_stale() + if pruned: + print(f"nightly: pruned stale warm data for de-enrolled canonicals: {pruned}", flush=True) + df = subprocess.run(["df", "-h", "/"], capture_output=True, text=True) + print(f"nightly: disk / →\n{df.stdout.strip()}", flush=True) print("\n===== nightly sweep summary =====", flush=True) for r, rc in results.items(): print(f" {r}: {'PASS' if rc == 0 else 'FAIL'}", flush=True) diff --git a/tests/unit/test_canonical.py b/tests/unit/test_canonical.py index a99783b..f25dd77 100644 --- a/tests/unit/test_canonical.py +++ b/tests/unit/test_canonical.py @@ -74,3 +74,23 @@ def test_enrolled_recipes_scans_meta(tmp_path, monkeypatch): (d / "recipe_meta.py").write_text(body) (tmp_path / "tests" / "ddd").mkdir(parents=True) # no recipe_meta.py at all assert canonical.enrolled_recipes() == ["aaa", "ccc"] + + +def test_prune_stale_drops_deenrolled_only(tmp_path, monkeypatch): + # prune_stale removes / dirs that have a canonical.json but aren't enrolled; keeps + # enrolled canonicals, reconciler dirs (no canonical.json), and alerts/. + monkeypatch.setenv("CCCI_WARM_ROOT", str(tmp_path)) + monkeypatch.setattr(canonical, "enrolled_recipes", lambda: ["keepme"]) + monkeypatch.setattr(canonical.warmsnap, "stack_volumes", lambda d: []) # no docker in unit + # enrolled canonical (keep), de-enrolled canonical (prune), reconciler dir (keep), alerts (keep) + for name in ("keepme", "gone"): + (tmp_path / name).mkdir() + (tmp_path / name / "canonical.json").write_text('{"recipe":"%s"}' % name) + (tmp_path / "keycloak").mkdir(); (tmp_path / "keycloak" / "last_good").write_text("v1") # reconciler + (tmp_path / "alerts").mkdir() + pruned = canonical.prune_stale() + assert pruned == ["gone"] + assert not (tmp_path / "gone").exists() + assert (tmp_path / "keepme").exists() + assert (tmp_path / "keycloak").exists() # no canonical.json → not a canonical → kept + assert (tmp_path / "alerts").exists()