claim(2w): WC8 + WC9 (FINAL gates) — resource-safety consolidation + stale-warm prune + docs/warm.md + --quick rollback proof
WC8: canonical.prune_stale (drop de-enrolled warm data + volumes) wired into the nightly sweep + df log; consolidated evidence (DRONE_RUNNER_CAPACITY=MAX_TESTS serialize; autoPrune drops --volumes so warm vols survive; cold teardown sacred; warm excluded from D8 — no nix source ref). +1 unit (72 pass). WC9: docs/warm.md documents the full warm/quick model; --quick rollback proof already proven live (W2 FAIL restores exact known-good; WC4 PASS byte-identical snapshot). On PASS, all WC1-WC9 (incl WC1.1/WC1.2) verified → DONE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
116
docs/warm.md
Normal file
116
docs/warm.md
Normal file
@ -0,0 +1,116 @@
|
||||
# Warm deployments + `--quick` CI mode (Phase 2w)
|
||||
|
||||
cc-ci keeps a small set of apps **warm** so SSO-dependent tests and an opt-in fast lane avoid paying
|
||||
the full cold-provisioning cost every run. Three states (use these terms):
|
||||
|
||||
- **live-warm** — actually deployed and running (keycloak, traefik): instant to use, costs RAM.
|
||||
- **data-warm** — *undeployed* (RAM freed) but its **data volume is retained**, so a later
|
||||
`abra app deploy` reattaches it and boots warm (skips fresh DB-init/first-boot); costs only disk.
|
||||
- **cold** — no retained data: fresh `abra app new` + new volume + full lifecycle + teardown that
|
||||
deletes the volume. **The authoritative default** (`!testme` = full cold).
|
||||
|
||||
**Stable-domain scheme:** warm apps live at `warm-<recipe>.ci.commoninternet.net` — deliberately
|
||||
distinct from the cold per-run scheme `<recipe[:4]>-<6hex>.ci...` so a warm app is never confused
|
||||
with a disposable cold run. Warm volumes + snapshots live under `/var/lib/ci-warm/<recipe>/` and are
|
||||
**cache, not source** — re-seeded by cold runs, **excluded from the D8 reproducibility closure** (no
|
||||
Nix module declares them as a source).
|
||||
|
||||
## Live-warm keycloak + traefik — auto-update, health-gated, with rollback
|
||||
|
||||
Both are **unpinned** and reconciled by `runner/warm_reconcile.py <app>` (driven by the systemd
|
||||
oneshots `warm-keycloak.service` / `deploy-proxy.service`, re-run every activation/boot). On each
|
||||
reconcile (and nightly, WC6):
|
||||
|
||||
1. **WC1.2 pre-deploy safety gate (first).** Compare current→latest. **Auto-apply only non-major
|
||||
(patch/minor) bumps with no manual-migration release notes.** A **MAJOR** recipe/app-version bump,
|
||||
or a target whose `releaseNotes/<version>.md` flags a manual migration, is **NOT auto-applied** —
|
||||
stay on current + write an alert with the notes for the operator. (A health pass ≠ migration done.)
|
||||
2. **WC1.1 post-deploy health gate.** Record running version = last-good → deploy latest →
|
||||
health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good + alert.**
|
||||
- **keycloak is stateful:** undeploy → **snapshot the data volume** → deploy latest → on failure
|
||||
**restore the snapshot** + redeploy the prior version (a forward DB migration makes a
|
||||
version-only rollback unsafe).
|
||||
- **traefik is stateless:** version rollback only (no snapshot).
|
||||
|
||||
keycloak is the **shared SSO provider**: SSO-dependent recipes point their `setup_custom_tests` at
|
||||
the one warm keycloak and create a **per-run namespaced realm** `<parent>-<6hex>` (created at run
|
||||
start, deleted at run end). Concurrent dependents get distinct realms; orphaned realms (crashed runs)
|
||||
are reaped by hex not matching a live app stack.
|
||||
|
||||
**Alerts.** A reconciler that rolls back (WC1.1) or holds an upgrade (WC1.2) writes a sentinel JSON to
|
||||
`/var/lib/ci-warm/alerts/*.json`. The Builder loop relays new alerts (PushNotification) and archives
|
||||
them to `alerts/seen/` — bridging the autonomous reconciler to operator visibility.
|
||||
|
||||
## Data-warm canonicals (WC2/WC3)
|
||||
|
||||
A **canonical** is a per-recipe known-good deployment at `warm-<recipe>`, kept data-warm
|
||||
(undeployed-when-idle, volume retained), tracked by `runner/harness/canonical.py`:
|
||||
|
||||
- **Enroll a recipe:** set `WARM_CANONICAL = True` in `tests/<recipe>/recipe_meta.py`. That's it.
|
||||
- **Registry:** `/var/lib/ci-warm/<recipe>/canonical.json` = `{recipe, domain, version, commit,
|
||||
status, ts}`.
|
||||
- **Known-good snapshot (WC3):** `runner/harness/warmsnap.py` takes a **raw per-volume tar while the
|
||||
app is UNDEPLOYED** under `/var/lib/ci-warm/<recipe>/snapshot/` — **one last-good per app**, atomic
|
||||
replace. `restore()` clears + untars each volume back; proven to round-trip data.
|
||||
|
||||
## `--quick` opt-in fast lane (WC4/WC7)
|
||||
|
||||
`!testme` = full **cold** (default, authoritative). `!testme --quick` = opt-in **lower-confidence**
|
||||
fast lane (the bridge parses it → `CCCI_QUICK=1` Drone param; `run_quick` in `run_recipe_ci.py`):
|
||||
|
||||
1. Reattach the canonical (`deploy_canonical` — warm boot at known-good) → wait healthy.
|
||||
2. (deps) use the warm keycloak + a per-run realm.
|
||||
3. **Upgrade in place to the PR head** (chaos) — the op, once.
|
||||
4. Assert: generic UPGRADE (reconverge + moved + serving) + recipe overlay + custom.
|
||||
5. **PASS → undeploy-keep-volume; known-good UNCHANGED (never promote).**
|
||||
**FAIL → restore the last-known-good snapshot + undeploy (roll back, data safe).**
|
||||
|
||||
`--quick` **never gates merge** and **never advances the canonical**. If no canonical exists it falls
|
||||
back cleanly to a full cold run (the PR is still tested).
|
||||
|
||||
## Cold-only canonical advancement (WC5) + nightly sweep (WC6)
|
||||
|
||||
- **WC5 promote-on-green-cold.** A **GREEN full-cold run on LATEST** (no PR head) of an enrolled
|
||||
recipe re-seeds the canonical at the green-verified latest (snapshot + registry, atomic). The
|
||||
old known-good is replaced **only** after green — **never lost on a red run**. The FIRST green cold
|
||||
run seeds the canonical. A PR `!testme` (carries REF) and `--quick` **never** promote — only
|
||||
cold-on-latest (the nightly sweep, or a manual `RECIPE=<r>` run) advances it.
|
||||
- **WC6 nightly sweep.** `nightly-sweep.timer` (03:00, Persistent) → `nightly_sweep.py`: roll
|
||||
warm/infra to latest (health-gated, WC1.1) → **serial** full-cold run across enrolled recipes on
|
||||
latest (each green run promotes its canonical) → prune stale warm data → log disk. Serial honors
|
||||
MAX_TESTS; skips if a test is already in flight.
|
||||
|
||||
## Resource safety + isolation (WC8)
|
||||
|
||||
- **Serialize:** `DRONE_RUNNER_CAPACITY = MAX_TESTS` (default 1); the nightly sweep is serial and
|
||||
skips if a `run_recipe_ci.py` is active. At most MAX_TESTS apps are ever live at once.
|
||||
- **Warm keycloak shared safely** via per-run namespaced realms (above); orphan realms reaped.
|
||||
- **Disk** (warm is the budget, not RAM): `virtualisation.docker.autoPrune` prunes
|
||||
images/containers/networks/build-cache older than 24h but **never `--volumes`** (so data-warm
|
||||
canonical volumes survive). Each canonical = one data volume + one snapshot (small; the keycloak DB
|
||||
snapshot ~300M dominates). `canonical.prune_stale()` (run nightly) drops warm data for
|
||||
**de-enrolled** canonicals. Monitor with `df -h /` (the nightly logs it).
|
||||
- **Cold teardown stays sacred:** a cold per-run app's volumes/secrets are always deleted at run end
|
||||
(or janitor-reaped); promote re-seeds the canonical separately (never reuses a per-run volume).
|
||||
- **Excluded from D8:** `/var/lib/ci-warm/` is runtime cache — no Nix module declares it as a source;
|
||||
a from-scratch rebuild re-seeds canonicals via cold runs, it does not restore them.
|
||||
|
||||
## The `--quick` rollback proof (WC9)
|
||||
|
||||
Deliberately failing a PR under `--quick` restores the canonical's last-known-good intact, and a
|
||||
`--quick` pass does not move the known-good — both proven live on the custom-html canonical:
|
||||
- **PASS keeps known-good:** a `--quick` PASS run left the registry version + the snapshot tar
|
||||
**byte-identical** (Adversary-verified sha256) and the canonical idle with its volume retained.
|
||||
- **FAIL restores known-good:** a `--quick` run against a broken PR head (bad image) → `quick FAIL →
|
||||
restored known-good data; canonical idle`, exit 1; the snapshot was byte-identical, the known-good
|
||||
marker was back, the app served 200, and the broken image was gone. The known-good version was
|
||||
never advanced.
|
||||
|
||||
## Operate / debug
|
||||
|
||||
- Inspect a canonical: `cat /var/lib/ci-warm/<recipe>/canonical.json`; `warmsnap` snapshot under
|
||||
`…/snapshot/`. Enrolled recipes: `canonical.enrolled_recipes()`.
|
||||
- Run a quick test manually: `RECIPE=<r> CCCI_QUICK=1 cc-ci-run runner/run_recipe_ci.py`.
|
||||
- Trigger the nightly sweep: `systemctl start nightly-sweep.service` (journal shows the roll + sweep).
|
||||
- Roll/repair warm keycloak or traefik: `cc-ci-run runner/warm_reconcile.py {keycloak|traefik}`.
|
||||
- Alerts: `ls /var/lib/ci-warm/alerts/` (active) and `…/seen/` (relayed).
|
||||
@ -377,3 +377,20 @@ red; added util-linux (matching cc-ci-run). After both fixes, the live SERVICE s
|
||||
(known-good stayed 1.10.0 — never lose known-good). W3 (WC5+WC6) essentially closed. Remaining:
|
||||
WC8 (resource/isolation hardening — mostly already in place) + WC9 (docs + --quick rollback proof,
|
||||
already shown) → then DONE.
|
||||
|
||||
## 2026-05-29 — W4 WC8 + WC9 (final gates) built + claimed; DONE pending their PASS
|
||||
|
||||
WC6 ADVERSARY PASS (REVIEW-2w b8b698e). Then built the final two:
|
||||
- **WC8 resource safety + isolation** — most was already in place; consolidated + added the missing
|
||||
piece: `canonical.prune_stale()` drops `/var/lib/ci-warm/<recipe>/` + the `warm-<recipe>` volumes
|
||||
for DE-ENROLLED canonicals (keeps enrolled + reconciler dirs keycloak/traefik + alerts/), wired
|
||||
into the nightly sweep + a `df` log. +1 unit (72 pass). Verified live: DRONE_RUNNER_CAPACITY=maxTests
|
||||
(serialize); autoPrune flags drop `--volumes` (warm vols survive); `grep ci-warm nix/` = comment
|
||||
only (excluded from D8); disk 50%, warm ~318M.
|
||||
- **WC9 docs** — `docs/warm.md`: the full warm/quick model (live/data-warm/cold, warm-<recipe> scheme,
|
||||
health-gated reconcilers + WC1.2 safety gate + alerts, canonicals + warmsnap + enroll, --quick,
|
||||
promote-on-green-cold, nightly sweep, resource safety, operate/debug) + the `--quick` rollback proof
|
||||
(FAIL restores exact known-good; PASS byte-identical snapshot — proven W2/WC4).
|
||||
|
||||
Claimed WC8+WC9 (the final gates). On their PASS, EVERY WC1–WC9 (incl WC1.1/WC1.2) is Adversary-verified
|
||||
→ write `## DONE` to STATUS-2w (handshake: <24h PASS for all + no VETO) → watchdog returns to Phase 2.
|
||||
|
||||
@ -44,15 +44,19 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa
|
||||
health-gated, WC1.1) → SERIAL full-cold run over enrolled (`canonical.enrolled_recipes`) recipes
|
||||
on latest → each green run promotes its canonical (WC5); skips if a test is in flight. Proven via
|
||||
the live service: enrolled=['custom-html'] → all tiers green → canonical advanced 1.10.0→1.11.0.
|
||||
**CLAIMED — see Gate.**
|
||||
**Adversary PASS @2026-05-29** (REVIEW-2w b8b698e, gate 465e105).
|
||||
- [x] **WC7** — Trigger/authority/labeling: default `!testme`=cold (unchanged); `--quick` opt-in via
|
||||
bridge `parse_trigger` (`!testme --quick` → CCCI_QUICK=1 Drone param, deployed+live-verified);
|
||||
never gates merge; runs carry mode=quick (lower-confidence label); clean no-canonical fallback
|
||||
to cold. **Adversary PASS @2026-05-29** (REVIEW-2w 31f0e42, gate 3ff2bf6).
|
||||
- [ ] **WC8** — Resource safety + isolation: warm runs serialize per app; warm keycloak shared via
|
||||
per-run realms; disk monitored+pruned; cold teardown sacred; warm data excluded from D8 closure.
|
||||
- [ ] **WC9** — Docs + cold verify incl. the rollback proof (deliberately fail a PR under `--quick`,
|
||||
confirm last-known-good restored intact; a `--quick` pass did not move the known-good).
|
||||
- [x] **WC8** — Resource safety + isolation: serialize via `DRONE_RUNNER_CAPACITY=MAX_TESTS` + serial
|
||||
nightly that skips-if-test-active; warm keycloak shared via per-run realms (WC1); disk
|
||||
monitored+pruned (autoPrune drops `--volumes` so warm vols survive; `canonical.prune_stale`
|
||||
drops de-enrolled warm data nightly; nightly logs `df`); cold teardown sacred; warm data
|
||||
EXCLUDED from D8 (no Nix module references `/var/lib/ci-warm` as a source). **CLAIMED — see Gate.**
|
||||
- [x] **WC9** — `docs/warm.md` documents the full warm/quick model; the `--quick` rollback proof
|
||||
(FAIL restores last-known-good intact; PASS doesn't move it) is proven live (W2 FAIL + WC4
|
||||
Adversary byte-identical-snapshot verify). **CLAIMED — see Gate.**
|
||||
|
||||
## Milestones (plan §3)
|
||||
- **W0** — Warm keycloak (WC1/WC1.1-keycloak/WC1.2). ✅ Adversary PASS @2026-05-29.
|
||||
@ -138,7 +142,42 @@ headline e2e is green (below). No recipe/harness change needed.
|
||||
|
||||
## Gate
|
||||
|
||||
### Gate: WC6 — CLAIMED, awaiting Adversary (@2026-05-29)
|
||||
### Gate: WC8 + WC9 — CLAIMED, awaiting Adversary (@2026-05-29) [FINAL gates]
|
||||
|
||||
**WHAT.** WC8 resource safety/isolation (consolidated + a stale-warm prune) + WC9 docs + the proven
|
||||
`--quick` rollback. **WHERE:** `runner/harness/canonical.py` (`prune_stale`), `runner/nightly_sweep.py`
|
||||
(prune + df after sweep), `nix/modules/{drone-runner,swarm}.nix` (capacity, autoPrune), `docs/warm.md`.
|
||||
|
||||
**HOW + EXPECTED (cold):**
|
||||
1. **Units:** `cc-ci-run -m pytest tests/unit -q` → **72 passed** (incl. test_canonical prune_stale:
|
||||
drops de-enrolled canonical dirs, keeps enrolled + reconciler dirs + alerts/).
|
||||
2. **WC8 serialize:** `grep DRONE_RUNNER_CAPACITY nix/modules/drone-runner.nix` → `= maxTests`
|
||||
(MAX_TESTS, default 1); `nightly_sweep.py` `_another_run_active()` skips if a run is in flight;
|
||||
sweep loop is serial.
|
||||
3. **WC8 disk/prune:** `grep flags nix/modules/swarm.nix` → `[ "--all" "--filter" "until=24h" ]`
|
||||
(NO `--volumes` → warm volumes survive); `canonical.prune_stale()` drops `/var/lib/ci-warm/<r>/`
|
||||
(+ its `warm-<r>` volumes) for recipes no longer WARM_CANONICAL, run nightly; `df -h /` logged by
|
||||
the sweep. Live: disk `/` 50% (14G free); warm total ~318M (keycloak DB snapshot dominates).
|
||||
4. **WC8 cold teardown sacred:** proven across W2/WC5/WC6 (no `<recipe>-<6hex>` leftovers post-run).
|
||||
5. **WC8 excluded from D8:** `grep -rn ci-warm nix/` → only a COMMENT (no Nix source declares
|
||||
`/var/lib/ci-warm`); it's runtime cache re-seeded by cold runs.
|
||||
6. **WC9 docs:** `docs/warm.md` covers live-warm/data-warm/cold, the reconcilers + health-gate +
|
||||
safety gate + alerts, canonicals + snapshots + enroll, `--quick`, promote-on-green-cold, the
|
||||
nightly sweep, resource safety, and the `--quick` rollback proof + operate/debug.
|
||||
7. **WC9 `--quick` rollback proof:** already cold-verified — W2 FAIL run restored the exact
|
||||
known-good; WC4 Adversary verify confirmed a PASS run leaves the snapshot byte-identical (does NOT
|
||||
move the known-good). Re-runnable per docs/warm.md "The --quick rollback proof".
|
||||
|
||||
**On WC8+WC9 PASS → ALL of WC1–WC9 (incl WC1.1/WC1.2) verified → Builder writes `## DONE`.**
|
||||
|
||||
---
|
||||
|
||||
### Gate: WC6 — ✅ Adversary PASS @2026-05-29 (REVIEW-2w b8b698e, gate 465e105)
|
||||
Declarative timer (Persistent) + orchestration + the live systemd-service run (infra roll
|
||||
health-gated → serial cold sweep → canonical advanced, infra healthy, no leftovers) cold-verified.
|
||||
Builder may proceed to W4 (WC8/WC9). (claim detail retained below.)
|
||||
|
||||
### (claimed, now PASS) Gate: WC6 — CLAIMED detail
|
||||
|
||||
**WHAT.** Nightly full-cold sweep: a scheduled job rolls warm/infra to latest (health-gated, WC1.1)
|
||||
then runs the full COLD suite serially across enrolled canonical recipes on latest — refreshing each
|
||||
|
||||
@ -138,6 +138,36 @@ def undeploy_keep_volume(recipe: str) -> None:
|
||||
_set_status(recipe, "idle")
|
||||
|
||||
|
||||
def prune_stale() -> list[str]:
|
||||
"""WC8 disk hygiene: remove warm data for DE-ENROLLED canonicals — a `/var/lib/ci-warm/<recipe>/`
|
||||
that carries a `canonical.json` but whose recipe is no longer enrolled (WARM_CANONICAL dropped).
|
||||
Drops the dir (snapshot + registry) AND the retained `warm-<recipe>` data volumes. Leaves the
|
||||
live-warm reconciler dirs (keycloak/traefik — they have a `last_good`, no `canonical.json`),
|
||||
`alerts/`, and currently-enrolled canonicals untouched. Returns the recipes pruned."""
|
||||
import shutil
|
||||
import subprocess
|
||||
|
||||
root = warmsnap.warm_root()
|
||||
keep = set(enrolled_recipes())
|
||||
pruned: list[str] = []
|
||||
try:
|
||||
entries = sorted(os.listdir(root))
|
||||
except OSError:
|
||||
return pruned
|
||||
for name in entries:
|
||||
d = os.path.join(root, name)
|
||||
if not os.path.isdir(d) or name in keep:
|
||||
continue
|
||||
if not os.path.isfile(os.path.join(d, "canonical.json")):
|
||||
continue # not a data-warm canonical (e.g. keycloak/traefik reconciler dir, alerts/)
|
||||
# drop the retained warm-<recipe> volumes, then the snapshot/registry dir
|
||||
for vol in warmsnap.stack_volumes(canonical_domain(name)):
|
||||
subprocess.run(["docker", "volume", "rm", vol], capture_output=True, text=True)
|
||||
shutil.rmtree(d, ignore_errors=True)
|
||||
pruned.append(name)
|
||||
return pruned
|
||||
|
||||
|
||||
def seed_canonical(recipe: str, version: str, commit: str | None = None) -> dict:
|
||||
"""Record <warm-domain> (already deployed at `version`) as the recipe's canonical: write the
|
||||
registry, then (app must be UNDEPLOYED) take the known-good snapshot. Caller deploys + verifies
|
||||
|
||||
@ -68,6 +68,12 @@ def sweep() -> int:
|
||||
).returncode
|
||||
results[r] = rc
|
||||
print(f"nightly: {r} rc={rc} ({'green→canonical refreshed' if rc == 0 else 'red'})", flush=True)
|
||||
# WC8 disk hygiene: drop warm data for de-enrolled canonicals; log the disk budget.
|
||||
pruned = canonical.prune_stale()
|
||||
if pruned:
|
||||
print(f"nightly: pruned stale warm data for de-enrolled canonicals: {pruned}", flush=True)
|
||||
df = subprocess.run(["df", "-h", "/"], capture_output=True, text=True)
|
||||
print(f"nightly: disk / →\n{df.stdout.strip()}", flush=True)
|
||||
print("\n===== nightly sweep summary =====", flush=True)
|
||||
for r, rc in results.items():
|
||||
print(f" {r}: {'PASS' if rc == 0 else 'FAIL'}", flush=True)
|
||||
|
||||
@ -74,3 +74,23 @@ def test_enrolled_recipes_scans_meta(tmp_path, monkeypatch):
|
||||
(d / "recipe_meta.py").write_text(body)
|
||||
(tmp_path / "tests" / "ddd").mkdir(parents=True) # no recipe_meta.py at all
|
||||
assert canonical.enrolled_recipes() == ["aaa", "ccc"]
|
||||
|
||||
|
||||
def test_prune_stale_drops_deenrolled_only(tmp_path, monkeypatch):
|
||||
# prune_stale removes <recipe>/ dirs that have a canonical.json but aren't enrolled; keeps
|
||||
# enrolled canonicals, reconciler dirs (no canonical.json), and alerts/.
|
||||
monkeypatch.setenv("CCCI_WARM_ROOT", str(tmp_path))
|
||||
monkeypatch.setattr(canonical, "enrolled_recipes", lambda: ["keepme"])
|
||||
monkeypatch.setattr(canonical.warmsnap, "stack_volumes", lambda d: []) # no docker in unit
|
||||
# enrolled canonical (keep), de-enrolled canonical (prune), reconciler dir (keep), alerts (keep)
|
||||
for name in ("keepme", "gone"):
|
||||
(tmp_path / name).mkdir()
|
||||
(tmp_path / name / "canonical.json").write_text('{"recipe":"%s"}' % name)
|
||||
(tmp_path / "keycloak").mkdir(); (tmp_path / "keycloak" / "last_good").write_text("v1") # reconciler
|
||||
(tmp_path / "alerts").mkdir()
|
||||
pruned = canonical.prune_stale()
|
||||
assert pruned == ["gone"]
|
||||
assert not (tmp_path / "gone").exists()
|
||||
assert (tmp_path / "keepme").exists()
|
||||
assert (tmp_path / "keycloak").exists() # no canonical.json → not a canonical → kept
|
||||
assert (tmp_path / "alerts").exists()
|
||||
|
||||
Reference in New Issue
Block a user