# Warm deployments + `--quick` CI mode (Phase 2w) cc-ci keeps a small set of apps **warm** so SSO-dependent tests and an opt-in fast lane avoid paying the full cold-provisioning cost every run. Three states (use these terms): - **live-warm** — actually deployed and running (keycloak, traefik): instant to use, costs RAM. - **data-warm** — *undeployed* (RAM freed) but its **data volume is retained**, so a later `abra app deploy` reattaches it and boots warm (skips fresh DB-init/first-boot); costs only disk. - **cold** — no retained data: fresh `abra app new` + new volume + full lifecycle + teardown that deletes the volume. **The authoritative default** (`!testme` = full cold). **Stable-domain scheme:** warm apps live at `warm-.ci.commoninternet.net` — deliberately distinct from the cold per-run scheme `-<6hex>.ci...` so a warm app is never confused with a disposable cold run. Warm volumes + snapshots live under `/var/lib/ci-warm//` and are **cache, not source** — re-seeded by cold runs, **excluded from the D8 reproducibility closure** (no Nix module declares them as a source). ## Live-warm keycloak + traefik — auto-update, health-gated, with rollback Both are **unpinned** and reconciled by `runner/warm_reconcile.py ` (driven by the systemd oneshots `warm-keycloak.service` / `deploy-proxy.service`, re-run every activation/boot). On each reconcile (and nightly, WC6): 1. **WC1.2 pre-deploy safety gate (first).** Compare current→latest. **Auto-apply only non-major (patch/minor) bumps with no manual-migration release notes.** A **MAJOR** recipe/app-version bump, or a target whose `releaseNotes/.md` flags a manual migration, is **NOT auto-applied** — stay on current + write an alert with the notes for the operator. (A health pass ≠ migration done.) 2. **WC1.1 post-deploy health gate.** Record running version = last-good → deploy latest → health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good + alert.** - **keycloak is stateful:** undeploy → **snapshot the data volume** → deploy latest → on failure **restore the snapshot** + redeploy the prior version (a forward DB migration makes a version-only rollback unsafe). - **traefik is stateless:** version rollback only (no snapshot). keycloak is the **shared SSO provider**: SSO-dependent recipes point their `setup_custom_tests` at the one warm keycloak and create a **per-run namespaced realm** `-<6hex>` (created at run start, deleted at run end). Concurrent dependents get distinct realms; orphaned realms (crashed runs) are reaped by hex not matching a live app stack. **Alerts.** A reconciler that rolls back (WC1.1) or holds an upgrade (WC1.2) writes a sentinel JSON to `/var/lib/ci-warm/alerts/*.json`. The Builder loop relays new alerts (PushNotification) and archives them to `alerts/seen/` — bridging the autonomous reconciler to operator visibility. ## Data-warm canonicals (WC2/WC3) A **canonical** is a per-recipe known-good deployment at `warm-`, kept data-warm (undeployed-when-idle, volume retained), tracked by `runner/harness/canonical.py`: - **Enroll a recipe:** set `WARM_CANONICAL = True` in `tests//recipe_meta.py`. That's it. - **Registry:** `/var/lib/ci-warm//canonical.json` = `{recipe, domain, version, commit, status, ts}`. - **Known-good snapshot (WC3):** `runner/harness/warmsnap.py` takes a **raw per-volume tar while the app is UNDEPLOYED** under `/var/lib/ci-warm//snapshot/` — **one last-good per app**, atomic replace. `restore()` clears + untars each volume back; proven to round-trip data. ## `--quick` opt-in fast lane (WC4/WC7) `!testme` = full **cold** (default, authoritative). `!testme --quick` = opt-in **lower-confidence** fast lane (the bridge parses it → `CCCI_QUICK=1` Drone param; `run_quick` in `run_recipe_ci.py`): 1. Reattach the canonical (`deploy_canonical` — warm boot at known-good) → wait healthy. 2. (deps) use the warm keycloak + a per-run realm. 3. **Upgrade in place to the PR head** (chaos) — the op, once. 4. Assert: generic UPGRADE (reconverge + moved + serving) + recipe overlay + custom. 5. **PASS → undeploy-keep-volume; known-good UNCHANGED (never promote).** **FAIL → restore the last-known-good snapshot + undeploy (roll back, data safe).** `--quick` **never gates merge** and **never advances the canonical**. If no canonical exists it falls back cleanly to a full cold run (the PR is still tested). ## Cold-only canonical advancement (WC5) + nightly sweep (WC6) - **WC5 promote-on-green-cold.** A **GREEN full-cold run on LATEST** (no PR head) of an enrolled recipe re-seeds the canonical at the green-verified latest (snapshot + registry, atomic). The old known-good is replaced **only** after green — **never lost on a red run**. The FIRST green cold run seeds the canonical. A PR `!testme` (carries REF) and `--quick` **never** promote — only cold-on-latest (the nightly sweep, or a manual `RECIPE=` run) advances it. - **WC6 nightly sweep.** `nightly-sweep.timer` (03:00, Persistent) → `nightly_sweep.py`: roll warm/infra to latest (health-gated, WC1.1) → **serial** full-cold run across enrolled recipes on latest (each green run promotes its canonical) → prune stale warm data → log disk. Serial honors MAX_TESTS; skips if a test is already in flight. ## Resource safety + isolation (WC8) - **Serialize:** `DRONE_RUNNER_CAPACITY = MAX_TESTS` (default 1); the nightly sweep is serial and skips if a `run_recipe_ci.py` is active. At most MAX_TESTS apps are ever live at once. - **Warm keycloak shared safely** via per-run namespaced realms (above); orphan realms reaped. - **Disk** (warm is the budget, not RAM): the `ci-docker-prune` unit (`nix/modules/docker-prune.nix`, Phase-2pc) prunes only **dangling** images/containers/build-cache (`until=24h`), and only under genuine disk pressure (`/` ≥ 80%) with nothing in flight — **never `--all`** (keeps cached base/ in-use images warm; the local store IS the cache on this single host) and **never `--volumes`** (so data-warm canonical volumes survive). Each canonical = one data volume + one snapshot (small; the keycloak DB snapshot ~300M dominates). `canonical.prune_stale()` (run nightly) drops warm data for **de-enrolled** canonicals. Monitor with `df -h /` (the nightly logs it). - **Cold teardown stays sacred:** a cold per-run app's volumes/secrets are always deleted at run end (or janitor-reaped); promote re-seeds the canonical separately (never reuses a per-run volume). - **Excluded from D8:** `/var/lib/ci-warm/` is runtime cache — no Nix module declares it as a source; a from-scratch rebuild re-seeds canonicals via cold runs, it does not restore them. ## The `--quick` rollback proof (WC9) Deliberately failing a PR under `--quick` restores the canonical's last-known-good intact, and a `--quick` pass does not move the known-good — both proven live on the custom-html canonical: - **PASS keeps known-good:** a `--quick` PASS run left the registry version + the snapshot tar **byte-identical** (Adversary-verified sha256) and the canonical idle with its volume retained. - **FAIL restores known-good:** a `--quick` run against a broken PR head (bad image) → `quick FAIL → restored known-good data; canonical idle`, exit 1; the snapshot was byte-identical, the known-good marker was back, the app served 200, and the broken image was gone. The known-good version was never advanced. ## Operate / debug - Inspect a canonical: `cat /var/lib/ci-warm//canonical.json`; `warmsnap` snapshot under `…/snapshot/`. Enrolled recipes: `canonical.enrolled_recipes()`. - Run a quick test manually: `RECIPE= CCCI_QUICK=1 cc-ci-run runner/run_recipe_ci.py`. - Trigger the nightly sweep: `systemctl start nightly-sweep.service` (journal shows the roll + sweep). - Roll/repair warm keycloak or traefik: `cc-ci-run runner/warm_reconcile.py {keycloak|traefik}`. - Alerts: `ls /var/lib/ci-warm/alerts/` (active) and `…/seen/` (relayed).