Files
cc-ci/docs/warm.md
autonomic-bot 40b03a9bf1 claim(2w): WC8 + WC9 (FINAL gates) — resource-safety consolidation + stale-warm prune + docs/warm.md + --quick rollback proof
WC8: canonical.prune_stale (drop de-enrolled warm data + volumes) wired into the
nightly sweep + df log; consolidated evidence (DRONE_RUNNER_CAPACITY=MAX_TESTS
serialize; autoPrune drops --volumes so warm vols survive; cold teardown sacred;
warm excluded from D8 — no nix source ref). +1 unit (72 pass). WC9: docs/warm.md
documents the full warm/quick model; --quick rollback proof already proven live
(W2 FAIL restores exact known-good; WC4 PASS byte-identical snapshot). On PASS,
all WC1-WC9 (incl WC1.1/WC1.2) verified → DONE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:43:34 +01:00

117 lines
7.7 KiB
Markdown

# Warm deployments + `--quick` CI mode (Phase 2w)
cc-ci keeps a small set of apps **warm** so SSO-dependent tests and an opt-in fast lane avoid paying
the full cold-provisioning cost every run. Three states (use these terms):
- **live-warm** — actually deployed and running (keycloak, traefik): instant to use, costs RAM.
- **data-warm** — *undeployed* (RAM freed) but its **data volume is retained**, so a later
`abra app deploy` reattaches it and boots warm (skips fresh DB-init/first-boot); costs only disk.
- **cold** — no retained data: fresh `abra app new` + new volume + full lifecycle + teardown that
deletes the volume. **The authoritative default** (`!testme` = full cold).
**Stable-domain scheme:** warm apps live at `warm-<recipe>.ci.commoninternet.net` — deliberately
distinct from the cold per-run scheme `<recipe[:4]>-<6hex>.ci...` so a warm app is never confused
with a disposable cold run. Warm volumes + snapshots live under `/var/lib/ci-warm/<recipe>/` and are
**cache, not source** — re-seeded by cold runs, **excluded from the D8 reproducibility closure** (no
Nix module declares them as a source).
## Live-warm keycloak + traefik — auto-update, health-gated, with rollback
Both are **unpinned** and reconciled by `runner/warm_reconcile.py <app>` (driven by the systemd
oneshots `warm-keycloak.service` / `deploy-proxy.service`, re-run every activation/boot). On each
reconcile (and nightly, WC6):
1. **WC1.2 pre-deploy safety gate (first).** Compare current→latest. **Auto-apply only non-major
(patch/minor) bumps with no manual-migration release notes.** A **MAJOR** recipe/app-version bump,
or a target whose `releaseNotes/<version>.md` flags a manual migration, is **NOT auto-applied**
stay on current + write an alert with the notes for the operator. (A health pass ≠ migration done.)
2. **WC1.1 post-deploy health gate.** Record running version = last-good → deploy latest →
health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good + alert.**
- **keycloak is stateful:** undeploy → **snapshot the data volume** → deploy latest → on failure
**restore the snapshot** + redeploy the prior version (a forward DB migration makes a
version-only rollback unsafe).
- **traefik is stateless:** version rollback only (no snapshot).
keycloak is the **shared SSO provider**: SSO-dependent recipes point their `setup_custom_tests` at
the one warm keycloak and create a **per-run namespaced realm** `<parent>-<6hex>` (created at run
start, deleted at run end). Concurrent dependents get distinct realms; orphaned realms (crashed runs)
are reaped by hex not matching a live app stack.
**Alerts.** A reconciler that rolls back (WC1.1) or holds an upgrade (WC1.2) writes a sentinel JSON to
`/var/lib/ci-warm/alerts/*.json`. The Builder loop relays new alerts (PushNotification) and archives
them to `alerts/seen/` — bridging the autonomous reconciler to operator visibility.
## Data-warm canonicals (WC2/WC3)
A **canonical** is a per-recipe known-good deployment at `warm-<recipe>`, kept data-warm
(undeployed-when-idle, volume retained), tracked by `runner/harness/canonical.py`:
- **Enroll a recipe:** set `WARM_CANONICAL = True` in `tests/<recipe>/recipe_meta.py`. That's it.
- **Registry:** `/var/lib/ci-warm/<recipe>/canonical.json` = `{recipe, domain, version, commit,
status, ts}`.
- **Known-good snapshot (WC3):** `runner/harness/warmsnap.py` takes a **raw per-volume tar while the
app is UNDEPLOYED** under `/var/lib/ci-warm/<recipe>/snapshot/` — **one last-good per app**, atomic
replace. `restore()` clears + untars each volume back; proven to round-trip data.
## `--quick` opt-in fast lane (WC4/WC7)
`!testme` = full **cold** (default, authoritative). `!testme --quick` = opt-in **lower-confidence**
fast lane (the bridge parses it → `CCCI_QUICK=1` Drone param; `run_quick` in `run_recipe_ci.py`):
1. Reattach the canonical (`deploy_canonical` — warm boot at known-good) → wait healthy.
2. (deps) use the warm keycloak + a per-run realm.
3. **Upgrade in place to the PR head** (chaos) — the op, once.
4. Assert: generic UPGRADE (reconverge + moved + serving) + recipe overlay + custom.
5. **PASS → undeploy-keep-volume; known-good UNCHANGED (never promote).**
**FAIL → restore the last-known-good snapshot + undeploy (roll back, data safe).**
`--quick` **never gates merge** and **never advances the canonical**. If no canonical exists it falls
back cleanly to a full cold run (the PR is still tested).
## Cold-only canonical advancement (WC5) + nightly sweep (WC6)
- **WC5 promote-on-green-cold.** A **GREEN full-cold run on LATEST** (no PR head) of an enrolled
recipe re-seeds the canonical at the green-verified latest (snapshot + registry, atomic). The
old known-good is replaced **only** after green — **never lost on a red run**. The FIRST green cold
run seeds the canonical. A PR `!testme` (carries REF) and `--quick` **never** promote — only
cold-on-latest (the nightly sweep, or a manual `RECIPE=<r>` run) advances it.
- **WC6 nightly sweep.** `nightly-sweep.timer` (03:00, Persistent) → `nightly_sweep.py`: roll
warm/infra to latest (health-gated, WC1.1) → **serial** full-cold run across enrolled recipes on
latest (each green run promotes its canonical) → prune stale warm data → log disk. Serial honors
MAX_TESTS; skips if a test is already in flight.
## Resource safety + isolation (WC8)
- **Serialize:** `DRONE_RUNNER_CAPACITY = MAX_TESTS` (default 1); the nightly sweep is serial and
skips if a `run_recipe_ci.py` is active. At most MAX_TESTS apps are ever live at once.
- **Warm keycloak shared safely** via per-run namespaced realms (above); orphan realms reaped.
- **Disk** (warm is the budget, not RAM): `virtualisation.docker.autoPrune` prunes
images/containers/networks/build-cache older than 24h but **never `--volumes`** (so data-warm
canonical volumes survive). Each canonical = one data volume + one snapshot (small; the keycloak DB
snapshot ~300M dominates). `canonical.prune_stale()` (run nightly) drops warm data for
**de-enrolled** canonicals. Monitor with `df -h /` (the nightly logs it).
- **Cold teardown stays sacred:** a cold per-run app's volumes/secrets are always deleted at run end
(or janitor-reaped); promote re-seeds the canonical separately (never reuses a per-run volume).
- **Excluded from D8:** `/var/lib/ci-warm/` is runtime cache — no Nix module declares it as a source;
a from-scratch rebuild re-seeds canonicals via cold runs, it does not restore them.
## The `--quick` rollback proof (WC9)
Deliberately failing a PR under `--quick` restores the canonical's last-known-good intact, and a
`--quick` pass does not move the known-good — both proven live on the custom-html canonical:
- **PASS keeps known-good:** a `--quick` PASS run left the registry version + the snapshot tar
**byte-identical** (Adversary-verified sha256) and the canonical idle with its volume retained.
- **FAIL restores known-good:** a `--quick` run against a broken PR head (bad image) → `quick FAIL →
restored known-good data; canonical idle`, exit 1; the snapshot was byte-identical, the known-good
marker was back, the app served 200, and the broken image was gone. The known-good version was
never advanced.
## Operate / debug
- Inspect a canonical: `cat /var/lib/ci-warm/<recipe>/canonical.json`; `warmsnap` snapshot under
`…/snapshot/`. Enrolled recipes: `canonical.enrolled_recipes()`.
- Run a quick test manually: `RECIPE=<r> CCCI_QUICK=1 cc-ci-run runner/run_recipe_ci.py`.
- Trigger the nightly sweep: `systemctl start nightly-sweep.service` (journal shows the roll + sweep).
- Roll/repair warm keycloak or traefik: `cc-ci-run runner/warm_reconcile.py {keycloak|traefik}`.
- Alerts: `ls /var/lib/ci-warm/alerts/` (active) and `…/seen/` (relayed).