Files
cc-ci/docs/warm.md
autonomic-bot 40b03a9bf1 claim(2w): WC8 + WC9 (FINAL gates) — resource-safety consolidation + stale-warm prune + docs/warm.md + --quick rollback proof
WC8: canonical.prune_stale (drop de-enrolled warm data + volumes) wired into the
nightly sweep + df log; consolidated evidence (DRONE_RUNNER_CAPACITY=MAX_TESTS
serialize; autoPrune drops --volumes so warm vols survive; cold teardown sacred;
warm excluded from D8 — no nix source ref). +1 unit (72 pass). WC9: docs/warm.md
documents the full warm/quick model; --quick rollback proof already proven live
(W2 FAIL restores exact known-good; WC4 PASS byte-identical snapshot). On PASS,
all WC1-WC9 (incl WC1.1/WC1.2) verified → DONE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:43:34 +01:00

7.7 KiB

Warm deployments + --quick CI mode (Phase 2w)

cc-ci keeps a small set of apps warm so SSO-dependent tests and an opt-in fast lane avoid paying the full cold-provisioning cost every run. Three states (use these terms):

  • live-warm — actually deployed and running (keycloak, traefik): instant to use, costs RAM.
  • data-warmundeployed (RAM freed) but its data volume is retained, so a later abra app deploy reattaches it and boots warm (skips fresh DB-init/first-boot); costs only disk.
  • cold — no retained data: fresh abra app new + new volume + full lifecycle + teardown that deletes the volume. The authoritative default (!testme = full cold).

Stable-domain scheme: warm apps live at warm-<recipe>.ci.commoninternet.net — deliberately distinct from the cold per-run scheme <recipe[:4]>-<6hex>.ci... so a warm app is never confused with a disposable cold run. Warm volumes + snapshots live under /var/lib/ci-warm/<recipe>/ and are cache, not source — re-seeded by cold runs, excluded from the D8 reproducibility closure (no Nix module declares them as a source).

Live-warm keycloak + traefik — auto-update, health-gated, with rollback

Both are unpinned and reconciled by runner/warm_reconcile.py <app> (driven by the systemd oneshots warm-keycloak.service / deploy-proxy.service, re-run every activation/boot). On each reconcile (and nightly, WC6):

  1. WC1.2 pre-deploy safety gate (first). Compare current→latest. Auto-apply only non-major (patch/minor) bumps with no manual-migration release notes. A MAJOR recipe/app-version bump, or a target whose releaseNotes/<version>.md flags a manual migration, is NOT auto-applied — stay on current + write an alert with the notes for the operator. (A health pass ≠ migration done.)
  2. WC1.1 post-deploy health gate. Record running version = last-good → deploy latest → health-check → healthy: commit last-good := latest; unhealthy: roll back to last-good + alert.
    • keycloak is stateful: undeploy → snapshot the data volume → deploy latest → on failure restore the snapshot + redeploy the prior version (a forward DB migration makes a version-only rollback unsafe).
    • traefik is stateless: version rollback only (no snapshot).

keycloak is the shared SSO provider: SSO-dependent recipes point their setup_custom_tests at the one warm keycloak and create a per-run namespaced realm <parent>-<6hex> (created at run start, deleted at run end). Concurrent dependents get distinct realms; orphaned realms (crashed runs) are reaped by hex not matching a live app stack.

Alerts. A reconciler that rolls back (WC1.1) or holds an upgrade (WC1.2) writes a sentinel JSON to /var/lib/ci-warm/alerts/*.json. The Builder loop relays new alerts (PushNotification) and archives them to alerts/seen/ — bridging the autonomous reconciler to operator visibility.

Data-warm canonicals (WC2/WC3)

A canonical is a per-recipe known-good deployment at warm-<recipe>, kept data-warm (undeployed-when-idle, volume retained), tracked by runner/harness/canonical.py:

  • Enroll a recipe: set WARM_CANONICAL = True in tests/<recipe>/recipe_meta.py. That's it.
  • Registry: /var/lib/ci-warm/<recipe>/canonical.json = {recipe, domain, version, commit, status, ts}.
  • Known-good snapshot (WC3): runner/harness/warmsnap.py takes a raw per-volume tar while the app is UNDEPLOYED under /var/lib/ci-warm/<recipe>/snapshot/one last-good per app, atomic replace. restore() clears + untars each volume back; proven to round-trip data.

--quick opt-in fast lane (WC4/WC7)

!testme = full cold (default, authoritative). !testme --quick = opt-in lower-confidence fast lane (the bridge parses it → CCCI_QUICK=1 Drone param; run_quick in run_recipe_ci.py):

  1. Reattach the canonical (deploy_canonical — warm boot at known-good) → wait healthy.
  2. (deps) use the warm keycloak + a per-run realm.
  3. Upgrade in place to the PR head (chaos) — the op, once.
  4. Assert: generic UPGRADE (reconverge + moved + serving) + recipe overlay + custom.
  5. PASS → undeploy-keep-volume; known-good UNCHANGED (never promote). FAIL → restore the last-known-good snapshot + undeploy (roll back, data safe).

--quick never gates merge and never advances the canonical. If no canonical exists it falls back cleanly to a full cold run (the PR is still tested).

Cold-only canonical advancement (WC5) + nightly sweep (WC6)

  • WC5 promote-on-green-cold. A GREEN full-cold run on LATEST (no PR head) of an enrolled recipe re-seeds the canonical at the green-verified latest (snapshot + registry, atomic). The old known-good is replaced only after green — never lost on a red run. The FIRST green cold run seeds the canonical. A PR !testme (carries REF) and --quick never promote — only cold-on-latest (the nightly sweep, or a manual RECIPE=<r> run) advances it.
  • WC6 nightly sweep. nightly-sweep.timer (03:00, Persistent) → nightly_sweep.py: roll warm/infra to latest (health-gated, WC1.1) → serial full-cold run across enrolled recipes on latest (each green run promotes its canonical) → prune stale warm data → log disk. Serial honors MAX_TESTS; skips if a test is already in flight.

Resource safety + isolation (WC8)

  • Serialize: DRONE_RUNNER_CAPACITY = MAX_TESTS (default 1); the nightly sweep is serial and skips if a run_recipe_ci.py is active. At most MAX_TESTS apps are ever live at once.
  • Warm keycloak shared safely via per-run namespaced realms (above); orphan realms reaped.
  • Disk (warm is the budget, not RAM): virtualisation.docker.autoPrune prunes images/containers/networks/build-cache older than 24h but never --volumes (so data-warm canonical volumes survive). Each canonical = one data volume + one snapshot (small; the keycloak DB snapshot ~300M dominates). canonical.prune_stale() (run nightly) drops warm data for de-enrolled canonicals. Monitor with df -h / (the nightly logs it).
  • Cold teardown stays sacred: a cold per-run app's volumes/secrets are always deleted at run end (or janitor-reaped); promote re-seeds the canonical separately (never reuses a per-run volume).
  • Excluded from D8: /var/lib/ci-warm/ is runtime cache — no Nix module declares it as a source; a from-scratch rebuild re-seeds canonicals via cold runs, it does not restore them.

The --quick rollback proof (WC9)

Deliberately failing a PR under --quick restores the canonical's last-known-good intact, and a --quick pass does not move the known-good — both proven live on the custom-html canonical:

  • PASS keeps known-good: a --quick PASS run left the registry version + the snapshot tar byte-identical (Adversary-verified sha256) and the canonical idle with its volume retained.
  • FAIL restores known-good: a --quick run against a broken PR head (bad image) → quick FAIL → restored known-good data; canonical idle, exit 1; the snapshot was byte-identical, the known-good marker was back, the app served 200, and the broken image was gone. The known-good version was never advanced.

Operate / debug

  • Inspect a canonical: cat /var/lib/ci-warm/<recipe>/canonical.json; warmsnap snapshot under …/snapshot/. Enrolled recipes: canonical.enrolled_recipes().
  • Run a quick test manually: RECIPE=<r> CCCI_QUICK=1 cc-ci-run runner/run_recipe_ci.py.
  • Trigger the nightly sweep: systemctl start nightly-sweep.service (journal shows the roll + sweep).
  • Roll/repair warm keycloak or traefik: cc-ci-run runner/warm_reconcile.py {keycloak|traefik}.
  • Alerts: ls /var/lib/ci-warm/alerts/ (active) and …/seen/ (relayed).