ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes. Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof: redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date", no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy", warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger). Gate 2pc CLAIMED, awaiting Adversary cold-verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.0 KiB
Warm deployments + --quick CI mode (Phase 2w)
cc-ci keeps a small set of apps warm so SSO-dependent tests and an opt-in fast lane avoid paying the full cold-provisioning cost every run. Three states (use these terms):
- live-warm — actually deployed and running (keycloak, traefik): instant to use, costs RAM.
- data-warm — undeployed (RAM freed) but its data volume is retained, so a later
abra app deployreattaches it and boots warm (skips fresh DB-init/first-boot); costs only disk. - cold — no retained data: fresh
abra app new+ new volume + full lifecycle + teardown that deletes the volume. The authoritative default (!testme= full cold).
Stable-domain scheme: warm apps live at warm-<recipe>.ci.commoninternet.net — deliberately
distinct from the cold per-run scheme <recipe[:4]>-<6hex>.ci... so a warm app is never confused
with a disposable cold run. Warm volumes + snapshots live under /var/lib/ci-warm/<recipe>/ and are
cache, not source — re-seeded by cold runs, excluded from the D8 reproducibility closure (no
Nix module declares them as a source).
Live-warm keycloak + traefik — auto-update, health-gated, with rollback
Both are unpinned and reconciled by runner/warm_reconcile.py <app> (driven by the systemd
oneshots warm-keycloak.service / deploy-proxy.service, re-run every activation/boot). On each
reconcile (and nightly, WC6):
- WC1.2 pre-deploy safety gate (first). Compare current→latest. Auto-apply only non-major
(patch/minor) bumps with no manual-migration release notes. A MAJOR recipe/app-version bump,
or a target whose
releaseNotes/<version>.mdflags a manual migration, is NOT auto-applied — stay on current + write an alert with the notes for the operator. (A health pass ≠ migration done.) - WC1.1 post-deploy health gate. Record running version = last-good → deploy latest →
health-check → healthy: commit last-good := latest; unhealthy: roll back to last-good + alert.
- keycloak is stateful: undeploy → snapshot the data volume → deploy latest → on failure restore the snapshot + redeploy the prior version (a forward DB migration makes a version-only rollback unsafe).
- traefik is stateless: version rollback only (no snapshot).
keycloak is the shared SSO provider: SSO-dependent recipes point their setup_custom_tests at
the one warm keycloak and create a per-run namespaced realm <parent>-<6hex> (created at run
start, deleted at run end). Concurrent dependents get distinct realms; orphaned realms (crashed runs)
are reaped by hex not matching a live app stack.
Alerts. A reconciler that rolls back (WC1.1) or holds an upgrade (WC1.2) writes a sentinel JSON to
/var/lib/ci-warm/alerts/*.json. The Builder loop relays new alerts (PushNotification) and archives
them to alerts/seen/ — bridging the autonomous reconciler to operator visibility.
Data-warm canonicals (WC2/WC3)
A canonical is a per-recipe known-good deployment at warm-<recipe>, kept data-warm
(undeployed-when-idle, volume retained), tracked by runner/harness/canonical.py:
- Enroll a recipe: set
WARM_CANONICAL = Trueintests/<recipe>/recipe_meta.py. That's it. - Registry:
/var/lib/ci-warm/<recipe>/canonical.json={recipe, domain, version, commit, status, ts}. - Known-good snapshot (WC3):
runner/harness/warmsnap.pytakes a raw per-volume tar while the app is UNDEPLOYED under/var/lib/ci-warm/<recipe>/snapshot/— one last-good per app, atomic replace.restore()clears + untars each volume back; proven to round-trip data.
--quick opt-in fast lane (WC4/WC7)
!testme = full cold (default, authoritative). !testme --quick = opt-in lower-confidence
fast lane (the bridge parses it → CCCI_QUICK=1 Drone param; run_quick in run_recipe_ci.py):
- Reattach the canonical (
deploy_canonical— warm boot at known-good) → wait healthy. - (deps) use the warm keycloak + a per-run realm.
- Upgrade in place to the PR head (chaos) — the op, once.
- Assert: generic UPGRADE (reconverge + moved + serving) + recipe overlay + custom.
- PASS → undeploy-keep-volume; known-good UNCHANGED (never promote). FAIL → restore the last-known-good snapshot + undeploy (roll back, data safe).
--quick never gates merge and never advances the canonical. If no canonical exists it falls
back cleanly to a full cold run (the PR is still tested).
Cold-only canonical advancement (WC5) + nightly sweep (WC6)
- WC5 promote-on-green-cold. A GREEN full-cold run on LATEST (no PR head) of an enrolled
recipe re-seeds the canonical at the green-verified latest (snapshot + registry, atomic). The
old known-good is replaced only after green — never lost on a red run. The FIRST green cold
run seeds the canonical. A PR
!testme(carries REF) and--quicknever promote — only cold-on-latest (the nightly sweep, or a manualRECIPE=<r>run) advances it. - WC6 nightly sweep.
nightly-sweep.timer(03:00, Persistent) →nightly_sweep.py: roll warm/infra to latest (health-gated, WC1.1) → serial full-cold run across enrolled recipes on latest (each green run promotes its canonical) → prune stale warm data → log disk. Serial honors MAX_TESTS; skips if a test is already in flight.
Resource safety + isolation (WC8)
- Serialize:
DRONE_RUNNER_CAPACITY = MAX_TESTS(default 1); the nightly sweep is serial and skips if arun_recipe_ci.pyis active. At most MAX_TESTS apps are ever live at once. - Warm keycloak shared safely via per-run namespaced realms (above); orphan realms reaped.
- Disk (warm is the budget, not RAM): the
ci-docker-pruneunit (nix/modules/docker-prune.nix, Phase-2pc) prunes only dangling images/containers/build-cache (until=24h), and only under genuine disk pressure (/≥ 80%) with nothing in flight — never--all(keeps cached base/ in-use images warm; the local store IS the cache on this single host) and never--volumes(so data-warm canonical volumes survive). Each canonical = one data volume + one snapshot (small; the keycloak DB snapshot ~300M dominates).canonical.prune_stale()(run nightly) drops warm data for de-enrolled canonicals. Monitor withdf -h /(the nightly logs it). - Cold teardown stays sacred: a cold per-run app's volumes/secrets are always deleted at run end (or janitor-reaped); promote re-seeds the canonical separately (never reuses a per-run volume).
- Excluded from D8:
/var/lib/ci-warm/is runtime cache — no Nix module declares it as a source; a from-scratch rebuild re-seeds canonicals via cold runs, it does not restore them.
The --quick rollback proof (WC9)
Deliberately failing a PR under --quick restores the canonical's last-known-good intact, and a
--quick pass does not move the known-good — both proven live on the custom-html canonical:
- PASS keeps known-good: a
--quickPASS run left the registry version + the snapshot tar byte-identical (Adversary-verified sha256) and the canonical idle with its volume retained. - FAIL restores known-good: a
--quickrun against a broken PR head (bad image) →quick FAIL → restored known-good data; canonical idle, exit 1; the snapshot was byte-identical, the known-good marker was back, the app served 200, and the broken image was gone. The known-good version was never advanced.
Operate / debug
- Inspect a canonical:
cat /var/lib/ci-warm/<recipe>/canonical.json;warmsnapsnapshot under…/snapshot/. Enrolled recipes:canonical.enrolled_recipes(). - Run a quick test manually:
RECIPE=<r> CCCI_QUICK=1 cc-ci-run runner/run_recipe_ci.py. - Trigger the nightly sweep:
systemctl start nightly-sweep.service(journal shows the roll + sweep). - Roll/repair warm keycloak or traefik:
cc-ci-run runner/warm_reconcile.py {keycloak|traefik}. - Alerts:
ls /var/lib/ci-warm/alerts/(active) and…/seen/(relayed).