Operator (2026-05-29): a passing health check does NOT prove a required manual migration ran, so auto-update needs a PRE-deploy gate in addition to the post-deploy health rollback. Reconciler auto-applies only non-major (patch/minor) upgrades with no manual-migration release notes; a MAJOR recipe-version bump (or release notes flagging a manual migration) is held on the current version with a PushNotification carrying the release notes (operator upgrades manually). Leans on abra's own major-bump caution + recipe releaseNotes/. Updated WC1.2/WC6/principles/decisions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
16 KiB
cc-ci Phase 2w — Warm canonical deployments + --quick CI mode (Autonomous Build Plan)
Status: ACTIVE — interjected into Phase 2 by operator decision (2026-05-28). Phase 2
(plan-phase2-recipe-tests.md) is PAUSED at its current progress (its STATUS-2/BACKLOG-2 state is
preserved); the loops do this phase now, then Phase 2 resumes automatically where it left off.
Transition: auto — on ## DONE in machine-docs/STATUS-2w.md the watchdog returns to Phase 2.
Builds on: the Phase-1d/1e harness (generic suite, deploy-once, override overlays, HC1 upgrade
to PR-head, the sso-dep pattern plan-sso-dep-testing.md) and the now-wired Docker Hub auth.
Owner agents: Builder + Adversary loops (plan.md §6/§7); Adversary cold-verifies.
This file: /srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md
Phase order now: 1c → 1b → 1d → 1e → 2(paused) → 2w → 2(resume) → 2b → 3 → 4.
0. Why this phase
Cold-start CI (fresh abra app new → deploy → DB-init/first-boot → … → teardown) is slow, and it
re-pays that cost on every run and for every SSO dependency. This phase adds a warm-data layer:
keep each app's data volume around between runs (Co-op Cloud's undeploy frees RAM but keeps
volumes), so a fast --quick run can reattach it, upgrade to the PR code, and assert — without the
cold-provisioning cost. A persistent keycloak serves SSO-dependent recipes without a fresh
co-deploy each run. A last-known-good snapshot per app means a bad PR tested under --quick can
never destroy the working state+data — we roll back.
Terminology (use these terms throughout code/docs/decisions):
- live-warm — actually deployed and running (e.g. keycloak): instant to use, costs RAM.
- data-warm — undeployed (RAM freed) but its data volume is retained, so a later
abra app deployreattaches it and boots warm (skips fresh DB-init/first-boot), costs only disk. - cold — no retained data: fresh
abra app new+ new volume + full lifecycle + teardown that deletes the volume. The authoritative default.
Design principles settled with the operator (do not relitigate):
- Keep keycloak live-warm; keep everything else data-warm. Only keycloak (shared dep) + the one app under test run at a time. RAM stops being the limiter; disk is the budget (monitor; bump only if needed — test fixtures are small).
- Default
!testme= full cold (authoritative; its upgrade tier already exercises PR-upgrade per 1e).--quickis an opt-in flag, a lower-confidence fast lane. - The canonical known-good advances ONLY via cold runs (esp. the nightly sweep).
--quickNEVER promotes the canonical — it consumes it read-mostly and rolls back on failure. - Snapshots: raw volume copy taken while UNDEPLOYED (fast + consistent because nothing is writing). One last-known-good per app.
- Warm volumes + snapshots are cache, not source — not in the git/D8 closure; re-seeded by cold runs, not restored on a VM rebuild.
- Warm/infra apps (traefik + keycloak) auto-update to LATEST, nightly, with health-gated
rollback (operator, 2026-05-28). Both are unpinned — their reconcilers
abra recipe fetchthe latest published recipe + chaos-deploy it. A nightlynixos-rebuild switchruns the reconcilers so the warm/infra apps roll to latest each night. Because the recipe is fetched at activation (runtime), the nix closure stays byte-identical (only the deployed versions float) — D8 is preserved; the version pin is gone, so the closure is more stable, not less. - Auto-update is gated TWICE — a pre-deploy safety gate AND a post-deploy health gate:
- Pre-deploy (don't even try unsafe upgrades): only auto-apply upgrades that don't require
manual intervention — i.e. non-major (patch/minor) recipe-version bumps with no
manual-migration in their release notes. If current→latest is a MAJOR bump or the target's
release notes flag a manual migration, DO NOT auto-upgrade: stay on the current version
and PUSH ALERT with the release notes (operator upgrades manually). A passing health check
does NOT prove a required migration was done, so this gate is independent of health. Lean on
abra's own major-bump caution + the recipe
releaseNotes/. - Post-deploy (for upgrades we DO apply): record running version as last-good → deploy latest → health-check → healthy: commit (last-good := latest); unhealthy: roll back to last-good + PUSH ALERT. For stateful apps (keycloak / any app with a DB/volume): snapshot the data volume BEFORE the upgrade and restore it on rollback (a forward DB migration can make a version-only rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik (stateless) = version rollback only. (Rollback is NOT nix-generation rollback — the swarm app isn't in the generation; it's built into the reconciler.)
- Pre-deploy (don't even try unsafe upgrades): only auto-apply upgrades that don't require
manual intervention — i.e. non-major (patch/minor) recipe-version bumps with no
manual-migration in their release notes. If current→latest is a MAJOR bump or the target's
release notes flag a manual migration, DO NOT auto-upgrade: stay on the current version
and PUSH ALERT with the release notes (operator upgrades manually). A passing health check
does NOT prove a required migration was done, so this gate is independent of health. Lean on
abra's own major-bump caution + the recipe
1. Definition of Done (Phase 2w exit condition)
Terminates when every item holds and the Adversary has independently cold-verified (logged in
machine-docs/REVIEW-2w.md):
- WC1 — Live-warm keycloak (SSO dep), unpinned + self-healing. A persistent (live-warm)
keycloak runs at a stable domain, unpinned (reconciler
abra recipe fetchlatest + chaos-deploy, matching traefik — drop thekcVersionpin; keep the secret-generate-only-if- missing guard + the health-wait). SSO-dependent recipes (perplan-sso-dep-testing.md) point theirsetup_custom_testsat it, create a per-run namespaced realm+client, then delete that realm after the run, instead of co-deploying a fresh keycloak. Proven: a dependent recipe's SSO custom tests pass against the warm keycloak; concurrent dependents don't collide (distinct realms); leftover realms are reaped. - WC1.2 — Pre-deploy auto-upgrade safety gate (manual-migration → alert, don't auto-apply).
Before the reconciler auto-applies "latest", it checks the current→latest delta: auto-apply
only non-major (patch/minor) bumps with no manual-migration release notes. A MAJOR
recipe-version bump, or a target whose
releaseNotes/flag a manual migration, is NOT auto-applied — the reconciler stays on the current version and pushes an alert with the release notes for the operator to upgrade manually. (Health-pass ≠ migration-done, so this is independent of WC1.1.) Detection leans on abra's major-bump handling + the recipe release notes. Adversary proof: simulate a major/manual-migration latest → confirm the warm app stays on current + an alert with the notes fired (no silent auto-upgrade). - WC1.1 — Health-gated deploy-with-rollback in the warm/infra reconcilers (traefik + keycloak).
Each reconciler: record the running version as last-good → fetch+deploy latest →
health-check → healthy: commit last-good := latest; unhealthy: roll back to last-good +
PushNotificationalert. For stateful apps (keycloak): snapshot the data volume BEFORE the upgrade; on rollback restore that snapshot + redeploy the prior version (forward DB migrations make version-only rollback unsafe) — reuse the WC3 snapshot helper. traefik (stateless) = version rollback only. Adversary proof: force a broken "latest" (simulate) → confirm the warm app self-reverts to the prior healthy version (keycloak with its pre-upgrade data intact) and an alert fired; a healthy update commits the new version as last-good. - WC2 — Data-warm canonical model. A canonical per warmed recipe at a stable domain
(distinct from cold per-run
<recipe>-<6hex>domains), kept data-warm (undeployed-when-idle, volume retained). A small declarative registry/reconciler tracks which recipes are canonical and at which commit their known-good is. Re-warmable from scratch (cache). - WC3 — Known-good snapshots. For each canonical app, a raw copy of its data volume(s)
taken while undeployed, stored under a stable path (e.g.
/var/lib/ci-warm/<recipe>/), tagged with the commit it passed on. One last-known-good retained per app (prior is replaced atomically on update). Restore is proven to bring the app back healthy with its data. - WC4 —
--quickmode.runner/run_recipe_ci.pygains a--quickpath (flag/env): reattach the canonical warm volume (abra app deployof the canonical) → upgrade to PR head (chaos redeploy) → run generic UPGRADE + serving + custom assertions (generic-first invariant holds) → on PASS:abra app undeploy(keep volume), do NOT alter the known-good; on FAIL: restore the last-known-good snapshot, then undeploy.--quicknever promotes the canonical. - WC5 — Canonical advancement via cold only. A green full-cold run on latest re-snapshots + re-tags the canonical known-good (promote-on-green instead of deleting at teardown). A cold run is the ONLY thing that advances a canonical. Seeding: the first green cold run on latest makes an app canonical.
- WC6 — Nightly: rebuild (warm/infra → latest) THEN full-cold sweep. A scheduled nightly job,
in order: (1)
nixos-rebuild switch→ the warm/infra reconcilers roll traefik + keycloak to latest, subject to the WC1.2 pre-deploy gate (major/manual-migration → hold on current + alert) and the WC1.1 health-gated rollback; (2) the full cold suite across enrolled recipes — refreshing every canonical's known-good (WC5) AND serving as a daily authoritative regression run, now against the freshly-updated infra. Don't run while a test run is in flight. Mechanism settled in DECISIONS (systemd timer on cc-ci / Drone cron / bridge), declarative + reproducible. Bounded by MAX_TESTS (serial is fine — nightly). If the rebuild's health-gate rolled an infra app back, the alert fires and the sweep still runs against the (healthy) prior version. - WC7 — Trigger + authority + labeling. Default
!testme= full cold (unchanged).--quickis opt-in (!testme --quick, or a build param) and never gates merge. Run results carry the mode (cold vs quick) so a--quickpass is distinctly labeled lower-confidence (feeds Phase 3). Quick requires an existing canonical; if none, it cleanly falls back to (or reports "no canonical — run cold first"). - WC8 — Resource safety + isolation. Warm-base runs serialize per app (MAX_TESTS honored); warm keycloak shared safely via per-run realms; disk monitored (warm volumes + one snapshot each) with a documented budget + prune of stale/orphaned warm data; cold-run teardown stays sacred (deletes its own per-run volumes); warm data is excluded from the D8 reproducibility closure (documented as cache).
- WC9 — Documented + cold-verified, incl. the rollback proof.
docs/explains warm/quick; the Adversary cold-verifies, including deliberately failing a PR under--quickand confirming the canonical's last-known-good is restored intact (data preserved), and that a--quickpass did not move the known-good. No softened tests.
When WC1–WC9 hold and are confirmed, write ## DONE to machine-docs/STATUS-2w.md → the watchdog
auto-returns to Phase 2 (resume recipe authoring).
2. The --quick flow (reference)
PRECOND: a canonical for <recipe> exists (seeded by a prior green cold run); else fall back/report.
1. abra app deploy <canonical-domain> # reattach warm volume -> fast warm boot at known-good commit
2. wait_healthy
3. (deps) point at the warm keycloak; create a per-run realm+client (namespaced)
4. UPGRADE to PR head (abra app deploy --chaos to the PR checkout) # the op, once
5. assert: generic upgrade (reconverge + moved + serving) + recipe overlay + custom (requires_deps)
6a. PASS -> abra app undeploy <canonical-domain> # keep volume; known-good UNCHANGED
6b. FAIL -> restore last-known-good snapshot to the volume; abra app undeploy # roll back, data safe
7. (deps) delete the per-run realm from the warm keycloak
Cold run (default, unchanged) seeds/advances the canonical: on a green cold run on latest, snapshot the (undeployed) volume → replace the last-known-good + tag the commit, and keep the volume as the new canonical instead of deleting it.
3. Milestones (bounded)
- W0 — Warm keycloak, unpinned + self-healing (WC1, WC1.1). Highest ROI; unblocks faster SSO recipe tests for the resumed Phase 2. Includes the health-gated deploy-with-rollback (snapshot keycloak before upgrade, restore on health-fail + alert); apply the same to traefik (version-only).
- W1 — Canonical registry + snapshot/restore (WC2, WC3). Stable-domain warm apps; raw-while- stopped snapshot + restore; prove restore round-trips data. (Shares the snapshot helper with WC1.1.)
- W2 —
--quickmode (WC4, WC7). Orchestrator path + labeling + fallback. - W3 — Nightly rebuild→sweep + cold-advances-canonical (WC5, WC6). Nightly
nixos-rebuild(warm/infra → latest, health-gated) then full-cold sweep; promote-on-green-cold; scheduled job. - W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). Then
## DONE.
4. Guardrails
--quicknever advances the canonical; only cold does. Anchors the baseline to verified states.- Never lose the known-good — snapshot before mutate (or rely on the standing known-good); restore on any quick failure. The rollback proof (WC9) is mandatory.
- Default stays cold; quick is opt-in + clearly lower-confidence. Don't let a quick pass read as full confidence.
- Snapshot only while undeployed (consistency). One last-known-good per app (disk).
- Cold teardown stays sacred (deletes per-run volumes); warm volumes are a managed cache, never confused with per-run state; warm data excluded from D8.
- Warm/infra auto-update is health-gated — a failed "latest" self-reverts to the last-good version (+ data, for stateful apps) and alerts; never leave the proxy/SSO dep broken silently.
- Never weaken a test (cardinal rule). Generic-first invariant holds in
--quicktoo. - Bounded — build the mechanism + prove on keycloak + a couple of recipes; do NOT re-warm all recipes here (the nightly sweep populates canonicals over time).
5. Open decisions (log in machine-docs/DECISIONS.md)
- Canonical stable-domain scheme (distinct from cold per-run domains) + how the registry/reconciler is declared.
- Snapshot storage + format (raw tar vs reflink/CoW copy) under
/var/lib/ci-warm/; atomic replace. - Nightly sweep mechanism (systemd timer / Drone cron / bridge) + ordering + disk-prune policy.
--quicktrigger surface (!testme --quickcomment vs Drone build param) + the "no canonical yet" fallback (run cold vs report-and-skip).- Disk budget: measure warm volume + snapshot sizes across recipes; decide if a 30→larger bump is needed or the warm set stays bounded.
- Stateful pre-upgrade snapshot consistency (keycloak). keycloak is live-warm (running) at nightly-upgrade time, but the snapshot rule is "raw copy while UNDEPLOYED." Cleanest: the nightly keycloak update = undeploy → raw snapshot → deploy latest → health-check → on fail restore snapshot + redeploy prior (the brief nightly downtime makes the snapshot consistent and honors the WC3 invariant). Confirm this vs an app-consistent backup alternative.
- last-good version state for warm/infra apps (where the reconciler records the prior healthy version to roll back to) — a small state file alongside the snapshot, re-derivable from the running swarm version label.
- Manual-migration / major-bump detection (WC1.2). How to decide "auto-apply vs alert-and-hold":
primary signal = major recipe-version bump (coop-cloud
<upstream>+<recipe-semver>; major recipe-semver = breaking, matches abra's own major-upgrade caution); secondary = scan the target'sreleaseNotes/<version>.mdfor manual-migration markers. Decide the exact rule + whether to parseabra app upgradeoutput vs compute the delta directly.