Operator (2026-05-29): a passing health check does NOT prove a required manual migration ran, so auto-update needs a PRE-deploy gate in addition to the post-deploy health rollback. Reconciler auto-applies only non-major (patch/minor) upgrades with no manual-migration release notes; a MAJOR recipe-version bump (or release notes flagging a manual migration) is held on the current version with a PushNotification carrying the release notes (operator upgrades manually). Leans on abra's own major-bump caution + recipe releaseNotes/. Updated WC1.2/WC6/principles/decisions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
211 lines
16 KiB
Markdown
211 lines
16 KiB
Markdown
# cc-ci Phase 2w — Warm canonical deployments + `--quick` CI mode (Autonomous Build Plan)
|
||
|
||
**Status:** ACTIVE — **interjected into Phase 2** by operator decision (2026-05-28). Phase 2
|
||
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (its STATUS-2/BACKLOG-2 state is
|
||
preserved); the loops do this phase now, then **Phase 2 resumes automatically** where it left off.
|
||
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2w.md` the watchdog returns to Phase 2.
|
||
**Builds on:** the Phase-1d/1e harness (generic suite, deploy-once, override overlays, HC1 upgrade
|
||
to PR-head, the sso-dep pattern `plan-sso-dep-testing.md`) and the now-wired Docker Hub auth.
|
||
**Owner agents:** Builder + Adversary loops (`plan.md` §6/§7); Adversary cold-verifies.
|
||
**This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md`
|
||
**Phase order now:** 1c → 1b → 1d → 1e → 2(paused) → **2w** → 2(resume) → 2b → 3 → 4.
|
||
|
||
---
|
||
|
||
## 0. Why this phase
|
||
|
||
Cold-start CI (fresh `abra app new` → deploy → DB-init/first-boot → … → teardown) is slow, and it
|
||
re-pays that cost on every run and for every SSO dependency. This phase adds a **warm-data** layer:
|
||
keep each app's **data volume** around between runs (Co-op Cloud's `undeploy` frees RAM but keeps
|
||
volumes), so a fast `--quick` run can reattach it, upgrade to the PR code, and assert — without the
|
||
cold-provisioning cost. A persistent **keycloak** serves SSO-dependent recipes without a fresh
|
||
co-deploy each run. A **last-known-good snapshot per app** means a bad PR tested under `--quick` can
|
||
never destroy the working state+data — we roll back.
|
||
|
||
**Terminology (use these terms throughout code/docs/decisions):**
|
||
- **live-warm** — actually deployed and running (e.g. keycloak): instant to use, costs RAM.
|
||
- **data-warm** — *undeployed* (RAM freed) but its **data volume is retained**, so a later
|
||
`abra app deploy` reattaches it and boots warm (skips fresh DB-init/first-boot), costs only disk.
|
||
- **cold** — no retained data: fresh `abra app new` + new volume + full lifecycle + teardown that
|
||
deletes the volume. The authoritative default.
|
||
|
||
**Design principles settled with the operator (do not relitigate):**
|
||
- **Keep keycloak live-warm; keep everything else data-warm.** Only keycloak (shared dep) + the one
|
||
app under test run at a time. RAM stops being the limiter; **disk is the budget** (monitor; bump
|
||
only if needed — test fixtures are small).
|
||
- **Default `!testme` = full cold** (authoritative; its upgrade tier already exercises PR-upgrade per
|
||
1e). **`--quick` is an opt-in flag**, a lower-confidence fast lane.
|
||
- **The canonical known-good advances ONLY via cold runs** (esp. the nightly sweep). `--quick` NEVER
|
||
promotes the canonical — it consumes it read-mostly and rolls back on failure.
|
||
- **Snapshots: raw volume copy taken while UNDEPLOYED** (fast + consistent because nothing is
|
||
writing). **One last-known-good per app.**
|
||
- Warm volumes + snapshots are **cache, not source** — not in the git/D8 closure; re-seeded by cold
|
||
runs, not restored on a VM rebuild.
|
||
- **Warm/infra apps (traefik + keycloak) auto-update to LATEST, nightly, with health-gated
|
||
rollback** (operator, 2026-05-28). Both are **unpinned** — their reconcilers `abra recipe fetch`
|
||
the latest published recipe + chaos-deploy it. A **nightly `nixos-rebuild switch`** runs the
|
||
reconcilers so the warm/infra apps roll to latest each night. Because the recipe is fetched at
|
||
*activation* (runtime), the **nix closure stays byte-identical** (only the deployed versions float)
|
||
— D8 is preserved; the version pin is gone, so the closure is *more* stable, not less.
|
||
- **Auto-update is gated TWICE — a pre-deploy safety gate AND a post-deploy health gate:**
|
||
- **Pre-deploy (don't even try unsafe upgrades):** only **auto-apply upgrades that don't require
|
||
manual intervention** — i.e. non-major (patch/minor) recipe-version bumps with no
|
||
manual-migration in their release notes. If current→latest is a **MAJOR bump** or the target's
|
||
**release notes flag a manual migration**, **DO NOT auto-upgrade**: stay on the current version
|
||
and **PUSH ALERT** with the release notes (operator upgrades manually). A passing health check
|
||
does NOT prove a required migration was done, so this gate is independent of health. Lean on
|
||
abra's own major-bump caution + the recipe `releaseNotes/`.
|
||
- **Post-deploy (for upgrades we DO apply):** record running version as last-good → deploy latest
|
||
→ health-check → **healthy: commit (last-good := latest); unhealthy: roll back to last-good +
|
||
PUSH ALERT.** For **stateful apps (keycloak / any app with a DB/volume): snapshot the data volume
|
||
BEFORE the upgrade and restore it on rollback** (a forward DB migration can make a version-only
|
||
rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik (stateless) = version
|
||
rollback only. (Rollback is NOT nix-generation rollback — the swarm app isn't in the generation;
|
||
it's built into the reconciler.)
|
||
|
||
---
|
||
|
||
## 1. Definition of Done (Phase 2w exit condition)
|
||
|
||
Terminates when every item holds **and the Adversary has independently cold-verified** (logged in
|
||
`machine-docs/REVIEW-2w.md`):
|
||
|
||
- [ ] **WC1 — Live-warm keycloak (SSO dep), unpinned + self-healing.** A persistent (live-warm)
|
||
keycloak runs at a stable domain, **unpinned** (reconciler `abra recipe fetch` latest +
|
||
chaos-deploy, matching traefik — drop the `kcVersion` pin; keep the *secret-generate-only-if-
|
||
missing* guard + the health-wait). SSO-dependent recipes (per `plan-sso-dep-testing.md`) point
|
||
their `setup_custom_tests` at it, create a **per-run namespaced realm+client**, then **delete
|
||
that realm** after the run, instead of co-deploying a fresh keycloak. Proven: a dependent
|
||
recipe's SSO custom tests pass against the warm keycloak; concurrent dependents don't collide
|
||
(distinct realms); leftover realms are reaped.
|
||
- [ ] **WC1.2 — Pre-deploy auto-upgrade safety gate (manual-migration → alert, don't auto-apply).**
|
||
Before the reconciler auto-applies "latest", it checks the current→latest delta: **auto-apply
|
||
only non-major (patch/minor) bumps with no manual-migration release notes.** A **MAJOR
|
||
recipe-version bump**, or a target whose **`releaseNotes/` flag a manual migration**, is NOT
|
||
auto-applied — the reconciler **stays on the current version and pushes an alert with the
|
||
release notes** for the operator to upgrade manually. (Health-pass ≠ migration-done, so this is
|
||
independent of WC1.1.) Detection leans on abra's major-bump handling + the recipe release notes.
|
||
**Adversary proof:** simulate a major/manual-migration latest → confirm the warm app stays on
|
||
current + an alert with the notes fired (no silent auto-upgrade).
|
||
- [ ] **WC1.1 — Health-gated deploy-with-rollback in the warm/infra reconcilers (traefik + keycloak).**
|
||
Each reconciler: record the running version as **last-good** → fetch+deploy latest →
|
||
health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good +
|
||
`PushNotification` alert.** For **stateful apps (keycloak): snapshot the data volume BEFORE the
|
||
upgrade; on rollback restore that snapshot + redeploy the prior version** (forward DB migrations
|
||
make version-only rollback unsafe) — reuse the WC3 snapshot helper. traefik (stateless) = version
|
||
rollback only. **Adversary proof:** force a broken "latest" (simulate) → confirm the warm app
|
||
self-reverts to the prior healthy version (keycloak with its pre-upgrade data intact) and an
|
||
alert fired; a healthy update commits the new version as last-good.
|
||
- [ ] **WC2 — Data-warm canonical model.** A canonical per warmed recipe at a **stable domain**
|
||
(distinct from cold per-run `<recipe>-<6hex>` domains), kept **data-warm** (undeployed-when-idle,
|
||
volume retained). A small declarative registry/reconciler tracks which recipes are
|
||
canonical and **at which commit** their known-good is. Re-warmable from scratch (cache).
|
||
- [ ] **WC3 — Known-good snapshots.** For each canonical app, a **raw copy of its data volume(s)
|
||
taken while undeployed**, stored under a stable path (e.g. `/var/lib/ci-warm/<recipe>/`),
|
||
tagged with the commit it passed on. **One last-known-good retained per app** (prior is
|
||
replaced atomically on update). Restore is proven to bring the app back healthy with its data.
|
||
- [ ] **WC4 — `--quick` mode.** `runner/run_recipe_ci.py` gains a `--quick` path (flag/env): reattach
|
||
the canonical warm volume (`abra app deploy` of the canonical) → **upgrade to PR head** (chaos
|
||
redeploy) → run generic UPGRADE + serving + custom assertions (generic-first invariant holds) →
|
||
**on PASS:** `abra app undeploy` (keep volume), do NOT alter the known-good; **on FAIL:**
|
||
restore the last-known-good snapshot, then undeploy. `--quick` **never promotes** the canonical.
|
||
- [ ] **WC5 — Canonical advancement via cold only.** A **green full-cold run on latest** re-snapshots
|
||
+ re-tags the canonical known-good (promote-on-green instead of deleting at teardown). A cold
|
||
run is the ONLY thing that advances a canonical. Seeding: the first green cold run on latest
|
||
makes an app canonical.
|
||
- [ ] **WC6 — Nightly: rebuild (warm/infra → latest) THEN full-cold sweep.** A scheduled nightly job,
|
||
in order: (1) **`nixos-rebuild switch`** → the warm/infra reconcilers roll traefik + keycloak to
|
||
latest, subject to the **WC1.2 pre-deploy gate** (major/manual-migration → hold on current +
|
||
alert) and the **WC1.1 health-gated rollback**; (2) the **full cold** suite across enrolled recipes
|
||
— refreshing every canonical's known-good (WC5) AND serving as a daily authoritative regression
|
||
run, now against the freshly-updated infra. Don't run while a test run is in flight. Mechanism
|
||
settled in DECISIONS (systemd timer on cc-ci / Drone cron / bridge), declarative + reproducible.
|
||
Bounded by MAX_TESTS (serial is fine — nightly). If the rebuild's health-gate rolled an infra
|
||
app back, the alert fires and the sweep still runs against the (healthy) prior version.
|
||
- [ ] **WC7 — Trigger + authority + labeling.** Default `!testme` = full cold (unchanged). `--quick`
|
||
is opt-in (`!testme --quick`, or a build param) and **never gates merge**. Run results carry
|
||
the **mode** (cold vs quick) so a `--quick` pass is distinctly labeled lower-confidence (feeds
|
||
Phase 3). Quick requires an existing canonical; if none, it cleanly falls back to (or reports
|
||
"no canonical — run cold first").
|
||
- [ ] **WC8 — Resource safety + isolation.** Warm-base runs serialize per app (MAX_TESTS honored);
|
||
warm keycloak shared safely via per-run realms; **disk monitored** (warm volumes + one snapshot
|
||
each) with a documented budget + prune of stale/orphaned warm data; cold-run teardown stays
|
||
sacred (deletes its own per-run volumes); warm data is excluded from the D8 reproducibility
|
||
closure (documented as cache).
|
||
- [ ] **WC9 — Documented + cold-verified, incl. the rollback proof.** `docs/` explains warm/quick;
|
||
the Adversary cold-verifies, **including deliberately failing a PR under `--quick` and
|
||
confirming the canonical's last-known-good is restored intact (data preserved)**, and that a
|
||
`--quick` pass did not move the known-good. No softened tests.
|
||
|
||
When WC1–WC9 hold and are confirmed, write `## DONE` to `machine-docs/STATUS-2w.md` → the watchdog
|
||
auto-returns to **Phase 2** (resume recipe authoring).
|
||
|
||
---
|
||
|
||
## 2. The `--quick` flow (reference)
|
||
|
||
```
|
||
PRECOND: a canonical for <recipe> exists (seeded by a prior green cold run); else fall back/report.
|
||
1. abra app deploy <canonical-domain> # reattach warm volume -> fast warm boot at known-good commit
|
||
2. wait_healthy
|
||
3. (deps) point at the warm keycloak; create a per-run realm+client (namespaced)
|
||
4. UPGRADE to PR head (abra app deploy --chaos to the PR checkout) # the op, once
|
||
5. assert: generic upgrade (reconverge + moved + serving) + recipe overlay + custom (requires_deps)
|
||
6a. PASS -> abra app undeploy <canonical-domain> # keep volume; known-good UNCHANGED
|
||
6b. FAIL -> restore last-known-good snapshot to the volume; abra app undeploy # roll back, data safe
|
||
7. (deps) delete the per-run realm from the warm keycloak
|
||
```
|
||
Cold run (default, unchanged) seeds/advances the canonical: on a green cold run on latest, snapshot
|
||
the (undeployed) volume → replace the last-known-good + tag the commit, and keep the volume as the
|
||
new canonical instead of deleting it.
|
||
|
||
## 3. Milestones (bounded)
|
||
- **W0 — Warm keycloak, unpinned + self-healing (WC1, WC1.1).** Highest ROI; unblocks faster SSO
|
||
recipe tests for the resumed Phase 2. Includes the health-gated deploy-with-rollback (snapshot
|
||
keycloak before upgrade, restore on health-fail + alert); apply the same to traefik (version-only).
|
||
- **W1 — Canonical registry + snapshot/restore (WC2, WC3).** Stable-domain warm apps; raw-while-
|
||
stopped snapshot + restore; prove restore round-trips data. (Shares the snapshot helper with WC1.1.)
|
||
- **W2 — `--quick` mode (WC4, WC7).** Orchestrator path + labeling + fallback.
|
||
- **W3 — Nightly rebuild→sweep + cold-advances-canonical (WC5, WC6).** Nightly `nixos-rebuild`
|
||
(warm/infra → latest, health-gated) then full-cold sweep; promote-on-green-cold; scheduled job.
|
||
- **W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9).** Then
|
||
`## DONE`.
|
||
|
||
## 4. Guardrails
|
||
- **`--quick` never advances the canonical; only cold does.** Anchors the baseline to verified states.
|
||
- **Never lose the known-good** — snapshot before mutate (or rely on the standing known-good); restore
|
||
on any quick failure. The rollback proof (WC9) is mandatory.
|
||
- **Default stays cold; quick is opt-in + clearly lower-confidence.** Don't let a quick pass read as
|
||
full confidence.
|
||
- **Snapshot only while undeployed** (consistency). **One last-known-good per app** (disk).
|
||
- **Cold teardown stays sacred** (deletes per-run volumes); warm volumes are a managed cache, never
|
||
confused with per-run state; warm data excluded from D8.
|
||
- **Warm/infra auto-update is health-gated** — a failed "latest" self-reverts to the last-good
|
||
version (+ data, for stateful apps) and alerts; never leave the proxy/SSO dep broken silently.
|
||
- **Never weaken a test** (cardinal rule). Generic-first invariant holds in `--quick` too.
|
||
- **Bounded** — build the mechanism + prove on keycloak + a couple of recipes; do NOT re-warm all
|
||
recipes here (the nightly sweep populates canonicals over time).
|
||
|
||
## 5. Open decisions (log in machine-docs/DECISIONS.md)
|
||
- Canonical **stable-domain scheme** (distinct from cold per-run domains) + how the registry/reconciler
|
||
is declared.
|
||
- **Snapshot storage + format** (raw tar vs reflink/CoW copy) under `/var/lib/ci-warm/`; atomic replace.
|
||
- **Nightly sweep mechanism** (systemd timer / Drone cron / bridge) + ordering + disk-prune policy.
|
||
- `--quick` **trigger surface** (`!testme --quick` comment vs Drone build param) + the "no canonical
|
||
yet" fallback (run cold vs report-and-skip).
|
||
- **Disk budget**: measure warm volume + snapshot sizes across recipes; decide if a 30→larger bump is
|
||
needed or the warm set stays bounded.
|
||
- **Stateful pre-upgrade snapshot consistency (keycloak).** keycloak is *live-warm* (running) at
|
||
nightly-upgrade time, but the snapshot rule is "raw copy while UNDEPLOYED." Cleanest: the nightly
|
||
keycloak update = **undeploy → raw snapshot → deploy latest → health-check → on fail restore
|
||
snapshot + redeploy prior** (the brief nightly downtime makes the snapshot consistent and honors
|
||
the WC3 invariant). Confirm this vs an app-consistent backup alternative.
|
||
- **last-good version state** for warm/infra apps (where the reconciler records the prior healthy
|
||
version to roll back to) — a small state file alongside the snapshot, re-derivable from the running
|
||
swarm version label.
|
||
- **Manual-migration / major-bump detection (WC1.2).** How to decide "auto-apply vs alert-and-hold":
|
||
primary signal = **major recipe-version bump** (coop-cloud `<upstream>+<recipe-semver>`; major
|
||
recipe-semver = breaking, matches abra's own major-upgrade caution); secondary = scan the target's
|
||
`releaseNotes/<version>.md` for manual-migration markers. Decide the exact rule + whether to parse
|
||
`abra app upgrade` output vs compute the delta directly.
|