Files
cc-ci-orchestrator/cc-ci-plan/plan-phase2w-warm-canonical-quick.md
autonomic-bot 00e90bb597 plan(2w): WC1.2 — pre-deploy auto-upgrade safety gate (major/manual-migration -> alert, hold)
Operator (2026-05-29): a passing health check does NOT prove a required manual migration ran, so
auto-update needs a PRE-deploy gate in addition to the post-deploy health rollback. Reconciler
auto-applies only non-major (patch/minor) upgrades with no manual-migration release notes; a MAJOR
recipe-version bump (or release notes flagging a manual migration) is held on the current version
with a PushNotification carrying the release notes (operator upgrades manually). Leans on abra's
own major-bump caution + recipe releaseNotes/. Updated WC1.2/WC6/principles/decisions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:02:28 +01:00

16 KiB
Raw Blame History

cc-ci Phase 2w — Warm canonical deployments + --quick CI mode (Autonomous Build Plan)

Status: ACTIVE — interjected into Phase 2 by operator decision (2026-05-28). Phase 2 (plan-phase2-recipe-tests.md) is PAUSED at its current progress (its STATUS-2/BACKLOG-2 state is preserved); the loops do this phase now, then Phase 2 resumes automatically where it left off. Transition: auto — on ## DONE in machine-docs/STATUS-2w.md the watchdog returns to Phase 2. Builds on: the Phase-1d/1e harness (generic suite, deploy-once, override overlays, HC1 upgrade to PR-head, the sso-dep pattern plan-sso-dep-testing.md) and the now-wired Docker Hub auth. Owner agents: Builder + Adversary loops (plan.md §6/§7); Adversary cold-verifies. This file: /srv/cc-ci/cc-ci-plan/plan-phase2w-warm-canonical-quick.md Phase order now: 1c → 1b → 1d → 1e → 2(paused) → 2w → 2(resume) → 2b → 3 → 4.


0. Why this phase

Cold-start CI (fresh abra app new → deploy → DB-init/first-boot → … → teardown) is slow, and it re-pays that cost on every run and for every SSO dependency. This phase adds a warm-data layer: keep each app's data volume around between runs (Co-op Cloud's undeploy frees RAM but keeps volumes), so a fast --quick run can reattach it, upgrade to the PR code, and assert — without the cold-provisioning cost. A persistent keycloak serves SSO-dependent recipes without a fresh co-deploy each run. A last-known-good snapshot per app means a bad PR tested under --quick can never destroy the working state+data — we roll back.

Terminology (use these terms throughout code/docs/decisions):

  • live-warm — actually deployed and running (e.g. keycloak): instant to use, costs RAM.
  • data-warmundeployed (RAM freed) but its data volume is retained, so a later abra app deploy reattaches it and boots warm (skips fresh DB-init/first-boot), costs only disk.
  • cold — no retained data: fresh abra app new + new volume + full lifecycle + teardown that deletes the volume. The authoritative default.

Design principles settled with the operator (do not relitigate):

  • Keep keycloak live-warm; keep everything else data-warm. Only keycloak (shared dep) + the one app under test run at a time. RAM stops being the limiter; disk is the budget (monitor; bump only if needed — test fixtures are small).
  • Default !testme = full cold (authoritative; its upgrade tier already exercises PR-upgrade per 1e). --quick is an opt-in flag, a lower-confidence fast lane.
  • The canonical known-good advances ONLY via cold runs (esp. the nightly sweep). --quick NEVER promotes the canonical — it consumes it read-mostly and rolls back on failure.
  • Snapshots: raw volume copy taken while UNDEPLOYED (fast + consistent because nothing is writing). One last-known-good per app.
  • Warm volumes + snapshots are cache, not source — not in the git/D8 closure; re-seeded by cold runs, not restored on a VM rebuild.
  • Warm/infra apps (traefik + keycloak) auto-update to LATEST, nightly, with health-gated rollback (operator, 2026-05-28). Both are unpinned — their reconcilers abra recipe fetch the latest published recipe + chaos-deploy it. A nightly nixos-rebuild switch runs the reconcilers so the warm/infra apps roll to latest each night. Because the recipe is fetched at activation (runtime), the nix closure stays byte-identical (only the deployed versions float) — D8 is preserved; the version pin is gone, so the closure is more stable, not less.
  • Auto-update is gated TWICE — a pre-deploy safety gate AND a post-deploy health gate:
    • Pre-deploy (don't even try unsafe upgrades): only auto-apply upgrades that don't require manual intervention — i.e. non-major (patch/minor) recipe-version bumps with no manual-migration in their release notes. If current→latest is a MAJOR bump or the target's release notes flag a manual migration, DO NOT auto-upgrade: stay on the current version and PUSH ALERT with the release notes (operator upgrades manually). A passing health check does NOT prove a required migration was done, so this gate is independent of health. Lean on abra's own major-bump caution + the recipe releaseNotes/.
    • Post-deploy (for upgrades we DO apply): record running version as last-good → deploy latest → health-check → healthy: commit (last-good := latest); unhealthy: roll back to last-good + PUSH ALERT. For stateful apps (keycloak / any app with a DB/volume): snapshot the data volume BEFORE the upgrade and restore it on rollback (a forward DB migration can make a version-only rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik (stateless) = version rollback only. (Rollback is NOT nix-generation rollback — the swarm app isn't in the generation; it's built into the reconciler.)

1. Definition of Done (Phase 2w exit condition)

Terminates when every item holds and the Adversary has independently cold-verified (logged in machine-docs/REVIEW-2w.md):

  • WC1 — Live-warm keycloak (SSO dep), unpinned + self-healing. A persistent (live-warm) keycloak runs at a stable domain, unpinned (reconciler abra recipe fetch latest + chaos-deploy, matching traefik — drop the kcVersion pin; keep the secret-generate-only-if- missing guard + the health-wait). SSO-dependent recipes (per plan-sso-dep-testing.md) point their setup_custom_tests at it, create a per-run namespaced realm+client, then delete that realm after the run, instead of co-deploying a fresh keycloak. Proven: a dependent recipe's SSO custom tests pass against the warm keycloak; concurrent dependents don't collide (distinct realms); leftover realms are reaped.
  • WC1.2 — Pre-deploy auto-upgrade safety gate (manual-migration → alert, don't auto-apply). Before the reconciler auto-applies "latest", it checks the current→latest delta: auto-apply only non-major (patch/minor) bumps with no manual-migration release notes. A MAJOR recipe-version bump, or a target whose releaseNotes/ flag a manual migration, is NOT auto-applied — the reconciler stays on the current version and pushes an alert with the release notes for the operator to upgrade manually. (Health-pass ≠ migration-done, so this is independent of WC1.1.) Detection leans on abra's major-bump handling + the recipe release notes. Adversary proof: simulate a major/manual-migration latest → confirm the warm app stays on current + an alert with the notes fired (no silent auto-upgrade).
  • WC1.1 — Health-gated deploy-with-rollback in the warm/infra reconcilers (traefik + keycloak). Each reconciler: record the running version as last-good → fetch+deploy latest → health-check → healthy: commit last-good := latest; unhealthy: roll back to last-good + PushNotification alert. For stateful apps (keycloak): snapshot the data volume BEFORE the upgrade; on rollback restore that snapshot + redeploy the prior version (forward DB migrations make version-only rollback unsafe) — reuse the WC3 snapshot helper. traefik (stateless) = version rollback only. Adversary proof: force a broken "latest" (simulate) → confirm the warm app self-reverts to the prior healthy version (keycloak with its pre-upgrade data intact) and an alert fired; a healthy update commits the new version as last-good.
  • WC2 — Data-warm canonical model. A canonical per warmed recipe at a stable domain (distinct from cold per-run <recipe>-<6hex> domains), kept data-warm (undeployed-when-idle, volume retained). A small declarative registry/reconciler tracks which recipes are canonical and at which commit their known-good is. Re-warmable from scratch (cache).
  • WC3 — Known-good snapshots. For each canonical app, a raw copy of its data volume(s) taken while undeployed, stored under a stable path (e.g. /var/lib/ci-warm/<recipe>/), tagged with the commit it passed on. One last-known-good retained per app (prior is replaced atomically on update). Restore is proven to bring the app back healthy with its data.
  • WC4 — --quick mode. runner/run_recipe_ci.py gains a --quick path (flag/env): reattach the canonical warm volume (abra app deploy of the canonical) → upgrade to PR head (chaos redeploy) → run generic UPGRADE + serving + custom assertions (generic-first invariant holds) → on PASS: abra app undeploy (keep volume), do NOT alter the known-good; on FAIL: restore the last-known-good snapshot, then undeploy. --quick never promotes the canonical.
  • WC5 — Canonical advancement via cold only. A green full-cold run on latest re-snapshots + re-tags the canonical known-good (promote-on-green instead of deleting at teardown). A cold run is the ONLY thing that advances a canonical. Seeding: the first green cold run on latest makes an app canonical.
  • WC6 — Nightly: rebuild (warm/infra → latest) THEN full-cold sweep. A scheduled nightly job, in order: (1) nixos-rebuild switch → the warm/infra reconcilers roll traefik + keycloak to latest, subject to the WC1.2 pre-deploy gate (major/manual-migration → hold on current + alert) and the WC1.1 health-gated rollback; (2) the full cold suite across enrolled recipes — refreshing every canonical's known-good (WC5) AND serving as a daily authoritative regression run, now against the freshly-updated infra. Don't run while a test run is in flight. Mechanism settled in DECISIONS (systemd timer on cc-ci / Drone cron / bridge), declarative + reproducible. Bounded by MAX_TESTS (serial is fine — nightly). If the rebuild's health-gate rolled an infra app back, the alert fires and the sweep still runs against the (healthy) prior version.
  • WC7 — Trigger + authority + labeling. Default !testme = full cold (unchanged). --quick is opt-in (!testme --quick, or a build param) and never gates merge. Run results carry the mode (cold vs quick) so a --quick pass is distinctly labeled lower-confidence (feeds Phase 3). Quick requires an existing canonical; if none, it cleanly falls back to (or reports "no canonical — run cold first").
  • WC8 — Resource safety + isolation. Warm-base runs serialize per app (MAX_TESTS honored); warm keycloak shared safely via per-run realms; disk monitored (warm volumes + one snapshot each) with a documented budget + prune of stale/orphaned warm data; cold-run teardown stays sacred (deletes its own per-run volumes); warm data is excluded from the D8 reproducibility closure (documented as cache).
  • WC9 — Documented + cold-verified, incl. the rollback proof. docs/ explains warm/quick; the Adversary cold-verifies, including deliberately failing a PR under --quick and confirming the canonical's last-known-good is restored intact (data preserved), and that a --quick pass did not move the known-good. No softened tests.

When WC1WC9 hold and are confirmed, write ## DONE to machine-docs/STATUS-2w.md → the watchdog auto-returns to Phase 2 (resume recipe authoring).


2. The --quick flow (reference)

PRECOND: a canonical for <recipe> exists (seeded by a prior green cold run); else fall back/report.
  1. abra app deploy <canonical-domain>      # reattach warm volume -> fast warm boot at known-good commit
  2. wait_healthy
  3. (deps) point at the warm keycloak; create a per-run realm+client (namespaced)
  4. UPGRADE to PR head (abra app deploy --chaos to the PR checkout)   # the op, once
  5. assert: generic upgrade (reconverge + moved + serving) + recipe overlay + custom (requires_deps)
  6a. PASS -> abra app undeploy <canonical-domain>      # keep volume; known-good UNCHANGED
  6b. FAIL -> restore last-known-good snapshot to the volume; abra app undeploy   # roll back, data safe
  7. (deps) delete the per-run realm from the warm keycloak

Cold run (default, unchanged) seeds/advances the canonical: on a green cold run on latest, snapshot the (undeployed) volume → replace the last-known-good + tag the commit, and keep the volume as the new canonical instead of deleting it.

3. Milestones (bounded)

  • W0 — Warm keycloak, unpinned + self-healing (WC1, WC1.1). Highest ROI; unblocks faster SSO recipe tests for the resumed Phase 2. Includes the health-gated deploy-with-rollback (snapshot keycloak before upgrade, restore on health-fail + alert); apply the same to traefik (version-only).
  • W1 — Canonical registry + snapshot/restore (WC2, WC3). Stable-domain warm apps; raw-while- stopped snapshot + restore; prove restore round-trips data. (Shares the snapshot helper with WC1.1.)
  • W2 — --quick mode (WC4, WC7). Orchestrator path + labeling + fallback.
  • W3 — Nightly rebuild→sweep + cold-advances-canonical (WC5, WC6). Nightly nixos-rebuild (warm/infra → latest, health-gated) then full-cold sweep; promote-on-green-cold; scheduled job.
  • W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9). Then ## DONE.

4. Guardrails

  • --quick never advances the canonical; only cold does. Anchors the baseline to verified states.
  • Never lose the known-good — snapshot before mutate (or rely on the standing known-good); restore on any quick failure. The rollback proof (WC9) is mandatory.
  • Default stays cold; quick is opt-in + clearly lower-confidence. Don't let a quick pass read as full confidence.
  • Snapshot only while undeployed (consistency). One last-known-good per app (disk).
  • Cold teardown stays sacred (deletes per-run volumes); warm volumes are a managed cache, never confused with per-run state; warm data excluded from D8.
  • Warm/infra auto-update is health-gated — a failed "latest" self-reverts to the last-good version (+ data, for stateful apps) and alerts; never leave the proxy/SSO dep broken silently.
  • Never weaken a test (cardinal rule). Generic-first invariant holds in --quick too.
  • Bounded — build the mechanism + prove on keycloak + a couple of recipes; do NOT re-warm all recipes here (the nightly sweep populates canonicals over time).

5. Open decisions (log in machine-docs/DECISIONS.md)

  • Canonical stable-domain scheme (distinct from cold per-run domains) + how the registry/reconciler is declared.
  • Snapshot storage + format (raw tar vs reflink/CoW copy) under /var/lib/ci-warm/; atomic replace.
  • Nightly sweep mechanism (systemd timer / Drone cron / bridge) + ordering + disk-prune policy.
  • --quick trigger surface (!testme --quick comment vs Drone build param) + the "no canonical yet" fallback (run cold vs report-and-skip).
  • Disk budget: measure warm volume + snapshot sizes across recipes; decide if a 30→larger bump is needed or the warm set stays bounded.
  • Stateful pre-upgrade snapshot consistency (keycloak). keycloak is live-warm (running) at nightly-upgrade time, but the snapshot rule is "raw copy while UNDEPLOYED." Cleanest: the nightly keycloak update = undeploy → raw snapshot → deploy latest → health-check → on fail restore snapshot + redeploy prior (the brief nightly downtime makes the snapshot consistent and honors the WC3 invariant). Confirm this vs an app-consistent backup alternative.
  • last-good version state for warm/infra apps (where the reconciler records the prior healthy version to roll back to) — a small state file alongside the snapshot, re-derivable from the running swarm version label.
  • Manual-migration / major-bump detection (WC1.2). How to decide "auto-apply vs alert-and-hold": primary signal = major recipe-version bump (coop-cloud <upstream>+<recipe-semver>; major recipe-semver = breaking, matches abra's own major-upgrade caution); secondary = scan the target's releaseNotes/<version>.md for manual-migration markers. Decide the exact rule + whether to parse abra app upgrade output vs compute the delta directly.