plan(2w): WC1.2 — pre-deploy auto-upgrade safety gate (major/manual-migration -> alert, hold)

Operator (2026-05-29): a passing health check does NOT prove a required manual migration ran, so
auto-update needs a PRE-deploy gate in addition to the post-deploy health rollback. Reconciler
auto-applies only non-major (patch/minor) upgrades with no manual-migration release notes; a MAJOR
recipe-version bump (or release notes flagging a manual migration) is held on the current version
with a PushNotification carrying the release notes (operator upgrades manually). Leans on abra's
own major-bump caution + recipe releaseNotes/. Updated WC1.2/WC6/principles/decisions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 00:02:28 +01:00
parent c3a572e4b9
commit 00e90bb597

View File

@ -47,13 +47,21 @@ never destroy the working state+data — we roll back.
reconcilers so the warm/infra apps roll to latest each night. Because the recipe is fetched at
*activation* (runtime), the **nix closure stays byte-identical** (only the deployed versions float)
— D8 is preserved; the version pin is gone, so the closure is *more* stable, not less.
- **Health-gated deploy-with-rollback is built INTO the reconciler** (NOT nix-generation rollback —
the deployed swarm app isn't in the generation). Pattern: record the running version as last-good
→ deploy latest → health-check → **healthy: commit (last-good := latest); unhealthy: redeploy the
recorded last-good + PUSH ALERT.** For **stateful apps (keycloak, any app with a DB/volume):
snapshot the data volume BEFORE the upgrade and restore it on rollback** (a forward DB migration
can make a version-only rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik
(stateless) needs only the version rollback.
- **Auto-update is gated TWICE — a pre-deploy safety gate AND a post-deploy health gate:**
- **Pre-deploy (don't even try unsafe upgrades):** only **auto-apply upgrades that don't require
manual intervention** — i.e. non-major (patch/minor) recipe-version bumps with no
manual-migration in their release notes. If current→latest is a **MAJOR bump** or the target's
**release notes flag a manual migration**, **DO NOT auto-upgrade**: stay on the current version
and **PUSH ALERT** with the release notes (operator upgrades manually). A passing health check
does NOT prove a required migration was done, so this gate is independent of health. Lean on
abra's own major-bump caution + the recipe `releaseNotes/`.
- **Post-deploy (for upgrades we DO apply):** record running version as last-good → deploy latest
→ health-check → **healthy: commit (last-good := latest); unhealthy: roll back to last-good +
PUSH ALERT.** For **stateful apps (keycloak / any app with a DB/volume): snapshot the data volume
BEFORE the upgrade and restore it on rollback** (a forward DB migration can make a version-only
rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik (stateless) = version
rollback only. (Rollback is NOT nix-generation rollback — the swarm app isn't in the generation;
it's built into the reconciler.)
---
@ -70,6 +78,15 @@ Terminates when every item holds **and the Adversary has independently cold-veri
that realm** after the run, instead of co-deploying a fresh keycloak. Proven: a dependent
recipe's SSO custom tests pass against the warm keycloak; concurrent dependents don't collide
(distinct realms); leftover realms are reaped.
- [ ] **WC1.2 — Pre-deploy auto-upgrade safety gate (manual-migration → alert, don't auto-apply).**
Before the reconciler auto-applies "latest", it checks the current→latest delta: **auto-apply
only non-major (patch/minor) bumps with no manual-migration release notes.** A **MAJOR
recipe-version bump**, or a target whose **`releaseNotes/` flag a manual migration**, is NOT
auto-applied — the reconciler **stays on the current version and pushes an alert with the
release notes** for the operator to upgrade manually. (Health-pass ≠ migration-done, so this is
independent of WC1.1.) Detection leans on abra's major-bump handling + the recipe release notes.
**Adversary proof:** simulate a major/manual-migration latest → confirm the warm app stays on
current + an alert with the notes fired (no silent auto-upgrade).
- [ ] **WC1.1 — Health-gated deploy-with-rollback in the warm/infra reconcilers (traefik + keycloak).**
Each reconciler: record the running version as **last-good** → fetch+deploy latest →
health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good +
@ -98,7 +115,8 @@ Terminates when every item holds **and the Adversary has independently cold-veri
makes an app canonical.
- [ ] **WC6 — Nightly: rebuild (warm/infra → latest) THEN full-cold sweep.** A scheduled nightly job,
in order: (1) **`nixos-rebuild switch`** → the warm/infra reconcilers roll traefik + keycloak to
latest with the WC1.1 health-gated rollback; (2) the **full cold** suite across enrolled recipes
latest, subject to the **WC1.2 pre-deploy gate** (major/manual-migration → hold on current +
alert) and the **WC1.1 health-gated rollback**; (2) the **full cold** suite across enrolled recipes
— refreshing every canonical's known-good (WC5) AND serving as a daily authoritative regression
run, now against the freshly-updated infra. Don't run while a test run is in flight. Mechanism
settled in DECISIONS (systemd timer on cc-ci / Drone cron / bridge), declarative + reproducible.
@ -185,3 +203,8 @@ new canonical instead of deleting it.
- **last-good version state** for warm/infra apps (where the reconciler records the prior healthy
version to roll back to) — a small state file alongside the snapshot, re-derivable from the running
swarm version label.
- **Manual-migration / major-bump detection (WC1.2).** How to decide "auto-apply vs alert-and-hold":
primary signal = **major recipe-version bump** (coop-cloud `<upstream>+<recipe-semver>`; major
recipe-semver = breaking, matches abra's own major-upgrade caution); secondary = scan the target's
`releaseNotes/<version>.md` for manual-migration markers. Decide the exact rule + whether to parse
`abra app upgrade` output vs compute the delta directly.