plan(2w): WC1.2 — pre-deploy auto-upgrade safety gate (major/manual-migration -> alert, hold)
Operator (2026-05-29): a passing health check does NOT prove a required manual migration ran, so auto-update needs a PRE-deploy gate in addition to the post-deploy health rollback. Reconciler auto-applies only non-major (patch/minor) upgrades with no manual-migration release notes; a MAJOR recipe-version bump (or release notes flagging a manual migration) is held on the current version with a PushNotification carrying the release notes (operator upgrades manually). Leans on abra's own major-bump caution + recipe releaseNotes/. Updated WC1.2/WC6/principles/decisions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -47,13 +47,21 @@ never destroy the working state+data — we roll back.
|
||||
reconcilers so the warm/infra apps roll to latest each night. Because the recipe is fetched at
|
||||
*activation* (runtime), the **nix closure stays byte-identical** (only the deployed versions float)
|
||||
— D8 is preserved; the version pin is gone, so the closure is *more* stable, not less.
|
||||
- **Health-gated deploy-with-rollback is built INTO the reconciler** (NOT nix-generation rollback —
|
||||
the deployed swarm app isn't in the generation). Pattern: record the running version as last-good
|
||||
→ deploy latest → health-check → **healthy: commit (last-good := latest); unhealthy: redeploy the
|
||||
recorded last-good + PUSH ALERT.** For **stateful apps (keycloak, any app with a DB/volume):
|
||||
snapshot the data volume BEFORE the upgrade and restore it on rollback** (a forward DB migration
|
||||
can make a version-only rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik
|
||||
(stateless) needs only the version rollback.
|
||||
- **Auto-update is gated TWICE — a pre-deploy safety gate AND a post-deploy health gate:**
|
||||
- **Pre-deploy (don't even try unsafe upgrades):** only **auto-apply upgrades that don't require
|
||||
manual intervention** — i.e. non-major (patch/minor) recipe-version bumps with no
|
||||
manual-migration in their release notes. If current→latest is a **MAJOR bump** or the target's
|
||||
**release notes flag a manual migration**, **DO NOT auto-upgrade**: stay on the current version
|
||||
and **PUSH ALERT** with the release notes (operator upgrades manually). A passing health check
|
||||
does NOT prove a required migration was done, so this gate is independent of health. Lean on
|
||||
abra's own major-bump caution + the recipe `releaseNotes/`.
|
||||
- **Post-deploy (for upgrades we DO apply):** record running version as last-good → deploy latest
|
||||
→ health-check → **healthy: commit (last-good := latest); unhealthy: roll back to last-good +
|
||||
PUSH ALERT.** For **stateful apps (keycloak / any app with a DB/volume): snapshot the data volume
|
||||
BEFORE the upgrade and restore it on rollback** (a forward DB migration can make a version-only
|
||||
rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik (stateless) = version
|
||||
rollback only. (Rollback is NOT nix-generation rollback — the swarm app isn't in the generation;
|
||||
it's built into the reconciler.)
|
||||
|
||||
---
|
||||
|
||||
@ -70,6 +78,15 @@ Terminates when every item holds **and the Adversary has independently cold-veri
|
||||
that realm** after the run, instead of co-deploying a fresh keycloak. Proven: a dependent
|
||||
recipe's SSO custom tests pass against the warm keycloak; concurrent dependents don't collide
|
||||
(distinct realms); leftover realms are reaped.
|
||||
- [ ] **WC1.2 — Pre-deploy auto-upgrade safety gate (manual-migration → alert, don't auto-apply).**
|
||||
Before the reconciler auto-applies "latest", it checks the current→latest delta: **auto-apply
|
||||
only non-major (patch/minor) bumps with no manual-migration release notes.** A **MAJOR
|
||||
recipe-version bump**, or a target whose **`releaseNotes/` flag a manual migration**, is NOT
|
||||
auto-applied — the reconciler **stays on the current version and pushes an alert with the
|
||||
release notes** for the operator to upgrade manually. (Health-pass ≠ migration-done, so this is
|
||||
independent of WC1.1.) Detection leans on abra's major-bump handling + the recipe release notes.
|
||||
**Adversary proof:** simulate a major/manual-migration latest → confirm the warm app stays on
|
||||
current + an alert with the notes fired (no silent auto-upgrade).
|
||||
- [ ] **WC1.1 — Health-gated deploy-with-rollback in the warm/infra reconcilers (traefik + keycloak).**
|
||||
Each reconciler: record the running version as **last-good** → fetch+deploy latest →
|
||||
health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good +
|
||||
@ -98,7 +115,8 @@ Terminates when every item holds **and the Adversary has independently cold-veri
|
||||
makes an app canonical.
|
||||
- [ ] **WC6 — Nightly: rebuild (warm/infra → latest) THEN full-cold sweep.** A scheduled nightly job,
|
||||
in order: (1) **`nixos-rebuild switch`** → the warm/infra reconcilers roll traefik + keycloak to
|
||||
latest with the WC1.1 health-gated rollback; (2) the **full cold** suite across enrolled recipes
|
||||
latest, subject to the **WC1.2 pre-deploy gate** (major/manual-migration → hold on current +
|
||||
alert) and the **WC1.1 health-gated rollback**; (2) the **full cold** suite across enrolled recipes
|
||||
— refreshing every canonical's known-good (WC5) AND serving as a daily authoritative regression
|
||||
run, now against the freshly-updated infra. Don't run while a test run is in flight. Mechanism
|
||||
settled in DECISIONS (systemd timer on cc-ci / Drone cron / bridge), declarative + reproducible.
|
||||
@ -185,3 +203,8 @@ new canonical instead of deleting it.
|
||||
- **last-good version state** for warm/infra apps (where the reconciler records the prior healthy
|
||||
version to roll back to) — a small state file alongside the snapshot, re-derivable from the running
|
||||
swarm version label.
|
||||
- **Manual-migration / major-bump detection (WC1.2).** How to decide "auto-apply vs alert-and-hold":
|
||||
primary signal = **major recipe-version bump** (coop-cloud `<upstream>+<recipe-semver>`; major
|
||||
recipe-semver = breaking, matches abra's own major-upgrade caution); secondary = scan the target's
|
||||
`releaseNotes/<version>.md` for manual-migration markers. Decide the exact rule + whether to parse
|
||||
`abra app upgrade` output vs compute the delta directly.
|
||||
|
||||
Reference in New Issue
Block a user