From 00e90bb5973e189293aac3b0099e5f782c46a8af Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Fri, 29 May 2026 00:02:28 +0100 Subject: [PATCH] =?UTF-8?q?plan(2w):=20WC1.2=20=E2=80=94=20pre-deploy=20au?= =?UTF-8?q?to-upgrade=20safety=20gate=20(major/manual-migration=20->=20ale?= =?UTF-8?q?rt,=20hold)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Operator (2026-05-29): a passing health check does NOT prove a required manual migration ran, so auto-update needs a PRE-deploy gate in addition to the post-deploy health rollback. Reconciler auto-applies only non-major (patch/minor) upgrades with no manual-migration release notes; a MAJOR recipe-version bump (or release notes flagging a manual migration) is held on the current version with a PushNotification carrying the release notes (operator upgrades manually). Leans on abra's own major-bump caution + recipe releaseNotes/. Updated WC1.2/WC6/principles/decisions. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../plan-phase2w-warm-canonical-quick.md | 39 +++++++++++++++---- 1 file changed, 31 insertions(+), 8 deletions(-) diff --git a/cc-ci-plan/plan-phase2w-warm-canonical-quick.md b/cc-ci-plan/plan-phase2w-warm-canonical-quick.md index 101445a..47d6f1f 100644 --- a/cc-ci-plan/plan-phase2w-warm-canonical-quick.md +++ b/cc-ci-plan/plan-phase2w-warm-canonical-quick.md @@ -47,13 +47,21 @@ never destroy the working state+data — we roll back. reconcilers so the warm/infra apps roll to latest each night. Because the recipe is fetched at *activation* (runtime), the **nix closure stays byte-identical** (only the deployed versions float) — D8 is preserved; the version pin is gone, so the closure is *more* stable, not less. -- **Health-gated deploy-with-rollback is built INTO the reconciler** (NOT nix-generation rollback — - the deployed swarm app isn't in the generation). Pattern: record the running version as last-good - → deploy latest → health-check → **healthy: commit (last-good := latest); unhealthy: redeploy the - recorded last-good + PUSH ALERT.** For **stateful apps (keycloak, any app with a DB/volume): - snapshot the data volume BEFORE the upgrade and restore it on rollback** (a forward DB migration - can make a version-only rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik - (stateless) needs only the version rollback. +- **Auto-update is gated TWICE — a pre-deploy safety gate AND a post-deploy health gate:** + - **Pre-deploy (don't even try unsafe upgrades):** only **auto-apply upgrades that don't require + manual intervention** — i.e. non-major (patch/minor) recipe-version bumps with no + manual-migration in their release notes. If current→latest is a **MAJOR bump** or the target's + **release notes flag a manual migration**, **DO NOT auto-upgrade**: stay on the current version + and **PUSH ALERT** with the release notes (operator upgrades manually). A passing health check + does NOT prove a required migration was done, so this gate is independent of health. Lean on + abra's own major-bump caution + the recipe `releaseNotes/`. + - **Post-deploy (for upgrades we DO apply):** record running version as last-good → deploy latest + → health-check → **healthy: commit (last-good := latest); unhealthy: roll back to last-good + + PUSH ALERT.** For **stateful apps (keycloak / any app with a DB/volume): snapshot the data volume + BEFORE the upgrade and restore it on rollback** (a forward DB migration can make a version-only + rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik (stateless) = version + rollback only. (Rollback is NOT nix-generation rollback — the swarm app isn't in the generation; + it's built into the reconciler.) --- @@ -70,6 +78,15 @@ Terminates when every item holds **and the Adversary has independently cold-veri that realm** after the run, instead of co-deploying a fresh keycloak. Proven: a dependent recipe's SSO custom tests pass against the warm keycloak; concurrent dependents don't collide (distinct realms); leftover realms are reaped. +- [ ] **WC1.2 — Pre-deploy auto-upgrade safety gate (manual-migration → alert, don't auto-apply).** + Before the reconciler auto-applies "latest", it checks the current→latest delta: **auto-apply + only non-major (patch/minor) bumps with no manual-migration release notes.** A **MAJOR + recipe-version bump**, or a target whose **`releaseNotes/` flag a manual migration**, is NOT + auto-applied — the reconciler **stays on the current version and pushes an alert with the + release notes** for the operator to upgrade manually. (Health-pass ≠ migration-done, so this is + independent of WC1.1.) Detection leans on abra's major-bump handling + the recipe release notes. + **Adversary proof:** simulate a major/manual-migration latest → confirm the warm app stays on + current + an alert with the notes fired (no silent auto-upgrade). - [ ] **WC1.1 — Health-gated deploy-with-rollback in the warm/infra reconcilers (traefik + keycloak).** Each reconciler: record the running version as **last-good** → fetch+deploy latest → health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good + @@ -98,7 +115,8 @@ Terminates when every item holds **and the Adversary has independently cold-veri makes an app canonical. - [ ] **WC6 — Nightly: rebuild (warm/infra → latest) THEN full-cold sweep.** A scheduled nightly job, in order: (1) **`nixos-rebuild switch`** → the warm/infra reconcilers roll traefik + keycloak to - latest with the WC1.1 health-gated rollback; (2) the **full cold** suite across enrolled recipes + latest, subject to the **WC1.2 pre-deploy gate** (major/manual-migration → hold on current + + alert) and the **WC1.1 health-gated rollback**; (2) the **full cold** suite across enrolled recipes — refreshing every canonical's known-good (WC5) AND serving as a daily authoritative regression run, now against the freshly-updated infra. Don't run while a test run is in flight. Mechanism settled in DECISIONS (systemd timer on cc-ci / Drone cron / bridge), declarative + reproducible. @@ -185,3 +203,8 @@ new canonical instead of deleting it. - **last-good version state** for warm/infra apps (where the reconciler records the prior healthy version to roll back to) — a small state file alongside the snapshot, re-derivable from the running swarm version label. +- **Manual-migration / major-bump detection (WC1.2).** How to decide "auto-apply vs alert-and-hold": + primary signal = **major recipe-version bump** (coop-cloud `+`; major + recipe-semver = breaking, matches abra's own major-upgrade caution); secondary = scan the target's + `releaseNotes/.md` for manual-migration markers. Decide the exact rule + whether to parse + `abra app upgrade` output vs compute the delta directly.