plan(2w): warm/infra auto-latest nightly + health-gated rollback (snapshot stateful apps)
Operator decision (2026-05-28): traefik + keycloak stay UNPINNED (fetch latest + chaos deploy); a nightly `nixos-rebuild switch` rolls them to latest, then the full-cold sweep runs. The nix closure stays byte-identical (recipe fetched at runtime, not in the store) so D8 holds. Health-gated rollback is built INTO the reconciler (not nix-generation rollback, since the swarm app isn't in the generation): record last-good -> deploy latest -> health-check -> commit or roll back + PushNotification. Stateful apps (keycloak): snapshot the data volume before upgrade (undeploy->snapshot->deploy-latest) and restore it on rollback, reusing the WC3 snapshot helper; traefik = version rollback only. Added WC1.1 + updated WC1/WC6/milestones/guardrails/decisions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -41,6 +41,19 @@ never destroy the working state+data — we roll back.
|
||||
writing). **One last-known-good per app.**
|
||||
- Warm volumes + snapshots are **cache, not source** — not in the git/D8 closure; re-seeded by cold
|
||||
runs, not restored on a VM rebuild.
|
||||
- **Warm/infra apps (traefik + keycloak) auto-update to LATEST, nightly, with health-gated
|
||||
rollback** (operator, 2026-05-28). Both are **unpinned** — their reconcilers `abra recipe fetch`
|
||||
the latest published recipe + chaos-deploy it. A **nightly `nixos-rebuild switch`** runs the
|
||||
reconcilers so the warm/infra apps roll to latest each night. Because the recipe is fetched at
|
||||
*activation* (runtime), the **nix closure stays byte-identical** (only the deployed versions float)
|
||||
— D8 is preserved; the version pin is gone, so the closure is *more* stable, not less.
|
||||
- **Health-gated deploy-with-rollback is built INTO the reconciler** (NOT nix-generation rollback —
|
||||
the deployed swarm app isn't in the generation). Pattern: record the running version as last-good
|
||||
→ deploy latest → health-check → **healthy: commit (last-good := latest); unhealthy: redeploy the
|
||||
recorded last-good + PUSH ALERT.** For **stateful apps (keycloak, any app with a DB/volume):
|
||||
snapshot the data volume BEFORE the upgrade and restore it on rollback** (a forward DB migration
|
||||
can make a version-only rollback fail) — reusing the WC3 known-good-snapshot mechanism. traefik
|
||||
(stateless) needs only the version rollback.
|
||||
|
||||
---
|
||||
|
||||
@ -49,12 +62,23 @@ never destroy the working state+data — we roll back.
|
||||
Terminates when every item holds **and the Adversary has independently cold-verified** (logged in
|
||||
`machine-docs/REVIEW-2w.md`):
|
||||
|
||||
- [ ] **WC1 — Live-warm keycloak (SSO dep).** A persistent (live-warm) keycloak runs at a stable domain. SSO-dependent
|
||||
recipes (per `plan-sso-dep-testing.md`) point their `setup_custom_tests` at the warm keycloak
|
||||
and create a **per-run namespaced realm+client**, then **delete that realm** after the run
|
||||
(cleanup), instead of co-deploying a fresh keycloak. Proven: a dependent recipe's SSO custom
|
||||
tests pass against the warm keycloak; concurrent dependents don't collide (distinct realms);
|
||||
leftover realms are reaped.
|
||||
- [ ] **WC1 — Live-warm keycloak (SSO dep), unpinned + self-healing.** A persistent (live-warm)
|
||||
keycloak runs at a stable domain, **unpinned** (reconciler `abra recipe fetch` latest +
|
||||
chaos-deploy, matching traefik — drop the `kcVersion` pin; keep the *secret-generate-only-if-
|
||||
missing* guard + the health-wait). SSO-dependent recipes (per `plan-sso-dep-testing.md`) point
|
||||
their `setup_custom_tests` at it, create a **per-run namespaced realm+client**, then **delete
|
||||
that realm** after the run, instead of co-deploying a fresh keycloak. Proven: a dependent
|
||||
recipe's SSO custom tests pass against the warm keycloak; concurrent dependents don't collide
|
||||
(distinct realms); leftover realms are reaped.
|
||||
- [ ] **WC1.1 — Health-gated deploy-with-rollback in the warm/infra reconcilers (traefik + keycloak).**
|
||||
Each reconciler: record the running version as **last-good** → fetch+deploy latest →
|
||||
health-check → **healthy: commit last-good := latest; unhealthy: roll back to last-good +
|
||||
`PushNotification` alert.** For **stateful apps (keycloak): snapshot the data volume BEFORE the
|
||||
upgrade; on rollback restore that snapshot + redeploy the prior version** (forward DB migrations
|
||||
make version-only rollback unsafe) — reuse the WC3 snapshot helper. traefik (stateless) = version
|
||||
rollback only. **Adversary proof:** force a broken "latest" (simulate) → confirm the warm app
|
||||
self-reverts to the prior healthy version (keycloak with its pre-upgrade data intact) and an
|
||||
alert fired; a healthy update commits the new version as last-good.
|
||||
- [ ] **WC2 — Data-warm canonical model.** A canonical per warmed recipe at a **stable domain**
|
||||
(distinct from cold per-run `<recipe>-<6hex>` domains), kept **data-warm** (undeployed-when-idle,
|
||||
volume retained). A small declarative registry/reconciler tracks which recipes are
|
||||
@ -72,10 +96,14 @@ Terminates when every item holds **and the Adversary has independently cold-veri
|
||||
+ re-tags the canonical known-good (promote-on-green instead of deleting at teardown). A cold
|
||||
run is the ONLY thing that advances a canonical. Seeding: the first green cold run on latest
|
||||
makes an app canonical.
|
||||
- [ ] **WC6 — Nightly full-cold sweep.** A scheduled job runs the **full cold** suite across enrolled
|
||||
recipes nightly — refreshing every canonical's known-good (WC5) AND serving as a daily
|
||||
authoritative regression run. Mechanism settled in DECISIONS (systemd timer on cc-ci / Drone
|
||||
cron / bridge), declarative + reproducible. Bounded by MAX_TESTS (serial is fine — nightly).
|
||||
- [ ] **WC6 — Nightly: rebuild (warm/infra → latest) THEN full-cold sweep.** A scheduled nightly job,
|
||||
in order: (1) **`nixos-rebuild switch`** → the warm/infra reconcilers roll traefik + keycloak to
|
||||
latest with the WC1.1 health-gated rollback; (2) the **full cold** suite across enrolled recipes
|
||||
— refreshing every canonical's known-good (WC5) AND serving as a daily authoritative regression
|
||||
run, now against the freshly-updated infra. Don't run while a test run is in flight. Mechanism
|
||||
settled in DECISIONS (systemd timer on cc-ci / Drone cron / bridge), declarative + reproducible.
|
||||
Bounded by MAX_TESTS (serial is fine — nightly). If the rebuild's health-gate rolled an infra
|
||||
app back, the alert fires and the sweep still runs against the (healthy) prior version.
|
||||
- [ ] **WC7 — Trigger + authority + labeling.** Default `!testme` = full cold (unchanged). `--quick`
|
||||
is opt-in (`!testme --quick`, or a build param) and **never gates merge**. Run results carry
|
||||
the **mode** (cold vs quick) so a `--quick` pass is distinctly labeled lower-confidence (feeds
|
||||
@ -114,11 +142,14 @@ the (undeployed) volume → replace the last-known-good + tag the commit, and ke
|
||||
new canonical instead of deleting it.
|
||||
|
||||
## 3. Milestones (bounded)
|
||||
- **W0 — Warm keycloak (WC1).** Highest ROI; unblocks faster SSO recipe tests for the resumed Phase 2.
|
||||
- **W0 — Warm keycloak, unpinned + self-healing (WC1, WC1.1).** Highest ROI; unblocks faster SSO
|
||||
recipe tests for the resumed Phase 2. Includes the health-gated deploy-with-rollback (snapshot
|
||||
keycloak before upgrade, restore on health-fail + alert); apply the same to traefik (version-only).
|
||||
- **W1 — Canonical registry + snapshot/restore (WC2, WC3).** Stable-domain warm apps; raw-while-
|
||||
stopped snapshot + restore; prove restore round-trips data.
|
||||
stopped snapshot + restore; prove restore round-trips data. (Shares the snapshot helper with WC1.1.)
|
||||
- **W2 — `--quick` mode (WC4, WC7).** Orchestrator path + labeling + fallback.
|
||||
- **W3 — Cold-advances-canonical + nightly sweep (WC5, WC6).** Promote-on-green-cold; scheduled job.
|
||||
- **W3 — Nightly rebuild→sweep + cold-advances-canonical (WC5, WC6).** Nightly `nixos-rebuild`
|
||||
(warm/infra → latest, health-gated) then full-cold sweep; promote-on-green-cold; scheduled job.
|
||||
- **W4 — Resource/isolation hardening + docs + cold verify incl. rollback proof (WC8, WC9).** Then
|
||||
`## DONE`.
|
||||
|
||||
@ -131,6 +162,8 @@ new canonical instead of deleting it.
|
||||
- **Snapshot only while undeployed** (consistency). **One last-known-good per app** (disk).
|
||||
- **Cold teardown stays sacred** (deletes per-run volumes); warm volumes are a managed cache, never
|
||||
confused with per-run state; warm data excluded from D8.
|
||||
- **Warm/infra auto-update is health-gated** — a failed "latest" self-reverts to the last-good
|
||||
version (+ data, for stateful apps) and alerts; never leave the proxy/SSO dep broken silently.
|
||||
- **Never weaken a test** (cardinal rule). Generic-first invariant holds in `--quick` too.
|
||||
- **Bounded** — build the mechanism + prove on keycloak + a couple of recipes; do NOT re-warm all
|
||||
recipes here (the nightly sweep populates canonicals over time).
|
||||
@ -144,3 +177,11 @@ new canonical instead of deleting it.
|
||||
yet" fallback (run cold vs report-and-skip).
|
||||
- **Disk budget**: measure warm volume + snapshot sizes across recipes; decide if a 30→larger bump is
|
||||
needed or the warm set stays bounded.
|
||||
- **Stateful pre-upgrade snapshot consistency (keycloak).** keycloak is *live-warm* (running) at
|
||||
nightly-upgrade time, but the snapshot rule is "raw copy while UNDEPLOYED." Cleanest: the nightly
|
||||
keycloak update = **undeploy → raw snapshot → deploy latest → health-check → on fail restore
|
||||
snapshot + redeploy prior** (the brief nightly downtime makes the snapshot consistent and honors
|
||||
the WC3 invariant). Confirm this vs an app-consistent backup alternative.
|
||||
- **last-good version state** for warm/infra apps (where the reconciler records the prior healthy
|
||||
version to roll back to) — a small state file alongside the snapshot, re-derivable from the running
|
||||
swarm version label.
|
||||
|
||||
Reference in New Issue
Block a user