backlog+decisions(2w): re-sequence W0 (WC3 helper first); unpin/snapshot/alert decisions

This commit is contained in:
2026-05-29 00:05:13 +01:00
parent 740d7bac4c
commit ceacd0e6de
2 changed files with 56 additions and 15 deletions

View File

@ -585,3 +585,33 @@ from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent t
from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path
falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred
when available.
## Phase 2w — design update: unpinned warm/infra + health-gated rollback (2026-05-28/29)
**Warm/infra apps (traefik + keycloak) auto-update to LATEST nightly, health-gated (operator).**
Supersedes the W0.3 pinned `kcVersion`. Keycloak is now unpinned like traefik: reconciler `abra
recipe fetch` latest + chaos deploy; keep secret-generate-only-if-missing + health-wait. D8 holds
because the recipe is fetched at *activation* (runtime), so the nix store closure is byte-identical
regardless of which keycloak version is live.
**Snapshot helper (WC3) — format + path.** `runner/harness/warmsnap.py`. A snapshot is a **raw tar
of each docker volume belonging to the app's stack**, taken **while the app is undeployed** (nothing
writing → consistent). Stored under `/var/lib/ci-warm/<recipe>/` as `<recipe>.snapshot.tar` + a
`<recipe>.meta.json` (commit/version/timestamp/volume list). **One last-good per app**, replaced
**atomically** (write to `.tmp` then `rename`). Restore: for each volume, clear `_data` and untar
back. Docker volumes are stack-scoped (`<stack>_<vol>`); the helper enumerates them via
`docker volume ls` filtered to the stack. Reused by WC1.1 (pre-upgrade snapshot of keycloak) and WC5
(promote-on-green-cold). Warm snapshots are **cache, excluded from the D8 closure** (WC8).
**Alert mechanism — sentinel files relayed by the Builder loop.** The warm/infra reconciler is an
autonomous bash systemd unit on cc-ci; it cannot call the agent's `PushNotification` tool. So a
reconciler that rolls back (WC1.1) or holds a major/manual-migration upgrade (WC1.2) writes a JSON
**alert sentinel** to `/var/lib/ci-warm/alerts/<ts>-<app>-<reason>.json` (fields: app, reason
[rollback|held-major|held-manual-migration], from_version, to_version, release_notes, ts). The
Builder loop, each wake, scans that dir; for each new alert it (a) issues `PushNotification` to the
operator, (b) records it in STATUS-2w/JOURNAL-2w, (c) archives it to `alerts/seen/`. This bridges the
autonomous reconciler to operator visibility (latency = next Builder wake; acceptable for an alert).
**Re-sequence:** WC1.1's keycloak rollback needs the WC3 snapshot helper, so build that FIRST, then
rewrite the reconciler ONCE into the unpinned + WC1.2-safety-gated + WC1.1-health-gated-rollback form
(avoids reworking the reconciler twice). The W0.3 reconciler is INTERIM until then.