diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index 1d57d61..ee3f9b1 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -615,3 +615,30 @@ autonomous reconciler to operator visibility (latency = next Builder wake; accep **Re-sequence:** WC1.1's keycloak rollback needs the WC3 snapshot helper, so build that FIRST, then rewrite the reconciler ONCE into the unpinned + WC1.2-safety-gated + WC1.1-health-gated-rollback form (avoids reworking the reconciler twice). The W0.3 reconciler is INTERIM until then. + +## Phase 2w — W0.6 reconciler: version model + deploy-by-tag (2026-05-29) + +**Reconcile entrypoint in Python, packaged in the nix store.** `runner/warm_reconcile.py`, invoked by +the systemd unit as `${pyEnv}/bin/python3 ${../../runner}/warm_reconcile.py ` (the runner/ dir is +copied into the store → D8-clean, no dependence on the /root/cc-ci checkout). Reuses +warmsnap/sso/abra/lifecycle so there is ONE snapshot impl (also used by the runner for WC5). Replaces +the bash reconcile in warm-keycloak.nix. + +**"latest" = newest published version TAG, deployed pinned (not chaos-of-main).** WC1.2's "major +recipe-version bump" detection needs comparable versions, which chaos (deploy main HEAD) doesn't give. +So the reconciler resolves latest = `git tag | sort -V | tail -1` (valid coop-cloud version tags), +records current = the app .env `VERSION`, and deploys the chosen tag pinned (`abra app deploy + -o -n -f`, after `git checkout `). "Auto-update to latest" is satisfied by converging +to the newest tag; "chaos" in the operator note is read as "auto-deploy latest", and tag-pinning is +the correct mechanism for a version-gated auto-update. + +**coop-cloud version format is `+` (observed), not the plan's +`+`.** Evidence: keycloak `10.7.1+26.6.2` → image `keycloak:26.6.2`; n8n +`3.2.0+2.20.6` → image `n8nio/n8n:2.20.6` (the post-`+` part is the app image tag). So the **recipe +semver is the part BEFORE `+`**. WC1.2's "major recipe bump = breaking" keys off the major (first) +component of the pre-`+` recipe semver (e.g. 3.x→4.0 = held). Secondary signal: scan the target's +`releaseNotes/.md` for manual-migration markers. + +**Scope order for W0.6:** keycloak first (the W0 focus, stateful → snapshot path); apply the same +health-gated + safety-gate pattern to traefik (stateless, version-rollback-only) afterward by +migrating proxy.nix onto the shared reconcile entrypoint. diff --git a/machine-docs/STATUS-2w.md b/machine-docs/STATUS-2w.md index 7f176d3..2e1f065 100644 --- a/machine-docs/STATUS-2w.md +++ b/machine-docs/STATUS-2w.md @@ -53,19 +53,25 @@ nightly full-cold sweep. Definition of Done = WC1–WC9 (plan §1), each Adversa warm-keycloak.service active, system running (0 failed), /realms/master=200. (INTERIM: pinned + skip-if-healthy; to be replaced by the unpinned + health-gated WC1.1 form.) -**Re-sequenced after the 2026-05-28/29 design update (unpin + WC1.1 rollback + WC1.2 safety gate):** -WC1.1's keycloak rollback needs the **WC3 snapshot/restore helper**, so build that FIRST, then -rewrite the reconciler ONCE into the unpinned + safety-gated + health-gated-with-rollback form. Next: -1. **WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`): raw copy of an app's data - volume(s) while undeployed, under `/var/lib/ci-warm//`, atomic replace, one last-good; - restore round-trips data. + unit tests + live round-trip proof. -2. Rewrite reconciler: unpin keycloak (fetch latest + chaos); WC1.2 safety gate (major / manual- - migration → hold + alert); WC1.1 record last-good → (keycloak: undeploy→snapshot→deploy latest) → - health-gate → commit-or-rollback+restore+alert. -3. Settle the **alert mechanism** (bash reconciler can't call agent PushNotification — sentinel file - the Builder loop relays, see DECISIONS). -4. Resolve the lasuite-docs in-place-redeploy race (BUILD finding below) OR pick a more-robust - dependent, then the headline WC1 e2e (dependent SSO green vs warm keycloak) + concurrency proof. +- **W0.5 WC3 snapshot/restore helper** (`runner/harness/warmsnap.py`) DONE (4cc1e15). +5 unit tests + (48 unit pass). **LIVE round-trip PROVEN on warm keycloak**: marker realm → undeploy → snapshot + (mariadb+providers) → deploy → delete marker (mutate DB) → undeploy → restore → deploy → marker + realm BACK; keycloak healthy. Snapshots under `/var/lib/ci-warm//`, atomic, one last-good. + +**Next (W0.6 reconciler rewrite — split):** +1. **W0.6a** — Python reconcile entrypoint `runner/warm_reconcile.py`, packaged into the nix store + (systemd unit invokes the store copy of runner/ — D8-clean, reuses warmsnap/sso/abra; replaces the + bash reconciler). UNPIN keycloak (fetch latest + chaos deploy; drop kcVersion); keep secret-guard + + health-wait. +2. **W0.6b** — WC1.2 pre-deploy safety gate: major recipe-semver bump OR releaseNotes manual-migration + marker → hold-on-current + alert-with-notes (no deploy churn). +3. **W0.6c** — WC1.1 health-gated rollback: record last-good → (keycloak: undeploy→snapshot→deploy + latest) → health-gate → commit-or-(restore+redeploy-prior+alert). Same for traefik (version + rollback only). Alert = sentinel file in `/var/lib/ci-warm/alerts/` relayed by the Builder loop. +4. **W0.7** — resolve the lasuite-docs in-place-redeploy race (finding below) OR pick a more-robust + dependent; then **W0.8** headline WC1 e2e (dependent SSO green vs warm keycloak) + concurrency. +5. **W0.9** — WC1.1/WC1.2 Adversary-facing proofs (broken latest → self-revert + data intact + alert; + healthy → commit last-good; major/manual-migration → hold + alert). **Build finding (mine, to fix):** lasuite-docs `setup_custom_tests` in-place `abra app deploy --force --chaos` (OIDC wiring) fails: nginx `web` fatally exits `[emerg] host not found in upstream