decisions(2): record operator principle — real-abra-only deploys, abra convergence by default, READY_PROBE (strict + negative-tested) only when abra doesn't fit; F2-12 applied

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 12:57:41 +01:00
parent a13d2ae48b
commit 7dab4f5cb6
1 changed files with 32 additions and 0 deletions
--- a/machine-docs/DECISIONS.md
+++ b/machine-docs/DECISIONS.md
@ -750,3 +750,35 @@ config + cache GC for marginal gain. **Revisit ONLY if** (a) cc-ci goes multi-no
 measurement shows cold-deploy pull time is a real bottleneck AND the cache can live on
 recreate-surviving storage (Incus volume / host b1 path, not the VM's ephemeral disk). No registry
 code was written (caught during orientation) — nothing to revert.
+
+---
+
+## 2026-05-29 — Real-abra-only deploys; abra convergence by default; READY_PROBE only when abra doesn't fit (operator principle; plan.md §9)
+
+**Decision (operator, 2026-05-29).** CI deploys/upgrades MUST use **real abra commands** — never
+`docker service update`/`docker service scale` to surgically patch a stack into health (that would
+test a non-abra path and can mask a broken deploy). **Prefer abra's own convergence checks by
+default.** Only skip abra's convergence monitor (`abra app deploy -c/--no-converge-checks`) and
+substitute a **harness READY_PROBE** when abra genuinely does not fit — e.g. its convergence window
+is too short for a heavy app and it `FATA deploy failed`s on a deploy that DOES converge given time.
+
+**When you do skip abra convergence, the rules are:**
+- The deploy stays **real abra** (`abra app deploy [-C] -c`); only abra's *waiting* is replaced, not
+  the deploy mechanism. `docker stack deploy` still applies the real spec.
+- The harness replacement MUST be a genuinely **STRICT** readiness test: **all swarm services N/N**
+  (`lifecycle.wait_healthy` → `services_converged`) **+ a real app-level check** (the app HEALTH_PATH
+  AND any recipe `READY_PROBE` — a live HTTP assertion on a real endpoint), bounded by a generous but
+  finite deadline (recipe `DEPLOY_TIMEOUT`).
+- It MUST **RAISE on actual non-readiness** — never a no-op that lets a failed deploy pass. **Prove it
+  has teeth with a negative test.**
+
+**Applied:** F2-12 lasuite-drive upgrade tier. abra's converge monitor FATA'd while the upgraded
+collabora `25.04.9.4.1` healthcheck was still in `start_period` (jail/config init), though it
+converges via swarm's healthcheck retries. Fix (`e1147b5`): upgrade chaos redeploy uses `abra … -c`;
+`generic.perform_upgrade` then owns `lifecycle.wait_healthy` (services N/N + app HEALTH_PATH) +
+`lifecycle.wait_ready_probes` (recipe `READY_PROBE` → collabora WOPI `/hosting/discovery` 200),
+bounded by `DEPLOY_TIMEOUT`. Teeth proven by `tests/unit/test_f212_upgrade_convergence.py` (`6506c4a`,
+5 P7-negative tests: the wait RAISES `TimeoutError` on stuck/never-serving convergence). The lone
+`docker service scale …minio-createbuckets` is NOT a bypass — it triggers the recipe's own
+`replicas:0` one-shot (Adversary-confirmed). The Adversary still owns confirming "not a weakening" at
+the Q3.2 cold-verify.