From 7dab4f5cb62af5bd8cd17231aad2b15cb8439508 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Fri, 29 May 2026 12:57:41 +0100 Subject: [PATCH] =?UTF-8?q?decisions(2):=20record=20operator=20principle?= =?UTF-8?q?=20=E2=80=94=20real-abra-only=20deploys,=20abra=20convergence?= =?UTF-8?q?=20by=20default,=20READY=5FPROBE=20(strict=20+=20negative-teste?= =?UTF-8?q?d)=20only=20when=20abra=20doesn't=20fit;=20F2-12=20applied?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.8 (1M context) --- machine-docs/DECISIONS.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index 0df5580..4e1e375 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -750,3 +750,35 @@ config + cache GC for marginal gain. **Revisit ONLY if** (a) cc-ci goes multi-no measurement shows cold-deploy pull time is a real bottleneck AND the cache can live on recreate-surviving storage (Incus volume / host b1 path, not the VM's ephemeral disk). No registry code was written (caught during orientation) — nothing to revert. + +--- + +## 2026-05-29 — Real-abra-only deploys; abra convergence by default; READY_PROBE only when abra doesn't fit (operator principle; plan.md §9) + +**Decision (operator, 2026-05-29).** CI deploys/upgrades MUST use **real abra commands** — never +`docker service update`/`docker service scale` to surgically patch a stack into health (that would +test a non-abra path and can mask a broken deploy). **Prefer abra's own convergence checks by +default.** Only skip abra's convergence monitor (`abra app deploy -c/--no-converge-checks`) and +substitute a **harness READY_PROBE** when abra genuinely does not fit — e.g. its convergence window +is too short for a heavy app and it `FATA deploy failed`s on a deploy that DOES converge given time. + +**When you do skip abra convergence, the rules are:** +- The deploy stays **real abra** (`abra app deploy [-C] -c`); only abra's *waiting* is replaced, not + the deploy mechanism. `docker stack deploy` still applies the real spec. +- The harness replacement MUST be a genuinely **STRICT** readiness test: **all swarm services N/N** + (`lifecycle.wait_healthy` → `services_converged`) **+ a real app-level check** (the app HEALTH_PATH + AND any recipe `READY_PROBE` — a live HTTP assertion on a real endpoint), bounded by a generous but + finite deadline (recipe `DEPLOY_TIMEOUT`). +- It MUST **RAISE on actual non-readiness** — never a no-op that lets a failed deploy pass. **Prove it + has teeth with a negative test.** + +**Applied:** F2-12 lasuite-drive upgrade tier. abra's converge monitor FATA'd while the upgraded +collabora `25.04.9.4.1` healthcheck was still in `start_period` (jail/config init), though it +converges via swarm's healthcheck retries. Fix (`e1147b5`): upgrade chaos redeploy uses `abra … -c`; +`generic.perform_upgrade` then owns `lifecycle.wait_healthy` (services N/N + app HEALTH_PATH) + +`lifecycle.wait_ready_probes` (recipe `READY_PROBE` → collabora WOPI `/hosting/discovery` 200), +bounded by `DEPLOY_TIMEOUT`. Teeth proven by `tests/unit/test_f212_upgrade_convergence.py` (`6506c4a`, +5 P7-negative tests: the wait RAISES `TimeoutError` on stuck/never-serving convergence). The lone +`docker service scale …minio-createbuckets` is NOT a bypass — it triggers the recipe's own +`replicas:0` one-shot (Adversary-confirmed). The Adversary still owns confirming "not a weakening" at +the Q3.2 cold-verify.