decisions(2): record operator principle — real-abra-only deploys, abra convergence by default, READY_PROBE (strict + negative-tested) only when abra doesn't fit; F2-12 applied
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -750,3 +750,35 @@ config + cache GC for marginal gain. **Revisit ONLY if** (a) cc-ci goes multi-no
|
||||
measurement shows cold-deploy pull time is a real bottleneck AND the cache can live on
|
||||
recreate-surviving storage (Incus volume / host b1 path, not the VM's ephemeral disk). No registry
|
||||
code was written (caught during orientation) — nothing to revert.
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-29 — Real-abra-only deploys; abra convergence by default; READY_PROBE only when abra doesn't fit (operator principle; plan.md §9)
|
||||
|
||||
**Decision (operator, 2026-05-29).** CI deploys/upgrades MUST use **real abra commands** — never
|
||||
`docker service update`/`docker service scale` to surgically patch a stack into health (that would
|
||||
test a non-abra path and can mask a broken deploy). **Prefer abra's own convergence checks by
|
||||
default.** Only skip abra's convergence monitor (`abra app deploy -c/--no-converge-checks`) and
|
||||
substitute a **harness READY_PROBE** when abra genuinely does not fit — e.g. its convergence window
|
||||
is too short for a heavy app and it `FATA deploy failed`s on a deploy that DOES converge given time.
|
||||
|
||||
**When you do skip abra convergence, the rules are:**
|
||||
- The deploy stays **real abra** (`abra app deploy [-C] -c`); only abra's *waiting* is replaced, not
|
||||
the deploy mechanism. `docker stack deploy` still applies the real spec.
|
||||
- The harness replacement MUST be a genuinely **STRICT** readiness test: **all swarm services N/N**
|
||||
(`lifecycle.wait_healthy` → `services_converged`) **+ a real app-level check** (the app HEALTH_PATH
|
||||
AND any recipe `READY_PROBE` — a live HTTP assertion on a real endpoint), bounded by a generous but
|
||||
finite deadline (recipe `DEPLOY_TIMEOUT`).
|
||||
- It MUST **RAISE on actual non-readiness** — never a no-op that lets a failed deploy pass. **Prove it
|
||||
has teeth with a negative test.**
|
||||
|
||||
**Applied:** F2-12 lasuite-drive upgrade tier. abra's converge monitor FATA'd while the upgraded
|
||||
collabora `25.04.9.4.1` healthcheck was still in `start_period` (jail/config init), though it
|
||||
converges via swarm's healthcheck retries. Fix (`e1147b5`): upgrade chaos redeploy uses `abra … -c`;
|
||||
`generic.perform_upgrade` then owns `lifecycle.wait_healthy` (services N/N + app HEALTH_PATH) +
|
||||
`lifecycle.wait_ready_probes` (recipe `READY_PROBE` → collabora WOPI `/hosting/discovery` 200),
|
||||
bounded by `DEPLOY_TIMEOUT`. Teeth proven by `tests/unit/test_f212_upgrade_convergence.py` (`6506c4a`,
|
||||
5 P7-negative tests: the wait RAISES `TimeoutError` on stuck/never-serving convergence). The lone
|
||||
`docker service scale …minio-createbuckets` is NOT a bypass — it triggers the recipe's own
|
||||
`replicas:0` one-shot (Adversary-confirmed). The Adversary still owns confirming "not a weakening" at
|
||||
the Q3.2 cold-verify.
|
||||
|
||||
Reference in New Issue
Block a user