24 lines
1.4 KiB
Markdown
24 lines
1.4 KiB
Markdown
---
|
|
name: swarm-updatestatus-convergence-gotchas
|
|
description: "N/N replicas is not convergence during stop-first rolling updates, and a 'paused' UpdateStatus persists forever — both bit cc-ci harness waits (builds 238/241)"
|
|
metadata:
|
|
node_type: memory
|
|
type: project
|
|
originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b
|
|
---
|
|
|
|
Two docker-swarm facts that broke cc-ci convergence waits on 2026-06-09:
|
|
|
|
1. **N/N ≠ converged.** A service update (e.g. chaos redeploy changing a db image) is *registered*
|
|
immediately but may not have *started* — the OLD task still shows 1/1, then dies seconds later
|
|
(stop-first). Build 238: backupbot exec'd a pre-hook into the just-killed db container → 409 →
|
|
empty snapshot → RED. Convergence must also check `docker service inspect --format
|
|
'{{if .UpdateStatus}}{{.UpdateStatus.State}}{{end}}'`.
|
|
2. **`paused` persists forever.** Swarm's default `update-failure-action: pause` flips UpdateStatus
|
|
to `paused` on ONE task flicker, and the flag never clears (until the next update) even when the
|
|
service recovers to N/N healthy. Build 241 hung 22min treating it as in-flight. Only `updating`
|
|
and `rollback_started` are active states worth waiting on.
|
|
|
|
Both encoded in cc-ci `runner/harness/lifecycle.py::services_converged` (commits 68ef0f8 + e6d55b5).
|
|
Remember when writing any NEW wait/health logic against swarm. Related: [[shared-recipe-checkout-race]]
|