cc-ci-orchestrator/swarm-updatestatus-convergence-gotchas.md at 28ef7e44ab19feef26e81fd388363d64feec0027

recipe-maintainers/cc-ci-orchestrator

Fork 0

Files

autonomic-bot 08706c665e memory: swarm UpdateStatus convergence gotchas (builds 238/241)

2026-06-09 23:14:18 +00:00

1.4 KiB

Raw Blame History

name, description, metadata

name

description

metadata

swarm-updatestatus-convergence-gotchas

N/N replicas is not convergence during stop-first rolling updates, and a 'paused' UpdateStatus persists forever — both bit cc-ci harness waits (builds 238/241)

node_type	type	originSessionId
memory	project	85355980-5e4f-4f90-b1ca-d0e4fe82f04b

Two docker-swarm facts that broke cc-ci convergence waits on 2026-06-09:

N/N ≠ converged. A service update (e.g. chaos redeploy changing a db image) is registered immediately but may not have started — the OLD task still shows 1/1, then dies seconds later (stop-first). Build 238: backupbot exec'd a pre-hook into the just-killed db container → 409 → empty snapshot → RED. Convergence must also check docker service inspect --format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{end}}'.
paused persists forever. Swarm's default update-failure-action: pause flips UpdateStatus to paused on ONE task flicker, and the flag never clears (until the next update) even when the service recovers to N/N healthy. Build 241 hung 22min treating it as in-flight. Only updating and rollback_started are active states worth waiting on.

Both encoded in cc-ci runner/harness/lifecycle.py::services_converged (commits 68ef0f8 + e6d55b5). Remember when writing any NEW wait/health logic against swarm. Related: shared-recipe-checkout-race

1.4 KiB Raw Blame History

1.4 KiB

Raw Blame History