From 08706c665eed451125d1e6c60444fac72b01be36 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Tue, 9 Jun 2026 23:14:18 +0000 Subject: [PATCH] memory: swarm UpdateStatus convergence gotchas (builds 238/241) --- memory/MEMORY.md | 1 + .../swarm-updatestatus-convergence-gotchas.md | 23 +++++++++++++++++++ 2 files changed, 24 insertions(+) create mode 100644 memory/swarm-updatestatus-convergence-gotchas.md diff --git a/memory/MEMORY.md b/memory/MEMORY.md index 1c4cff3..afc92be 100644 --- a/memory/MEMORY.md +++ b/memory/MEMORY.md @@ -9,3 +9,4 @@ - [immich pgvecto.rs DROP DATABASE panic](immich-pgvectors-drop-database-panic.md) — DROP DATABASE crashes immich's postgres image; use pg_dump --clean --if-exists + search_path rewrite - [Drone sqlite log extraction](drone-sqlite-log-extraction.md) — copy /data/database.sqlite from drone container, query builds→stages→steps→logs for full step output - [plausible upgrade-base trap](plausible-upgrade-base-trap.md) — CI REDs from published 3.0.0 base (no x86_64 arch → 404 → silent exit 1), not the PR; needs UPGRADE_BASE_VERSION=3.0.1+v2.0.0 in cc-ci tests +- [Swarm UpdateStatus convergence gotchas](swarm-updatestatus-convergence-gotchas.md) — N/N is not converged mid stop-first update; paused flag persists forever; only updating/rollback_started are active diff --git a/memory/swarm-updatestatus-convergence-gotchas.md b/memory/swarm-updatestatus-convergence-gotchas.md new file mode 100644 index 0000000..9f00bc7 --- /dev/null +++ b/memory/swarm-updatestatus-convergence-gotchas.md @@ -0,0 +1,23 @@ +--- +name: swarm-updatestatus-convergence-gotchas +description: "N/N replicas is not convergence during stop-first rolling updates, and a 'paused' UpdateStatus persists forever — both bit cc-ci harness waits (builds 238/241)" +metadata: + node_type: memory + type: project + originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b +--- + +Two docker-swarm facts that broke cc-ci convergence waits on 2026-06-09: + +1. **N/N ≠ converged.** A service update (e.g. chaos redeploy changing a db image) is *registered* + immediately but may not have *started* — the OLD task still shows 1/1, then dies seconds later + (stop-first). Build 238: backupbot exec'd a pre-hook into the just-killed db container → 409 → + empty snapshot → RED. Convergence must also check `docker service inspect --format + '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{end}}'`. +2. **`paused` persists forever.** Swarm's default `update-failure-action: pause` flips UpdateStatus + to `paused` on ONE task flicker, and the flag never clears (until the next update) even when the + service recovers to N/N healthy. Build 241 hung 22min treating it as in-flight. Only `updating` + and `rollback_started` are active states worth waiting on. + +Both encoded in cc-ci `runner/harness/lifecycle.py::services_converged` (commits 68ef0f8 + e6d55b5). +Remember when writing any NEW wait/health logic against swarm. Related: [[shared-recipe-checkout-race]]