fix(harness): a paused swarm update is settled — only active states block convergence
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing

68ef0f8 made services_converged() require UpdateStatus settled, treating
'paused' as in flight. But swarm's default update-failure-action pauses the
update on a single task flicker and the flag persists FOREVER (until the next
update): immich CI 241 had the app service 'paused' from a restart during
restore while the service was back at 1/1 and healthy — every subsequent wait
hung to its deadline and the run had to be killed.

Only 'updating' and 'rollback_started' now block convergence: those are the
states swarm is actively driving (the 238 stop-first race lives in 'updating').
'paused'/'rollback_paused' make no progress without intervention, so waiting on
them is pointless — N/N replicas is already required, and the HTTP-health and
tier assertions still gate whether the app actually works.

lint: PASS, unit tests: 138 passed.
This commit is contained in:
2026-06-09 23:07:36 +00:00
parent 68ef0f84fb
commit e6d55b53c7

View File

@ -384,7 +384,13 @@ def services_converged(domain: str) -> bool:
if proc.returncode != 0:
return False # a service vanished mid-check — not settled
for state in proc.stdout.split("\n"):
if state.strip() not in ("", "completed", "rollback_completed"):
# Only ACTIVE states block convergence. 'paused'/'rollback_paused' are terminal-without-
# intervention: swarm's default update-failure-action pauses the update on one task flicker
# and the flag then persists FOREVER (immich CI 241: app service 'paused' from a restart
# during restore, service back at 1/1 and healthy — the wait hung to its deadline). With
# N/N already required above, a paused update is settled for our purposes; the HTTP-health
# and tier assertions still gate whether the app actually works.
if state.strip() in ("updating", "rollback_started"):
return False
return True