fix(harness): services_converged — a replica deficit explained entirely by Complete tasks is converged (triggered one-shot, rcust M2 lasuite-drive root cause)
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -348,8 +348,27 @@ def services_converged(domain: str) -> bool:
|
||||
# `want == "0"` rejection wrongly treated those as never-converged, hanging the deploy
|
||||
# forever. `cur == want` (with `want` present) is the correct convergence test; a service
|
||||
# still spinning up shows e.g. "0/1" (cur != want) and is correctly not-yet-converged.
|
||||
if not want or cur != want:
|
||||
if not want:
|
||||
return False
|
||||
if cur != want:
|
||||
# A TRIGGERED one-shot (restart_policy none, scaled 0→1, runs once, exits 0) reports
|
||||
# "0/1" FOREVER after its task completes — swarm never restarts it, so a bare
|
||||
# `cur != want` rejection would block convergence for the rest of the run (lasuite-drive
|
||||
# minio-createbuckets, rcust M2: install assert burned the full DEPLOY_TIMEOUT after the
|
||||
# P2b port moved the bucket trigger BEFORE the install assert; pre-restructure the
|
||||
# trigger ran after it, so converge never saw the 0/1). A replica deficit explained
|
||||
# entirely by COMPLETE tasks IS converged: the one-shot did its job and will never run
|
||||
# again. Anything else in the deficit (Running/Starting/Pending = still spinning up;
|
||||
# Failed/Rejected = genuinely broken) stays not-converged, and a desired>0 service with
|
||||
# no tasks yet is still scheduling.
|
||||
tasks = subprocess.run(
|
||||
["docker", "service", "ps", name, "--format", "{{.CurrentState}}"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
states = [ln.split()[0] for ln in tasks.stdout.split("\n") if ln.strip()]
|
||||
if not (states and all(s == "Complete" for s in states)):
|
||||
return False
|
||||
# N/N alone is NOT convergence during a stop-first rolling update: a chaos redeploy that changes
|
||||
# a non-app service image (e.g. immich's db pin) registers the update immediately, but swarm may
|
||||
# not have cycled that service's task yet — the OLD task still shows 1/1, then dies seconds later
|
||||
|
||||
Reference in New Issue
Block a user