Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
3.8 KiB
Runbook — fix proxy overlay VIP exhaustion (durable) + empirical verification
Owner: ORCHESTRATOR (host/swarm infra, not a recipe/test change). Execute after the current
weekly upgrade run finishes (the box must be quiescent — recreating proxy disrupts traefik
routing for every live service). Do NOT run mid-upgrade.
Root cause (empirically verified 2026-06-12, from dockerd logs)
- The shared
proxyoverlay network (ID wasab54…) is10.0.1.0/24= 254 VIPs. EVERY recipe deploy joins it (traefik routing). - Under concurrent stack
rm, Swarm's endpoint GC races (Unable to complete atomic operation, key modified/network proxy remove failed) and leaks endpoints → leaks IPs (45 such errors over the day).dockerdhad 11 days uptime accumulating leaks. - The pool exhausted → 13×
could not find an available IP while allocating VIP(first 22:53, straddling both wedges) → new services' tasks stuck in SwarmNewstate (never scheduled). - The 02:50 docker restart rebuilt the allocator and reclaimed everything → healthy.
- This presents as a recipe FAILURE (discourse, ghost both "failed") but is purely infra.
The fix (durable): enlarge the proxy subnet
nix/modules/swarm.nix:~43 creates it with no --subnet (defaults to a /24):
docker network create --driver overlay --attachable proxy
Change to a /16 (≈65,534 VIPs, ~258× headroom — the leak can't reach it before a routine
reboot/nixos-rebuild resets the allocator). Pick a block clear of ingress (10.0.0.0/24) and the
current proxy (10.0.1.0/24); the default-addr-pool is 10.0.0.0/8, so use e.g. 10.10.0.0/16:
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
Procedure
- Pre-req: weekly upgrade run done;
docker stack lsshows only infra +warm-*. - EMPIRICAL BEFORE — measure the leak. Baseline
proxyendpoint/IP count, then deploy + concurrentlyrmN (~10) throwaway published-port stacks; re-count. Show endpoints/IPs do NOT return to baseline (leak), and grep dockerd for freshkey modified/network proxy removeerrors. Record the per-cycle leak rate → projects the /24 exhaustion time. - Edit
nix/modules/swarm.nix— add--subnet 10.10.0.0/16to the proxy create (commit to the cc-ci repo; this is infra/nix, orchestrator-authored, push to git.autonomic.zone). - Recreate
proxyon the host (DISRUPTIVE): the network can't be resized in place. Eithernixos-rebuildafter temporarily removing proxy, or manually: detach services /docker stack rmthe live recipe stacks (none mid-upgrade),docker network rm proxy, recreate with the /16, then redeploy/reconcile traefik + theccci-*control plane +warm-*so they rejoin. Verify traefik routing, drone, dashboard, bridge, reports all healthy. nixos-rebuild switchso the /16 persists across reboots (sync/root/cc-cifirst, per the host-deploy mechanism).- EMPIRICAL AFTER — prove it. Re-run step 2's reproduction: confirm (a)
proxynow reports a /16 with vast headroom, (b)docker network prune -freclaims the leaked per-stack overlays, (c) the leak no longer approaches exhaustion. Confirm a fresh recipe!testmedeploys clean (noNew-state hang).
Acceptance
proxy is a /16 (pinned in swarm.nix, survives rebuild); reproduction shows the leak is bounded
far below the new ceiling; the upgrade Step-0 guard (prune + VIP-failure docker-restart, already
added to the skill 2026-06-12) remains as the per-run safety net. Then delete the
proxy-vip-exhaustion-runbook memory.
Guardrails
- Maintenance window only (recreating proxy = brief routing outage for ALL services). Never during
a live upgrade or phase run. No secrets in commits. Author
autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>; push after commit.