Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
1.6 KiB
name, description, metadata
| name | description | metadata | ||||||
|---|---|---|---|---|---|---|---|---|
| proxy-vip-exhaustion-runbook | TODO after the weekly upgrade — enlarge the proxy overlay subnet to /16 (it exhausts at /24=254 VIPs); runbook + empirical verify |
|
Root cause found 2026-06-12 (empirically, from dockerd logs): recipe test deploys hung at 0/1 in
Swarm New state (looked like discourse/ghost "failing") because the shared proxy overlay
network (10.0.1.0/24 = 254 VIPs, joined by every recipe deploy) exhausted its IP pool.
Leaked endpoints from concurrent stack rm (Swarm endpoint-GC race: key modified / network proxy remove failed, 45×) accumulated over 11 days of dockerd uptime → could not find an available IP while allocating VIP (13×). A docker restart rebuilds the allocator and reclaims it (proven).
Per-run safety net (DONE 2026-06-12): upgrade-all Step 0 now runs docker network prune -f + a
guard that restarts docker if recent VIP-allocation failures are in the journal.
TODO (durable fix, ORCHESTRATOR, in a maintenance window AFTER the current upgrade + when the box
is quiescent — recreating proxy disrupts traefik routing): enlarge proxy to a /16. Edit
nix/modules/swarm.nix:~43 (docker network create --driver overlay --attachable proxy → add
--subnet 10.10.0.0/16), recreate the proxy network on the host, nixos-rebuild, and empirically
verify (reproduce the leak before/after). Full runbook: cc-ci-plan/plan-proxy-vip-exhaustion-fix.md.
Then debug the ghost PR (ghost-pr-debug). Delete this memory once proxy is /16 + verified.