Files
cc-ci-orchestrator/memory/proxy-vip-exhaustion-runbook.md
autonomic-bot ca02a0dd6f upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges:
the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks
endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety
net to Step 0 (network prune + docker restart when VIP-allocation failures are
logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix,
maintenance window) and for debugging/fixing the ghost PR afterward.
2026-06-12 03:30:00 +00:00

1.6 KiB
Raw Blame History

name, description, metadata
name description metadata
proxy-vip-exhaustion-runbook TODO after the weekly upgrade — enlarge the proxy overlay subnet to /16 (it exhausts at /24=254 VIPs); runbook + empirical verify
node_type type originSessionId
memory project 85355980-5e4f-4f90-b1ca-d0e4fe82f04b

Root cause found 2026-06-12 (empirically, from dockerd logs): recipe test deploys hung at 0/1 in Swarm New state (looked like discourse/ghost "failing") because the shared proxy overlay network (10.0.1.0/24 = 254 VIPs, joined by every recipe deploy) exhausted its IP pool. Leaked endpoints from concurrent stack rm (Swarm endpoint-GC race: key modified / network proxy remove failed, 45×) accumulated over 11 days of dockerd uptime → could not find an available IP while allocating VIP (13×). A docker restart rebuilds the allocator and reclaims it (proven).

Per-run safety net (DONE 2026-06-12): upgrade-all Step 0 now runs docker network prune -f + a guard that restarts docker if recent VIP-allocation failures are in the journal.

TODO (durable fix, ORCHESTRATOR, in a maintenance window AFTER the current upgrade + when the box is quiescent — recreating proxy disrupts traefik routing): enlarge proxy to a /16. Edit nix/modules/swarm.nix:~43 (docker network create --driver overlay --attachable proxy → add --subnet 10.10.0.0/16), recreate the proxy network on the host, nixos-rebuild, and empirically verify (reproduce the leak before/after). Full runbook: cc-ci-plan/plan-proxy-vip-exhaustion-fix.md. Then debug the ghost PR (ghost-pr-debug). Delete this memory once proxy is /16 + verified.