Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
26 lines
1.6 KiB
Markdown
26 lines
1.6 KiB
Markdown
---
|
||
name: proxy-vip-exhaustion-runbook
|
||
description: TODO after the weekly upgrade — enlarge the proxy overlay subnet to /16 (it exhausts at /24=254 VIPs); runbook + empirical verify
|
||
metadata:
|
||
node_type: memory
|
||
type: project
|
||
originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b
|
||
---
|
||
|
||
**Root cause found 2026-06-12 (empirically, from dockerd logs):** recipe test deploys hung at 0/1 in
|
||
Swarm `New` state (looked like discourse/ghost "failing") because the shared **`proxy` overlay
|
||
network** (`10.0.1.0/24` = 254 VIPs, joined by every recipe deploy) **exhausted its IP pool**.
|
||
Leaked endpoints from concurrent stack `rm` (Swarm endpoint-GC race: `key modified` / `network proxy
|
||
remove failed`, 45×) accumulated over 11 days of dockerd uptime → `could not find an available IP
|
||
while allocating VIP` (13×). A `docker restart` rebuilds the allocator and reclaims it (proven).
|
||
|
||
**Per-run safety net (DONE 2026-06-12):** upgrade-all Step 0 now runs `docker network prune -f` + a
|
||
guard that restarts docker if recent VIP-allocation failures are in the journal.
|
||
|
||
**TODO (durable fix, ORCHESTRATOR, in a maintenance window AFTER the current upgrade + when the box
|
||
is quiescent — recreating proxy disrupts traefik routing):** enlarge `proxy` to a /16. Edit
|
||
`nix/modules/swarm.nix:~43` (`docker network create --driver overlay --attachable proxy` → add
|
||
`--subnet 10.10.0.0/16`), recreate the proxy network on the host, `nixos-rebuild`, and empirically
|
||
verify (reproduce the leak before/after). Full runbook: `cc-ci-plan/plan-proxy-vip-exhaustion-fix.md`.
|
||
Then debug the ghost PR ([[ghost-pr-debug]]). Delete this memory once proxy is /16 + verified.
|