Files
cc-ci-orchestrator/memory/proxy-vip-exhaustion-runbook.md
autonomic-bot ca02a0dd6f upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges:
the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks
endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety
net to Step 0 (network prune + docker restart when VIP-allocation failures are
logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix,
maintenance window) and for debugging/fixing the ghost PR afterward.
2026-06-12 03:30:00 +00:00

26 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
name: proxy-vip-exhaustion-runbook
description: TODO after the weekly upgrade — enlarge the proxy overlay subnet to /16 (it exhausts at /24=254 VIPs); runbook + empirical verify
metadata:
node_type: memory
type: project
originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b
---
**Root cause found 2026-06-12 (empirically, from dockerd logs):** recipe test deploys hung at 0/1 in
Swarm `New` state (looked like discourse/ghost "failing") because the shared **`proxy` overlay
network** (`10.0.1.0/24` = 254 VIPs, joined by every recipe deploy) **exhausted its IP pool**.
Leaked endpoints from concurrent stack `rm` (Swarm endpoint-GC race: `key modified` / `network proxy
remove failed`, 45×) accumulated over 11 days of dockerd uptime → `could not find an available IP
while allocating VIP` (13×). A `docker restart` rebuilds the allocator and reclaims it (proven).
**Per-run safety net (DONE 2026-06-12):** upgrade-all Step 0 now runs `docker network prune -f` + a
guard that restarts docker if recent VIP-allocation failures are in the journal.
**TODO (durable fix, ORCHESTRATOR, in a maintenance window AFTER the current upgrade + when the box
is quiescent — recreating proxy disrupts traefik routing):** enlarge `proxy` to a /16. Edit
`nix/modules/swarm.nix:~43` (`docker network create --driver overlay --attachable proxy` → add
`--subnet 10.10.0.0/16`), recreate the proxy network on the host, `nixos-rebuild`, and empirically
verify (reproduce the leak before/after). Full runbook: `cc-ci-plan/plan-proxy-vip-exhaustion-fix.md`.
Then debug the ghost PR ([[ghost-pr-debug]]). Delete this memory once proxy is /16 + verified.