Files
cc-ci-orchestrator/cc-ci-plan/plan-proxy-vip-exhaustion-fix.md
autonomic-bot ca02a0dd6f upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges:
the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks
endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety
net to Step 0 (network prune + docker restart when VIP-allocation failures are
logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix,
maintenance window) and for debugging/fixing the ghost PR afterward.
2026-06-12 03:30:00 +00:00

3.8 KiB
Raw Blame History

Runbook — fix proxy overlay VIP exhaustion (durable) + empirical verification

Owner: ORCHESTRATOR (host/swarm infra, not a recipe/test change). Execute after the current weekly upgrade run finishes (the box must be quiescent — recreating proxy disrupts traefik routing for every live service). Do NOT run mid-upgrade.

Root cause (empirically verified 2026-06-12, from dockerd logs)

  • The shared proxy overlay network (ID was ab54…) is 10.0.1.0/24 = 254 VIPs. EVERY recipe deploy joins it (traefik routing).
  • Under concurrent stack rm, Swarm's endpoint GC races (Unable to complete atomic operation, key modified / network proxy remove failed) and leaks endpoints → leaks IPs (45 such errors over the day). dockerd had 11 days uptime accumulating leaks.
  • The pool exhausted → 13× could not find an available IP while allocating VIP (first 22:53, straddling both wedges) → new services' tasks stuck in Swarm New state (never scheduled).
  • The 02:50 docker restart rebuilt the allocator and reclaimed everything → healthy.
  • This presents as a recipe FAILURE (discourse, ghost both "failed") but is purely infra.

The fix (durable): enlarge the proxy subnet

nix/modules/swarm.nix:~43 creates it with no --subnet (defaults to a /24):

docker network create --driver overlay --attachable proxy

Change to a /16 (≈65,534 VIPs, ~258× headroom — the leak can't reach it before a routine reboot/nixos-rebuild resets the allocator). Pick a block clear of ingress (10.0.0.0/24) and the current proxy (10.0.1.0/24); the default-addr-pool is 10.0.0.0/8, so use e.g. 10.10.0.0/16:

docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy

Procedure

  1. Pre-req: weekly upgrade run done; docker stack ls shows only infra + warm-*.
  2. EMPIRICAL BEFORE — measure the leak. Baseline proxy endpoint/IP count, then deploy + concurrently rm N (~10) throwaway published-port stacks; re-count. Show endpoints/IPs do NOT return to baseline (leak), and grep dockerd for fresh key modified/network proxy remove errors. Record the per-cycle leak rate → projects the /24 exhaustion time.
  3. Edit nix/modules/swarm.nix — add --subnet 10.10.0.0/16 to the proxy create (commit to the cc-ci repo; this is infra/nix, orchestrator-authored, push to git.autonomic.zone).
  4. Recreate proxy on the host (DISRUPTIVE): the network can't be resized in place. Either nixos-rebuild after temporarily removing proxy, or manually: detach services / docker stack rm the live recipe stacks (none mid-upgrade), docker network rm proxy, recreate with the /16, then redeploy/reconcile traefik + the ccci-* control plane + warm-* so they rejoin. Verify traefik routing, drone, dashboard, bridge, reports all healthy.
  5. nixos-rebuild switch so the /16 persists across reboots (sync /root/cc-ci first, per the host-deploy mechanism).
  6. EMPIRICAL AFTER — prove it. Re-run step 2's reproduction: confirm (a) proxy now reports a /16 with vast headroom, (b) docker network prune -f reclaims the leaked per-stack overlays, (c) the leak no longer approaches exhaustion. Confirm a fresh recipe !testme deploys clean (no New-state hang).

Acceptance

proxy is a /16 (pinned in swarm.nix, survives rebuild); reproduction shows the leak is bounded far below the new ceiling; the upgrade Step-0 guard (prune + VIP-failure docker-restart, already added to the skill 2026-06-12) remains as the per-run safety net. Then delete the proxy-vip-exhaustion-runbook memory.

Guardrails

  • Maintenance window only (recreating proxy = brief routing outage for ALL services). Never during a live upgrade or phase run. No secrets in commits. Author autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>; push after commit.