Files
cc-ci-orchestrator/cc-ci-plan/plan-proxy-vip-exhaustion-fix.md
autonomic-bot ca02a0dd6f upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges:
the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks
endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety
net to Step 0 (network prune + docker restart when VIP-allocation failures are
logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix,
maintenance window) and for debugging/fixing the ghost PR afterward.
2026-06-12 03:30:00 +00:00

60 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — fix `proxy` overlay VIP exhaustion (durable) + empirical verification
**Owner: ORCHESTRATOR** (host/swarm infra, not a recipe/test change). Execute **after the current
weekly upgrade run finishes** (the box must be quiescent — recreating `proxy` disrupts traefik
routing for every live service). Do NOT run mid-upgrade.
## Root cause (empirically verified 2026-06-12, from dockerd logs)
- The shared **`proxy` overlay network** (ID was `ab54…`) is **`10.0.1.0/24` = 254 VIPs**. EVERY
recipe deploy joins it (traefik routing).
- Under concurrent stack `rm`, Swarm's endpoint GC races (`Unable to complete atomic operation,
key modified` / `network proxy remove failed`) and **leaks endpoints → leaks IPs** (45 such
errors over the day). `dockerd` had **11 days** uptime accumulating leaks.
- The pool exhausted → **13× `could not find an available IP while allocating VIP`** (first 22:53,
straddling both wedges) → new services' tasks stuck in Swarm **`New`** state (never scheduled).
- The 02:50 docker restart rebuilt the allocator and reclaimed everything → healthy.
- This presents as a recipe FAILURE (discourse, ghost both "failed") but is purely infra.
## The fix (durable): enlarge the `proxy` subnet
`nix/modules/swarm.nix:~43` creates it with no `--subnet` (defaults to a /24):
```
docker network create --driver overlay --attachable proxy
```
Change to a **/16** (≈65,534 VIPs, ~258× headroom — the leak can't reach it before a routine
reboot/`nixos-rebuild` resets the allocator). Pick a block clear of `ingress` (10.0.0.0/24) and the
current proxy (10.0.1.0/24); the default-addr-pool is 10.0.0.0/8, so use e.g. **`10.10.0.0/16`**:
```
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
## Procedure
1. **Pre-req:** weekly upgrade run done; `docker stack ls` shows only infra + `warm-*`.
2. **EMPIRICAL BEFORE — measure the leak.** Baseline `proxy` endpoint/IP count, then deploy +
*concurrently* `rm` N (~10) throwaway published-port stacks; re-count. Show endpoints/IPs do NOT
return to baseline (leak), and grep dockerd for fresh `key modified`/`network proxy remove`
errors. Record the per-cycle leak rate → projects the /24 exhaustion time.
3. **Edit `nix/modules/swarm.nix`** — add `--subnet 10.10.0.0/16` to the proxy create (commit to
the cc-ci repo; this is infra/nix, orchestrator-authored, push to git.autonomic.zone).
4. **Recreate `proxy` on the host (DISRUPTIVE):** the network can't be resized in place. Either
`nixos-rebuild` after temporarily removing proxy, or manually: detach services / `docker stack
rm` the live recipe stacks (none mid-upgrade), `docker network rm proxy`, recreate with the /16,
then redeploy/reconcile traefik + the `ccci-*` control plane + `warm-*` so they rejoin. Verify
traefik routing, drone, dashboard, bridge, reports all healthy.
5. **`nixos-rebuild switch`** so the /16 persists across reboots (sync `/root/cc-ci` first, per the
host-deploy mechanism).
6. **EMPIRICAL AFTER — prove it.** Re-run step 2's reproduction: confirm (a) `proxy` now reports a
/16 with vast headroom, (b) `docker network prune -f` reclaims the leaked per-stack overlays,
(c) the leak no longer approaches exhaustion. Confirm a fresh recipe `!testme` deploys clean (no
`New`-state hang).
## Acceptance
`proxy` is a /16 (pinned in swarm.nix, survives rebuild); reproduction shows the leak is bounded
far below the new ceiling; the upgrade Step-0 guard (prune + VIP-failure docker-restart, already
added to the skill 2026-06-12) remains as the per-run safety net. Then delete the
[[proxy-vip-exhaustion-runbook]] memory.
## Guardrails
- Maintenance window only (recreating proxy = brief routing outage for ALL services). Never during
a live upgrade or phase run. No secrets in commits. Author `autonomic-bot
<autonomic-bot@noreply.git.autonomic.zone>`; push after commit.