Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
60 lines
3.8 KiB
Markdown
60 lines
3.8 KiB
Markdown
# Runbook — fix `proxy` overlay VIP exhaustion (durable) + empirical verification
|
||
|
||
**Owner: ORCHESTRATOR** (host/swarm infra, not a recipe/test change). Execute **after the current
|
||
weekly upgrade run finishes** (the box must be quiescent — recreating `proxy` disrupts traefik
|
||
routing for every live service). Do NOT run mid-upgrade.
|
||
|
||
## Root cause (empirically verified 2026-06-12, from dockerd logs)
|
||
- The shared **`proxy` overlay network** (ID was `ab54…`) is **`10.0.1.0/24` = 254 VIPs**. EVERY
|
||
recipe deploy joins it (traefik routing).
|
||
- Under concurrent stack `rm`, Swarm's endpoint GC races (`Unable to complete atomic operation,
|
||
key modified` / `network proxy remove failed`) and **leaks endpoints → leaks IPs** (45 such
|
||
errors over the day). `dockerd` had **11 days** uptime accumulating leaks.
|
||
- The pool exhausted → **13× `could not find an available IP while allocating VIP`** (first 22:53,
|
||
straddling both wedges) → new services' tasks stuck in Swarm **`New`** state (never scheduled).
|
||
- The 02:50 docker restart rebuilt the allocator and reclaimed everything → healthy.
|
||
- This presents as a recipe FAILURE (discourse, ghost both "failed") but is purely infra.
|
||
|
||
## The fix (durable): enlarge the `proxy` subnet
|
||
`nix/modules/swarm.nix:~43` creates it with no `--subnet` (defaults to a /24):
|
||
```
|
||
docker network create --driver overlay --attachable proxy
|
||
```
|
||
Change to a **/16** (≈65,534 VIPs, ~258× headroom — the leak can't reach it before a routine
|
||
reboot/`nixos-rebuild` resets the allocator). Pick a block clear of `ingress` (10.0.0.0/24) and the
|
||
current proxy (10.0.1.0/24); the default-addr-pool is 10.0.0.0/8, so use e.g. **`10.10.0.0/16`**:
|
||
```
|
||
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
|
||
```
|
||
|
||
## Procedure
|
||
1. **Pre-req:** weekly upgrade run done; `docker stack ls` shows only infra + `warm-*`.
|
||
2. **EMPIRICAL BEFORE — measure the leak.** Baseline `proxy` endpoint/IP count, then deploy +
|
||
*concurrently* `rm` N (~10) throwaway published-port stacks; re-count. Show endpoints/IPs do NOT
|
||
return to baseline (leak), and grep dockerd for fresh `key modified`/`network proxy remove`
|
||
errors. Record the per-cycle leak rate → projects the /24 exhaustion time.
|
||
3. **Edit `nix/modules/swarm.nix`** — add `--subnet 10.10.0.0/16` to the proxy create (commit to
|
||
the cc-ci repo; this is infra/nix, orchestrator-authored, push to git.autonomic.zone).
|
||
4. **Recreate `proxy` on the host (DISRUPTIVE):** the network can't be resized in place. Either
|
||
`nixos-rebuild` after temporarily removing proxy, or manually: detach services / `docker stack
|
||
rm` the live recipe stacks (none mid-upgrade), `docker network rm proxy`, recreate with the /16,
|
||
then redeploy/reconcile traefik + the `ccci-*` control plane + `warm-*` so they rejoin. Verify
|
||
traefik routing, drone, dashboard, bridge, reports all healthy.
|
||
5. **`nixos-rebuild switch`** so the /16 persists across reboots (sync `/root/cc-ci` first, per the
|
||
host-deploy mechanism).
|
||
6. **EMPIRICAL AFTER — prove it.** Re-run step 2's reproduction: confirm (a) `proxy` now reports a
|
||
/16 with vast headroom, (b) `docker network prune -f` reclaims the leaked per-stack overlays,
|
||
(c) the leak no longer approaches exhaustion. Confirm a fresh recipe `!testme` deploys clean (no
|
||
`New`-state hang).
|
||
|
||
## Acceptance
|
||
`proxy` is a /16 (pinned in swarm.nix, survives rebuild); reproduction shows the leak is bounded
|
||
far below the new ceiling; the upgrade Step-0 guard (prune + VIP-failure docker-restart, already
|
||
added to the skill 2026-06-12) remains as the per-run safety net. Then delete the
|
||
[[proxy-vip-exhaustion-runbook]] memory.
|
||
|
||
## Guardrails
|
||
- Maintenance window only (recreating proxy = brief routing outage for ALL services). Never during
|
||
a live upgrade or phase run. No secrets in commits. Author `autonomic-bot
|
||
<autonomic-bot@noreply.git.autonomic.zone>`; push after commit.
|