# Phase `pvfix` — durable Swarm `proxy` overlay VIP exhaustion fix **Mission:** eliminate the recurring Docker Swarm `proxy` overlay VIP exhaustion class by making the shared `proxy` network large enough for the cc-ci workload, while preserving the already-added per-run safety net. This is an infra phase: coordinate carefully, because recreating `proxy` briefly disrupts routing for Traefik, Drone, dashboard, bridge, reports, and any live recipe deploys. State files live under `machine-docs/`: `STATUS-pvfix.md`, `BACKLOG-pvfix.md`, `REVIEW-pvfix.md`, `JOURNAL-pvfix.md`. ## Context The 2026-06-12 weekly upgrade exposed a real infra failure mode: - The shared `proxy` overlay was using Docker's default `/24` allocation (`10.0.1.0/24`, 254 VIPs). - Every recipe deploy joins `proxy` for Traefik routing. - Concurrent stack removal can race Swarm endpoint GC (`key modified`, `network proxy remove failed`) and leak endpoint/VIP allocations. - After 11 days of dockerd uptime the allocator exhausted the `/24`, producing `could not find an available IP while allocating VIP` and leaving tasks stuck in Swarm `New` state. - A docker restart rebuilt allocator state and cleared the symptom, proving the issue was infra, not the affected recipes. Existing runbook/background: `/srv/cc-ci/cc-ci-plan/plan-proxy-vip-exhaustion-fix.md`. ## Required Fix 1. Confirm the current host state is quiet enough for a disruptive network maintenance window. No live `/upgrade-all`, no active recipe `!testme` runs, no phase CI sweep in progress. 2. Update `nix/modules/swarm.nix` in the cc-ci repo so the `proxy` overlay is created with an explicit `/16`, for example: ```bash docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy ``` Use a subnet clear of `ingress` and existing Docker allocations. If `10.10.0.0/16` is unsuitable on the live host, choose a different documented `/16` and explain why. 3. Keep the upgrade Step-0 safety net in place: prune leaked overlays and restart Docker when VIP-allocation failure signatures are detected. The durable `/16` fix is headroom; the guard is still useful as a future self-healing belt-and-braces mechanism. 4. Recreate the live `proxy` network safely. The network cannot be resized in place. Plan the exact live-host steps before executing them. The expected sequence is: - capture current `proxy` inspect output and joined services - stop or drain live recipe stacks as needed - remove/recreate `proxy` with the `/16` - redeploy/reconcile Traefik and the cc-ci control-plane services so they rejoin - run `nixos-rebuild switch` using the canonical live cc-ci deploy checkout 5. Commit and push the cc-ci repo change. Do not commit secrets. Do not merge recipe PRs. ## Gates **M1 — Plan and patch ready.** Builder produces the minimal `swarm.nix` patch, records the exact maintenance procedure, and proves from live inspection that the chosen `/16` is safe. Adversary cold-reviews the patch and live procedure before any disruptive action. **M2 — Live durable fix applied.** The live host has `proxy` recreated as `/16`, the NixOS configuration has been switched, and Traefik/Drone/dashboard/bridge/reports are reachable. Adversary verifies from the host that `docker network inspect proxy` reports the intended subnet and that the control-plane services are healthy. ## Guardrails - Maintenance window only. Do not recreate `proxy` while recipe CI, `/upgrade-all`, or `cfold` sweep runs are active. - No force-pushes. No secret values in logs, plans, commits, or comments. - Prefer the smallest host change: one explicit `--subnet` plus the minimum live reconciliation needed to restore routing. - If the host topology differs from the runbook, stop and record the actual state before changing anything. ## Definition of Done `proxy` is explicitly configured and live as a `/16`, the change is committed and pushed to cc-ci, core routes are healthy after the maintenance action, and Adversary has signed off on M1 and M2 in `machine-docs/REVIEW-pvfix.md`. Builder writes `## DONE` only after both gates have fresh Adversary PASSes.