Phase `pvfix` — durable Swarm `proxy` overlay VIP exhaustion fix

Mission: eliminate the recurring Docker Swarm proxy overlay VIP exhaustion class by making the shared proxy network large enough for the cc-ci workload, while preserving the already-added per-run safety net. This is an infra phase: coordinate carefully, because recreating proxy briefly disrupts routing for Traefik, Drone, dashboard, bridge, reports, and any live recipe deploys.

State files live under machine-docs/: STATUS-pvfix.md, BACKLOG-pvfix.md, REVIEW-pvfix.md, JOURNAL-pvfix.md.

Context

The 2026-06-12 weekly upgrade exposed a real infra failure mode:

The shared proxy overlay was using Docker's default /24 allocation (10.0.1.0/24, 254 VIPs).
Every recipe deploy joins proxy for Traefik routing.
Concurrent stack removal can race Swarm endpoint GC (key modified, network proxy remove failed) and leak endpoint/VIP allocations.
After 11 days of dockerd uptime the allocator exhausted the /24, producing could not find an available IP while allocating VIP and leaving tasks stuck in Swarm New state.
A docker restart rebuilt allocator state and cleared the symptom, proving the issue was infra, not the affected recipes.

Existing runbook/background: /srv/cc-ci/cc-ci-plan/plan-proxy-vip-exhaustion-fix.md.

Required Fix

Confirm the current host state is quiet enough for a disruptive network maintenance window. No live /upgrade-all, no active recipe !testme runs, no phase CI sweep in progress.
Update nix/modules/swarm.nix in the cc-ci repo so the proxy overlay is created with an explicit /16, for example:
```
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
Use a subnet clear of ingress and existing Docker allocations. If 10.10.0.0/16 is unsuitable on the live host, choose a different documented /16 and explain why.
Keep the upgrade Step-0 safety net in place: prune leaked overlays and restart Docker when VIP-allocation failure signatures are detected. The durable /16 fix is headroom; the guard is still useful as a future self-healing belt-and-braces mechanism.
Recreate the live proxy network safely. The network cannot be resized in place. Plan the exact live-host steps before executing them. The expected sequence is:
- capture current proxy inspect output and joined services
- stop or drain live recipe stacks as needed
- remove/recreate proxy with the /16
- redeploy/reconcile Traefik and the cc-ci control-plane services so they rejoin
- run nixos-rebuild switch using the canonical live cc-ci deploy checkout
Commit and push the cc-ci repo change. Do not commit secrets. Do not merge recipe PRs.

Gates

M1 — Plan and patch ready. Builder produces the minimal swarm.nix patch, records the exact maintenance procedure, and proves from live inspection that the chosen /16 is safe. Adversary cold-reviews the patch and live procedure before any disruptive action.

M2 — Live durable fix applied. The live host has proxy recreated as /16, the NixOS configuration has been switched, and Traefik/Drone/dashboard/bridge/reports are reachable. Adversary verifies from the host that docker network inspect proxy reports the intended subnet and that the control-plane services are healthy.

Guardrails

Maintenance window only. Do not recreate proxy while recipe CI, /upgrade-all, or cfold sweep runs are active.
No force-pushes. No secret values in logs, plans, commits, or comments.
Prefer the smallest host change: one explicit --subnet plus the minimum live reconciliation needed to restore routing.
If the host topology differs from the runbook, stop and record the actual state before changing anything.

Definition of Done

proxy is explicitly configured and live as a /16, the change is committed and pushed to cc-ci, core routes are healthy after the maintenance action, and Adversary has signed off on M1 and M2 in machine-docs/REVIEW-pvfix.md. Builder writes ## DONE only after both gates have fresh Adversary PASSes.

4.1 KiB Raw Blame History

Phase pvfix — durable Swarm proxy overlay VIP exhaustion fix