4.1 KiB
Phase pvfix — durable Swarm proxy overlay VIP exhaustion fix
Mission: eliminate the recurring Docker Swarm proxy overlay VIP exhaustion class by
making the shared proxy network large enough for the cc-ci workload, while preserving the
already-added per-run safety net. This is an infra phase: coordinate carefully, because
recreating proxy briefly disrupts routing for Traefik, Drone, dashboard, bridge, reports,
and any live recipe deploys.
State files live under machine-docs/: STATUS-pvfix.md, BACKLOG-pvfix.md,
REVIEW-pvfix.md, JOURNAL-pvfix.md.
Context
The 2026-06-12 weekly upgrade exposed a real infra failure mode:
- The shared
proxyoverlay was using Docker's default/24allocation (10.0.1.0/24, 254 VIPs). - Every recipe deploy joins
proxyfor Traefik routing. - Concurrent stack removal can race Swarm endpoint GC (
key modified,network proxy remove failed) and leak endpoint/VIP allocations. - After 11 days of dockerd uptime the allocator exhausted the
/24, producingcould not find an available IP while allocating VIPand leaving tasks stuck in SwarmNewstate. - A docker restart rebuilt allocator state and cleared the symptom, proving the issue was infra, not the affected recipes.
Existing runbook/background: /srv/cc-ci/cc-ci-plan/plan-proxy-vip-exhaustion-fix.md.
Required Fix
-
Confirm the current host state is quiet enough for a disruptive network maintenance window. No live
/upgrade-all, no active recipe!testmeruns, no phase CI sweep in progress. -
Update
nix/modules/swarm.nixin the cc-ci repo so theproxyoverlay is created with an explicit/16, for example:docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxyUse a subnet clear of
ingressand existing Docker allocations. If10.10.0.0/16is unsuitable on the live host, choose a different documented/16and explain why. -
Keep the upgrade Step-0 safety net in place: prune leaked overlays and restart Docker when VIP-allocation failure signatures are detected. The durable
/16fix is headroom; the guard is still useful as a future self-healing belt-and-braces mechanism. -
Recreate the live
proxynetwork safely. The network cannot be resized in place. Plan the exact live-host steps before executing them. The expected sequence is:- capture current
proxyinspect output and joined services - stop or drain live recipe stacks as needed
- remove/recreate
proxywith the/16 - redeploy/reconcile Traefik and the cc-ci control-plane services so they rejoin
- run
nixos-rebuild switchusing the canonical live cc-ci deploy checkout
- capture current
-
Commit and push the cc-ci repo change. Do not commit secrets. Do not merge recipe PRs.
Gates
M1 — Plan and patch ready. Builder produces the minimal swarm.nix patch, records the
exact maintenance procedure, and proves from live inspection that the chosen /16 is safe.
Adversary cold-reviews the patch and live procedure before any disruptive action.
M2 — Live durable fix applied. The live host has proxy recreated as /16, the NixOS
configuration has been switched, and Traefik/Drone/dashboard/bridge/reports are reachable.
Adversary verifies from the host that docker network inspect proxy reports the intended
subnet and that the control-plane services are healthy.
Guardrails
- Maintenance window only. Do not recreate
proxywhile recipe CI,/upgrade-all, orcfoldsweep runs are active. - No force-pushes. No secret values in logs, plans, commits, or comments.
- Prefer the smallest host change: one explicit
--subnetplus the minimum live reconciliation needed to restore routing. - If the host topology differs from the runbook, stop and record the actual state before changing anything.
Definition of Done
proxy is explicitly configured and live as a /16, the change is committed and pushed to
cc-ci, core routes are healthy after the maintenance action, and Adversary has signed off on
M1 and M2 in machine-docs/REVIEW-pvfix.md. Builder writes ## DONE only after both gates
have fresh Adversary PASSes.