Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-pvfix-swarm-proxy.md
2026-06-12 15:56:03 +00:00

4.1 KiB

Phase pvfix — durable Swarm proxy overlay VIP exhaustion fix

Mission: eliminate the recurring Docker Swarm proxy overlay VIP exhaustion class by making the shared proxy network large enough for the cc-ci workload, while preserving the already-added per-run safety net. This is an infra phase: coordinate carefully, because recreating proxy briefly disrupts routing for Traefik, Drone, dashboard, bridge, reports, and any live recipe deploys.

State files live under machine-docs/: STATUS-pvfix.md, BACKLOG-pvfix.md, REVIEW-pvfix.md, JOURNAL-pvfix.md.

Context

The 2026-06-12 weekly upgrade exposed a real infra failure mode:

  • The shared proxy overlay was using Docker's default /24 allocation (10.0.1.0/24, 254 VIPs).
  • Every recipe deploy joins proxy for Traefik routing.
  • Concurrent stack removal can race Swarm endpoint GC (key modified, network proxy remove failed) and leak endpoint/VIP allocations.
  • After 11 days of dockerd uptime the allocator exhausted the /24, producing could not find an available IP while allocating VIP and leaving tasks stuck in Swarm New state.
  • A docker restart rebuilt allocator state and cleared the symptom, proving the issue was infra, not the affected recipes.

Existing runbook/background: /srv/cc-ci/cc-ci-plan/plan-proxy-vip-exhaustion-fix.md.

Required Fix

  1. Confirm the current host state is quiet enough for a disruptive network maintenance window. No live /upgrade-all, no active recipe !testme runs, no phase CI sweep in progress.

  2. Update nix/modules/swarm.nix in the cc-ci repo so the proxy overlay is created with an explicit /16, for example:

    docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
    

    Use a subnet clear of ingress and existing Docker allocations. If 10.10.0.0/16 is unsuitable on the live host, choose a different documented /16 and explain why.

  3. Keep the upgrade Step-0 safety net in place: prune leaked overlays and restart Docker when VIP-allocation failure signatures are detected. The durable /16 fix is headroom; the guard is still useful as a future self-healing belt-and-braces mechanism.

  4. Recreate the live proxy network safely. The network cannot be resized in place. Plan the exact live-host steps before executing them. The expected sequence is:

    • capture current proxy inspect output and joined services
    • stop or drain live recipe stacks as needed
    • remove/recreate proxy with the /16
    • redeploy/reconcile Traefik and the cc-ci control-plane services so they rejoin
    • run nixos-rebuild switch using the canonical live cc-ci deploy checkout
  5. Commit and push the cc-ci repo change. Do not commit secrets. Do not merge recipe PRs.

Gates

M1 — Plan and patch ready. Builder produces the minimal swarm.nix patch, records the exact maintenance procedure, and proves from live inspection that the chosen /16 is safe. Adversary cold-reviews the patch and live procedure before any disruptive action.

M2 — Live durable fix applied. The live host has proxy recreated as /16, the NixOS configuration has been switched, and Traefik/Drone/dashboard/bridge/reports are reachable. Adversary verifies from the host that docker network inspect proxy reports the intended subnet and that the control-plane services are healthy.

Guardrails

  • Maintenance window only. Do not recreate proxy while recipe CI, /upgrade-all, or cfold sweep runs are active.
  • No force-pushes. No secret values in logs, plans, commits, or comments.
  • Prefer the smallest host change: one explicit --subnet plus the minimum live reconciliation needed to restore routing.
  • If the host topology differs from the runbook, stop and record the actual state before changing anything.

Definition of Done

proxy is explicitly configured and live as a /16, the change is committed and pushed to cc-ci, core routes are healthy after the maintenance action, and Adversary has signed off on M1 and M2 in machine-docs/REVIEW-pvfix.md. Builder writes ## DONE only after both gates have fresh Adversary PASSes.