Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-pvfix-swarm-proxy.md
2026-06-12 15:56:03 +00:00

82 lines
4.1 KiB
Markdown

# Phase `pvfix` — durable Swarm `proxy` overlay VIP exhaustion fix
**Mission:** eliminate the recurring Docker Swarm `proxy` overlay VIP exhaustion class by
making the shared `proxy` network large enough for the cc-ci workload, while preserving the
already-added per-run safety net. This is an infra phase: coordinate carefully, because
recreating `proxy` briefly disrupts routing for Traefik, Drone, dashboard, bridge, reports,
and any live recipe deploys.
State files live under `machine-docs/`: `STATUS-pvfix.md`, `BACKLOG-pvfix.md`,
`REVIEW-pvfix.md`, `JOURNAL-pvfix.md`.
## Context
The 2026-06-12 weekly upgrade exposed a real infra failure mode:
- The shared `proxy` overlay was using Docker's default `/24` allocation (`10.0.1.0/24`,
254 VIPs).
- Every recipe deploy joins `proxy` for Traefik routing.
- Concurrent stack removal can race Swarm endpoint GC (`key modified`, `network proxy
remove failed`) and leak endpoint/VIP allocations.
- After 11 days of dockerd uptime the allocator exhausted the `/24`, producing
`could not find an available IP while allocating VIP` and leaving tasks stuck in Swarm
`New` state.
- A docker restart rebuilt allocator state and cleared the symptom, proving the issue was
infra, not the affected recipes.
Existing runbook/background: `/srv/cc-ci/cc-ci-plan/plan-proxy-vip-exhaustion-fix.md`.
## Required Fix
1. Confirm the current host state is quiet enough for a disruptive network maintenance
window. No live `/upgrade-all`, no active recipe `!testme` runs, no phase CI sweep in
progress.
2. Update `nix/modules/swarm.nix` in the cc-ci repo so the `proxy` overlay is created with
an explicit `/16`, for example:
```bash
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
Use a subnet clear of `ingress` and existing Docker allocations. If `10.10.0.0/16` is
unsuitable on the live host, choose a different documented `/16` and explain why.
3. Keep the upgrade Step-0 safety net in place: prune leaked overlays and restart Docker
when VIP-allocation failure signatures are detected. The durable `/16` fix is headroom;
the guard is still useful as a future self-healing belt-and-braces mechanism.
4. Recreate the live `proxy` network safely. The network cannot be resized in place.
Plan the exact live-host steps before executing them. The expected sequence is:
- capture current `proxy` inspect output and joined services
- stop or drain live recipe stacks as needed
- remove/recreate `proxy` with the `/16`
- redeploy/reconcile Traefik and the cc-ci control-plane services so they rejoin
- run `nixos-rebuild switch` using the canonical live cc-ci deploy checkout
5. Commit and push the cc-ci repo change. Do not commit secrets. Do not merge recipe PRs.
## Gates
**M1 — Plan and patch ready.** Builder produces the minimal `swarm.nix` patch, records the
exact maintenance procedure, and proves from live inspection that the chosen `/16` is safe.
Adversary cold-reviews the patch and live procedure before any disruptive action.
**M2 — Live durable fix applied.** The live host has `proxy` recreated as `/16`, the NixOS
configuration has been switched, and Traefik/Drone/dashboard/bridge/reports are reachable.
Adversary verifies from the host that `docker network inspect proxy` reports the intended
subnet and that the control-plane services are healthy.
## Guardrails
- Maintenance window only. Do not recreate `proxy` while recipe CI, `/upgrade-all`, or
`cfold` sweep runs are active.
- No force-pushes. No secret values in logs, plans, commits, or comments.
- Prefer the smallest host change: one explicit `--subnet` plus the minimum live
reconciliation needed to restore routing.
- If the host topology differs from the runbook, stop and record the actual state before
changing anything.
## Definition of Done
`proxy` is explicitly configured and live as a `/16`, the change is committed and pushed to
cc-ci, core routes are healthy after the maintenance action, and Adversary has signed off on
M1 and M2 in `machine-docs/REVIEW-pvfix.md`. Builder writes `## DONE` only after both gates
have fresh Adversary PASSes.