82 lines
4.1 KiB
Markdown
82 lines
4.1 KiB
Markdown
# Phase `pvfix` — durable Swarm `proxy` overlay VIP exhaustion fix
|
|
|
|
**Mission:** eliminate the recurring Docker Swarm `proxy` overlay VIP exhaustion class by
|
|
making the shared `proxy` network large enough for the cc-ci workload, while preserving the
|
|
already-added per-run safety net. This is an infra phase: coordinate carefully, because
|
|
recreating `proxy` briefly disrupts routing for Traefik, Drone, dashboard, bridge, reports,
|
|
and any live recipe deploys.
|
|
|
|
State files live under `machine-docs/`: `STATUS-pvfix.md`, `BACKLOG-pvfix.md`,
|
|
`REVIEW-pvfix.md`, `JOURNAL-pvfix.md`.
|
|
|
|
## Context
|
|
|
|
The 2026-06-12 weekly upgrade exposed a real infra failure mode:
|
|
|
|
- The shared `proxy` overlay was using Docker's default `/24` allocation (`10.0.1.0/24`,
|
|
254 VIPs).
|
|
- Every recipe deploy joins `proxy` for Traefik routing.
|
|
- Concurrent stack removal can race Swarm endpoint GC (`key modified`, `network proxy
|
|
remove failed`) and leak endpoint/VIP allocations.
|
|
- After 11 days of dockerd uptime the allocator exhausted the `/24`, producing
|
|
`could not find an available IP while allocating VIP` and leaving tasks stuck in Swarm
|
|
`New` state.
|
|
- A docker restart rebuilt allocator state and cleared the symptom, proving the issue was
|
|
infra, not the affected recipes.
|
|
|
|
Existing runbook/background: `/srv/cc-ci/cc-ci-plan/plan-proxy-vip-exhaustion-fix.md`.
|
|
|
|
## Required Fix
|
|
|
|
1. Confirm the current host state is quiet enough for a disruptive network maintenance
|
|
window. No live `/upgrade-all`, no active recipe `!testme` runs, no phase CI sweep in
|
|
progress.
|
|
2. Update `nix/modules/swarm.nix` in the cc-ci repo so the `proxy` overlay is created with
|
|
an explicit `/16`, for example:
|
|
|
|
```bash
|
|
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
|
|
```
|
|
|
|
Use a subnet clear of `ingress` and existing Docker allocations. If `10.10.0.0/16` is
|
|
unsuitable on the live host, choose a different documented `/16` and explain why.
|
|
3. Keep the upgrade Step-0 safety net in place: prune leaked overlays and restart Docker
|
|
when VIP-allocation failure signatures are detected. The durable `/16` fix is headroom;
|
|
the guard is still useful as a future self-healing belt-and-braces mechanism.
|
|
4. Recreate the live `proxy` network safely. The network cannot be resized in place.
|
|
Plan the exact live-host steps before executing them. The expected sequence is:
|
|
- capture current `proxy` inspect output and joined services
|
|
- stop or drain live recipe stacks as needed
|
|
- remove/recreate `proxy` with the `/16`
|
|
- redeploy/reconcile Traefik and the cc-ci control-plane services so they rejoin
|
|
- run `nixos-rebuild switch` using the canonical live cc-ci deploy checkout
|
|
5. Commit and push the cc-ci repo change. Do not commit secrets. Do not merge recipe PRs.
|
|
|
|
## Gates
|
|
|
|
**M1 — Plan and patch ready.** Builder produces the minimal `swarm.nix` patch, records the
|
|
exact maintenance procedure, and proves from live inspection that the chosen `/16` is safe.
|
|
Adversary cold-reviews the patch and live procedure before any disruptive action.
|
|
|
|
**M2 — Live durable fix applied.** The live host has `proxy` recreated as `/16`, the NixOS
|
|
configuration has been switched, and Traefik/Drone/dashboard/bridge/reports are reachable.
|
|
Adversary verifies from the host that `docker network inspect proxy` reports the intended
|
|
subnet and that the control-plane services are healthy.
|
|
|
|
## Guardrails
|
|
|
|
- Maintenance window only. Do not recreate `proxy` while recipe CI, `/upgrade-all`, or
|
|
`cfold` sweep runs are active.
|
|
- No force-pushes. No secret values in logs, plans, commits, or comments.
|
|
- Prefer the smallest host change: one explicit `--subnet` plus the minimum live
|
|
reconciliation needed to restore routing.
|
|
- If the host topology differs from the runbook, stop and record the actual state before
|
|
changing anything.
|
|
|
|
## Definition of Done
|
|
|
|
`proxy` is explicitly configured and live as a `/16`, the change is committed and pushed to
|
|
cc-ci, core routes are healthy after the maintenance action, and Adversary has signed off on
|
|
M1 and M2 in `machine-docs/REVIEW-pvfix.md`. Builder writes `## DONE` only after both gates
|
|
have fresh Adversary PASSes.
|