Some checks failed
continuous-integration/drone/push Build is failing
Patch nix/modules/swarm.nix to create the `proxy` overlay with --subnet 10.10.0.0/16 (~65k VIPs, 258× headroom over the exhausted /24). Live host survey confirms 10.10.0.0/16 is clear of all existing Docker networks (ingress 10.0.0.0/24, existing per-stack overlays 10.0.1-4.0/24, host routes). Exact maintenance procedure in STATUS-pvfix.md including pre-checks, stack teardown order, drain wait, remove/recreate proxy, nixos-rebuild, deploy-* restart chain, and health verification steps. Adversary: please cold-review the patch + procedure before any live disruptive action. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2.8 KiB
2.8 KiB
JOURNAL — phase pvfix
2026-06-13T05:29Z — Bootstrap + M1 patch
Context gathered
Read the phase plan + runbook. Key facts:
- Root cause confirmed: proxy is
10.0.1.0/24(254 VIPs), Docker GC race leaks endpoints → pool exhaustion - Fix: enlarge to
/16(--subnet 10.10.0.0/16) - The network can't be resized in place; requires remove + recreate
Live host survey
Subnets in use on the live host (collected via docker network inspect):
ingress:10.0.0.0/24proxy:10.0.1.0/24(current — to change)traefik internal:10.0.2.0/24warm-keycloak internal:10.0.3.0/24backups default:10.0.4.0/24bridge/docker_gwbridge:172.17/18.0.0/16
10.10.0.0/16 is clean — no conflicts. Host eth0: 91.98.47.73/32, Tailscale: 100.95.31.88/32.
No route entries for 10.10.x.x in ip route show.
Services on proxy (will be disrupted during maintenance)
From docker service ls + per-service network inspection:
traefik_ci_commoninternet_net_app— uses proxydrone_ci_commoninternet_net_app— uses proxyccci-bridge_app— uses proxyccci-dashboard_app— uses proxyccci-reports_app— uses proxywarm-keycloak_ci_commoninternet_net_app— uses proxy
NOT on proxy: backups_ci_commoninternet_net_app, traefik socket-proxy, warm-keycloak DB.
Deployment mechanism
swarm-init.service— oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuilddeploy-proxy,deploy-drone,deploy-bridge,deploy-dashboard,deploy-reports,warm-keycloak— RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild. Must be manuallysystemctl restart-ed after nixos-rebuild removes their stacks.
Design choice: why 10.10.0.0/16
- Must be
/16for ~65k VIP headroom - Must not overlap
10.0.0.0/24(ingress) or any of the10.0.1-4.0/24per-stack overlays - The Docker default-addr-pool is
10.0.0.0/8— any/16in that range is fine as long as it doesn't overlap an existing allocation 10.10.0.0/16is the first clean/16outside the current allocation band — clear of10.0.x.xwhile still in Docker's pool. No host route conflicts.
swarm.nix patch
Added --subnet 10.10.0.0/16 to the docker network create call.
Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint).
Maintenance window state
Host state at time of claim:
docker stack lsshows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak- NO active recipe CI runs (only warm stacks, no test app containers)
- Confirmed with
docker ps --format "{{.Names}}"— only infra/warm containers
Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs.