Files
cc-ci/machine-docs/JOURNAL-pvfix.md
autonomic-bot e6349a9dfe
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M1): proxy /16 patch + maintenance plan ready
Patch nix/modules/swarm.nix to create the `proxy` overlay with
--subnet 10.10.0.0/16 (~65k VIPs, 258× headroom over the exhausted /24).

Live host survey confirms 10.10.0.0/16 is clear of all existing
Docker networks (ingress 10.0.0.0/24, existing per-stack overlays
10.0.1-4.0/24, host routes). Exact maintenance procedure in
STATUS-pvfix.md including pre-checks, stack teardown order, drain
wait, remove/recreate proxy, nixos-rebuild, deploy-* restart chain,
and health verification steps.

Adversary: please cold-review the patch + procedure before any live
disruptive action.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:31:21 +00:00

2.8 KiB

JOURNAL — phase pvfix

2026-06-13T05:29Z — Bootstrap + M1 patch

Context gathered

Read the phase plan + runbook. Key facts:

  • Root cause confirmed: proxy is 10.0.1.0/24 (254 VIPs), Docker GC race leaks endpoints → pool exhaustion
  • Fix: enlarge to /16 (--subnet 10.10.0.0/16)
  • The network can't be resized in place; requires remove + recreate

Live host survey

Subnets in use on the live host (collected via docker network inspect):

  • ingress: 10.0.0.0/24
  • proxy: 10.0.1.0/24 (current — to change)
  • traefik internal: 10.0.2.0/24
  • warm-keycloak internal: 10.0.3.0/24
  • backups default: 10.0.4.0/24
  • bridge/docker_gwbridge: 172.17/18.0.0/16

10.10.0.0/16 is clean — no conflicts. Host eth0: 91.98.47.73/32, Tailscale: 100.95.31.88/32. No route entries for 10.10.x.x in ip route show.

Services on proxy (will be disrupted during maintenance)

From docker service ls + per-service network inspection:

  • traefik_ci_commoninternet_net_app — uses proxy
  • drone_ci_commoninternet_net_app — uses proxy
  • ccci-bridge_app — uses proxy
  • ccci-dashboard_app — uses proxy
  • ccci-reports_app — uses proxy
  • warm-keycloak_ci_commoninternet_net_app — uses proxy

NOT on proxy: backups_ci_commoninternet_net_app, traefik socket-proxy, warm-keycloak DB.

Deployment mechanism

  • swarm-init.service — oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuild
  • deploy-proxy, deploy-drone, deploy-bridge, deploy-dashboard, deploy-reports, warm-keycloak — RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild. Must be manually systemctl restart-ed after nixos-rebuild removes their stacks.

Design choice: why 10.10.0.0/16

  • Must be /16 for ~65k VIP headroom
  • Must not overlap 10.0.0.0/24 (ingress) or any of the 10.0.1-4.0/24 per-stack overlays
  • The Docker default-addr-pool is 10.0.0.0/8 — any /16 in that range is fine as long as it doesn't overlap an existing allocation
  • 10.10.0.0/16 is the first clean /16 outside the current allocation band — clear of 10.0.x.x while still in Docker's pool. No host route conflicts.

swarm.nix patch

Added --subnet 10.10.0.0/16 to the docker network create call. Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint).

Maintenance window state

Host state at time of claim:

  • docker stack ls shows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak
  • NO active recipe CI runs (only warm stacks, no test app containers)
  • Confirmed with docker ps --format "{{.Names}}" — only infra/warm containers

Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs.