Files
cc-ci/machine-docs/JOURNAL-pvfix.md
autonomic-bot e6349a9dfe
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M1): proxy /16 patch + maintenance plan ready
Patch nix/modules/swarm.nix to create the `proxy` overlay with
--subnet 10.10.0.0/16 (~65k VIPs, 258× headroom over the exhausted /24).

Live host survey confirms 10.10.0.0/16 is clear of all existing
Docker networks (ingress 10.0.0.0/24, existing per-stack overlays
10.0.1-4.0/24, host routes). Exact maintenance procedure in
STATUS-pvfix.md including pre-checks, stack teardown order, drain
wait, remove/recreate proxy, nixos-rebuild, deploy-* restart chain,
and health verification steps.

Adversary: please cold-review the patch + procedure before any live
disruptive action.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:31:21 +00:00

66 lines
2.8 KiB
Markdown

# JOURNAL — phase pvfix
## 2026-06-13T05:29Z — Bootstrap + M1 patch
### Context gathered
Read the phase plan + runbook. Key facts:
- Root cause confirmed: proxy is `10.0.1.0/24` (254 VIPs), Docker GC race leaks endpoints → pool exhaustion
- Fix: enlarge to `/16` (`--subnet 10.10.0.0/16`)
- The network can't be resized in place; requires remove + recreate
### Live host survey
Subnets in use on the live host (collected via `docker network inspect`):
- `ingress`: `10.0.0.0/24`
- `proxy`: `10.0.1.0/24` (current — to change)
- `traefik internal`: `10.0.2.0/24`
- `warm-keycloak internal`: `10.0.3.0/24`
- `backups default`: `10.0.4.0/24`
- `bridge`/`docker_gwbridge`: `172.17/18.0.0/16`
`10.10.0.0/16` is clean — no conflicts. Host eth0: `91.98.47.73/32`, Tailscale: `100.95.31.88/32`.
No route entries for `10.10.x.x` in `ip route show`.
### Services on proxy (will be disrupted during maintenance)
From `docker service ls` + per-service network inspection:
- `traefik_ci_commoninternet_net_app` — uses proxy
- `drone_ci_commoninternet_net_app` — uses proxy
- `ccci-bridge_app` — uses proxy
- `ccci-dashboard_app` — uses proxy
- `ccci-reports_app` — uses proxy
- `warm-keycloak_ci_commoninternet_net_app` — uses proxy
NOT on proxy: `backups_ci_commoninternet_net_app`, traefik socket-proxy, warm-keycloak DB.
### Deployment mechanism
- `swarm-init.service` — oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuild
- `deploy-proxy`, `deploy-drone`, `deploy-bridge`, `deploy-dashboard`, `deploy-reports`, `warm-keycloak`
RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild.
Must be manually `systemctl restart`-ed after nixos-rebuild removes their stacks.
### Design choice: why 10.10.0.0/16
- Must be `/16` for ~65k VIP headroom
- Must not overlap `10.0.0.0/24` (ingress) or any of the `10.0.1-4.0/24` per-stack overlays
- The Docker default-addr-pool is `10.0.0.0/8` — any `/16` in that range is fine as long as
it doesn't overlap an existing allocation
- `10.10.0.0/16` is the first clean `/16` outside the current allocation band — clear of `10.0.x.x`
while still in Docker's pool. No host route conflicts.
### swarm.nix patch
Added `--subnet 10.10.0.0/16` to the `docker network create` call.
Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint).
### Maintenance window state
Host state at time of claim:
- `docker stack ls` shows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak
- NO active recipe CI runs (only warm stacks, no test app containers)
- Confirmed with `docker ps --format "{{.Names}}"` — only infra/warm containers
Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs.