Live maintenance executed 2026-06-13T05:33–05:46Z: - Removed 6 stacks from proxy (traefik, drone, bridge, dashboard, reports, warm-keycloak) - Waited for proxy to drain, removed old 10.0.1.0/24 network - nixos-rebuild switch with git+file:///?submodules=1 → swarm-init restarted - proxy recreated: Subnet 10.10.0.0/16, gateway 10.10.0.1 - All 9 swarm services running 1/1 - ci.commoninternet.net → HTTP/2 200; drone → 303 Adversary: verify from host that proxy subnet is /16 and routes healthy. Full evidence in STATUS-pvfix.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.4 KiB
JOURNAL — phase pvfix
2026-06-13T05:29Z — Bootstrap + M1 patch
Context gathered
Read the phase plan + runbook. Key facts:
- Root cause confirmed: proxy is
10.0.1.0/24(254 VIPs), Docker GC race leaks endpoints → pool exhaustion - Fix: enlarge to
/16(--subnet 10.10.0.0/16) - The network can't be resized in place; requires remove + recreate
Live host survey
Subnets in use on the live host (collected via docker network inspect):
ingress:10.0.0.0/24proxy:10.0.1.0/24(current — to change)traefik internal:10.0.2.0/24warm-keycloak internal:10.0.3.0/24backups default:10.0.4.0/24bridge/docker_gwbridge:172.17/18.0.0/16
10.10.0.0/16 is clean — no conflicts. Host eth0: 91.98.47.73/32, Tailscale: 100.95.31.88/32.
No route entries for 10.10.x.x in ip route show.
Services on proxy (will be disrupted during maintenance)
From docker service ls + per-service network inspection:
traefik_ci_commoninternet_net_app— uses proxydrone_ci_commoninternet_net_app— uses proxyccci-bridge_app— uses proxyccci-dashboard_app— uses proxyccci-reports_app— uses proxywarm-keycloak_ci_commoninternet_net_app— uses proxy
NOT on proxy: backups_ci_commoninternet_net_app, traefik socket-proxy, warm-keycloak DB.
Deployment mechanism
swarm-init.service— oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuilddeploy-proxy,deploy-drone,deploy-bridge,deploy-dashboard,deploy-reports,warm-keycloak— RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild. Must be manuallysystemctl restart-ed after nixos-rebuild removes their stacks.
Design choice: why 10.10.0.0/16
- Must be
/16for ~65k VIP headroom - Must not overlap
10.0.0.0/24(ingress) or any of the10.0.1-4.0/24per-stack overlays - The Docker default-addr-pool is
10.0.0.0/8— any/16in that range is fine as long as it doesn't overlap an existing allocation 10.10.0.0/16is the first clean/16outside the current allocation band — clear of10.0.x.xwhile still in Docker's pool. No host route conflicts.
swarm.nix patch
Added --subnet 10.10.0.0/16 to the docker network create call.
Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint).
Maintenance window state
Host state at time of claim:
docker stack lsshows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak- NO active recipe CI runs (only warm stacks, no test app containers)
- Confirmed with
docker ps --format "{{.Names}}"— only infra/warm containers
Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs.
2026-06-13T05:33–05:46Z — Live maintenance execution
Adversary M1 PASS received
Adversary confirmed patch correct and procedure safe. Non-blocking recommendation: add explicit
systemctl restart swarm-init after nixos-rebuild. Adopted.
Pre-flight confirmed
- No active recipe test containers (
docker ps— empty) - All stacks infra-only (7 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak)
Stack removal
docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net
Output showed all services/configs/networks being removed. proxy drained in ~12s (4 polling attempts).
Proxy removal
docker network rm proxy
→ proxy
proxy removed
builder-clone sync issue
/root/cc-ci didn't exist — needed /root/builder-clone instead. The builder-clone was at e1c4198 (old).
git pull --rebase failed with untracked files: tests/concurrency/test_run_state.py.
Moved to /root/test_run_state.py.bak. Second pull succeeded, fast-forwarded to b6e12ef.
Then git merge --ff-only origin/main also failed (many stale untracked files from previous phases).
Moved all conflicting files to /root/stash-pvfix/. Successfully merged to caef217 (latest main).
Confirmed grep subnet /root/builder-clone/nix/modules/swarm.nix → --subnet 10.10.0.0/16.
nixos-rebuild
First attempt: nixos-rebuild switch --flake /root/builder-clone#cc-ci → FAILED
- Error:
path '/nix/store/.../secrets/secrets.yaml' does not exist - Root cause: flake default doesn't include git submodule content
Second attempt: path: scheme with ?submodules=1 → FAILED
- Error:
path URL has unsupported parameter 'submodules'
Third attempt: git+file:///root/builder-clone?submodules=1#cc-ci → SUCCESS (exit 0)
- Output:
building the system configuration...(used nix cache, fast)
swarm-init restart
Checked: the new unit script /nix/store/apv1zvz658ddq0i8z0ivmc8f9sydxv7h-unit-script-swarm-init-start/bin/swarm-init-start
contained --subnet 10.10.0.0/16. The service was still showing "active" from its old run (Jun 12).
Ran: systemctl restart swarm-init
→ Active: active (exited) since 2026-06-13 05:38:17 UTC
→ docker network inspect proxy → Subnet: 10.10.0.0/16 ✓
Deploy-proxy health gate deadlock
systemctl restart deploy-proxy started successfully. Traefik deployed.
But health gate (ci.commoninternet.net → 200) failed because dashboard not yet deployed.
Reconciler logged: [traefik] on latest 5.1.1+v3.6.15 but UNHEALTHY → redeploy
Analysis: The deploy-proxy health_timeout=300s (5 min) gives enough time for dashboard to be
deployed concurrently. The After= ordering in systemd means these services DON'T start until
deploy-proxy is "active", but since deploy-proxy was still "activating", systemd would have
waited indefinitely if we relied on the ordering chain.
Fix: started deploy-drone, deploy-bridge, deploy-dashboard, deploy-reports concurrently:
systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports
Within ~20 seconds, ci.commoninternet.net returned 200. Deploy-proxy health gate passed.
Final health state (2026-06-13T05:45Z)
docker stack ls → 7 stacks all present
docker service ls → all 9 services 1/1
docker network inspect proxy → Subnet: 10.10.0.0/16
ci.commoninternet.net → HTTP/2 200
drone.ci.commoninternet.net → HTTP/2 303
systemctl is-active deploy-proxy deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak
→ active active active active active active