Files
cc-ci/machine-docs/JOURNAL-pvfix.md
autonomic-bot 71319d7096
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M2): proxy recreated as /16 — all routes healthy
Live maintenance executed 2026-06-13T05:33–05:46Z:
- Removed 6 stacks from proxy (traefik, drone, bridge, dashboard, reports, warm-keycloak)
- Waited for proxy to drain, removed old 10.0.1.0/24 network
- nixos-rebuild switch with git+file:///?submodules=1 → swarm-init restarted
- proxy recreated: Subnet 10.10.0.0/16, gateway 10.10.0.1
- All 9 swarm services running 1/1
- ci.commoninternet.net → HTTP/2 200; drone → 303

Adversary: verify from host that proxy subnet is /16 and routes healthy.
Full evidence in STATUS-pvfix.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:47:04 +00:00

6.4 KiB
Raw Blame History

JOURNAL — phase pvfix

2026-06-13T05:29Z — Bootstrap + M1 patch

Context gathered

Read the phase plan + runbook. Key facts:

  • Root cause confirmed: proxy is 10.0.1.0/24 (254 VIPs), Docker GC race leaks endpoints → pool exhaustion
  • Fix: enlarge to /16 (--subnet 10.10.0.0/16)
  • The network can't be resized in place; requires remove + recreate

Live host survey

Subnets in use on the live host (collected via docker network inspect):

  • ingress: 10.0.0.0/24
  • proxy: 10.0.1.0/24 (current — to change)
  • traefik internal: 10.0.2.0/24
  • warm-keycloak internal: 10.0.3.0/24
  • backups default: 10.0.4.0/24
  • bridge/docker_gwbridge: 172.17/18.0.0/16

10.10.0.0/16 is clean — no conflicts. Host eth0: 91.98.47.73/32, Tailscale: 100.95.31.88/32. No route entries for 10.10.x.x in ip route show.

Services on proxy (will be disrupted during maintenance)

From docker service ls + per-service network inspection:

  • traefik_ci_commoninternet_net_app — uses proxy
  • drone_ci_commoninternet_net_app — uses proxy
  • ccci-bridge_app — uses proxy
  • ccci-dashboard_app — uses proxy
  • ccci-reports_app — uses proxy
  • warm-keycloak_ci_commoninternet_net_app — uses proxy

NOT on proxy: backups_ci_commoninternet_net_app, traefik socket-proxy, warm-keycloak DB.

Deployment mechanism

  • swarm-init.service — oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuild
  • deploy-proxy, deploy-drone, deploy-bridge, deploy-dashboard, deploy-reports, warm-keycloak — RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild. Must be manually systemctl restart-ed after nixos-rebuild removes their stacks.

Design choice: why 10.10.0.0/16

  • Must be /16 for ~65k VIP headroom
  • Must not overlap 10.0.0.0/24 (ingress) or any of the 10.0.1-4.0/24 per-stack overlays
  • The Docker default-addr-pool is 10.0.0.0/8 — any /16 in that range is fine as long as it doesn't overlap an existing allocation
  • 10.10.0.0/16 is the first clean /16 outside the current allocation band — clear of 10.0.x.x while still in Docker's pool. No host route conflicts.

swarm.nix patch

Added --subnet 10.10.0.0/16 to the docker network create call. Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint).

Maintenance window state

Host state at time of claim:

  • docker stack ls shows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak
  • NO active recipe CI runs (only warm stacks, no test app containers)
  • Confirmed with docker ps --format "{{.Names}}" — only infra/warm containers

Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs.


2026-06-13T05:3305:46Z — Live maintenance execution

Adversary M1 PASS received

Adversary confirmed patch correct and procedure safe. Non-blocking recommendation: add explicit systemctl restart swarm-init after nixos-rebuild. Adopted.

Pre-flight confirmed

  • No active recipe test containers (docker ps — empty)
  • All stacks infra-only (7 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak)

Stack removal

docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net

Output showed all services/configs/networks being removed. proxy drained in ~12s (4 polling attempts).

Proxy removal

docker network rm proxy
→ proxy
proxy removed

builder-clone sync issue

/root/cc-ci didn't exist — needed /root/builder-clone instead. The builder-clone was at e1c4198 (old). git pull --rebase failed with untracked files: tests/concurrency/test_run_state.py. Moved to /root/test_run_state.py.bak. Second pull succeeded, fast-forwarded to b6e12ef.

Then git merge --ff-only origin/main also failed (many stale untracked files from previous phases). Moved all conflicting files to /root/stash-pvfix/. Successfully merged to caef217 (latest main). Confirmed grep subnet /root/builder-clone/nix/modules/swarm.nix--subnet 10.10.0.0/16.

nixos-rebuild

First attempt: nixos-rebuild switch --flake /root/builder-clone#cc-ci → FAILED

  • Error: path '/nix/store/.../secrets/secrets.yaml' does not exist
  • Root cause: flake default doesn't include git submodule content

Second attempt: path: scheme with ?submodules=1 → FAILED

  • Error: path URL has unsupported parameter 'submodules'

Third attempt: git+file:///root/builder-clone?submodules=1#cc-ci → SUCCESS (exit 0)

  • Output: building the system configuration... (used nix cache, fast)

swarm-init restart

Checked: the new unit script /nix/store/apv1zvz658ddq0i8z0ivmc8f9sydxv7h-unit-script-swarm-init-start/bin/swarm-init-start contained --subnet 10.10.0.0/16. The service was still showing "active" from its old run (Jun 12).

Ran: systemctl restart swarm-init → Active: active (exited) since 2026-06-13 05:38:17 UTC → docker network inspect proxy → Subnet: 10.10.0.0/16 ✓

Deploy-proxy health gate deadlock

systemctl restart deploy-proxy started successfully. Traefik deployed. But health gate (ci.commoninternet.net → 200) failed because dashboard not yet deployed. Reconciler logged: [traefik] on latest 5.1.1+v3.6.15 but UNHEALTHY → redeploy

Analysis: The deploy-proxy health_timeout=300s (5 min) gives enough time for dashboard to be deployed concurrently. The After= ordering in systemd means these services DON'T start until deploy-proxy is "active", but since deploy-proxy was still "activating", systemd would have waited indefinitely if we relied on the ordering chain.

Fix: started deploy-drone, deploy-bridge, deploy-dashboard, deploy-reports concurrently:

systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports

Within ~20 seconds, ci.commoninternet.net returned 200. Deploy-proxy health gate passed.

Final health state (2026-06-13T05:45Z)

docker stack ls → 7 stacks all present
docker service ls → all 9 services 1/1
docker network inspect proxy → Subnet: 10.10.0.0/16
ci.commoninternet.net → HTTP/2 200
drone.ci.commoninternet.net → HTTP/2 303
systemctl is-active deploy-proxy deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak
→ active active active active active active