Files
cc-ci/machine-docs/REVIEW-pvfix.md
autonomic-bot ccd93da65c
Some checks failed
continuous-integration/drone/push Build is failing
review(pvfix-M2): M2 PASS + [adversary] A1 health gate deadlock
M2 PASS: proxy confirmed 10.10.0.0/16 (created 05:38:02Z), all 9 services 1/1,
swarm-init active script has --subnet, ci.commoninternet.net=200,
drone.ci.commoninternet.net=303.

A1 [adversary]: deploy-proxy health gate (ci.commoninternet.net=200) circular
with deploy-dashboard After=deploy-proxy ordering — deadlocks on fresh boot
(TimeoutStartSec=900). Pre-existing; pvfix exposed it. Needs fix before D8 pass.
2026-06-13 05:50:22 +00:00

6.9 KiB

REVIEW — phase pvfix (Adversary)

Adversary clone: /srv/cc-ci/cc-ci-adv Phase plan: /srv/cc-ci/cc-ci-plan/plan-phase-pvfix-swarm-proxy.md


Phase context (initial orientation, 2026-06-13T05:30Z)

Cold check of live host and current repo:

  • docker network inspect proxy → Subnet: 10.0.1.0/24 (default /24 — the exhaustion vector)
  • docker network ls | grep proxyab54qfa7gsk5 proxy overlay swarm
  • nix/modules/swarm.nixswarm-init creates proxy without --subnet, inheriting Docker's default /24. No explicit subnet configured.
  • Builder has not started pvfix work yet (no STATUS-pvfix.md in repo).

The fix is needed. Watching for Builder M1 claim (patch + procedure + live inspection proof).

Break-it probe: live host subnet collision check (2026-06-13T05:31Z)

Existing subnets on host:

  • ingress: 10.0.0.0/24
  • proxy (current): 10.0.1.0/24
  • docker0: 172.17.0.0/16
  • docker_gwbridge: 172.18.0.0/16
  • Host IP: 91.98.47.73 (public), 100.95.31.88 (tailscale), gateway 172.31.1.1

10.10.0.0/16 (proposed): does NOT collide with any existing subnet. Safe.

Services currently on proxy (will be disrupted during recreation):

  • traefik → 10.0.1.9
  • ccci-reports → 10.0.1.7
  • drone → 10.0.1.12
  • ccci-bridge → 10.0.1.248
  • ccci-dashboard → 10.0.1.249
  • warm-keycloak → 10.0.1.251

Stacks currently running (all will briefly lose routing): backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak

Maintenance window status: CLEAR — no active recipe test stacks (*-pr*), no cfold sweep, no /upgrade-all visible. A quiet window is available now.

Key risk to probe when M2 is claimed: confirm that after proxy recreation, all 6 services above rejoin with healthy VIP allocations and Traefik routes are reachable end-to-end.


M1: PASS @2026-06-13T05:33Z

Claim: nix/modules/swarm.nix patched with --subnet 10.10.0.0/16; maintenance procedure documented; chosen /16 proven safe from live host inspection. Commit: e6349a9 (claim(pvfix-M1): proxy /16 patch + maintenance plan ready)

Cold-run evidence

1. Patch in repo:

grep -n 'subnet' nix/modules/swarm.nix
→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy

Correct. The if ! docker network inspect proxy guard ensures idempotent create. Comment accurately names the failure mode and runbook. ✓

2. Subnet safety — live host inspection:

docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}"
→
backups_ci_commoninternet_net_default: 10.0.4.0/24
bridge:                               172.17.0.0/16
docker_gwbridge:                      172.18.0.0/16
host:                                 (none)
ingress:                              10.0.0.0/24
none:                                 (none)
proxy:                                10.0.1.0/24
traefik_ci_commoninternet_net_internal: 10.0.2.0/24
warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24

Builder's table matches exactly. 10.10.0.0/16 is clear of all existing networks. ✓

3. Maintenance procedure review:

  • Service names confirmed correct against live host: deploy-proxy, deploy-drone, deploy-bridge, deploy-dashboard, deploy-reports, warm-keycloak — all exist as active oneshot services. ✓
  • backups stack correctly excludedbackups_ci_commoninternet_net_default (10.0.4.0/24) is NOT on proxy (confirmed via proxy Containers inspection). ✓
  • Step sequencing is safe: stack rm → drain wait → network rm → nixos-rebuild (triggers swarm-init with new --subnet) → restart deploy services. ✓
  • nixos-rebuild will restart swarm-init: swarm-init.service unit script changed (added --subnet flag); nixos-rebuild switch calls daemon-reload + restart for changed units. ✓
  • Note (non-blocking recommendation): Builder may want to add an explicit systemctl restart swarm-init after nixos-rebuild as belt-and-braces insurance (in case daemon-reload timing is unusual). Not required for correctness but eliminates any ambiguity.

M1 PASS — safe to execute the maintenance procedure. Waiting for Builder M2 claim.

M2: PASS @2026-06-13T05:49Z

Claim: proxy recreated as 10.10.0.0/16; nixos-rebuild applied; all services healthy; routes up. Commits: e6349a9 (patch), 71319d7 (M2 claim)

Cold-run evidence (all 4 acceptance checks + pre-verification probe)

1. Proxy subnet:

ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}} created={{.Created}}"'
→ 10.10.0.0/16 created=2026-06-13 05:38:02.125154677 +0000 UTC

Network recreated at 05:38:02 UTC. ✓

2. All 9 services at 1/1:

backups_ci_commoninternet_net_app  1/1
ccci-bridge_app                    1/1
ccci-dashboard_app                 1/1
ccci-reports_app                   1/1
drone_ci_commoninternet_net_app    1/1
traefik_ci_commoninternet_net_app  1/1
traefik_ci_commoninternet_net_socket-proxy 1/1
warm-keycloak_ci_commoninternet_net_app 1/1
warm-keycloak_ci_commoninternet_net_db  1/1

All 1/1. ✓

3. swarm-init activation time:

systemctl status swarm-init --no-pager | grep Active
→ Active: active (exited) since Sat 2026-06-13 05:38:17 UTC; 9min ago

Activated 05:38:17 UTC — matches proxy creation timestamp. nixos-rebuild applied new unit. ✓

4. Core routes:

curl -sI https://ci.commoninternet.net/      → HTTP/2 200
curl -sI https://drone.ci.commoninternet.net/ → HTTP/2 303

✓ Both healthy.

5. Active swarm-init script has --subnet:

/nix/store/…/swarm-init-start: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy

nixos-rebuild confirmed applied. ✓

M2 PASS — proxy VIP exhaustion fix is live and durable. See [adversary] finding A1 below (health gate circular dependency, pre-existing, not introduced by pvfix).


Pre-verification probe (2026-06-13T05:45Z — before M2 claimed)

Builder has executed the maintenance; M2 has not been formally claimed yet. Independent host check run while waiting:

  • docker network inspect proxy --format "..."Subnet: 10.10.0.0/16
  • Container VIPs on proxy: all in 10.10.0.x/16 space: traefik=10.10.0.2, proxy-endpoint=10.10.0.3, drone=10.10.0.5, warm-keycloak=10.10.0.7, ccci-bridge=10.10.0.9, ccci-dashboard=10.10.0.11, ccci-reports=10.10.0.13 ✓
  • docker service ls → all 9 services at 1/1 REPLICAS ✓
  • systemctl cat swarm-init → active script has --subnet 10.10.0.0/16 (nixos-rebuild applied) ✓
  • https://ci.commoninternet.netHTTP/2 200
  • https://drone.ci.commoninternet.netHTTP/2 303 (login redirect = healthy) ✓
  • https://bridge.ci.commoninternet.netHTTP/2 404 (root path = expected, Traefik routes it) ✓
  • https://report.ci.commoninternet.netHTTP/2 200