diff --git a/machine-docs/REVIEW-pvfix.md b/machine-docs/REVIEW-pvfix.md index 8234f00..9f9cb85 100644 --- a/machine-docs/REVIEW-pvfix.md +++ b/machine-docs/REVIEW-pvfix.md @@ -16,6 +16,80 @@ Cold check of live host and current repo: The fix is needed. Watching for Builder M1 claim (patch + procedure + live inspection proof). +### Break-it probe: live host subnet collision check (2026-06-13T05:31Z) + +Existing subnets on host: +- `ingress`: `10.0.0.0/24` +- `proxy` (current): `10.0.1.0/24` +- `docker0`: `172.17.0.0/16` +- `docker_gwbridge`: `172.18.0.0/16` +- Host IP: `91.98.47.73` (public), `100.95.31.88` (tailscale), gateway `172.31.1.1` + +**10.10.0.0/16 (proposed):** does NOT collide with any existing subnet. Safe. + +Services currently on proxy (will be disrupted during recreation): +- `traefik` → 10.0.1.9 +- `ccci-reports` → 10.0.1.7 +- `drone` → 10.0.1.12 +- `ccci-bridge` → 10.0.1.248 +- `ccci-dashboard` → 10.0.1.249 +- `warm-keycloak` → 10.0.1.251 + +Stacks currently running (all will briefly lose routing): +`backups`, `ccci-bridge`, `ccci-dashboard`, `ccci-reports`, `drone`, `traefik`, `warm-keycloak` + +**Maintenance window status:** CLEAR — no active recipe test stacks (`*-pr*`), no cfold sweep, +no /upgrade-all visible. A quiet window is available now. + +**Key risk to probe when M2 is claimed:** confirm that after proxy recreation, all 6 services +above rejoin with healthy VIP allocations and Traefik routes are reachable end-to-end. + --- - +## M1: PASS @2026-06-13T05:33Z + +**Claim:** `nix/modules/swarm.nix` patched with `--subnet 10.10.0.0/16`; maintenance procedure +documented; chosen /16 proven safe from live host inspection. +**Commit:** `e6349a9` (`claim(pvfix-M1): proxy /16 patch + maintenance plan ready`) + +### Cold-run evidence + +**1. Patch in repo:** +``` +grep -n 'subnet' nix/modules/swarm.nix +→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy +``` +Correct. The `if ! docker network inspect proxy` guard ensures idempotent create. Comment +accurately names the failure mode and runbook. ✓ + +**2. Subnet safety — live host inspection:** +``` +docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}" +→ +backups_ci_commoninternet_net_default: 10.0.4.0/24 +bridge: 172.17.0.0/16 +docker_gwbridge: 172.18.0.0/16 +host: (none) +ingress: 10.0.0.0/24 +none: (none) +proxy: 10.0.1.0/24 +traefik_ci_commoninternet_net_internal: 10.0.2.0/24 +warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24 +``` +Builder's table matches exactly. `10.10.0.0/16` is clear of all existing networks. ✓ + +**3. Maintenance procedure review:** +- **Service names confirmed correct** against live host: + `deploy-proxy`, `deploy-drone`, `deploy-bridge`, `deploy-dashboard`, `deploy-reports`, + `warm-keycloak` — all exist as active oneshot services. ✓ +- **backups stack correctly excluded** — `backups_ci_commoninternet_net_default` (10.0.4.0/24) + is NOT on `proxy` (confirmed via proxy Containers inspection). ✓ +- **Step sequencing is safe:** stack rm → drain wait → network rm → nixos-rebuild (triggers + swarm-init with new --subnet) → restart deploy services. ✓ +- **nixos-rebuild will restart swarm-init:** `swarm-init.service` unit script changed (added + --subnet flag); nixos-rebuild switch calls daemon-reload + restart for changed units. ✓ +- **Note (non-blocking recommendation):** Builder may want to add an explicit + `systemctl restart swarm-init` after nixos-rebuild as belt-and-braces insurance (in case + daemon-reload timing is unusual). Not required for correctness but eliminates any ambiguity. + +**M1 PASS — safe to execute the maintenance procedure.** Waiting for Builder M2 claim.