Files
cc-ci/machine-docs/REVIEW-pvfix.md
autonomic-bot ccd93da65c
Some checks failed
continuous-integration/drone/push Build is failing
review(pvfix-M2): M2 PASS + [adversary] A1 health gate deadlock
M2 PASS: proxy confirmed 10.10.0.0/16 (created 05:38:02Z), all 9 services 1/1,
swarm-init active script has --subnet, ci.commoninternet.net=200,
drone.ci.commoninternet.net=303.

A1 [adversary]: deploy-proxy health gate (ci.commoninternet.net=200) circular
with deploy-dashboard After=deploy-proxy ordering — deadlocks on fresh boot
(TimeoutStartSec=900). Pre-existing; pvfix exposed it. Needs fix before D8 pass.
2026-06-13 05:50:22 +00:00

166 lines
6.9 KiB
Markdown

# REVIEW — phase pvfix (Adversary)
Adversary clone: `/srv/cc-ci/cc-ci-adv`
Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase-pvfix-swarm-proxy.md`
---
## Phase context (initial orientation, 2026-06-13T05:30Z)
Cold check of live host and current repo:
- `docker network inspect proxy` → Subnet: `10.0.1.0/24` (default /24 — the exhaustion vector)
- `docker network ls | grep proxy``ab54qfa7gsk5 proxy overlay swarm`
- `nix/modules/swarm.nix``swarm-init` creates proxy without `--subnet`, inheriting Docker's
default `/24`. No explicit subnet configured.
- Builder has not started pvfix work yet (no STATUS-pvfix.md in repo).
The fix is needed. Watching for Builder M1 claim (patch + procedure + live inspection proof).
### Break-it probe: live host subnet collision check (2026-06-13T05:31Z)
Existing subnets on host:
- `ingress`: `10.0.0.0/24`
- `proxy` (current): `10.0.1.0/24`
- `docker0`: `172.17.0.0/16`
- `docker_gwbridge`: `172.18.0.0/16`
- Host IP: `91.98.47.73` (public), `100.95.31.88` (tailscale), gateway `172.31.1.1`
**10.10.0.0/16 (proposed):** does NOT collide with any existing subnet. Safe.
Services currently on proxy (will be disrupted during recreation):
- `traefik` → 10.0.1.9
- `ccci-reports` → 10.0.1.7
- `drone` → 10.0.1.12
- `ccci-bridge` → 10.0.1.248
- `ccci-dashboard` → 10.0.1.249
- `warm-keycloak` → 10.0.1.251
Stacks currently running (all will briefly lose routing):
`backups`, `ccci-bridge`, `ccci-dashboard`, `ccci-reports`, `drone`, `traefik`, `warm-keycloak`
**Maintenance window status:** CLEAR — no active recipe test stacks (`*-pr*`), no cfold sweep,
no /upgrade-all visible. A quiet window is available now.
**Key risk to probe when M2 is claimed:** confirm that after proxy recreation, all 6 services
above rejoin with healthy VIP allocations and Traefik routes are reachable end-to-end.
---
## M1: PASS @2026-06-13T05:33Z
**Claim:** `nix/modules/swarm.nix` patched with `--subnet 10.10.0.0/16`; maintenance procedure
documented; chosen /16 proven safe from live host inspection.
**Commit:** `e6349a9` (`claim(pvfix-M1): proxy /16 patch + maintenance plan ready`)
### Cold-run evidence
**1. Patch in repo:**
```
grep -n 'subnet' nix/modules/swarm.nix
→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
Correct. The `if ! docker network inspect proxy` guard ensures idempotent create. Comment
accurately names the failure mode and runbook. ✓
**2. Subnet safety — live host inspection:**
```
docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}"
backups_ci_commoninternet_net_default: 10.0.4.0/24
bridge: 172.17.0.0/16
docker_gwbridge: 172.18.0.0/16
host: (none)
ingress: 10.0.0.0/24
none: (none)
proxy: 10.0.1.0/24
traefik_ci_commoninternet_net_internal: 10.0.2.0/24
warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24
```
Builder's table matches exactly. `10.10.0.0/16` is clear of all existing networks. ✓
**3. Maintenance procedure review:**
- **Service names confirmed correct** against live host:
`deploy-proxy`, `deploy-drone`, `deploy-bridge`, `deploy-dashboard`, `deploy-reports`,
`warm-keycloak` — all exist as active oneshot services. ✓
- **backups stack correctly excluded** — `backups_ci_commoninternet_net_default` (10.0.4.0/24)
is NOT on `proxy` (confirmed via proxy Containers inspection). ✓
- **Step sequencing is safe:** stack rm → drain wait → network rm → nixos-rebuild (triggers
swarm-init with new --subnet) → restart deploy services. ✓
- **nixos-rebuild will restart swarm-init:** `swarm-init.service` unit script changed (added
--subnet flag); nixos-rebuild switch calls daemon-reload + restart for changed units. ✓
- **Note (non-blocking recommendation):** Builder may want to add an explicit
`systemctl restart swarm-init` after nixos-rebuild as belt-and-braces insurance (in case
daemon-reload timing is unusual). Not required for correctness but eliminates any ambiguity.
**M1 PASS — safe to execute the maintenance procedure.** Waiting for Builder M2 claim.
## M2: PASS @2026-06-13T05:49Z
**Claim:** proxy recreated as 10.10.0.0/16; nixos-rebuild applied; all services healthy; routes up.
**Commits:** `e6349a9` (patch), `71319d7` (M2 claim)
### Cold-run evidence (all 4 acceptance checks + pre-verification probe)
**1. Proxy subnet:**
```
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}} created={{.Created}}"'
→ 10.10.0.0/16 created=2026-06-13 05:38:02.125154677 +0000 UTC
```
Network recreated at 05:38:02 UTC. ✓
**2. All 9 services at 1/1:**
```
backups_ci_commoninternet_net_app 1/1
ccci-bridge_app 1/1
ccci-dashboard_app 1/1
ccci-reports_app 1/1
drone_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_socket-proxy 1/1
warm-keycloak_ci_commoninternet_net_app 1/1
warm-keycloak_ci_commoninternet_net_db 1/1
```
All 1/1. ✓
**3. swarm-init activation time:**
```
systemctl status swarm-init --no-pager | grep Active
→ Active: active (exited) since Sat 2026-06-13 05:38:17 UTC; 9min ago
```
Activated 05:38:17 UTC — matches proxy creation timestamp. nixos-rebuild applied new unit. ✓
**4. Core routes:**
```
curl -sI https://ci.commoninternet.net/ → HTTP/2 200
curl -sI https://drone.ci.commoninternet.net/ → HTTP/2 303
```
✓ Both healthy.
**5. Active swarm-init script has --subnet:**
```
/nix/store/…/swarm-init-start: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
nixos-rebuild confirmed applied. ✓
**M2 PASS — proxy VIP exhaustion fix is live and durable.**
See [adversary] finding A1 below (health gate circular dependency, pre-existing, not introduced by pvfix).
---
## Pre-verification probe (2026-06-13T05:45Z — before M2 claimed)
Builder has executed the maintenance; M2 has not been formally claimed yet.
Independent host check run while waiting:
- `docker network inspect proxy --format "..."`**Subnet: 10.10.0.0/16**
- Container VIPs on proxy: all in `10.10.0.x/16` space:
traefik=10.10.0.2, proxy-endpoint=10.10.0.3, drone=10.10.0.5,
warm-keycloak=10.10.0.7, ccci-bridge=10.10.0.9, ccci-dashboard=10.10.0.11,
ccci-reports=10.10.0.13 ✓
- `docker service ls` → all 9 services at 1/1 REPLICAS ✓
- `systemctl cat swarm-init` → active script has `--subnet 10.10.0.0/16` (nixos-rebuild applied) ✓
- `https://ci.commoninternet.net`**HTTP/2 200**
- `https://drone.ci.commoninternet.net`**HTTP/2 303** (login redirect = healthy) ✓
- `https://bridge.ci.commoninternet.net`**HTTP/2 404** (root path = expected, Traefik routes it) ✓
- `https://report.ci.commoninternet.net`**HTTP/2 200**