Some checks failed
continuous-integration/drone/push Build is failing
M2 PASS: proxy confirmed 10.10.0.0/16 (created 05:38:02Z), all 9 services 1/1, swarm-init active script has --subnet, ci.commoninternet.net=200, drone.ci.commoninternet.net=303. A1 [adversary]: deploy-proxy health gate (ci.commoninternet.net=200) circular with deploy-dashboard After=deploy-proxy ordering — deadlocks on fresh boot (TimeoutStartSec=900). Pre-existing; pvfix exposed it. Needs fix before D8 pass.
166 lines
6.9 KiB
Markdown
166 lines
6.9 KiB
Markdown
# REVIEW — phase pvfix (Adversary)
|
|
|
|
Adversary clone: `/srv/cc-ci/cc-ci-adv`
|
|
Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase-pvfix-swarm-proxy.md`
|
|
|
|
---
|
|
|
|
## Phase context (initial orientation, 2026-06-13T05:30Z)
|
|
|
|
Cold check of live host and current repo:
|
|
- `docker network inspect proxy` → Subnet: `10.0.1.0/24` (default /24 — the exhaustion vector)
|
|
- `docker network ls | grep proxy` → `ab54qfa7gsk5 proxy overlay swarm`
|
|
- `nix/modules/swarm.nix` → `swarm-init` creates proxy without `--subnet`, inheriting Docker's
|
|
default `/24`. No explicit subnet configured.
|
|
- Builder has not started pvfix work yet (no STATUS-pvfix.md in repo).
|
|
|
|
The fix is needed. Watching for Builder M1 claim (patch + procedure + live inspection proof).
|
|
|
|
### Break-it probe: live host subnet collision check (2026-06-13T05:31Z)
|
|
|
|
Existing subnets on host:
|
|
- `ingress`: `10.0.0.0/24`
|
|
- `proxy` (current): `10.0.1.0/24`
|
|
- `docker0`: `172.17.0.0/16`
|
|
- `docker_gwbridge`: `172.18.0.0/16`
|
|
- Host IP: `91.98.47.73` (public), `100.95.31.88` (tailscale), gateway `172.31.1.1`
|
|
|
|
**10.10.0.0/16 (proposed):** does NOT collide with any existing subnet. Safe.
|
|
|
|
Services currently on proxy (will be disrupted during recreation):
|
|
- `traefik` → 10.0.1.9
|
|
- `ccci-reports` → 10.0.1.7
|
|
- `drone` → 10.0.1.12
|
|
- `ccci-bridge` → 10.0.1.248
|
|
- `ccci-dashboard` → 10.0.1.249
|
|
- `warm-keycloak` → 10.0.1.251
|
|
|
|
Stacks currently running (all will briefly lose routing):
|
|
`backups`, `ccci-bridge`, `ccci-dashboard`, `ccci-reports`, `drone`, `traefik`, `warm-keycloak`
|
|
|
|
**Maintenance window status:** CLEAR — no active recipe test stacks (`*-pr*`), no cfold sweep,
|
|
no /upgrade-all visible. A quiet window is available now.
|
|
|
|
**Key risk to probe when M2 is claimed:** confirm that after proxy recreation, all 6 services
|
|
above rejoin with healthy VIP allocations and Traefik routes are reachable end-to-end.
|
|
|
|
---
|
|
|
|
## M1: PASS @2026-06-13T05:33Z
|
|
|
|
**Claim:** `nix/modules/swarm.nix` patched with `--subnet 10.10.0.0/16`; maintenance procedure
|
|
documented; chosen /16 proven safe from live host inspection.
|
|
**Commit:** `e6349a9` (`claim(pvfix-M1): proxy /16 patch + maintenance plan ready`)
|
|
|
|
### Cold-run evidence
|
|
|
|
**1. Patch in repo:**
|
|
```
|
|
grep -n 'subnet' nix/modules/swarm.nix
|
|
→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
|
|
```
|
|
Correct. The `if ! docker network inspect proxy` guard ensures idempotent create. Comment
|
|
accurately names the failure mode and runbook. ✓
|
|
|
|
**2. Subnet safety — live host inspection:**
|
|
```
|
|
docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}"
|
|
→
|
|
backups_ci_commoninternet_net_default: 10.0.4.0/24
|
|
bridge: 172.17.0.0/16
|
|
docker_gwbridge: 172.18.0.0/16
|
|
host: (none)
|
|
ingress: 10.0.0.0/24
|
|
none: (none)
|
|
proxy: 10.0.1.0/24
|
|
traefik_ci_commoninternet_net_internal: 10.0.2.0/24
|
|
warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24
|
|
```
|
|
Builder's table matches exactly. `10.10.0.0/16` is clear of all existing networks. ✓
|
|
|
|
**3. Maintenance procedure review:**
|
|
- **Service names confirmed correct** against live host:
|
|
`deploy-proxy`, `deploy-drone`, `deploy-bridge`, `deploy-dashboard`, `deploy-reports`,
|
|
`warm-keycloak` — all exist as active oneshot services. ✓
|
|
- **backups stack correctly excluded** — `backups_ci_commoninternet_net_default` (10.0.4.0/24)
|
|
is NOT on `proxy` (confirmed via proxy Containers inspection). ✓
|
|
- **Step sequencing is safe:** stack rm → drain wait → network rm → nixos-rebuild (triggers
|
|
swarm-init with new --subnet) → restart deploy services. ✓
|
|
- **nixos-rebuild will restart swarm-init:** `swarm-init.service` unit script changed (added
|
|
--subnet flag); nixos-rebuild switch calls daemon-reload + restart for changed units. ✓
|
|
- **Note (non-blocking recommendation):** Builder may want to add an explicit
|
|
`systemctl restart swarm-init` after nixos-rebuild as belt-and-braces insurance (in case
|
|
daemon-reload timing is unusual). Not required for correctness but eliminates any ambiguity.
|
|
|
|
**M1 PASS — safe to execute the maintenance procedure.** Waiting for Builder M2 claim.
|
|
|
|
## M2: PASS @2026-06-13T05:49Z
|
|
|
|
**Claim:** proxy recreated as 10.10.0.0/16; nixos-rebuild applied; all services healthy; routes up.
|
|
**Commits:** `e6349a9` (patch), `71319d7` (M2 claim)
|
|
|
|
### Cold-run evidence (all 4 acceptance checks + pre-verification probe)
|
|
|
|
**1. Proxy subnet:**
|
|
```
|
|
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}} created={{.Created}}"'
|
|
→ 10.10.0.0/16 created=2026-06-13 05:38:02.125154677 +0000 UTC
|
|
```
|
|
Network recreated at 05:38:02 UTC. ✓
|
|
|
|
**2. All 9 services at 1/1:**
|
|
```
|
|
backups_ci_commoninternet_net_app 1/1
|
|
ccci-bridge_app 1/1
|
|
ccci-dashboard_app 1/1
|
|
ccci-reports_app 1/1
|
|
drone_ci_commoninternet_net_app 1/1
|
|
traefik_ci_commoninternet_net_app 1/1
|
|
traefik_ci_commoninternet_net_socket-proxy 1/1
|
|
warm-keycloak_ci_commoninternet_net_app 1/1
|
|
warm-keycloak_ci_commoninternet_net_db 1/1
|
|
```
|
|
All 1/1. ✓
|
|
|
|
**3. swarm-init activation time:**
|
|
```
|
|
systemctl status swarm-init --no-pager | grep Active
|
|
→ Active: active (exited) since Sat 2026-06-13 05:38:17 UTC; 9min ago
|
|
```
|
|
Activated 05:38:17 UTC — matches proxy creation timestamp. nixos-rebuild applied new unit. ✓
|
|
|
|
**4. Core routes:**
|
|
```
|
|
curl -sI https://ci.commoninternet.net/ → HTTP/2 200
|
|
curl -sI https://drone.ci.commoninternet.net/ → HTTP/2 303
|
|
```
|
|
✓ Both healthy.
|
|
|
|
**5. Active swarm-init script has --subnet:**
|
|
```
|
|
/nix/store/…/swarm-init-start: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
|
|
```
|
|
nixos-rebuild confirmed applied. ✓
|
|
|
|
**M2 PASS — proxy VIP exhaustion fix is live and durable.**
|
|
See [adversary] finding A1 below (health gate circular dependency, pre-existing, not introduced by pvfix).
|
|
|
|
---
|
|
|
|
## Pre-verification probe (2026-06-13T05:45Z — before M2 claimed)
|
|
|
|
Builder has executed the maintenance; M2 has not been formally claimed yet.
|
|
Independent host check run while waiting:
|
|
|
|
- `docker network inspect proxy --format "..."` → **Subnet: 10.10.0.0/16** ✓
|
|
- Container VIPs on proxy: all in `10.10.0.x/16` space:
|
|
traefik=10.10.0.2, proxy-endpoint=10.10.0.3, drone=10.10.0.5,
|
|
warm-keycloak=10.10.0.7, ccci-bridge=10.10.0.9, ccci-dashboard=10.10.0.11,
|
|
ccci-reports=10.10.0.13 ✓
|
|
- `docker service ls` → all 9 services at 1/1 REPLICAS ✓
|
|
- `systemctl cat swarm-init` → active script has `--subnet 10.10.0.0/16` (nixos-rebuild applied) ✓
|
|
- `https://ci.commoninternet.net` → **HTTP/2 200** ✓
|
|
- `https://drone.ci.commoninternet.net` → **HTTP/2 303** (login redirect = healthy) ✓
|
|
- `https://bridge.ci.commoninternet.net` → **HTTP/2 404** (root path = expected, Traefik routes it) ✓
|
|
- `https://report.ci.commoninternet.net` → **HTTP/2 200** ✓
|