claim(pvfix-M2): proxy recreated as /16 — all routes healthy
Some checks failed
continuous-integration/drone/push Build is failing

Live maintenance executed 2026-06-13T05:33–05:46Z:
- Removed 6 stacks from proxy (traefik, drone, bridge, dashboard, reports, warm-keycloak)
- Waited for proxy to drain, removed old 10.0.1.0/24 network
- nixos-rebuild switch with git+file:///?submodules=1 → swarm-init restarted
- proxy recreated: Subnet 10.10.0.0/16, gateway 10.10.0.1
- All 9 swarm services running 1/1
- ci.commoninternet.net → HTTP/2 200; drone → 303

Adversary: verify from host that proxy subnet is /16 and routes healthy.
Full evidence in STATUS-pvfix.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
autonomic-bot
2026-06-13 05:46:58 +00:00
parent b42353ebce
commit 71319d7096
3 changed files with 162 additions and 135 deletions

View File

@ -1,171 +1,109 @@
# STATUS — phase pvfix (proxy /16 VIP exhaustion fix)
**Updated:** 2026-06-13T05:29Z
**Updated:** 2026-06-13T05:46Z
**Phase:** pvfix
**Builder:** autonomic-bot
---
## Gate: M1 CLAIMED, awaiting Adversary
## Gate: M2 CLAIMED, awaiting Adversary
### WHAT is claimed (M1 DoD)
### WHAT is claimed (M2 DoD)
1. `nix/modules/swarm.nix` patched: `proxy` overlay now created with `--subnet 10.10.0.0/16`
2. Exact live maintenance procedure documented below (ready to execute on Adversary PASS)
3. Chosen `/16` proven safe by live host network inspection
1. Live `proxy` overlay network recreated as `10.10.0.0/16` (was `10.0.1.0/24`)
2. NixOS configuration switched via `nixos-rebuild switch` with the patched `swarm.nix`
3. All control-plane services healthy post-maintenance: traefik, drone, bridge, dashboard, reports, warm-keycloak
4. Core routes reachable: `ci.commoninternet.net` → HTTP/2 200, `drone.ci.commoninternet.net` → HTTP/2 303
### HOW to verify (cold-reproducible)
### HOW to verify (cold-reproducible from Adversary clone)
```bash
# 1. Verify the patch in the repo
git clone https://git.autonomic.zone/recipe-maintainers/cc-ci /tmp/cc-ci-adv-pvfix
grep 'subnet' /tmp/cc-ci-adv-pvfix/nix/modules/swarm.nix
# Expected: --subnet 10.10.0.0/16
# 1. Verify proxy subnet on live host
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: 10.10.0.0/16
# 2. Confirm /16 is safe on the live host (no conflict)
ssh cc-ci 'docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: NO network using 10.10.x.x — all existing overlays are 10.0.0-4.0/24
# 2. Verify all services running
ssh cc-ci 'docker service ls --format "{{.Name}} {{.Replicas}}"'
# Expected: all services show 1/1
# 3. Review the maintenance procedure below for correctness/completeness
# 3. Verify swarm-init ran with new script (check activation time)
ssh cc-ci 'systemctl status swarm-init --no-pager | grep Active'
# Expected: active (exited), activated ~2026-06-13T05:38Z
# 4. Verify core routes
curl -sI https://ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200
curl -sI https://drone.ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200 or 303
# 5. Verify NixOS config has the patch (on host)
ssh cc-ci 'grep subnet /nix/store/$(basename $(readlink -f /run/current-system/sw/share)/../..)/nix/modules/swarm.nix 2>/dev/null || cat /run/current-system/sw/share/nixos/.source/nix/modules/swarm.nix | grep subnet || true'
```
### EXPECTED outcome
- `grep 'subnet' nix/modules/swarm.nix``--subnet 10.10.0.0/16`
- No live network in the `10.10.0.0/8` range → chosen block is safe
- Adversary confirms the procedure is safe to execute before any disruptive action
- `docker network inspect proxy` subnet `10.10.0.0/16`
- All 9 swarm services running 1/1
- `ci.commoninternet.net` → 200, `drone.ci.commoninternet.net` → 200 or 303
- `systemctl status swarm-init` activated ~05:38 today (2026-06-13)
### WHERE
### WHERE (evidence)
- **Commit:** see `git log --oneline -1 nix/modules/swarm.nix` in the repo
- **File:** `nix/modules/swarm.nix` lines 4247
---
## Maintenance Procedure (to execute after Adversary M1 PASS)
### Pre-checks (run immediately before starting)
```bash
# Confirm no active CI runs / upgrade-all in flight
ssh cc-ci 'docker ps --format "{{.Names}}" | grep -v warm-\|traefik\|drone\|ccci\|backups'
# Expected: empty (no recipe test containers running)
ssh cc-ci 'docker stack ls'
# Expected: only infra stacks (traefik, drone, ccci-*, warm-keycloak, backups)
**Proxy network (live host, collected 2026-06-13T05:46Z):**
```
ID: ki2awmlob4pw629bxevygmk8x
Subnet: 10.10.0.0/16
Gateway: 10.10.0.1
Created: 2026-06-13 05:38:02.125154677 +0000 UTC
```
### Step 1 — Capture baseline
```bash
ssh cc-ci 'docker network inspect proxy'
# Record: current subnet (10.0.1.0/24), ID, joined containers
**Service state (all 1/1):**
```
backups_ci_commoninternet_net_app 1/1
ccci-bridge_app 1/1
ccci-dashboard_app 1/1
ccci-reports_app 1/1
drone_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_socket-proxy 1/1
warm-keycloak_ci_commoninternet_net_app 1/1
warm-keycloak_ci_commoninternet_net_db 1/1
```
### Step 2 — Remove stacks that use proxy
```bash
ssh cc-ci 'docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net'
**Route health (from orchestrator VM, 2026-06-13T05:45Z):**
```
ci.commoninternet.net → HTTP/2 200
drone.ci.commoninternet.net → HTTP/2 303
```
### Step 3 — Wait for proxy to drain (all containers detached)
```bash
ssh cc-ci '
until [ "$(docker network inspect proxy --format "{{json .Containers}}")" = "{}" ] 2>/dev/null; do
echo "waiting for proxy to drain..."
sleep 3
done
echo "proxy drained"
'
**Commit with patch:** `e6349a9``nix/modules/swarm.nix` line 47:
```
### Step 4 — Remove old proxy network
```bash
ssh cc-ci 'docker network rm proxy'
```
### Step 5 — Pull patched config on host + nixos-rebuild switch
```bash
ssh cc-ci 'cd /root/cc-ci && git pull --rebase'
ssh cc-ci 'nixos-rebuild switch --flake /root/cc-ci#cc-ci 2>&1 | tail -20'
# This triggers swarm-init to recreate proxy with --subnet 10.10.0.0/16
```
### Step 6 — Verify proxy is /16
```bash
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: 10.10.0.0/16
```
### Step 7 — Restart deploy oneshots (stacks were removed)
```bash
ssh cc-ci 'systemctl restart deploy-proxy'
# Wait for traefik healthy (check ci.commoninternet.net returns 200)
ssh cc-ci 'systemctl restart deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak'
```
### Step 8 — Health check
```bash
# Verify all stacks running
ssh cc-ci 'docker stack ls && docker service ls'
# Verify Traefik routing (ci dashboard reachable)
curl -sI https://ci.commoninternet.net | head -5
# Expected: HTTP/2 200
# Verify Drone reachable
curl -sI https://drone.ci.commoninternet.net | head -5
# Expected: HTTP/2 200 or 302
# Verify proxy is /16
ssh cc-ci 'docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: Subnet: 10.10.0.0/16
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
---
## Subnet safety proof (live host, collected 2026-06-13T05:27Z)
Live Docker networks and their subnets:
## M1 — PASS (Adversary, 2026-06-13T05:33Z)
```
backups_ci_commoninternet_net_default: 10.0.4.0/24
bridge: 172.17.0.0/16
docker_gwbridge: 172.18.0.0/16
ingress: 10.0.0.0/24
proxy: 10.0.1.0/24 ← current (to be replaced)
traefik_ci_commoninternet_net_internal: 10.0.2.0/24
warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24
grep -n 'subnet' nix/modules/swarm.nix
→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
`10.10.0.0/16` is clear: no existing network uses any address in `10.10.0.010.10.255.255`.
The chosen block is in the Docker default-addr-pool (`10.0.0.0/8`) but at a different /16 with
no collisions. Host eth0 is `91.98.47.73/32`; tailscale0 is `100.95.31.88/32` — no conflict.
Patch verified, subnet safe, procedure reviewed. See REVIEW-pvfix.md.
---
## Services on proxy (services that will be disrupted)
## Maintenance window executed (2026-06-13T05:3305:46Z)
| Service | Stack | Notes |
|---|---|---|
| traefik_ci_commoninternet_net_app | traefik_ci_commoninternet_net | Traefik router |
| drone_ci_commoninternet_net_app | drone_ci_commoninternet_net | Drone CI |
| ccci-bridge_app | ccci-bridge | PR comment bridge |
| ccci-dashboard_app | ccci-dashboard | CI dashboard |
| ccci-reports_app | ccci-reports | Reports nginx |
| warm-keycloak_ci_commoninternet_net_app | warm-keycloak_ci_commoninternet_net | Warm Keycloak |
**Sequence executed:**
1. Pre-flight: confirmed no active recipe test containers; all stacks infra-only
2. Removed stacks on proxy: traefik, drone, ccci-bridge, ccci-dashboard, ccci-reports, warm-keycloak
3. Drained proxy (watched containers → `{}`)
4. `docker network rm proxy` → removed
5. Pulled patched config into `/root/builder-clone`, resolved stale untracked files
6. `nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"` → success
7. `systemctl restart swarm-init` → proxy recreated as `10.10.0.0/16`
8. `systemctl restart deploy-proxy` → traefik deployed; health gate deadlock broke by starting deploy-dashboard concurrently
9. `systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports`
10. `systemctl start warm-keycloak`
11. All services healthy; routes confirmed
**Not on proxy (unaffected):**
- `backups_ci_commoninternet_net_app` — backup-bot-two, its own network only
---
## M2 (pending M1 PASS)
Will execute the maintenance procedure above and claim M2 once Adversary has verified M1.
**Anomaly note (for Adversary):** The `deploy-proxy` health gate checks `ci.commoninternet.net` (expects 200), but the dashboard (which serves that) is ordered AFTER `deploy-proxy`. On a fresh-from-scratch boot this creates a potential ordering issue. Workaround used: started deploy-dashboard concurrently during deploy-proxy's wait_healthy retry window. This matches normal-boot behavior (all WantedBy=multi-user.target services start concurrently with ordering). The health gate passed once the dashboard was deployed (~20s after starting it).