claim(pvfix-M1): proxy /16 patch + maintenance plan ready
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
Patch nix/modules/swarm.nix to create the `proxy` overlay with --subnet 10.10.0.0/16 (~65k VIPs, 258× headroom over the exhausted /24). Live host survey confirms 10.10.0.0/16 is clear of all existing Docker networks (ingress 10.0.0.0/24, existing per-stack overlays 10.0.1-4.0/24, host routes). Exact maintenance procedure in STATUS-pvfix.md including pre-checks, stack teardown order, drain wait, remove/recreate proxy, nixos-rebuild, deploy-* restart chain, and health verification steps. Adversary: please cold-review the patch + procedure before any live disruptive action. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@ -1,4 +1,16 @@
|
||||
# BACKLOG — phase pvfix (Adversary section)
|
||||
# BACKLOG — phase pvfix
|
||||
|
||||
## Build backlog
|
||||
|
||||
- [x] Seed pvfix state files
|
||||
- [x] Read plan-phase-pvfix-swarm-proxy.md + runbook
|
||||
- [x] Inspect live host subnets + services on proxy
|
||||
- [x] Patch nix/modules/swarm.nix (add --subnet 10.10.0.0/16)
|
||||
- [x] Write exact maintenance procedure in STATUS-pvfix.md
|
||||
- [x] **CLAIM M1** — awaiting Adversary review
|
||||
- [ ] Execute live maintenance (after M1 PASS)
|
||||
- [ ] Verify health post-maintenance
|
||||
- [ ] **CLAIM M2** — awaiting Adversary verification
|
||||
|
||||
## Adversary findings
|
||||
|
||||
|
||||
65
machine-docs/JOURNAL-pvfix.md
Normal file
65
machine-docs/JOURNAL-pvfix.md
Normal file
@ -0,0 +1,65 @@
|
||||
# JOURNAL — phase pvfix
|
||||
|
||||
## 2026-06-13T05:29Z — Bootstrap + M1 patch
|
||||
|
||||
### Context gathered
|
||||
|
||||
Read the phase plan + runbook. Key facts:
|
||||
- Root cause confirmed: proxy is `10.0.1.0/24` (254 VIPs), Docker GC race leaks endpoints → pool exhaustion
|
||||
- Fix: enlarge to `/16` (`--subnet 10.10.0.0/16`)
|
||||
- The network can't be resized in place; requires remove + recreate
|
||||
|
||||
### Live host survey
|
||||
|
||||
Subnets in use on the live host (collected via `docker network inspect`):
|
||||
- `ingress`: `10.0.0.0/24`
|
||||
- `proxy`: `10.0.1.0/24` (current — to change)
|
||||
- `traefik internal`: `10.0.2.0/24`
|
||||
- `warm-keycloak internal`: `10.0.3.0/24`
|
||||
- `backups default`: `10.0.4.0/24`
|
||||
- `bridge`/`docker_gwbridge`: `172.17/18.0.0/16`
|
||||
|
||||
`10.10.0.0/16` is clean — no conflicts. Host eth0: `91.98.47.73/32`, Tailscale: `100.95.31.88/32`.
|
||||
No route entries for `10.10.x.x` in `ip route show`.
|
||||
|
||||
### Services on proxy (will be disrupted during maintenance)
|
||||
|
||||
From `docker service ls` + per-service network inspection:
|
||||
- `traefik_ci_commoninternet_net_app` — uses proxy
|
||||
- `drone_ci_commoninternet_net_app` — uses proxy
|
||||
- `ccci-bridge_app` — uses proxy
|
||||
- `ccci-dashboard_app` — uses proxy
|
||||
- `ccci-reports_app` — uses proxy
|
||||
- `warm-keycloak_ci_commoninternet_net_app` — uses proxy
|
||||
|
||||
NOT on proxy: `backups_ci_commoninternet_net_app`, traefik socket-proxy, warm-keycloak DB.
|
||||
|
||||
### Deployment mechanism
|
||||
|
||||
- `swarm-init.service` — oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuild
|
||||
- `deploy-proxy`, `deploy-drone`, `deploy-bridge`, `deploy-dashboard`, `deploy-reports`, `warm-keycloak` —
|
||||
RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild.
|
||||
Must be manually `systemctl restart`-ed after nixos-rebuild removes their stacks.
|
||||
|
||||
### Design choice: why 10.10.0.0/16
|
||||
|
||||
- Must be `/16` for ~65k VIP headroom
|
||||
- Must not overlap `10.0.0.0/24` (ingress) or any of the `10.0.1-4.0/24` per-stack overlays
|
||||
- The Docker default-addr-pool is `10.0.0.0/8` — any `/16` in that range is fine as long as
|
||||
it doesn't overlap an existing allocation
|
||||
- `10.10.0.0/16` is the first clean `/16` outside the current allocation band — clear of `10.0.x.x`
|
||||
while still in Docker's pool. No host route conflicts.
|
||||
|
||||
### swarm.nix patch
|
||||
|
||||
Added `--subnet 10.10.0.0/16` to the `docker network create` call.
|
||||
Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint).
|
||||
|
||||
### Maintenance window state
|
||||
|
||||
Host state at time of claim:
|
||||
- `docker stack ls` shows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak
|
||||
- NO active recipe CI runs (only warm stacks, no test app containers)
|
||||
- Confirmed with `docker ps --format "{{.Names}}"` — only infra/warm containers
|
||||
|
||||
Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs.
|
||||
171
machine-docs/STATUS-pvfix.md
Normal file
171
machine-docs/STATUS-pvfix.md
Normal file
@ -0,0 +1,171 @@
|
||||
# STATUS — phase pvfix (proxy /16 VIP exhaustion fix)
|
||||
|
||||
**Updated:** 2026-06-13T05:29Z
|
||||
**Phase:** pvfix
|
||||
**Builder:** autonomic-bot
|
||||
|
||||
---
|
||||
|
||||
## Gate: M1 CLAIMED, awaiting Adversary
|
||||
|
||||
### WHAT is claimed (M1 DoD)
|
||||
|
||||
1. `nix/modules/swarm.nix` patched: `proxy` overlay now created with `--subnet 10.10.0.0/16`
|
||||
2. Exact live maintenance procedure documented below (ready to execute on Adversary PASS)
|
||||
3. Chosen `/16` proven safe by live host network inspection
|
||||
|
||||
### HOW to verify (cold-reproducible)
|
||||
|
||||
```bash
|
||||
# 1. Verify the patch in the repo
|
||||
git clone https://git.autonomic.zone/recipe-maintainers/cc-ci /tmp/cc-ci-adv-pvfix
|
||||
grep 'subnet' /tmp/cc-ci-adv-pvfix/nix/modules/swarm.nix
|
||||
# Expected: --subnet 10.10.0.0/16
|
||||
|
||||
# 2. Confirm /16 is safe on the live host (no conflict)
|
||||
ssh cc-ci 'docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}"'
|
||||
# Expected: NO network using 10.10.x.x — all existing overlays are 10.0.0-4.0/24
|
||||
|
||||
# 3. Review the maintenance procedure below for correctness/completeness
|
||||
```
|
||||
|
||||
### EXPECTED outcome
|
||||
|
||||
- `grep 'subnet' nix/modules/swarm.nix` → `--subnet 10.10.0.0/16`
|
||||
- No live network in the `10.10.0.0/8` range → chosen block is safe
|
||||
- Adversary confirms the procedure is safe to execute before any disruptive action
|
||||
|
||||
### WHERE
|
||||
|
||||
- **Commit:** see `git log --oneline -1 nix/modules/swarm.nix` in the repo
|
||||
- **File:** `nix/modules/swarm.nix` lines 42–47
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Procedure (to execute after Adversary M1 PASS)
|
||||
|
||||
### Pre-checks (run immediately before starting)
|
||||
|
||||
```bash
|
||||
# Confirm no active CI runs / upgrade-all in flight
|
||||
ssh cc-ci 'docker ps --format "{{.Names}}" | grep -v warm-\|traefik\|drone\|ccci\|backups'
|
||||
# Expected: empty (no recipe test containers running)
|
||||
|
||||
ssh cc-ci 'docker stack ls'
|
||||
# Expected: only infra stacks (traefik, drone, ccci-*, warm-keycloak, backups)
|
||||
```
|
||||
|
||||
### Step 1 — Capture baseline
|
||||
|
||||
```bash
|
||||
ssh cc-ci 'docker network inspect proxy'
|
||||
# Record: current subnet (10.0.1.0/24), ID, joined containers
|
||||
```
|
||||
|
||||
### Step 2 — Remove stacks that use proxy
|
||||
|
||||
```bash
|
||||
ssh cc-ci 'docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net'
|
||||
```
|
||||
|
||||
### Step 3 — Wait for proxy to drain (all containers detached)
|
||||
|
||||
```bash
|
||||
ssh cc-ci '
|
||||
until [ "$(docker network inspect proxy --format "{{json .Containers}}")" = "{}" ] 2>/dev/null; do
|
||||
echo "waiting for proxy to drain..."
|
||||
sleep 3
|
||||
done
|
||||
echo "proxy drained"
|
||||
'
|
||||
```
|
||||
|
||||
### Step 4 — Remove old proxy network
|
||||
|
||||
```bash
|
||||
ssh cc-ci 'docker network rm proxy'
|
||||
```
|
||||
|
||||
### Step 5 — Pull patched config on host + nixos-rebuild switch
|
||||
|
||||
```bash
|
||||
ssh cc-ci 'cd /root/cc-ci && git pull --rebase'
|
||||
ssh cc-ci 'nixos-rebuild switch --flake /root/cc-ci#cc-ci 2>&1 | tail -20'
|
||||
# This triggers swarm-init to recreate proxy with --subnet 10.10.0.0/16
|
||||
```
|
||||
|
||||
### Step 6 — Verify proxy is /16
|
||||
|
||||
```bash
|
||||
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
|
||||
# Expected: 10.10.0.0/16
|
||||
```
|
||||
|
||||
### Step 7 — Restart deploy oneshots (stacks were removed)
|
||||
|
||||
```bash
|
||||
ssh cc-ci 'systemctl restart deploy-proxy'
|
||||
# Wait for traefik healthy (check ci.commoninternet.net returns 200)
|
||||
ssh cc-ci 'systemctl restart deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak'
|
||||
```
|
||||
|
||||
### Step 8 — Health check
|
||||
|
||||
```bash
|
||||
# Verify all stacks running
|
||||
ssh cc-ci 'docker stack ls && docker service ls'
|
||||
|
||||
# Verify Traefik routing (ci dashboard reachable)
|
||||
curl -sI https://ci.commoninternet.net | head -5
|
||||
# Expected: HTTP/2 200
|
||||
|
||||
# Verify Drone reachable
|
||||
curl -sI https://drone.ci.commoninternet.net | head -5
|
||||
# Expected: HTTP/2 200 or 302
|
||||
|
||||
# Verify proxy is /16
|
||||
ssh cc-ci 'docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}"'
|
||||
# Expected: Subnet: 10.10.0.0/16
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Subnet safety proof (live host, collected 2026-06-13T05:27Z)
|
||||
|
||||
Live Docker networks and their subnets:
|
||||
|
||||
```
|
||||
backups_ci_commoninternet_net_default: 10.0.4.0/24
|
||||
bridge: 172.17.0.0/16
|
||||
docker_gwbridge: 172.18.0.0/16
|
||||
ingress: 10.0.0.0/24
|
||||
proxy: 10.0.1.0/24 ← current (to be replaced)
|
||||
traefik_ci_commoninternet_net_internal: 10.0.2.0/24
|
||||
warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24
|
||||
```
|
||||
|
||||
`10.10.0.0/16` is clear: no existing network uses any address in `10.10.0.0–10.10.255.255`.
|
||||
The chosen block is in the Docker default-addr-pool (`10.0.0.0/8`) but at a different /16 with
|
||||
no collisions. Host eth0 is `91.98.47.73/32`; tailscale0 is `100.95.31.88/32` — no conflict.
|
||||
|
||||
---
|
||||
|
||||
## Services on proxy (services that will be disrupted)
|
||||
|
||||
| Service | Stack | Notes |
|
||||
|---|---|---|
|
||||
| traefik_ci_commoninternet_net_app | traefik_ci_commoninternet_net | Traefik router |
|
||||
| drone_ci_commoninternet_net_app | drone_ci_commoninternet_net | Drone CI |
|
||||
| ccci-bridge_app | ccci-bridge | PR comment bridge |
|
||||
| ccci-dashboard_app | ccci-dashboard | CI dashboard |
|
||||
| ccci-reports_app | ccci-reports | Reports nginx |
|
||||
| warm-keycloak_ci_commoninternet_net_app | warm-keycloak_ci_commoninternet_net | Warm Keycloak |
|
||||
|
||||
**Not on proxy (unaffected):**
|
||||
- `backups_ci_commoninternet_net_app` — backup-bot-two, its own network only
|
||||
|
||||
---
|
||||
|
||||
## M2 (pending M1 PASS)
|
||||
|
||||
Will execute the maintenance procedure above and claim M2 once Adversary has verified M1.
|
||||
@ -40,7 +40,11 @@
|
||||
docker swarm init --advertise-addr 127.0.0.1
|
||||
fi
|
||||
if ! docker network inspect proxy >/dev/null 2>&1; then
|
||||
docker network create --driver overlay --attachable proxy
|
||||
# Explicit /16 (~65 534 VIPs) prevents the /24-exhaustion class seen 2026-06-12:
|
||||
# leaked endpoints from concurrent stack GC race exhausted the default 254-VIP pool.
|
||||
# 10.10.0.0/16 is clear of ingress (10.0.0.0/24) and existing per-stack overlays
|
||||
# (10.0.1–4.0/24). Runbook: cc-ci-plan/plan-proxy-vip-exhaustion-fix.md
|
||||
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
|
||||
fi
|
||||
'';
|
||||
};
|
||||
|
||||
Reference in New Issue
Block a user