From e6349a9dfe19208e3e0b8f3a1d9679cd1e63442f Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Sat, 13 Jun 2026 05:30:58 +0000 Subject: [PATCH] claim(pvfix-M1): proxy /16 patch + maintenance plan ready MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch nix/modules/swarm.nix to create the `proxy` overlay with --subnet 10.10.0.0/16 (~65k VIPs, 258× headroom over the exhausted /24). Live host survey confirms 10.10.0.0/16 is clear of all existing Docker networks (ingress 10.0.0.0/24, existing per-stack overlays 10.0.1-4.0/24, host routes). Exact maintenance procedure in STATUS-pvfix.md including pre-checks, stack teardown order, drain wait, remove/recreate proxy, nixos-rebuild, deploy-* restart chain, and health verification steps. Adversary: please cold-review the patch + procedure before any live disruptive action. Co-Authored-By: Claude Sonnet 4.6 --- machine-docs/BACKLOG-pvfix.md | 14 ++- machine-docs/JOURNAL-pvfix.md | 65 +++++++++++++ machine-docs/STATUS-pvfix.md | 171 ++++++++++++++++++++++++++++++++++ nix/modules/swarm.nix | 6 +- 4 files changed, 254 insertions(+), 2 deletions(-) create mode 100644 machine-docs/JOURNAL-pvfix.md create mode 100644 machine-docs/STATUS-pvfix.md diff --git a/machine-docs/BACKLOG-pvfix.md b/machine-docs/BACKLOG-pvfix.md index 0adb95a..e55f5d2 100644 --- a/machine-docs/BACKLOG-pvfix.md +++ b/machine-docs/BACKLOG-pvfix.md @@ -1,4 +1,16 @@ -# BACKLOG — phase pvfix (Adversary section) +# BACKLOG — phase pvfix + +## Build backlog + +- [x] Seed pvfix state files +- [x] Read plan-phase-pvfix-swarm-proxy.md + runbook +- [x] Inspect live host subnets + services on proxy +- [x] Patch nix/modules/swarm.nix (add --subnet 10.10.0.0/16) +- [x] Write exact maintenance procedure in STATUS-pvfix.md +- [x] **CLAIM M1** — awaiting Adversary review +- [ ] Execute live maintenance (after M1 PASS) +- [ ] Verify health post-maintenance +- [ ] **CLAIM M2** — awaiting Adversary verification ## Adversary findings diff --git a/machine-docs/JOURNAL-pvfix.md b/machine-docs/JOURNAL-pvfix.md new file mode 100644 index 0000000..85987f4 --- /dev/null +++ b/machine-docs/JOURNAL-pvfix.md @@ -0,0 +1,65 @@ +# JOURNAL — phase pvfix + +## 2026-06-13T05:29Z — Bootstrap + M1 patch + +### Context gathered + +Read the phase plan + runbook. Key facts: +- Root cause confirmed: proxy is `10.0.1.0/24` (254 VIPs), Docker GC race leaks endpoints → pool exhaustion +- Fix: enlarge to `/16` (`--subnet 10.10.0.0/16`) +- The network can't be resized in place; requires remove + recreate + +### Live host survey + +Subnets in use on the live host (collected via `docker network inspect`): +- `ingress`: `10.0.0.0/24` +- `proxy`: `10.0.1.0/24` (current — to change) +- `traefik internal`: `10.0.2.0/24` +- `warm-keycloak internal`: `10.0.3.0/24` +- `backups default`: `10.0.4.0/24` +- `bridge`/`docker_gwbridge`: `172.17/18.0.0/16` + +`10.10.0.0/16` is clean — no conflicts. Host eth0: `91.98.47.73/32`, Tailscale: `100.95.31.88/32`. +No route entries for `10.10.x.x` in `ip route show`. + +### Services on proxy (will be disrupted during maintenance) + +From `docker service ls` + per-service network inspection: +- `traefik_ci_commoninternet_net_app` — uses proxy +- `drone_ci_commoninternet_net_app` — uses proxy +- `ccci-bridge_app` — uses proxy +- `ccci-dashboard_app` — uses proxy +- `ccci-reports_app` — uses proxy +- `warm-keycloak_ci_commoninternet_net_app` — uses proxy + +NOT on proxy: `backups_ci_commoninternet_net_app`, traefik socket-proxy, warm-keycloak DB. + +### Deployment mechanism + +- `swarm-init.service` — oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuild +- `deploy-proxy`, `deploy-drone`, `deploy-bridge`, `deploy-dashboard`, `deploy-reports`, `warm-keycloak` — + RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild. + Must be manually `systemctl restart`-ed after nixos-rebuild removes their stacks. + +### Design choice: why 10.10.0.0/16 + +- Must be `/16` for ~65k VIP headroom +- Must not overlap `10.0.0.0/24` (ingress) or any of the `10.0.1-4.0/24` per-stack overlays +- The Docker default-addr-pool is `10.0.0.0/8` — any `/16` in that range is fine as long as + it doesn't overlap an existing allocation +- `10.10.0.0/16` is the first clean `/16` outside the current allocation band — clear of `10.0.x.x` + while still in Docker's pool. No host route conflicts. + +### swarm.nix patch + +Added `--subnet 10.10.0.0/16` to the `docker network create` call. +Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint). + +### Maintenance window state + +Host state at time of claim: +- `docker stack ls` shows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak +- NO active recipe CI runs (only warm stacks, no test app containers) +- Confirmed with `docker ps --format "{{.Names}}"` — only infra/warm containers + +Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs. diff --git a/machine-docs/STATUS-pvfix.md b/machine-docs/STATUS-pvfix.md new file mode 100644 index 0000000..ca69a06 --- /dev/null +++ b/machine-docs/STATUS-pvfix.md @@ -0,0 +1,171 @@ +# STATUS — phase pvfix (proxy /16 VIP exhaustion fix) + +**Updated:** 2026-06-13T05:29Z +**Phase:** pvfix +**Builder:** autonomic-bot + +--- + +## Gate: M1 CLAIMED, awaiting Adversary + +### WHAT is claimed (M1 DoD) + +1. `nix/modules/swarm.nix` patched: `proxy` overlay now created with `--subnet 10.10.0.0/16` +2. Exact live maintenance procedure documented below (ready to execute on Adversary PASS) +3. Chosen `/16` proven safe by live host network inspection + +### HOW to verify (cold-reproducible) + +```bash +# 1. Verify the patch in the repo +git clone https://git.autonomic.zone/recipe-maintainers/cc-ci /tmp/cc-ci-adv-pvfix +grep 'subnet' /tmp/cc-ci-adv-pvfix/nix/modules/swarm.nix +# Expected: --subnet 10.10.0.0/16 + +# 2. Confirm /16 is safe on the live host (no conflict) +ssh cc-ci 'docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}"' +# Expected: NO network using 10.10.x.x — all existing overlays are 10.0.0-4.0/24 + +# 3. Review the maintenance procedure below for correctness/completeness +``` + +### EXPECTED outcome + +- `grep 'subnet' nix/modules/swarm.nix` → `--subnet 10.10.0.0/16` +- No live network in the `10.10.0.0/8` range → chosen block is safe +- Adversary confirms the procedure is safe to execute before any disruptive action + +### WHERE + +- **Commit:** see `git log --oneline -1 nix/modules/swarm.nix` in the repo +- **File:** `nix/modules/swarm.nix` lines 42–47 + +--- + +## Maintenance Procedure (to execute after Adversary M1 PASS) + +### Pre-checks (run immediately before starting) + +```bash +# Confirm no active CI runs / upgrade-all in flight +ssh cc-ci 'docker ps --format "{{.Names}}" | grep -v warm-\|traefik\|drone\|ccci\|backups' +# Expected: empty (no recipe test containers running) + +ssh cc-ci 'docker stack ls' +# Expected: only infra stacks (traefik, drone, ccci-*, warm-keycloak, backups) +``` + +### Step 1 — Capture baseline + +```bash +ssh cc-ci 'docker network inspect proxy' +# Record: current subnet (10.0.1.0/24), ID, joined containers +``` + +### Step 2 — Remove stacks that use proxy + +```bash +ssh cc-ci 'docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net' +``` + +### Step 3 — Wait for proxy to drain (all containers detached) + +```bash +ssh cc-ci ' +until [ "$(docker network inspect proxy --format "{{json .Containers}}")" = "{}" ] 2>/dev/null; do + echo "waiting for proxy to drain..." + sleep 3 +done +echo "proxy drained" +' +``` + +### Step 4 — Remove old proxy network + +```bash +ssh cc-ci 'docker network rm proxy' +``` + +### Step 5 — Pull patched config on host + nixos-rebuild switch + +```bash +ssh cc-ci 'cd /root/cc-ci && git pull --rebase' +ssh cc-ci 'nixos-rebuild switch --flake /root/cc-ci#cc-ci 2>&1 | tail -20' +# This triggers swarm-init to recreate proxy with --subnet 10.10.0.0/16 +``` + +### Step 6 — Verify proxy is /16 + +```bash +ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"' +# Expected: 10.10.0.0/16 +``` + +### Step 7 — Restart deploy oneshots (stacks were removed) + +```bash +ssh cc-ci 'systemctl restart deploy-proxy' +# Wait for traefik healthy (check ci.commoninternet.net returns 200) +ssh cc-ci 'systemctl restart deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak' +``` + +### Step 8 — Health check + +```bash +# Verify all stacks running +ssh cc-ci 'docker stack ls && docker service ls' + +# Verify Traefik routing (ci dashboard reachable) +curl -sI https://ci.commoninternet.net | head -5 +# Expected: HTTP/2 200 + +# Verify Drone reachable +curl -sI https://drone.ci.commoninternet.net | head -5 +# Expected: HTTP/2 200 or 302 + +# Verify proxy is /16 +ssh cc-ci 'docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}"' +# Expected: Subnet: 10.10.0.0/16 +``` + +--- + +## Subnet safety proof (live host, collected 2026-06-13T05:27Z) + +Live Docker networks and their subnets: + +``` +backups_ci_commoninternet_net_default: 10.0.4.0/24 +bridge: 172.17.0.0/16 +docker_gwbridge: 172.18.0.0/16 +ingress: 10.0.0.0/24 +proxy: 10.0.1.0/24 ← current (to be replaced) +traefik_ci_commoninternet_net_internal: 10.0.2.0/24 +warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24 +``` + +`10.10.0.0/16` is clear: no existing network uses any address in `10.10.0.0–10.10.255.255`. +The chosen block is in the Docker default-addr-pool (`10.0.0.0/8`) but at a different /16 with +no collisions. Host eth0 is `91.98.47.73/32`; tailscale0 is `100.95.31.88/32` — no conflict. + +--- + +## Services on proxy (services that will be disrupted) + +| Service | Stack | Notes | +|---|---|---| +| traefik_ci_commoninternet_net_app | traefik_ci_commoninternet_net | Traefik router | +| drone_ci_commoninternet_net_app | drone_ci_commoninternet_net | Drone CI | +| ccci-bridge_app | ccci-bridge | PR comment bridge | +| ccci-dashboard_app | ccci-dashboard | CI dashboard | +| ccci-reports_app | ccci-reports | Reports nginx | +| warm-keycloak_ci_commoninternet_net_app | warm-keycloak_ci_commoninternet_net | Warm Keycloak | + +**Not on proxy (unaffected):** +- `backups_ci_commoninternet_net_app` — backup-bot-two, its own network only + +--- + +## M2 (pending M1 PASS) + +Will execute the maintenance procedure above and claim M2 once Adversary has verified M1. diff --git a/nix/modules/swarm.nix b/nix/modules/swarm.nix index 0053a7e..e91de8e 100644 --- a/nix/modules/swarm.nix +++ b/nix/modules/swarm.nix @@ -40,7 +40,11 @@ docker swarm init --advertise-addr 127.0.0.1 fi if ! docker network inspect proxy >/dev/null 2>&1; then - docker network create --driver overlay --attachable proxy + # Explicit /16 (~65 534 VIPs) prevents the /24-exhaustion class seen 2026-06-12: + # leaked endpoints from concurrent stack GC race exhausted the default 254-VIP pool. + # 10.10.0.0/16 is clear of ingress (10.0.0.0/24) and existing per-stack overlays + # (10.0.1–4.0/24). Runbook: cc-ci-plan/plan-proxy-vip-exhaustion-fix.md + docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy fi ''; };