Files
cc-ci/machine-docs/STATUS-pvfix.md
autonomic-bot e6349a9dfe
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M1): proxy /16 patch + maintenance plan ready
Patch nix/modules/swarm.nix to create the `proxy` overlay with
--subnet 10.10.0.0/16 (~65k VIPs, 258× headroom over the exhausted /24).

Live host survey confirms 10.10.0.0/16 is clear of all existing
Docker networks (ingress 10.0.0.0/24, existing per-stack overlays
10.0.1-4.0/24, host routes). Exact maintenance procedure in
STATUS-pvfix.md including pre-checks, stack teardown order, drain
wait, remove/recreate proxy, nixos-rebuild, deploy-* restart chain,
and health verification steps.

Adversary: please cold-review the patch + procedure before any live
disruptive action.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:31:21 +00:00

172 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — phase pvfix (proxy /16 VIP exhaustion fix)
**Updated:** 2026-06-13T05:29Z
**Phase:** pvfix
**Builder:** autonomic-bot
---
## Gate: M1 CLAIMED, awaiting Adversary
### WHAT is claimed (M1 DoD)
1. `nix/modules/swarm.nix` patched: `proxy` overlay now created with `--subnet 10.10.0.0/16`
2. Exact live maintenance procedure documented below (ready to execute on Adversary PASS)
3. Chosen `/16` proven safe by live host network inspection
### HOW to verify (cold-reproducible)
```bash
# 1. Verify the patch in the repo
git clone https://git.autonomic.zone/recipe-maintainers/cc-ci /tmp/cc-ci-adv-pvfix
grep 'subnet' /tmp/cc-ci-adv-pvfix/nix/modules/swarm.nix
# Expected: --subnet 10.10.0.0/16
# 2. Confirm /16 is safe on the live host (no conflict)
ssh cc-ci 'docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: NO network using 10.10.x.x — all existing overlays are 10.0.0-4.0/24
# 3. Review the maintenance procedure below for correctness/completeness
```
### EXPECTED outcome
- `grep 'subnet' nix/modules/swarm.nix``--subnet 10.10.0.0/16`
- No live network in the `10.10.0.0/8` range → chosen block is safe
- Adversary confirms the procedure is safe to execute before any disruptive action
### WHERE
- **Commit:** see `git log --oneline -1 nix/modules/swarm.nix` in the repo
- **File:** `nix/modules/swarm.nix` lines 4247
---
## Maintenance Procedure (to execute after Adversary M1 PASS)
### Pre-checks (run immediately before starting)
```bash
# Confirm no active CI runs / upgrade-all in flight
ssh cc-ci 'docker ps --format "{{.Names}}" | grep -v warm-\|traefik\|drone\|ccci\|backups'
# Expected: empty (no recipe test containers running)
ssh cc-ci 'docker stack ls'
# Expected: only infra stacks (traefik, drone, ccci-*, warm-keycloak, backups)
```
### Step 1 — Capture baseline
```bash
ssh cc-ci 'docker network inspect proxy'
# Record: current subnet (10.0.1.0/24), ID, joined containers
```
### Step 2 — Remove stacks that use proxy
```bash
ssh cc-ci 'docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net'
```
### Step 3 — Wait for proxy to drain (all containers detached)
```bash
ssh cc-ci '
until [ "$(docker network inspect proxy --format "{{json .Containers}}")" = "{}" ] 2>/dev/null; do
echo "waiting for proxy to drain..."
sleep 3
done
echo "proxy drained"
'
```
### Step 4 — Remove old proxy network
```bash
ssh cc-ci 'docker network rm proxy'
```
### Step 5 — Pull patched config on host + nixos-rebuild switch
```bash
ssh cc-ci 'cd /root/cc-ci && git pull --rebase'
ssh cc-ci 'nixos-rebuild switch --flake /root/cc-ci#cc-ci 2>&1 | tail -20'
# This triggers swarm-init to recreate proxy with --subnet 10.10.0.0/16
```
### Step 6 — Verify proxy is /16
```bash
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: 10.10.0.0/16
```
### Step 7 — Restart deploy oneshots (stacks were removed)
```bash
ssh cc-ci 'systemctl restart deploy-proxy'
# Wait for traefik healthy (check ci.commoninternet.net returns 200)
ssh cc-ci 'systemctl restart deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak'
```
### Step 8 — Health check
```bash
# Verify all stacks running
ssh cc-ci 'docker stack ls && docker service ls'
# Verify Traefik routing (ci dashboard reachable)
curl -sI https://ci.commoninternet.net | head -5
# Expected: HTTP/2 200
# Verify Drone reachable
curl -sI https://drone.ci.commoninternet.net | head -5
# Expected: HTTP/2 200 or 302
# Verify proxy is /16
ssh cc-ci 'docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: Subnet: 10.10.0.0/16
```
---
## Subnet safety proof (live host, collected 2026-06-13T05:27Z)
Live Docker networks and their subnets:
```
backups_ci_commoninternet_net_default: 10.0.4.0/24
bridge: 172.17.0.0/16
docker_gwbridge: 172.18.0.0/16
ingress: 10.0.0.0/24
proxy: 10.0.1.0/24 ← current (to be replaced)
traefik_ci_commoninternet_net_internal: 10.0.2.0/24
warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24
```
`10.10.0.0/16` is clear: no existing network uses any address in `10.10.0.010.10.255.255`.
The chosen block is in the Docker default-addr-pool (`10.0.0.0/8`) but at a different /16 with
no collisions. Host eth0 is `91.98.47.73/32`; tailscale0 is `100.95.31.88/32` — no conflict.
---
## Services on proxy (services that will be disrupted)
| Service | Stack | Notes |
|---|---|---|
| traefik_ci_commoninternet_net_app | traefik_ci_commoninternet_net | Traefik router |
| drone_ci_commoninternet_net_app | drone_ci_commoninternet_net | Drone CI |
| ccci-bridge_app | ccci-bridge | PR comment bridge |
| ccci-dashboard_app | ccci-dashboard | CI dashboard |
| ccci-reports_app | ccci-reports | Reports nginx |
| warm-keycloak_ci_commoninternet_net_app | warm-keycloak_ci_commoninternet_net | Warm Keycloak |
**Not on proxy (unaffected):**
- `backups_ci_commoninternet_net_app` — backup-bot-two, its own network only
---
## M2 (pending M1 PASS)
Will execute the maintenance procedure above and claim M2 once Adversary has verified M1.