Files
cc-ci/machine-docs/STATUS-pvfix.md
autonomic-bot e6349a9dfe
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M1): proxy /16 patch + maintenance plan ready
Patch nix/modules/swarm.nix to create the `proxy` overlay with
--subnet 10.10.0.0/16 (~65k VIPs, 258× headroom over the exhausted /24).

Live host survey confirms 10.10.0.0/16 is clear of all existing
Docker networks (ingress 10.0.0.0/24, existing per-stack overlays
10.0.1-4.0/24, host routes). Exact maintenance procedure in
STATUS-pvfix.md including pre-checks, stack teardown order, drain
wait, remove/recreate proxy, nixos-rebuild, deploy-* restart chain,
and health verification steps.

Adversary: please cold-review the patch + procedure before any live
disruptive action.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:31:21 +00:00

5.3 KiB
Raw Blame History

STATUS — phase pvfix (proxy /16 VIP exhaustion fix)

Updated: 2026-06-13T05:29Z
Phase: pvfix
Builder: autonomic-bot


Gate: M1 CLAIMED, awaiting Adversary

WHAT is claimed (M1 DoD)

  1. nix/modules/swarm.nix patched: proxy overlay now created with --subnet 10.10.0.0/16
  2. Exact live maintenance procedure documented below (ready to execute on Adversary PASS)
  3. Chosen /16 proven safe by live host network inspection

HOW to verify (cold-reproducible)

# 1. Verify the patch in the repo
git clone https://git.autonomic.zone/recipe-maintainers/cc-ci /tmp/cc-ci-adv-pvfix
grep 'subnet' /tmp/cc-ci-adv-pvfix/nix/modules/swarm.nix
# Expected: --subnet 10.10.0.0/16

# 2. Confirm /16 is safe on the live host (no conflict)
ssh cc-ci 'docker network inspect $(docker network ls -q) --format "{{.Name}}: {{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: NO network using 10.10.x.x — all existing overlays are 10.0.0-4.0/24

# 3. Review the maintenance procedure below for correctness/completeness

EXPECTED outcome

  • grep 'subnet' nix/modules/swarm.nix--subnet 10.10.0.0/16
  • No live network in the 10.10.0.0/8 range → chosen block is safe
  • Adversary confirms the procedure is safe to execute before any disruptive action

WHERE

  • Commit: see git log --oneline -1 nix/modules/swarm.nix in the repo
  • File: nix/modules/swarm.nix lines 4247

Maintenance Procedure (to execute after Adversary M1 PASS)

Pre-checks (run immediately before starting)

# Confirm no active CI runs / upgrade-all in flight
ssh cc-ci 'docker ps --format "{{.Names}}" | grep -v warm-\|traefik\|drone\|ccci\|backups'
# Expected: empty (no recipe test containers running)

ssh cc-ci 'docker stack ls'
# Expected: only infra stacks (traefik, drone, ccci-*, warm-keycloak, backups)

Step 1 — Capture baseline

ssh cc-ci 'docker network inspect proxy'
# Record: current subnet (10.0.1.0/24), ID, joined containers

Step 2 — Remove stacks that use proxy

ssh cc-ci 'docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net'

Step 3 — Wait for proxy to drain (all containers detached)

ssh cc-ci '
until [ "$(docker network inspect proxy --format "{{json .Containers}}")" = "{}" ] 2>/dev/null; do
  echo "waiting for proxy to drain..."
  sleep 3
done
echo "proxy drained"
'

Step 4 — Remove old proxy network

ssh cc-ci 'docker network rm proxy'

Step 5 — Pull patched config on host + nixos-rebuild switch

ssh cc-ci 'cd /root/cc-ci && git pull --rebase'
ssh cc-ci 'nixos-rebuild switch --flake /root/cc-ci#cc-ci 2>&1 | tail -20'
# This triggers swarm-init to recreate proxy with --subnet 10.10.0.0/16

Step 6 — Verify proxy is /16

ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: 10.10.0.0/16

Step 7 — Restart deploy oneshots (stacks were removed)

ssh cc-ci 'systemctl restart deploy-proxy'
# Wait for traefik healthy (check ci.commoninternet.net returns 200)
ssh cc-ci 'systemctl restart deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak'

Step 8 — Health check

# Verify all stacks running
ssh cc-ci 'docker stack ls && docker service ls'

# Verify Traefik routing (ci dashboard reachable)
curl -sI https://ci.commoninternet.net | head -5
# Expected: HTTP/2 200

# Verify Drone reachable
curl -sI https://drone.ci.commoninternet.net | head -5
# Expected: HTTP/2 200 or 302

# Verify proxy is /16
ssh cc-ci 'docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: Subnet: 10.10.0.0/16

Subnet safety proof (live host, collected 2026-06-13T05:27Z)

Live Docker networks and their subnets:

backups_ci_commoninternet_net_default: 10.0.4.0/24
bridge:                               172.17.0.0/16
docker_gwbridge:                      172.18.0.0/16
ingress:                              10.0.0.0/24
proxy:                                10.0.1.0/24  ← current (to be replaced)
traefik_ci_commoninternet_net_internal: 10.0.2.0/24
warm-keycloak_ci_commoninternet_net_internal: 10.0.3.0/24

10.10.0.0/16 is clear: no existing network uses any address in 10.10.0.010.10.255.255. The chosen block is in the Docker default-addr-pool (10.0.0.0/8) but at a different /16 with no collisions. Host eth0 is 91.98.47.73/32; tailscale0 is 100.95.31.88/32 — no conflict.


Services on proxy (services that will be disrupted)

Service Stack Notes
traefik_ci_commoninternet_net_app traefik_ci_commoninternet_net Traefik router
drone_ci_commoninternet_net_app drone_ci_commoninternet_net Drone CI
ccci-bridge_app ccci-bridge PR comment bridge
ccci-dashboard_app ccci-dashboard CI dashboard
ccci-reports_app ccci-reports Reports nginx
warm-keycloak_ci_commoninternet_net_app warm-keycloak_ci_commoninternet_net Warm Keycloak

Not on proxy (unaffected):

  • backups_ci_commoninternet_net_app — backup-bot-two, its own network only

M2 (pending M1 PASS)

Will execute the maintenance procedure above and claim M2 once Adversary has verified M1.