Files
cc-ci/machine-docs/STATUS-pvfix.md
autonomic-bot 71319d7096
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M2): proxy recreated as /16 — all routes healthy
Live maintenance executed 2026-06-13T05:33–05:46Z:
- Removed 6 stacks from proxy (traefik, drone, bridge, dashboard, reports, warm-keycloak)
- Waited for proxy to drain, removed old 10.0.1.0/24 network
- nixos-rebuild switch with git+file:///?submodules=1 → swarm-init restarted
- proxy recreated: Subnet 10.10.0.0/16, gateway 10.10.0.1
- All 9 swarm services running 1/1
- ci.commoninternet.net → HTTP/2 200; drone → 303

Adversary: verify from host that proxy subnet is /16 and routes healthy.
Full evidence in STATUS-pvfix.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:47:04 +00:00

4.5 KiB
Raw Blame History

STATUS — phase pvfix (proxy /16 VIP exhaustion fix)

Updated: 2026-06-13T05:46Z
Phase: pvfix
Builder: autonomic-bot


Gate: M2 CLAIMED, awaiting Adversary

WHAT is claimed (M2 DoD)

  1. Live proxy overlay network recreated as 10.10.0.0/16 (was 10.0.1.0/24)
  2. NixOS configuration switched via nixos-rebuild switch with the patched swarm.nix
  3. All control-plane services healthy post-maintenance: traefik, drone, bridge, dashboard, reports, warm-keycloak
  4. Core routes reachable: ci.commoninternet.net → HTTP/2 200, drone.ci.commoninternet.net → HTTP/2 303

HOW to verify (cold-reproducible from Adversary clone)

# 1. Verify proxy subnet on live host
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: 10.10.0.0/16

# 2. Verify all services running
ssh cc-ci 'docker service ls --format "{{.Name}} {{.Replicas}}"'
# Expected: all services show 1/1

# 3. Verify swarm-init ran with new script (check activation time)
ssh cc-ci 'systemctl status swarm-init --no-pager | grep Active'
# Expected: active (exited), activated ~2026-06-13T05:38Z

# 4. Verify core routes
curl -sI https://ci.commoninternet.net/ | head -1   # Expected: HTTP/2 200
curl -sI https://drone.ci.commoninternet.net/ | head -1  # Expected: HTTP/2 200 or 303

# 5. Verify NixOS config has the patch (on host)
ssh cc-ci 'grep subnet /nix/store/$(basename $(readlink -f /run/current-system/sw/share)/../..)/nix/modules/swarm.nix 2>/dev/null || cat /run/current-system/sw/share/nixos/.source/nix/modules/swarm.nix | grep subnet || true'

EXPECTED outcome

  • docker network inspect proxy subnet → 10.10.0.0/16
  • All 9 swarm services running 1/1
  • ci.commoninternet.net → 200, drone.ci.commoninternet.net → 200 or 303
  • systemctl status swarm-init activated ~05:38 today (2026-06-13)

WHERE (evidence)

Proxy network (live host, collected 2026-06-13T05:46Z):

ID: ki2awmlob4pw629bxevygmk8x
Subnet: 10.10.0.0/16
Gateway: 10.10.0.1
Created: 2026-06-13 05:38:02.125154677 +0000 UTC

Service state (all 1/1):

backups_ci_commoninternet_net_app            1/1
ccci-bridge_app                              1/1
ccci-dashboard_app                           1/1
ccci-reports_app                             1/1
drone_ci_commoninternet_net_app              1/1
traefik_ci_commoninternet_net_app            1/1
traefik_ci_commoninternet_net_socket-proxy   1/1
warm-keycloak_ci_commoninternet_net_app      1/1
warm-keycloak_ci_commoninternet_net_db       1/1

Route health (from orchestrator VM, 2026-06-13T05:45Z):

ci.commoninternet.net → HTTP/2 200
drone.ci.commoninternet.net → HTTP/2 303

Commit with patch: e6349a9nix/modules/swarm.nix line 47:

docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy

M1 — PASS (Adversary, 2026-06-13T05:33Z)

grep -n 'subnet' nix/modules/swarm.nix
→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy

Patch verified, subnet safe, procedure reviewed. See REVIEW-pvfix.md.


Maintenance window executed (2026-06-13T05:3305:46Z)

Sequence executed:

  1. Pre-flight: confirmed no active recipe test containers; all stacks infra-only
  2. Removed stacks on proxy: traefik, drone, ccci-bridge, ccci-dashboard, ccci-reports, warm-keycloak
  3. Drained proxy (watched containers → {})
  4. docker network rm proxy → removed
  5. Pulled patched config into /root/builder-clone, resolved stale untracked files
  6. nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci" → success
  7. systemctl restart swarm-init → proxy recreated as 10.10.0.0/16
  8. systemctl restart deploy-proxy → traefik deployed; health gate deadlock broke by starting deploy-dashboard concurrently
  9. systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports
  10. systemctl start warm-keycloak
  11. All services healthy; routes confirmed

Anomaly note (for Adversary): The deploy-proxy health gate checks ci.commoninternet.net (expects 200), but the dashboard (which serves that) is ordered AFTER deploy-proxy. On a fresh-from-scratch boot this creates a potential ordering issue. Workaround used: started deploy-dashboard concurrently during deploy-proxy's wait_healthy retry window. This matches normal-boot behavior (all WantedBy=multi-user.target services start concurrently with ordering). The health gate passed once the dashboard was deployed (~20s after starting it).