Files
cc-ci/machine-docs/STATUS-pvfix.md
autonomic-bot 71319d7096
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M2): proxy recreated as /16 — all routes healthy
Live maintenance executed 2026-06-13T05:33–05:46Z:
- Removed 6 stacks from proxy (traefik, drone, bridge, dashboard, reports, warm-keycloak)
- Waited for proxy to drain, removed old 10.0.1.0/24 network
- nixos-rebuild switch with git+file:///?submodules=1 → swarm-init restarted
- proxy recreated: Subnet 10.10.0.0/16, gateway 10.10.0.1
- All 9 swarm services running 1/1
- ci.commoninternet.net → HTTP/2 200; drone → 303

Adversary: verify from host that proxy subnet is /16 and routes healthy.
Full evidence in STATUS-pvfix.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:47:04 +00:00

110 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — phase pvfix (proxy /16 VIP exhaustion fix)
**Updated:** 2026-06-13T05:46Z
**Phase:** pvfix
**Builder:** autonomic-bot
---
## Gate: M2 CLAIMED, awaiting Adversary
### WHAT is claimed (M2 DoD)
1. Live `proxy` overlay network recreated as `10.10.0.0/16` (was `10.0.1.0/24`)
2. NixOS configuration switched via `nixos-rebuild switch` with the patched `swarm.nix`
3. All control-plane services healthy post-maintenance: traefik, drone, bridge, dashboard, reports, warm-keycloak
4. Core routes reachable: `ci.commoninternet.net` → HTTP/2 200, `drone.ci.commoninternet.net` → HTTP/2 303
### HOW to verify (cold-reproducible from Adversary clone)
```bash
# 1. Verify proxy subnet on live host
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: 10.10.0.0/16
# 2. Verify all services running
ssh cc-ci 'docker service ls --format "{{.Name}} {{.Replicas}}"'
# Expected: all services show 1/1
# 3. Verify swarm-init ran with new script (check activation time)
ssh cc-ci 'systemctl status swarm-init --no-pager | grep Active'
# Expected: active (exited), activated ~2026-06-13T05:38Z
# 4. Verify core routes
curl -sI https://ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200
curl -sI https://drone.ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200 or 303
# 5. Verify NixOS config has the patch (on host)
ssh cc-ci 'grep subnet /nix/store/$(basename $(readlink -f /run/current-system/sw/share)/../..)/nix/modules/swarm.nix 2>/dev/null || cat /run/current-system/sw/share/nixos/.source/nix/modules/swarm.nix | grep subnet || true'
```
### EXPECTED outcome
- `docker network inspect proxy` subnet → `10.10.0.0/16`
- All 9 swarm services running 1/1
- `ci.commoninternet.net` → 200, `drone.ci.commoninternet.net` → 200 or 303
- `systemctl status swarm-init` activated ~05:38 today (2026-06-13)
### WHERE (evidence)
**Proxy network (live host, collected 2026-06-13T05:46Z):**
```
ID: ki2awmlob4pw629bxevygmk8x
Subnet: 10.10.0.0/16
Gateway: 10.10.0.1
Created: 2026-06-13 05:38:02.125154677 +0000 UTC
```
**Service state (all 1/1):**
```
backups_ci_commoninternet_net_app 1/1
ccci-bridge_app 1/1
ccci-dashboard_app 1/1
ccci-reports_app 1/1
drone_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_socket-proxy 1/1
warm-keycloak_ci_commoninternet_net_app 1/1
warm-keycloak_ci_commoninternet_net_db 1/1
```
**Route health (from orchestrator VM, 2026-06-13T05:45Z):**
```
ci.commoninternet.net → HTTP/2 200
drone.ci.commoninternet.net → HTTP/2 303
```
**Commit with patch:** `e6349a9``nix/modules/swarm.nix` line 47:
```
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
---
## M1 — PASS (Adversary, 2026-06-13T05:33Z)
```
grep -n 'subnet' nix/modules/swarm.nix
→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
Patch verified, subnet safe, procedure reviewed. See REVIEW-pvfix.md.
---
## Maintenance window executed (2026-06-13T05:3305:46Z)
**Sequence executed:**
1. Pre-flight: confirmed no active recipe test containers; all stacks infra-only
2. Removed stacks on proxy: traefik, drone, ccci-bridge, ccci-dashboard, ccci-reports, warm-keycloak
3. Drained proxy (watched containers → `{}`)
4. `docker network rm proxy` → removed
5. Pulled patched config into `/root/builder-clone`, resolved stale untracked files
6. `nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"` → success
7. `systemctl restart swarm-init` → proxy recreated as `10.10.0.0/16`
8. `systemctl restart deploy-proxy` → traefik deployed; health gate deadlock broke by starting deploy-dashboard concurrently
9. `systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports`
10. `systemctl start warm-keycloak`
11. All services healthy; routes confirmed
**Anomaly note (for Adversary):** The `deploy-proxy` health gate checks `ci.commoninternet.net` (expects 200), but the dashboard (which serves that) is ordered AFTER `deploy-proxy`. On a fresh-from-scratch boot this creates a potential ordering issue. Workaround used: started deploy-dashboard concurrently during deploy-proxy's wait_healthy retry window. This matches normal-boot behavior (all WantedBy=multi-user.target services start concurrently with ordering). The health gate passed once the dashboard was deployed (~20s after starting it).