status(pvfix): ## DONE — M1+M2 PASS, proxy live as /16
Some checks failed
continuous-integration/drone/push Build is failing

Both gates Adversary-verified 2026-06-13:
- M1 PASS @05:33Z: patch + procedure cold-verified
- M2 PASS @05:49Z: live host confirmed 10.10.0.0/16, all 9 services 1/1, routes healthy

Adversary finding A1 (health gate circular dependency) deferred to DEFERRED.md —
pre-existing D8 risk, not introduced by pvfix, not a VETO.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
autonomic-bot
2026-06-13 05:52:18 +00:00
parent ccd93da65c
commit 9b3e77a57f
2 changed files with 39 additions and 92 deletions

View File

@ -408,3 +408,17 @@ behavior (not introduced or worsened by lvl5; Adversary concurs it is not a find
reachable via the operator/dev STAGES escape — production drone runs always run all stages.
**Needed from operator:** decide whether promote should additionally require the full stage set
(one-line guard in `should_promote_canonical`), or whether dev hand-runs promoting is acceptable.
### 2026-06-13 — deploy-proxy health-gate circular dependency (D8 risk)
- [ ] **What:** `deploy-proxy.service` health gate waits for `ci.commoninternet.net → 200`, served by
`deploy-dashboard.service` which is ordered `After=deploy-proxy.service`. On a fresh-from-scratch
boot, deploy-proxy waits 5 min for the health gate, then retries up to 15 min (TimeoutStartSec=900),
then fails — deploy-dashboard starts after but proxy is in failed state. Filed as A1 by the Adversary
(2026-06-13, phase pvfix). See `machine-docs/BACKLOG-pvfix.md`.
- **Filed by:** Adversary, phase pvfix (cross-filed by Builder)
- **Reason for deferral:** Fix requires changing the health probe target for traefik to something
available before the dashboard (e.g. a Traefik-internal health path like `https://traefik.ci.commoninternet.net/api/version`)
or moving the health gate out of the deploy-proxy oneshot into a separate converge step. Scope
exceeds pvfix objective; needs consideration against D8 test setup.
- **Re-entry trigger:** Operator decides to harden D8; or a fresh-install attempt fails and triggers a bugfix phase.
- **Needed from operator:** Confirm acceptable health probe target for traefik without dashboard dependency.

View File

@ -1,109 +1,42 @@
# STATUS — phase pvfix (proxy /16 VIP exhaustion fix)
**Updated:** 2026-06-13T05:46Z
**Updated:** 2026-06-13T05:53Z
**Phase:** pvfix
**Builder:** autonomic-bot
---
## Gate: M2 CLAIMED, awaiting Adversary
## DONE
### WHAT is claimed (M2 DoD)
Both gates have fresh Adversary PASSes (dated 2026-06-13, within 24h).
1. Live `proxy` overlay network recreated as `10.10.0.0/16` (was `10.0.1.0/24`)
2. NixOS configuration switched via `nixos-rebuild switch` with the patched `swarm.nix`
3. All control-plane services healthy post-maintenance: traefik, drone, bridge, dashboard, reports, warm-keycloak
4. Core routes reachable: `ci.commoninternet.net` → HTTP/2 200, `drone.ci.commoninternet.net` → HTTP/2 303
### Evidence
### HOW to verify (cold-reproducible from Adversary clone)
| Check | Result |
|---|---|
| M1 PASS (patch + procedure) | 2026-06-13T05:33Z — see REVIEW-pvfix.md |
| M2 PASS (live fix + health) | 2026-06-13T05:49Z — see REVIEW-pvfix.md |
| `proxy` subnet on host | `10.10.0.0/16` (was `10.0.1.0/24`) |
| All 9 swarm services | 1/1 |
| `ci.commoninternet.net` | HTTP/2 200 |
| `drone.ci.commoninternet.net` | HTTP/2 303 |
| `nix/modules/swarm.nix` commit | `e6349a9``--subnet 10.10.0.0/16` |
| nixos-rebuild applied | swarm-init activated 2026-06-13T05:38:17 UTC |
```bash
# 1. Verify proxy subnet on live host
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
# Expected: 10.10.0.0/16
### Adversary finding A1
# 2. Verify all services running
ssh cc-ci 'docker service ls --format "{{.Name}} {{.Replicas}}"'
# Expected: all services show 1/1
# 3. Verify swarm-init ran with new script (check activation time)
ssh cc-ci 'systemctl status swarm-init --no-pager | grep Active'
# Expected: active (exited), activated ~2026-06-13T05:38Z
# 4. Verify core routes
curl -sI https://ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200
curl -sI https://drone.ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200 or 303
# 5. Verify NixOS config has the patch (on host)
ssh cc-ci 'grep subnet /nix/store/$(basename $(readlink -f /run/current-system/sw/share)/../..)/nix/modules/swarm.nix 2>/dev/null || cat /run/current-system/sw/share/nixos/.source/nix/modules/swarm.nix | grep subnet || true'
```
### EXPECTED outcome
- `docker network inspect proxy` subnet → `10.10.0.0/16`
- All 9 swarm services running 1/1
- `ci.commoninternet.net` → 200, `drone.ci.commoninternet.net` → 200 or 303
- `systemctl status swarm-init` activated ~05:38 today (2026-06-13)
### WHERE (evidence)
**Proxy network (live host, collected 2026-06-13T05:46Z):**
```
ID: ki2awmlob4pw629bxevygmk8x
Subnet: 10.10.0.0/16
Gateway: 10.10.0.1
Created: 2026-06-13 05:38:02.125154677 +0000 UTC
```
**Service state (all 1/1):**
```
backups_ci_commoninternet_net_app 1/1
ccci-bridge_app 1/1
ccci-dashboard_app 1/1
ccci-reports_app 1/1
drone_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_socket-proxy 1/1
warm-keycloak_ci_commoninternet_net_app 1/1
warm-keycloak_ci_commoninternet_net_db 1/1
```
**Route health (from orchestrator VM, 2026-06-13T05:45Z):**
```
ci.commoninternet.net → HTTP/2 200
drone.ci.commoninternet.net → HTTP/2 303
```
**Commit with patch:** `e6349a9``nix/modules/swarm.nix` line 47:
```
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
Filed by Adversary (2026-06-13, pvfix): `deploy-proxy` health gate circular dependency on fresh boot.
Pre-existing issue (not introduced by pvfix), D8 risk. Not a VETO on pvfix DONE.
Deferred to `machine-docs/DEFERRED.md` (entry: `2026-06-13 — deploy-proxy health-gate circular dependency`).
---
## M1 — PASS (Adversary, 2026-06-13T05:33Z)
## Definition-of-Done checklist (pvfix)
```
grep -n 'subnet' nix/modules/swarm.nix
→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
```
Patch verified, subnet safe, procedure reviewed. See REVIEW-pvfix.md.
- [x] `proxy` is explicitly configured and live as a `/16`
- [x] The change is committed and pushed to cc-ci (`e6349a9`)
- [x] Core routes are healthy after the maintenance action
- [x] Adversary has signed off on M1 in `machine-docs/REVIEW-pvfix.md`
- [x] Adversary has signed off on M2 in `machine-docs/REVIEW-pvfix.md`
---
## Maintenance window executed (2026-06-13T05:3305:46Z)
**Sequence executed:**
1. Pre-flight: confirmed no active recipe test containers; all stacks infra-only
2. Removed stacks on proxy: traefik, drone, ccci-bridge, ccci-dashboard, ccci-reports, warm-keycloak
3. Drained proxy (watched containers → `{}`)
4. `docker network rm proxy` → removed
5. Pulled patched config into `/root/builder-clone`, resolved stale untracked files
6. `nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"` → success
7. `systemctl restart swarm-init` → proxy recreated as `10.10.0.0/16`
8. `systemctl restart deploy-proxy` → traefik deployed; health gate deadlock broke by starting deploy-dashboard concurrently
9. `systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports`
10. `systemctl start warm-keycloak`
11. All services healthy; routes confirmed
**Anomaly note (for Adversary):** The `deploy-proxy` health gate checks `ci.commoninternet.net` (expects 200), but the dashboard (which serves that) is ordered AFTER `deploy-proxy`. On a fresh-from-scratch boot this creates a potential ordering issue. Workaround used: started deploy-dashboard concurrently during deploy-proxy's wait_healthy retry window. This matches normal-boot behavior (all WantedBy=multi-user.target services start concurrently with ordering). The health gate passed once the dashboard was deployed (~20s after starting it).
Phase pvfix complete.