status(pvfix): ## DONE — M1+M2 PASS, proxy live as /16
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
Both gates Adversary-verified 2026-06-13: - M1 PASS @05:33Z: patch + procedure cold-verified - M2 PASS @05:49Z: live host confirmed 10.10.0.0/16, all 9 services 1/1, routes healthy Adversary finding A1 (health gate circular dependency) deferred to DEFERRED.md — pre-existing D8 risk, not introduced by pvfix, not a VETO. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@ -408,3 +408,17 @@ behavior (not introduced or worsened by lvl5; Adversary concurs it is not a find
|
||||
reachable via the operator/dev STAGES escape — production drone runs always run all stages.
|
||||
**Needed from operator:** decide whether promote should additionally require the full stage set
|
||||
(one-line guard in `should_promote_canonical`), or whether dev hand-runs promoting is acceptable.
|
||||
|
||||
### 2026-06-13 — deploy-proxy health-gate circular dependency (D8 risk)
|
||||
- [ ] **What:** `deploy-proxy.service` health gate waits for `ci.commoninternet.net → 200`, served by
|
||||
`deploy-dashboard.service` which is ordered `After=deploy-proxy.service`. On a fresh-from-scratch
|
||||
boot, deploy-proxy waits 5 min for the health gate, then retries up to 15 min (TimeoutStartSec=900),
|
||||
then fails — deploy-dashboard starts after but proxy is in failed state. Filed as A1 by the Adversary
|
||||
(2026-06-13, phase pvfix). See `machine-docs/BACKLOG-pvfix.md`.
|
||||
- **Filed by:** Adversary, phase pvfix (cross-filed by Builder)
|
||||
- **Reason for deferral:** Fix requires changing the health probe target for traefik to something
|
||||
available before the dashboard (e.g. a Traefik-internal health path like `https://traefik.ci.commoninternet.net/api/version`)
|
||||
or moving the health gate out of the deploy-proxy oneshot into a separate converge step. Scope
|
||||
exceeds pvfix objective; needs consideration against D8 test setup.
|
||||
- **Re-entry trigger:** Operator decides to harden D8; or a fresh-install attempt fails and triggers a bugfix phase.
|
||||
- **Needed from operator:** Confirm acceptable health probe target for traefik without dashboard dependency.
|
||||
|
||||
@ -1,109 +1,42 @@
|
||||
# STATUS — phase pvfix (proxy /16 VIP exhaustion fix)
|
||||
|
||||
**Updated:** 2026-06-13T05:46Z
|
||||
**Updated:** 2026-06-13T05:53Z
|
||||
**Phase:** pvfix
|
||||
**Builder:** autonomic-bot
|
||||
|
||||
---
|
||||
|
||||
## Gate: M2 CLAIMED, awaiting Adversary
|
||||
## DONE
|
||||
|
||||
### WHAT is claimed (M2 DoD)
|
||||
Both gates have fresh Adversary PASSes (dated 2026-06-13, within 24h).
|
||||
|
||||
1. Live `proxy` overlay network recreated as `10.10.0.0/16` (was `10.0.1.0/24`)
|
||||
2. NixOS configuration switched via `nixos-rebuild switch` with the patched `swarm.nix`
|
||||
3. All control-plane services healthy post-maintenance: traefik, drone, bridge, dashboard, reports, warm-keycloak
|
||||
4. Core routes reachable: `ci.commoninternet.net` → HTTP/2 200, `drone.ci.commoninternet.net` → HTTP/2 303
|
||||
### Evidence
|
||||
|
||||
### HOW to verify (cold-reproducible from Adversary clone)
|
||||
| Check | Result |
|
||||
|---|---|
|
||||
| M1 PASS (patch + procedure) | 2026-06-13T05:33Z — see REVIEW-pvfix.md |
|
||||
| M2 PASS (live fix + health) | 2026-06-13T05:49Z — see REVIEW-pvfix.md |
|
||||
| `proxy` subnet on host | `10.10.0.0/16` (was `10.0.1.0/24`) |
|
||||
| All 9 swarm services | 1/1 |
|
||||
| `ci.commoninternet.net` | HTTP/2 200 |
|
||||
| `drone.ci.commoninternet.net` | HTTP/2 303 |
|
||||
| `nix/modules/swarm.nix` commit | `e6349a9` — `--subnet 10.10.0.0/16` |
|
||||
| nixos-rebuild applied | swarm-init activated 2026-06-13T05:38:17 UTC |
|
||||
|
||||
```bash
|
||||
# 1. Verify proxy subnet on live host
|
||||
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"'
|
||||
# Expected: 10.10.0.0/16
|
||||
### Adversary finding A1
|
||||
|
||||
# 2. Verify all services running
|
||||
ssh cc-ci 'docker service ls --format "{{.Name}} {{.Replicas}}"'
|
||||
# Expected: all services show 1/1
|
||||
|
||||
# 3. Verify swarm-init ran with new script (check activation time)
|
||||
ssh cc-ci 'systemctl status swarm-init --no-pager | grep Active'
|
||||
# Expected: active (exited), activated ~2026-06-13T05:38Z
|
||||
|
||||
# 4. Verify core routes
|
||||
curl -sI https://ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200
|
||||
curl -sI https://drone.ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200 or 303
|
||||
|
||||
# 5. Verify NixOS config has the patch (on host)
|
||||
ssh cc-ci 'grep subnet /nix/store/$(basename $(readlink -f /run/current-system/sw/share)/../..)/nix/modules/swarm.nix 2>/dev/null || cat /run/current-system/sw/share/nixos/.source/nix/modules/swarm.nix | grep subnet || true'
|
||||
```
|
||||
|
||||
### EXPECTED outcome
|
||||
|
||||
- `docker network inspect proxy` subnet → `10.10.0.0/16`
|
||||
- All 9 swarm services running 1/1
|
||||
- `ci.commoninternet.net` → 200, `drone.ci.commoninternet.net` → 200 or 303
|
||||
- `systemctl status swarm-init` activated ~05:38 today (2026-06-13)
|
||||
|
||||
### WHERE (evidence)
|
||||
|
||||
**Proxy network (live host, collected 2026-06-13T05:46Z):**
|
||||
```
|
||||
ID: ki2awmlob4pw629bxevygmk8x
|
||||
Subnet: 10.10.0.0/16
|
||||
Gateway: 10.10.0.1
|
||||
Created: 2026-06-13 05:38:02.125154677 +0000 UTC
|
||||
```
|
||||
|
||||
**Service state (all 1/1):**
|
||||
```
|
||||
backups_ci_commoninternet_net_app 1/1
|
||||
ccci-bridge_app 1/1
|
||||
ccci-dashboard_app 1/1
|
||||
ccci-reports_app 1/1
|
||||
drone_ci_commoninternet_net_app 1/1
|
||||
traefik_ci_commoninternet_net_app 1/1
|
||||
traefik_ci_commoninternet_net_socket-proxy 1/1
|
||||
warm-keycloak_ci_commoninternet_net_app 1/1
|
||||
warm-keycloak_ci_commoninternet_net_db 1/1
|
||||
```
|
||||
|
||||
**Route health (from orchestrator VM, 2026-06-13T05:45Z):**
|
||||
```
|
||||
ci.commoninternet.net → HTTP/2 200
|
||||
drone.ci.commoninternet.net → HTTP/2 303
|
||||
```
|
||||
|
||||
**Commit with patch:** `e6349a9` — `nix/modules/swarm.nix` line 47:
|
||||
```
|
||||
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
|
||||
```
|
||||
Filed by Adversary (2026-06-13, pvfix): `deploy-proxy` health gate circular dependency on fresh boot.
|
||||
Pre-existing issue (not introduced by pvfix), D8 risk. Not a VETO on pvfix DONE.
|
||||
Deferred to `machine-docs/DEFERRED.md` (entry: `2026-06-13 — deploy-proxy health-gate circular dependency`).
|
||||
|
||||
---
|
||||
|
||||
## M1 — PASS (Adversary, 2026-06-13T05:33Z)
|
||||
## Definition-of-Done checklist (pvfix)
|
||||
|
||||
```
|
||||
grep -n 'subnet' nix/modules/swarm.nix
|
||||
→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
|
||||
```
|
||||
Patch verified, subnet safe, procedure reviewed. See REVIEW-pvfix.md.
|
||||
- [x] `proxy` is explicitly configured and live as a `/16`
|
||||
- [x] The change is committed and pushed to cc-ci (`e6349a9`)
|
||||
- [x] Core routes are healthy after the maintenance action
|
||||
- [x] Adversary has signed off on M1 in `machine-docs/REVIEW-pvfix.md`
|
||||
- [x] Adversary has signed off on M2 in `machine-docs/REVIEW-pvfix.md`
|
||||
|
||||
---
|
||||
|
||||
## Maintenance window executed (2026-06-13T05:33–05:46Z)
|
||||
|
||||
**Sequence executed:**
|
||||
1. Pre-flight: confirmed no active recipe test containers; all stacks infra-only
|
||||
2. Removed stacks on proxy: traefik, drone, ccci-bridge, ccci-dashboard, ccci-reports, warm-keycloak
|
||||
3. Drained proxy (watched containers → `{}`)
|
||||
4. `docker network rm proxy` → removed
|
||||
5. Pulled patched config into `/root/builder-clone`, resolved stale untracked files
|
||||
6. `nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"` → success
|
||||
7. `systemctl restart swarm-init` → proxy recreated as `10.10.0.0/16`
|
||||
8. `systemctl restart deploy-proxy` → traefik deployed; health gate deadlock broke by starting deploy-dashboard concurrently
|
||||
9. `systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports`
|
||||
10. `systemctl start warm-keycloak`
|
||||
11. All services healthy; routes confirmed
|
||||
|
||||
**Anomaly note (for Adversary):** The `deploy-proxy` health gate checks `ci.commoninternet.net` (expects 200), but the dashboard (which serves that) is ordered AFTER `deploy-proxy`. On a fresh-from-scratch boot this creates a potential ordering issue. Workaround used: started deploy-dashboard concurrently during deploy-proxy's wait_healthy retry window. This matches normal-boot behavior (all WantedBy=multi-user.target services start concurrently with ordering). The health gate passed once the dashboard was deployed (~20s after starting it).
|
||||
Phase pvfix complete.
|
||||
|
||||
Reference in New Issue
Block a user