diff --git a/machine-docs/DEFERRED.md b/machine-docs/DEFERRED.md index 5f04530..6d9700c 100644 --- a/machine-docs/DEFERRED.md +++ b/machine-docs/DEFERRED.md @@ -408,3 +408,17 @@ behavior (not introduced or worsened by lvl5; Adversary concurs it is not a find reachable via the operator/dev STAGES escape — production drone runs always run all stages. **Needed from operator:** decide whether promote should additionally require the full stage set (one-line guard in `should_promote_canonical`), or whether dev hand-runs promoting is acceptable. + +### 2026-06-13 — deploy-proxy health-gate circular dependency (D8 risk) +- [ ] **What:** `deploy-proxy.service` health gate waits for `ci.commoninternet.net → 200`, served by + `deploy-dashboard.service` which is ordered `After=deploy-proxy.service`. On a fresh-from-scratch + boot, deploy-proxy waits 5 min for the health gate, then retries up to 15 min (TimeoutStartSec=900), + then fails — deploy-dashboard starts after but proxy is in failed state. Filed as A1 by the Adversary + (2026-06-13, phase pvfix). See `machine-docs/BACKLOG-pvfix.md`. +- **Filed by:** Adversary, phase pvfix (cross-filed by Builder) +- **Reason for deferral:** Fix requires changing the health probe target for traefik to something + available before the dashboard (e.g. a Traefik-internal health path like `https://traefik.ci.commoninternet.net/api/version`) + or moving the health gate out of the deploy-proxy oneshot into a separate converge step. Scope + exceeds pvfix objective; needs consideration against D8 test setup. +- **Re-entry trigger:** Operator decides to harden D8; or a fresh-install attempt fails and triggers a bugfix phase. +- **Needed from operator:** Confirm acceptable health probe target for traefik without dashboard dependency. diff --git a/machine-docs/STATUS-pvfix.md b/machine-docs/STATUS-pvfix.md index 1f58a6c..9a2405a 100644 --- a/machine-docs/STATUS-pvfix.md +++ b/machine-docs/STATUS-pvfix.md @@ -1,109 +1,42 @@ # STATUS — phase pvfix (proxy /16 VIP exhaustion fix) -**Updated:** 2026-06-13T05:46Z +**Updated:** 2026-06-13T05:53Z **Phase:** pvfix **Builder:** autonomic-bot --- -## Gate: M2 CLAIMED, awaiting Adversary +## DONE -### WHAT is claimed (M2 DoD) +Both gates have fresh Adversary PASSes (dated 2026-06-13, within 24h). -1. Live `proxy` overlay network recreated as `10.10.0.0/16` (was `10.0.1.0/24`) -2. NixOS configuration switched via `nixos-rebuild switch` with the patched `swarm.nix` -3. All control-plane services healthy post-maintenance: traefik, drone, bridge, dashboard, reports, warm-keycloak -4. Core routes reachable: `ci.commoninternet.net` → HTTP/2 200, `drone.ci.commoninternet.net` → HTTP/2 303 +### Evidence -### HOW to verify (cold-reproducible from Adversary clone) +| Check | Result | +|---|---| +| M1 PASS (patch + procedure) | 2026-06-13T05:33Z — see REVIEW-pvfix.md | +| M2 PASS (live fix + health) | 2026-06-13T05:49Z — see REVIEW-pvfix.md | +| `proxy` subnet on host | `10.10.0.0/16` (was `10.0.1.0/24`) | +| All 9 swarm services | 1/1 | +| `ci.commoninternet.net` | HTTP/2 200 | +| `drone.ci.commoninternet.net` | HTTP/2 303 | +| `nix/modules/swarm.nix` commit | `e6349a9` — `--subnet 10.10.0.0/16` | +| nixos-rebuild applied | swarm-init activated 2026-06-13T05:38:17 UTC | -```bash -# 1. Verify proxy subnet on live host -ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"' -# Expected: 10.10.0.0/16 +### Adversary finding A1 -# 2. Verify all services running -ssh cc-ci 'docker service ls --format "{{.Name}} {{.Replicas}}"' -# Expected: all services show 1/1 - -# 3. Verify swarm-init ran with new script (check activation time) -ssh cc-ci 'systemctl status swarm-init --no-pager | grep Active' -# Expected: active (exited), activated ~2026-06-13T05:38Z - -# 4. Verify core routes -curl -sI https://ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200 -curl -sI https://drone.ci.commoninternet.net/ | head -1 # Expected: HTTP/2 200 or 303 - -# 5. Verify NixOS config has the patch (on host) -ssh cc-ci 'grep subnet /nix/store/$(basename $(readlink -f /run/current-system/sw/share)/../..)/nix/modules/swarm.nix 2>/dev/null || cat /run/current-system/sw/share/nixos/.source/nix/modules/swarm.nix | grep subnet || true' -``` - -### EXPECTED outcome - -- `docker network inspect proxy` subnet → `10.10.0.0/16` -- All 9 swarm services running 1/1 -- `ci.commoninternet.net` → 200, `drone.ci.commoninternet.net` → 200 or 303 -- `systemctl status swarm-init` activated ~05:38 today (2026-06-13) - -### WHERE (evidence) - -**Proxy network (live host, collected 2026-06-13T05:46Z):** -``` -ID: ki2awmlob4pw629bxevygmk8x -Subnet: 10.10.0.0/16 -Gateway: 10.10.0.1 -Created: 2026-06-13 05:38:02.125154677 +0000 UTC -``` - -**Service state (all 1/1):** -``` -backups_ci_commoninternet_net_app 1/1 -ccci-bridge_app 1/1 -ccci-dashboard_app 1/1 -ccci-reports_app 1/1 -drone_ci_commoninternet_net_app 1/1 -traefik_ci_commoninternet_net_app 1/1 -traefik_ci_commoninternet_net_socket-proxy 1/1 -warm-keycloak_ci_commoninternet_net_app 1/1 -warm-keycloak_ci_commoninternet_net_db 1/1 -``` - -**Route health (from orchestrator VM, 2026-06-13T05:45Z):** -``` -ci.commoninternet.net → HTTP/2 200 -drone.ci.commoninternet.net → HTTP/2 303 -``` - -**Commit with patch:** `e6349a9` — `nix/modules/swarm.nix` line 47: -``` -docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy -``` +Filed by Adversary (2026-06-13, pvfix): `deploy-proxy` health gate circular dependency on fresh boot. +Pre-existing issue (not introduced by pvfix), D8 risk. Not a VETO on pvfix DONE. +Deferred to `machine-docs/DEFERRED.md` (entry: `2026-06-13 — deploy-proxy health-gate circular dependency`). --- -## M1 — PASS (Adversary, 2026-06-13T05:33Z) +## Definition-of-Done checklist (pvfix) -``` -grep -n 'subnet' nix/modules/swarm.nix -→ 47: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy -``` -Patch verified, subnet safe, procedure reviewed. See REVIEW-pvfix.md. +- [x] `proxy` is explicitly configured and live as a `/16` +- [x] The change is committed and pushed to cc-ci (`e6349a9`) +- [x] Core routes are healthy after the maintenance action +- [x] Adversary has signed off on M1 in `machine-docs/REVIEW-pvfix.md` +- [x] Adversary has signed off on M2 in `machine-docs/REVIEW-pvfix.md` ---- - -## Maintenance window executed (2026-06-13T05:33–05:46Z) - -**Sequence executed:** -1. Pre-flight: confirmed no active recipe test containers; all stacks infra-only -2. Removed stacks on proxy: traefik, drone, ccci-bridge, ccci-dashboard, ccci-reports, warm-keycloak -3. Drained proxy (watched containers → `{}`) -4. `docker network rm proxy` → removed -5. Pulled patched config into `/root/builder-clone`, resolved stale untracked files -6. `nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"` → success -7. `systemctl restart swarm-init` → proxy recreated as `10.10.0.0/16` -8. `systemctl restart deploy-proxy` → traefik deployed; health gate deadlock broke by starting deploy-dashboard concurrently -9. `systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports` -10. `systemctl start warm-keycloak` -11. All services healthy; routes confirmed - -**Anomaly note (for Adversary):** The `deploy-proxy` health gate checks `ci.commoninternet.net` (expects 200), but the dashboard (which serves that) is ordered AFTER `deploy-proxy`. On a fresh-from-scratch boot this creates a potential ordering issue. Workaround used: started deploy-dashboard concurrently during deploy-proxy's wait_healthy retry window. This matches normal-boot behavior (all WantedBy=multi-user.target services start concurrently with ordering). The health gate passed once the dashboard was deployed (~20s after starting it). +Phase pvfix complete.