From ccd93da65c900b1a62c803fee33ea6359c973e46 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Sat, 13 Jun 2026 05:50:22 +0000 Subject: [PATCH] review(pvfix-M2): M2 PASS + [adversary] A1 health gate deadlock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit M2 PASS: proxy confirmed 10.10.0.0/16 (created 05:38:02Z), all 9 services 1/1, swarm-init active script has --subnet, ci.commoninternet.net=200, drone.ci.commoninternet.net=303. A1 [adversary]: deploy-proxy health gate (ci.commoninternet.net=200) circular with deploy-dashboard After=deploy-proxy ordering — deadlocks on fresh boot (TimeoutStartSec=900). Pre-existing; pvfix exposed it. Needs fix before D8 pass. --- machine-docs/BACKLOG-pvfix.md | 49 ++++++++++++++++++++++++++++++++- machine-docs/REVIEW-pvfix.md | 51 +++++++++++++++++++++++++++++++++++ 2 files changed, 99 insertions(+), 1 deletion(-) diff --git a/machine-docs/BACKLOG-pvfix.md b/machine-docs/BACKLOG-pvfix.md index 69b5f61..baee358 100644 --- a/machine-docs/BACKLOG-pvfix.md +++ b/machine-docs/BACKLOG-pvfix.md @@ -14,4 +14,51 @@ ## Adversary findings - +### A1 [adversary] deploy-proxy health gate circular dependency on fresh boot + +**Filed:** 2026-06-13T05:49Z +**Severity:** D8 risk — from-scratch install deadlocks deploy-proxy for up to 15 min on first boot +**Status:** OPEN + +**Description:** +`deploy-proxy.service` runs `warm_reconcile.py traefik` whose health gate checks +`ci.commoninternet.net` returns HTTP 200. That URL is served by the dashboard. +`deploy-dashboard.service` has `After=deploy-proxy.service` (`nix/modules/dashboard.nix`), +so systemd holds deploy-dashboard until deploy-proxy exits. + +On a fresh-from-scratch boot: +1. deploy-proxy starts, deploys traefik, calls `wait_healthy` → polls `ci.commoninternet.net` +2. deploy-dashboard is blocked by `After=deploy-proxy.service` (systemd won't start it) +3. `ci.commoninternet.net` never returns 200 (dashboard not up) +4. deploy-proxy times out at `TimeoutStartSec=900` (15 min) and fails +5. deploy-dashboard then starts but proxy is in failed state + +**Repro (controlled):** +```bash +# Simulate on live host: +systemctl stop deploy-dashboard deploy-proxy +systemctl reset-failed deploy-dashboard deploy-proxy +# Observe: starting deploy-proxy without deploy-dashboard running → wait_healthy loops until timeout +systemctl start deploy-proxy & +journalctl -u deploy-proxy -f # confirms repeated curl ci.commoninternet.net failures +``` + +**Root cause:** `warm_reconcile.py traefik` spec has `health_domain = "ci.commoninternet.net"` +(a routed host proving Traefik routes + TLS — valid goal, wrong URL for a service ordered-after). + +**Fix options for Builder:** +1. Change `health_domain` to a URL independent of ordered services (e.g. a Traefik + `api/ping` endpoint on `traefik.ci.commoninternet.net`, or `drone.ci.commoninternet.net` + which starts concurrently with deploy-proxy since deploy-drone only has `After=deploy-proxy` + — but that would also be circular since drone is after proxy too). +2. Remove `deploy-proxy.service` from deploy-dashboard's `after` list — dashboard becomes + concurrent with proxy on boot (fine: it's a static web server, just won't be routable until + Traefik is up, which is tolerable). +3. Add `Wants=deploy-dashboard.service` + `After=deploy-dashboard.service` to deploy-proxy, so + systemd starts dashboard before proxy runs its health gate (reverses the current ordering). + +**Note:** Pre-existing, not introduced by pvfix. Manual maintenance worked around it by starting +deploy-dashboard concurrently. Only a cold from-scratch boot or deliberate service reset exposes +the deadlock. Builder flagged it in STATUS-pvfix.md anomaly note. + +**Only the Adversary closes this item**, after re-test confirms the fix resolves the deadlock. diff --git a/machine-docs/REVIEW-pvfix.md b/machine-docs/REVIEW-pvfix.md index 014762a..fbe4a6d 100644 --- a/machine-docs/REVIEW-pvfix.md +++ b/machine-docs/REVIEW-pvfix.md @@ -94,6 +94,57 @@ Builder's table matches exactly. `10.10.0.0/16` is clear of all existing network **M1 PASS — safe to execute the maintenance procedure.** Waiting for Builder M2 claim. +## M2: PASS @2026-06-13T05:49Z + +**Claim:** proxy recreated as 10.10.0.0/16; nixos-rebuild applied; all services healthy; routes up. +**Commits:** `e6349a9` (patch), `71319d7` (M2 claim) + +### Cold-run evidence (all 4 acceptance checks + pre-verification probe) + +**1. Proxy subnet:** +``` +ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}} created={{.Created}}"' +→ 10.10.0.0/16 created=2026-06-13 05:38:02.125154677 +0000 UTC +``` +Network recreated at 05:38:02 UTC. ✓ + +**2. All 9 services at 1/1:** +``` +backups_ci_commoninternet_net_app 1/1 +ccci-bridge_app 1/1 +ccci-dashboard_app 1/1 +ccci-reports_app 1/1 +drone_ci_commoninternet_net_app 1/1 +traefik_ci_commoninternet_net_app 1/1 +traefik_ci_commoninternet_net_socket-proxy 1/1 +warm-keycloak_ci_commoninternet_net_app 1/1 +warm-keycloak_ci_commoninternet_net_db 1/1 +``` +All 1/1. ✓ + +**3. swarm-init activation time:** +``` +systemctl status swarm-init --no-pager | grep Active +→ Active: active (exited) since Sat 2026-06-13 05:38:17 UTC; 9min ago +``` +Activated 05:38:17 UTC — matches proxy creation timestamp. nixos-rebuild applied new unit. ✓ + +**4. Core routes:** +``` +curl -sI https://ci.commoninternet.net/ → HTTP/2 200 +curl -sI https://drone.ci.commoninternet.net/ → HTTP/2 303 +``` +✓ Both healthy. + +**5. Active swarm-init script has --subnet:** +``` +/nix/store/…/swarm-init-start: docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy +``` +nixos-rebuild confirmed applied. ✓ + +**M2 PASS — proxy VIP exhaustion fix is live and durable.** +See [adversary] finding A1 below (health gate circular dependency, pre-existing, not introduced by pvfix). + --- ## Pre-verification probe (2026-06-13T05:45Z — before M2 claimed)