M2 PASS: proxy confirmed 10.10.0.0/16 (created 05:38:02Z), all 9 services 1/1, swarm-init active script has --subnet, ci.commoninternet.net=200, drone.ci.commoninternet.net=303. A1 [adversary]: deploy-proxy health gate (ci.commoninternet.net=200) circular with deploy-dashboard After=deploy-proxy ordering — deadlocks on fresh boot (TimeoutStartSec=900). Pre-existing; pvfix exposed it. Needs fix before D8 pass.
3.1 KiB
BACKLOG — phase pvfix
Build backlog
- Seed pvfix state files
- Read plan-phase-pvfix-swarm-proxy.md + runbook
- Inspect live host subnets + services on proxy
- Patch nix/modules/swarm.nix (add --subnet 10.10.0.0/16)
- Write exact maintenance procedure in STATUS-pvfix.md
- CLAIM M1 — awaiting Adversary review
- Execute live maintenance (after M1 PASS)
- Verify health post-maintenance
- CLAIM M2 — awaiting Adversary verification
Adversary findings
A1 [adversary] deploy-proxy health gate circular dependency on fresh boot
Filed: 2026-06-13T05:49Z
Severity: D8 risk — from-scratch install deadlocks deploy-proxy for up to 15 min on first boot
Status: OPEN
Description:
deploy-proxy.service runs warm_reconcile.py traefik whose health gate checks
ci.commoninternet.net returns HTTP 200. That URL is served by the dashboard.
deploy-dashboard.service has After=deploy-proxy.service (nix/modules/dashboard.nix),
so systemd holds deploy-dashboard until deploy-proxy exits.
On a fresh-from-scratch boot:
- deploy-proxy starts, deploys traefik, calls
wait_healthy→ pollsci.commoninternet.net - deploy-dashboard is blocked by
After=deploy-proxy.service(systemd won't start it) ci.commoninternet.netnever returns 200 (dashboard not up)- deploy-proxy times out at
TimeoutStartSec=900(15 min) and fails - deploy-dashboard then starts but proxy is in failed state
Repro (controlled):
# Simulate on live host:
systemctl stop deploy-dashboard deploy-proxy
systemctl reset-failed deploy-dashboard deploy-proxy
# Observe: starting deploy-proxy without deploy-dashboard running → wait_healthy loops until timeout
systemctl start deploy-proxy &
journalctl -u deploy-proxy -f # confirms repeated curl ci.commoninternet.net failures
Root cause: warm_reconcile.py traefik spec has health_domain = "ci.commoninternet.net"
(a routed host proving Traefik routes + TLS — valid goal, wrong URL for a service ordered-after).
Fix options for Builder:
- Change
health_domainto a URL independent of ordered services (e.g. a Traefikapi/pingendpoint ontraefik.ci.commoninternet.net, ordrone.ci.commoninternet.netwhich starts concurrently with deploy-proxy since deploy-drone only hasAfter=deploy-proxy— but that would also be circular since drone is after proxy too). - Remove
deploy-proxy.servicefrom deploy-dashboard'safterlist — dashboard becomes concurrent with proxy on boot (fine: it's a static web server, just won't be routable until Traefik is up, which is tolerable). - Add
Wants=deploy-dashboard.service+After=deploy-dashboard.serviceto deploy-proxy, so systemd starts dashboard before proxy runs its health gate (reverses the current ordering).
Note: Pre-existing, not introduced by pvfix. Manual maintenance worked around it by starting deploy-dashboard concurrently. Only a cold from-scratch boot or deliberate service reset exposes the deadlock. Builder flagged it in STATUS-pvfix.md anomaly note.
Only the Adversary closes this item, after re-test confirms the fix resolves the deadlock.