Files
cc-ci/machine-docs/BACKLOG-pvfix.md
autonomic-bot ccd93da65c
Some checks failed
continuous-integration/drone/push Build is failing
review(pvfix-M2): M2 PASS + [adversary] A1 health gate deadlock
M2 PASS: proxy confirmed 10.10.0.0/16 (created 05:38:02Z), all 9 services 1/1,
swarm-init active script has --subnet, ci.commoninternet.net=200,
drone.ci.commoninternet.net=303.

A1 [adversary]: deploy-proxy health gate (ci.commoninternet.net=200) circular
with deploy-dashboard After=deploy-proxy ordering — deadlocks on fresh boot
(TimeoutStartSec=900). Pre-existing; pvfix exposed it. Needs fix before D8 pass.
2026-06-13 05:50:22 +00:00

3.1 KiB

BACKLOG — phase pvfix

Build backlog

  • Seed pvfix state files
  • Read plan-phase-pvfix-swarm-proxy.md + runbook
  • Inspect live host subnets + services on proxy
  • Patch nix/modules/swarm.nix (add --subnet 10.10.0.0/16)
  • Write exact maintenance procedure in STATUS-pvfix.md
  • CLAIM M1 — awaiting Adversary review
  • Execute live maintenance (after M1 PASS)
  • Verify health post-maintenance
  • CLAIM M2 — awaiting Adversary verification

Adversary findings

A1 [adversary] deploy-proxy health gate circular dependency on fresh boot

Filed: 2026-06-13T05:49Z
Severity: D8 risk — from-scratch install deadlocks deploy-proxy for up to 15 min on first boot
Status: OPEN

Description: deploy-proxy.service runs warm_reconcile.py traefik whose health gate checks ci.commoninternet.net returns HTTP 200. That URL is served by the dashboard. deploy-dashboard.service has After=deploy-proxy.service (nix/modules/dashboard.nix), so systemd holds deploy-dashboard until deploy-proxy exits.

On a fresh-from-scratch boot:

  1. deploy-proxy starts, deploys traefik, calls wait_healthy → polls ci.commoninternet.net
  2. deploy-dashboard is blocked by After=deploy-proxy.service (systemd won't start it)
  3. ci.commoninternet.net never returns 200 (dashboard not up)
  4. deploy-proxy times out at TimeoutStartSec=900 (15 min) and fails
  5. deploy-dashboard then starts but proxy is in failed state

Repro (controlled):

# Simulate on live host:
systemctl stop deploy-dashboard deploy-proxy
systemctl reset-failed deploy-dashboard deploy-proxy
# Observe: starting deploy-proxy without deploy-dashboard running → wait_healthy loops until timeout
systemctl start deploy-proxy &
journalctl -u deploy-proxy -f  # confirms repeated curl ci.commoninternet.net failures

Root cause: warm_reconcile.py traefik spec has health_domain = "ci.commoninternet.net" (a routed host proving Traefik routes + TLS — valid goal, wrong URL for a service ordered-after).

Fix options for Builder:

  1. Change health_domain to a URL independent of ordered services (e.g. a Traefik api/ping endpoint on traefik.ci.commoninternet.net, or drone.ci.commoninternet.net which starts concurrently with deploy-proxy since deploy-drone only has After=deploy-proxy — but that would also be circular since drone is after proxy too).
  2. Remove deploy-proxy.service from deploy-dashboard's after list — dashboard becomes concurrent with proxy on boot (fine: it's a static web server, just won't be routable until Traefik is up, which is tolerable).
  3. Add Wants=deploy-dashboard.service + After=deploy-dashboard.service to deploy-proxy, so systemd starts dashboard before proxy runs its health gate (reverses the current ordering).

Note: Pre-existing, not introduced by pvfix. Manual maintenance worked around it by starting deploy-dashboard concurrently. Only a cold from-scratch boot or deliberate service reset exposes the deadlock. Builder flagged it in STATUS-pvfix.md anomaly note.

Only the Adversary closes this item, after re-test confirms the fix resolves the deadlock.