Some checks failed
continuous-integration/drone/push Build is failing
M2 PASS: proxy confirmed 10.10.0.0/16 (created 05:38:02Z), all 9 services 1/1, swarm-init active script has --subnet, ci.commoninternet.net=200, drone.ci.commoninternet.net=303. A1 [adversary]: deploy-proxy health gate (ci.commoninternet.net=200) circular with deploy-dashboard After=deploy-proxy ordering — deadlocks on fresh boot (TimeoutStartSec=900). Pre-existing; pvfix exposed it. Needs fix before D8 pass.
65 lines
3.1 KiB
Markdown
65 lines
3.1 KiB
Markdown
# BACKLOG — phase pvfix
|
|
|
|
## Build backlog
|
|
|
|
- [x] Seed pvfix state files
|
|
- [x] Read plan-phase-pvfix-swarm-proxy.md + runbook
|
|
- [x] Inspect live host subnets + services on proxy
|
|
- [x] Patch nix/modules/swarm.nix (add --subnet 10.10.0.0/16)
|
|
- [x] Write exact maintenance procedure in STATUS-pvfix.md
|
|
- [x] **CLAIM M1** — awaiting Adversary review
|
|
- [x] Execute live maintenance (after M1 PASS)
|
|
- [x] Verify health post-maintenance
|
|
- [x] **CLAIM M2** — awaiting Adversary verification
|
|
|
|
## Adversary findings
|
|
|
|
### A1 [adversary] deploy-proxy health gate circular dependency on fresh boot
|
|
|
|
**Filed:** 2026-06-13T05:49Z
|
|
**Severity:** D8 risk — from-scratch install deadlocks deploy-proxy for up to 15 min on first boot
|
|
**Status:** OPEN
|
|
|
|
**Description:**
|
|
`deploy-proxy.service` runs `warm_reconcile.py traefik` whose health gate checks
|
|
`ci.commoninternet.net` returns HTTP 200. That URL is served by the dashboard.
|
|
`deploy-dashboard.service` has `After=deploy-proxy.service` (`nix/modules/dashboard.nix`),
|
|
so systemd holds deploy-dashboard until deploy-proxy exits.
|
|
|
|
On a fresh-from-scratch boot:
|
|
1. deploy-proxy starts, deploys traefik, calls `wait_healthy` → polls `ci.commoninternet.net`
|
|
2. deploy-dashboard is blocked by `After=deploy-proxy.service` (systemd won't start it)
|
|
3. `ci.commoninternet.net` never returns 200 (dashboard not up)
|
|
4. deploy-proxy times out at `TimeoutStartSec=900` (15 min) and fails
|
|
5. deploy-dashboard then starts but proxy is in failed state
|
|
|
|
**Repro (controlled):**
|
|
```bash
|
|
# Simulate on live host:
|
|
systemctl stop deploy-dashboard deploy-proxy
|
|
systemctl reset-failed deploy-dashboard deploy-proxy
|
|
# Observe: starting deploy-proxy without deploy-dashboard running → wait_healthy loops until timeout
|
|
systemctl start deploy-proxy &
|
|
journalctl -u deploy-proxy -f # confirms repeated curl ci.commoninternet.net failures
|
|
```
|
|
|
|
**Root cause:** `warm_reconcile.py traefik` spec has `health_domain = "ci.commoninternet.net"`
|
|
(a routed host proving Traefik routes + TLS — valid goal, wrong URL for a service ordered-after).
|
|
|
|
**Fix options for Builder:**
|
|
1. Change `health_domain` to a URL independent of ordered services (e.g. a Traefik
|
|
`api/ping` endpoint on `traefik.ci.commoninternet.net`, or `drone.ci.commoninternet.net`
|
|
which starts concurrently with deploy-proxy since deploy-drone only has `After=deploy-proxy`
|
|
— but that would also be circular since drone is after proxy too).
|
|
2. Remove `deploy-proxy.service` from deploy-dashboard's `after` list — dashboard becomes
|
|
concurrent with proxy on boot (fine: it's a static web server, just won't be routable until
|
|
Traefik is up, which is tolerable).
|
|
3. Add `Wants=deploy-dashboard.service` + `After=deploy-dashboard.service` to deploy-proxy, so
|
|
systemd starts dashboard before proxy runs its health gate (reverses the current ordering).
|
|
|
|
**Note:** Pre-existing, not introduced by pvfix. Manual maintenance worked around it by starting
|
|
deploy-dashboard concurrently. Only a cold from-scratch boot or deliberate service reset exposes
|
|
the deadlock. Builder flagged it in STATUS-pvfix.md anomaly note.
|
|
|
|
**Only the Adversary closes this item**, after re-test confirms the fix resolves the deadlock.
|