# REVIEW — phase pxgate **Phase:** pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix) **Adversary:** autonomic-bot (Sonnet 4.6) **Started:** 2026-06-13T12:41Z --- ## Adversary orientation (cold start — 2026-06-13T12:41Z) Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so the M1 verdict below is COLD and reproducible. ### Root cause — INDEPENDENTLY CONFIRMED Reading `nix/modules/proxy.nix` + `runner/warm_reconcile.py` + `nix/modules/dashboard.nix`: 1. `deploy-proxy.service` runs `warm_reconcile.py traefik`. 2. The traefik SPEC in `warm_reconcile.py:117-128` sets: ```python "health_domain": "ci.commoninternet.net", "health_path": "/", ``` So `health_code()` probes `https://ci.commoninternet.net/` — the dashboard. 3. `deploy-dashboard.service` (dashboard.nix:89) has: ``` After=deploy-bridge.service deploy-proxy.service ... ``` systemd will not start deploy-dashboard until deploy-proxy exits. 4. **Deadlock:** proxy waits for dashboard; dashboard waits for proxy. ### Root cause — PROVEN LIVE (not merely theoretical) The alert file `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json` confirms the deadlock hit TODAY at boot time: ``` deploy-proxy started: 05:38:21 UTC → probed ci.commoninternet.net (60s timeout): unhealthy → redeployed traefik → probed ci.commoninternet.net (300s timeout): still unhealthy → wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true) deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited) → deployed dashboard successfully → ci.commoninternet.net now returns 200 ``` traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at 05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard. ### Verified fix endpoint `curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version` → `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}` (200) This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth. `/ping` → 404 (not configured in the current recipe — avoid). ### Required change (my independent read of the fix) In `runner/warm_reconcile.py` SPECS["traefik"]: - Remove `"health_domain": "ci.commoninternet.net"` — so `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"` - Change `"health_path": "/"` → `"health_path": "/api/version"` `health_code()` will then probe `https://traefik.ci.commoninternet.net/api/version` directly (via `--resolve traefik.ci.commoninternet.net:443:127.0.0.1`), which returns 200 as soon as traefik is up — no dashboard dependency. ### Pre-M1 break-it probes (before Builder's fix, 2026-06-13T12:50Z) **P5 — Secret leak in alert files:** PASS. `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json` contains only `{"app": "traefik", "reason": "unhealthy-on-latest", "ts": "...", "version": "5.1.1+v3.6.15"}`. No credentials, no secrets. **P3 — After=deploy-proxy consumers ordering:** PASS (no regression in current ordering): - deploy-drone: After=deploy-proxy.service - deploy-bridge: After=deploy-drone.service deploy-proxy.service - deploy-dashboard: After=deploy-bridge.service deploy-proxy.service - deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service - deploy-reports: After=deploy-dashboard.service deploy-proxy.service - nightly-sweep: After=deploy-proxy.service warm-keycloak.service - warm-keycloak: After=deploy-proxy.service These all correctly depend on deploy-proxy; after the fix, proxy completes without deadlock and the rest of the chain proceeds normally. **Endpoint stability:** `/api/version` returns 200 reliably (3/3 probes). No backend dependency. **P1-negative (traefik-down):** PENDING at M1 gate — requires a controlled stop of traefik (risky on live system); will execute at M1 verification using a short pause or by examining the reconciler code path (deploy_version raises → upgrade_ok=False → rollback). --- ## M1 — Fix + controlled reproduction ### PASS @2026-06-13T13:00Z — Adversary cold-verified **Commit:** `0e9fd38` (`claim(pxgate-M1): change traefik health probe to /api/version`) #### Check 1 — Code change correct ✅ `runner/warm_reconcile.py` SPECS["traefik"] (lines 120–129): ```python "traefik": { "recipe": "traefik", "domain": "traefik.ci.commoninternet.net", "health_path": "/api/version", # ← changed from "/" "health_ok": (200,), "stateful": False, "deploy_timeout": 600, "health_timeout": 300, "setup": _traefik_setup, }, ``` `health_domain` key is **absent** → `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"`. Probe is now `https://traefik.ci.commoninternet.net/api/version` with `--resolve traefik.ci.commoninternet.net:443:127.0.0.1` — traefik's own API, no backend dep. #### Check 2 — Controlled reproduction ✅ Scaled `ccci-dashboard_app` to 0 replicas (dashboard absent): - **New probe** (`/api/version` on traefik domain): HTTP **200** ← cycle broken - **Old probe** (`ci.commoninternet.net/`): HTTP **404** ← confirms old gate was deadlocked Dashboard restored to 1/1 and returns 200 after scale-up. #### Check 3 — Consumer ordering unchanged ✅ All `After=deploy-proxy.service` consumers unchanged: ``` deploy-drone: After=deploy-proxy.service swarm-init.service docker.service network-online.target deploy-bridge: After=deploy-drone.service deploy-proxy.service ... deploy-dashboard: After=deploy-bridge.service deploy-proxy.service ... deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service ... deploy-reports: After=deploy-dashboard.service deploy-proxy.service ... nightly-sweep: After=deploy-proxy.service warm-keycloak.service docker.service warm-keycloak: After=deploy-proxy.service ... ``` `deploy-proxy` itself: `After=swarm-init.service docker.service network-online.target` — no dashboard dependency in its own ordering (correct). Fix does not change any service ordering. #### Check 4 — Alert dir empty ✅ `/var/lib/ci-warm/alerts/` is empty — Builder cleared the stale 05:44Z alert (valid false-alarm from the old gate hitting the deadlock this morning). #### Check 5 — proxy.nix comment ✅ Comment updated: "health-gate (traefik.ci.commoninternet.net/api/version returns 200 — traefik's own API, no backend dep)". No functional change to the nix module (same systemd unit). #### Check 6 — Gate has teeth ✅ (with one documentation note) **Functional PASS:** `health_code()` line 276 returns `int(r.stdout.strip() or "0")` → on curl connection failure, stdout = "000" (curl's HTTP-code sentinel) → `int("000") = 0` → 0 ∉ `health_ok=(200,)` → `wait_healthy()` returns False → rollback triggered. Gate genuinely fails on a broken traefik. **Documentation discrepancy (non-blocking):** The STATUS claim says "EXPECTED: error sentinel 999 returned when curl fails." The actual code returns 0 (not 999) on curl failure. `grep` for "999" returns no matches. This is a documentation error in the M1 claim only — the functional behavior is correct (0 ≠ 200 → gate fails → rollback). No code defect; no blocking finding. #### Check 7 — DEFERRED + DECISIONS updated ✅ `machine-docs/DEFERRED.md`: 2026-06-13 circular-dependency entry marked `[x] CLOSED @2026-06-13` with fix pointer. `machine-docs/DECISIONS.md`: "deploy-proxy health gate — SETTLED (2026-06-13, phase pxgate)" entry added with rationale. --- **M1 VERDICT: PASS** — cycle broken, new probe is dashboard-independent, rollback gate has teeth, ordering unchanged, DEFERRED closed, docs updated. One non-blocking STATUS discrepancy (999 vs 0 sentinel) noted; no code defect. --- ## M2 — Proven on a real from-scratch boot ### PENDING — awaiting orchestrator nixos-rebuild (as of 2026-06-13T13:08Z) M1 is PASS. The fix is in the repo (`0e9fd38`). The live cc-ci host still has the OLD probe: - Active reconcile script: `/nix/store/km6173hm5a77wxggd7zba3mfakrz0c94-cc-ci-reconcile-proxy` - Calls: `/nix/store/ls5d6s7q2892z0n0qv7sfk03zimwx3nd-runner/warm_reconcile.py` - That file has: `"health_domain": "ci.commoninternet.net"`, `"health_path": "/"` — OLD probe still live **Orchestrator action required:** ```bash ssh cc-ci cd /root/builder-clone git pull # to get commit 0e9fd38 nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci" ``` After nixos-rebuild, I will verify (per STATUS-pxgate.md M2 checks): 1. `deploy-proxy.service` shows `active (exited)` (not unhealthy alert) 2. New nix store path with `/api/version` in use 3. All services 1/1 unaffected 4. Cold-boot simulation: stop dashboard + restart proxy → proxy completes healthy without dashboard