Active nix store (km6173hm5a...) calls ls5d6s7q...-runner/warm_reconcile.py which
still has health_domain=ci.commoninternet.net (OLD probe). Fix 0e9fd38 in git but not
deployed. Waiting for: cd /root/builder-clone && git pull && nixos-rebuild switch.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.8 KiB
REVIEW — phase pxgate
Phase: pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix) Adversary: autonomic-bot (Sonnet 4.6) Started: 2026-06-13T12:41Z
Adversary orientation (cold start — 2026-06-13T12:41Z)
Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so the M1 verdict below is COLD and reproducible.
Root cause — INDEPENDENTLY CONFIRMED
Reading nix/modules/proxy.nix + runner/warm_reconcile.py + nix/modules/dashboard.nix:
deploy-proxy.servicerunswarm_reconcile.py traefik.- The traefik SPEC in
warm_reconcile.py:117-128sets:So"health_domain": "ci.commoninternet.net", "health_path": "/",health_code()probeshttps://ci.commoninternet.net/— the dashboard. deploy-dashboard.service(dashboard.nix:89) has:systemd will not start deploy-dashboard until deploy-proxy exits.After=deploy-bridge.service deploy-proxy.service ...- Deadlock: proxy waits for dashboard; dashboard waits for proxy.
Root cause — PROVEN LIVE (not merely theoretical)
The alert file /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json
confirms the deadlock hit TODAY at boot time:
deploy-proxy started: 05:38:21 UTC
→ probed ci.commoninternet.net (60s timeout): unhealthy
→ redeployed traefik
→ probed ci.commoninternet.net (300s timeout): still unhealthy
→ wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
→ deployed dashboard successfully
→ ci.commoninternet.net now returns 200
traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at 05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.
Verified fix endpoint
curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version
→ {"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"} (200)
This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.
/ping → 404 (not configured in the current recipe — avoid).
Required change (my independent read of the fix)
In runner/warm_reconcile.py SPECS["traefik"]:
- Remove
"health_domain": "ci.commoninternet.net"— sohealth_code()falls back tospec["domain"]="traefik.ci.commoninternet.net" - Change
"health_path": "/"→"health_path": "/api/version"
health_code() will then probe https://traefik.ci.commoninternet.net/api/version directly
(via --resolve traefik.ci.commoninternet.net:443:127.0.0.1), which returns 200 as soon as
traefik is up — no dashboard dependency.
Pre-M1 break-it probes (before Builder's fix, 2026-06-13T12:50Z)
P5 — Secret leak in alert files: PASS. /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json
contains only {"app": "traefik", "reason": "unhealthy-on-latest", "ts": "...", "version": "5.1.1+v3.6.15"}.
No credentials, no secrets.
P3 — After=deploy-proxy consumers ordering: PASS (no regression in current ordering):
- deploy-drone: After=deploy-proxy.service
- deploy-bridge: After=deploy-drone.service deploy-proxy.service
- deploy-dashboard: After=deploy-bridge.service deploy-proxy.service
- deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service
- deploy-reports: After=deploy-dashboard.service deploy-proxy.service
- nightly-sweep: After=deploy-proxy.service warm-keycloak.service
- warm-keycloak: After=deploy-proxy.service These all correctly depend on deploy-proxy; after the fix, proxy completes without deadlock and the rest of the chain proceeds normally.
Endpoint stability: /api/version returns 200 reliably (3/3 probes). No backend dependency.
P1-negative (traefik-down): PENDING at M1 gate — requires a controlled stop of traefik (risky on live system); will execute at M1 verification using a short pause or by examining the reconciler code path (deploy_version raises → upgrade_ok=False → rollback).
M1 — Fix + controlled reproduction
PASS @2026-06-13T13:00Z — Adversary cold-verified
Commit: 0e9fd38 (claim(pxgate-M1): change traefik health probe to /api/version)
Check 1 — Code change correct ✅
runner/warm_reconcile.py SPECS["traefik"] (lines 120–129):
"traefik": {
"recipe": "traefik",
"domain": "traefik.ci.commoninternet.net",
"health_path": "/api/version", # ← changed from "/"
"health_ok": (200,),
"stateful": False,
"deploy_timeout": 600,
"health_timeout": 300,
"setup": _traefik_setup,
},
health_domain key is absent → health_code() falls back to spec["domain"] =
"traefik.ci.commoninternet.net". Probe is now https://traefik.ci.commoninternet.net/api/version
with --resolve traefik.ci.commoninternet.net:443:127.0.0.1 — traefik's own API, no backend dep.
Check 2 — Controlled reproduction ✅
Scaled ccci-dashboard_app to 0 replicas (dashboard absent):
- New probe (
/api/versionon traefik domain): HTTP 200 ← cycle broken - Old probe (
ci.commoninternet.net/): HTTP 404 ← confirms old gate was deadlocked
Dashboard restored to 1/1 and returns 200 after scale-up.
Check 3 — Consumer ordering unchanged ✅
All After=deploy-proxy.service consumers unchanged:
deploy-drone: After=deploy-proxy.service swarm-init.service docker.service network-online.target
deploy-bridge: After=deploy-drone.service deploy-proxy.service ...
deploy-dashboard: After=deploy-bridge.service deploy-proxy.service ...
deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service ...
deploy-reports: After=deploy-dashboard.service deploy-proxy.service ...
nightly-sweep: After=deploy-proxy.service warm-keycloak.service docker.service
warm-keycloak: After=deploy-proxy.service ...
deploy-proxy itself: After=swarm-init.service docker.service network-online.target — no dashboard
dependency in its own ordering (correct). Fix does not change any service ordering.
Check 4 — Alert dir empty ✅
/var/lib/ci-warm/alerts/ is empty — Builder cleared the stale 05:44Z alert (valid false-alarm from
the old gate hitting the deadlock this morning).
Check 5 — proxy.nix comment ✅
Comment updated: "health-gate (traefik.ci.commoninternet.net/api/version returns 200 — traefik's own API, no backend dep)". No functional change to the nix module (same systemd unit).
Check 6 — Gate has teeth ✅ (with one documentation note)
Functional PASS: health_code() line 276 returns int(r.stdout.strip() or "0") → on curl
connection failure, stdout = "000" (curl's HTTP-code sentinel) → int("000") = 0 → 0 ∉ health_ok=(200,)
→ wait_healthy() returns False → rollback triggered. Gate genuinely fails on a broken traefik.
Documentation discrepancy (non-blocking): The STATUS claim says "EXPECTED: error sentinel 999 returned
when curl fails." The actual code returns 0 (not 999) on curl failure. grep for "999" returns no matches.
This is a documentation error in the M1 claim only — the functional behavior is correct (0 ≠ 200 → gate
fails → rollback). No code defect; no blocking finding.
Check 7 — DEFERRED + DECISIONS updated ✅
machine-docs/DEFERRED.md: 2026-06-13 circular-dependency entry marked [x] CLOSED @2026-06-13 with fix pointer.
machine-docs/DECISIONS.md: "deploy-proxy health gate — SETTLED (2026-06-13, phase pxgate)" entry added with rationale.
M1 VERDICT: PASS — cycle broken, new probe is dashboard-independent, rollback gate has teeth, ordering unchanged, DEFERRED closed, docs updated. One non-blocking STATUS discrepancy (999 vs 0 sentinel) noted; no code defect.
M2 — Proven on a real from-scratch boot
PENDING — awaiting orchestrator nixos-rebuild (as of 2026-06-13T13:08Z)
M1 is PASS. The fix is in the repo (0e9fd38). The live cc-ci host still has the OLD probe:
- Active reconcile script:
/nix/store/km6173hm5a77wxggd7zba3mfakrz0c94-cc-ci-reconcile-proxy - Calls:
/nix/store/ls5d6s7q2892z0n0qv7sfk03zimwx3nd-runner/warm_reconcile.py - That file has:
"health_domain": "ci.commoninternet.net","health_path": "/"— OLD probe still live
Orchestrator action required:
ssh cc-ci
cd /root/builder-clone
git pull # to get commit 0e9fd38
nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"
After nixos-rebuild, I will verify (per STATUS-pxgate.md M2 checks):
deploy-proxy.serviceshowsactive (exited)(not unhealthy alert)- New nix store path with
/api/versionin use - All services 1/1 unaffected
- Cold-boot simulation: stop dashboard + restart proxy → proxy completes healthy without dashboard