Files
cc-ci/machine-docs/REVIEW-pxgate.md
autonomic-bot a9e67af61e
Some checks failed
continuous-integration/drone/push Build is failing
chore(pxgate): init Adversary phase files — root cause cold-verified, M1/M2 PENDING
Independent cold read confirms the circular dependency (proxy health-gate polls
ci.commoninternet.net served by dashboard which is After=deploy-proxy). Root cause
is PROVEN LIVE by today's alert: 20260613T054428Z-traefik-unhealthy-on-latest.json.

Fix endpoint independently verified: /api/version on traefik.ci.commoninternet.net
returns 200 as soon as traefik is up, no dashboard dependency.

REVIEW-pxgate.md: orientation, M1/M2 acceptance criteria.
BACKLOG-pxgate.md: break-it probes P1–P5 to run at M1 gate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 12:42:30 +00:00

3.9 KiB

REVIEW — phase pxgate

Phase: pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix) Adversary: autonomic-bot (Sonnet 4.6) Started: 2026-06-13T12:41Z


Adversary orientation (cold start — 2026-06-13T12:41Z)

Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so the M1 verdict below is COLD and reproducible.

Root cause — INDEPENDENTLY CONFIRMED

Reading nix/modules/proxy.nix + runner/warm_reconcile.py + nix/modules/dashboard.nix:

  1. deploy-proxy.service runs warm_reconcile.py traefik.
  2. The traefik SPEC in warm_reconcile.py:117-128 sets:
    "health_domain": "ci.commoninternet.net",
    "health_path": "/",
    
    So health_code() probes https://ci.commoninternet.net/ — the dashboard.
  3. deploy-dashboard.service (dashboard.nix:89) has:
    After=deploy-bridge.service deploy-proxy.service ...
    
    systemd will not start deploy-dashboard until deploy-proxy exits.
  4. Deadlock: proxy waits for dashboard; dashboard waits for proxy.

Root cause — PROVEN LIVE (not merely theoretical)

The alert file /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json confirms the deadlock hit TODAY at boot time:

deploy-proxy started: 05:38:21 UTC
  → probed ci.commoninternet.net (60s timeout): unhealthy
  → redeployed traefik
  → probed ci.commoninternet.net (300s timeout): still unhealthy
  → wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
  → deployed dashboard successfully
  → ci.commoninternet.net now returns 200

traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at 05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.

Verified fix endpoint

curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"} (200)

This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.

/ping → 404 (not configured in the current recipe — avoid).

Required change (my independent read of the fix)

In runner/warm_reconcile.py SPECS["traefik"]:

  • Remove "health_domain": "ci.commoninternet.net" — so health_code() falls back to spec["domain"] = "traefik.ci.commoninternet.net"
  • Change "health_path": "/""health_path": "/api/version"

health_code() will then probe https://traefik.ci.commoninternet.net/api/version directly (via --resolve traefik.ci.commoninternet.net:443:127.0.0.1), which returns 200 as soon as traefik is up — no dashboard dependency.


M1 — Fix + controlled reproduction

PENDING — awaiting Builder implementation

Acceptance criteria I will independently verify:

  1. Code change correct: SPECS["traefik"] removes health_domain override and uses /api/version
  2. New gate is meaningful: a STOPPED/broken traefik must cause the probe to fail (return non-200)
  3. Controlled reproduction: with dashboard held back, old gate hangs/fails; new gate passes on traefik alone
  4. No After=deploy-proxy consumer regressed: drone, warm-keycloak, bridge, dashboard, backupbot, reports ordering still correct
  5. PR merged or ready to the cc-ci repo

M2 — Proven on a real from-scratch boot

PENDING — awaiting Builder implementation + orchestrator cold-boot

Acceptance criteria I will independently verify:

  1. deploy-proxy reaches active without the dashboard being pre-deployed
  2. Rollback path still works: a deliberately broken traefik fails the gate and rolls back
  3. Running server unaffected: all services still up after the fix deploys
  4. A1 / DEFERRED entry closed with pointers