Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-pxgate-proxy-healthgate.md
autonomic-bot 1aee81b4f3 plan: queue pxgate — fix deploy-proxy/dashboard health-gate circular dependency (D8)
Re-target the traefik health gate off ci.commoninternet.net (the dashboard,
which is After=deploy-proxy) onto a traefik-self endpoint, breaking the
fresh-boot deadlock while keeping health-gated rollback. M1 controlled repro by
the loops; M2 from-scratch cold-boot proof owned by the orchestrator.
2026-06-13 12:38:40 +00:00

5.2 KiB

Phase pxgate — break the deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix)

Mission: fix the boot-ordering deadlock the Adversary filed as A1 (BACKLOG-pvfix / DEFERRED 2026-06-13) so a from-scratch (D8) boot brings traefik up cleanly instead of deploy-proxy hanging 15 min and failing. The traefik proxy must still be health-gated with rollback — just gated on a signal that does NOT depend on a service ordered after it.

State files under machine-docs/: STATUS-pxgate.md, BACKLOG-pxgate.md, REVIEW-pxgate.md, JOURNAL-pxgate.md. DECISIONS.md shared.

Root cause (verified)

  1. nix/modules/proxy.nixdeploy-proxy.service runs warm_reconcile.py traefik; its wait_healthy polls https://ci.commoninternet.net → 200. TimeoutStartSec = 900.
  2. https://ci.commoninternet.net is served by the dashboard.
  3. nix/modules/dashboard.nixdeploy-dashboard.service has After=deploy-proxy.service, so systemd won't start the dashboard until deploy-proxy exits. → On a fresh boot: proxy waits for the dashboard, dashboard waits for proxy → proxy times out at 900s and fails, then the dashboard starts onto a failed proxy. (On the running server it's invisible — the dashboard is already up — which is why it's a latent D8 risk.)

Required fix

Break the cycle while keeping a MEANINGFUL traefik health signal + the rollback semantics (deploy latest → health-gate → commit last-good if healthy / roll back to last-good if not).

Recommended approach (Builder confirms in DECISIONS.md): change the deploy-proxy health gate to probe a traefik-self-served endpoint that is up the moment traefik is, with no backend/dashboard dependency — e.g. traefik's own ping entrypoint (/ping → 200) or https://traefik.ci.commoninternet.net/api/version. That confirms "traefik (re)deployed and is serving" — enough to drive the rollback decision — without waiting on ci.commoninternet.net.

  • If you still want an end-to-end "traefik routes to a real backend" assertion, move THAT to a separate converge/health step ordered AFTER the dashboard (or rely on the dashboard's own health gate); it must NOT be inside the deploy-proxy oneshot.
  • Do NOT simply delete the health gate (loses the rollback safety net). Do NOT just bump the timeout (the deadlock still fails on a cold boot, just slower). Do NOT reintroduce any dependency on a service that is After=deploy-proxy.
  • Make sure traefik actually EXPOSES whatever endpoint you choose (the ping entrypoint / api may need enabling in the traefik recipe/compose) — verify it returns 200 with only traefik up.

Touch points: nix/modules/proxy.nix, the traefik health check in runner/warm_reconcile.py (and/or runner/harness/* health helpers), possibly the traefik recipe overlay to expose the health endpoint. Check the other After=deploy-proxy consumers (drone, warm-keycloak, reports, bridge, backupbot, nightly-sweep) still order correctly after the change.

Gates

M1 — Fix + controlled reproduction (Builder/Adversary, on a test box / the cc-ci host without a real wipe). Implement the fix. Reproduce the cycle in a CONTROLLED way (per the Adversary's A1 repro): with the dashboard held back / absent, show the OLD gate hangs→fails and the NEW gate goes healthy on traefik alone. Unit/integration-test the new health check. Adversary cold-verifies: cycle is genuinely broken, health signal is still meaningful (a broken traefik must still fail the gate and roll back), no After=deploy-proxy consumer regressed. PR to cc-ci repo; do NOT merge to the live host yet.

M2 — Proven on a real from-scratch boot. The durable proof is a from-scratch / cold boot where deploy-proxy reaches active without the dashboard and the whole control plane converges. Coordinate with the orchestrator: the orchestrator owns the live nixos-rebuild and any throwaway-VM / from-scratch rebuild (host infra is orchestrator/operator territory, not the loops). Acceptance: on a clean boot deploy-proxy is active (not failed), the rollback path still works on a deliberately-broken traefik, and the running server is unaffected. Fresh Adversary PASS → ## DONE.

Guardrails

  • Health verification must stay honest — a traefik that isn't actually serving must still fail the gate and roll back to last-good. Breaking the cycle must not become "always pass".
  • Live host changes (nixos-rebuild, from-scratch boot) are orchestrator/operator-owned; the loops produce the code + controlled proof and a PR, the orchestrator deploys + runs the cold-boot test. The loops must NOT nixos-rebuild the live host themselves.
  • No gate weakening elsewhere; minimal change; no secrets in commits. Commit author autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>; push every commit. CI host has no python3 on default PATH — use the harness venv.

Definition of Done

The deploy-proxy↔dashboard cycle is broken: a from-scratch boot brings traefik up health-gated (on a dashboard-independent signal) with rollback intact, no consumer mis-ordered, the running server unaffected, the A1 / DEFERRED entry closed with pointers, and M1+M2 fresh Adversary PASSes.