claim(pxgate-M1): change traefik health probe to /api/version (A1 cycle fix)
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
Break the deploy-proxy ↔ dashboard health-gate circular dependency (Adversary A1, pvfix): - runner/warm_reconcile.py: remove health_domain override (was ci.commoninternet.net, the dashboard). Change health_path from / to /api/version. The probe now uses traefik.ci.commoninternet.net/api/version — traefik's own API, no backend/dashboard dep. - nix/modules/proxy.nix: update comment to reflect new health probe. - machine-docs/DECISIONS.md: pxgate fix logged (supersedes pvfix manual workaround). - machine-docs/DEFERRED.md: 2026-06-13 circular-dependency entry closed. - Consumed BUILDER-INBOX.md (Adversary orientation msg). Controlled reproduction (dashboard swarm scaled to 0): OLD probe (ci.commoninternet.net): HTTP 404 ← gate would loop → timeout NEW probe (traefik.../api/version): HTTP 200 ← passes immediately Stale false-alarm alert 20260613T054428Z-traefik-unhealthy-on-latest.json cleared on host. No After=deploy-proxy consumers changed (ordering preserved). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@ -112,13 +112,15 @@ SPECS: dict[str, dict] = {
|
||||
"health_timeout": 900,
|
||||
},
|
||||
# traefik = the reverse proxy: STATELESS (version-rollback-only, NO snapshot). Health is probed
|
||||
# on a ROUTED host (the dashboard) since traefik's own domain has no route. `setup` preserves the
|
||||
# wildcard cert / file-provider config.
|
||||
# on traefik's OWN /api/version endpoint (no backend/dashboard dependency) — a broken traefik
|
||||
# will not serve it, so rollback still triggers. Probing ci.commoninternet.net (dashboard) caused
|
||||
# a cold-boot deadlock: deploy-dashboard is After=deploy-proxy, so the dashboard was never up
|
||||
# when deploy-proxy's wait_healthy ran (A1 fix, phase pxgate). `setup` preserves the wildcard
|
||||
# cert / file-provider config.
|
||||
"traefik": {
|
||||
"recipe": "traefik",
|
||||
"domain": "traefik.ci.commoninternet.net",
|
||||
"health_domain": "ci.commoninternet.net",
|
||||
"health_path": "/",
|
||||
"health_path": "/api/version",
|
||||
"health_ok": (200,),
|
||||
"stateful": False,
|
||||
"deploy_timeout": 600,
|
||||
@ -251,9 +253,8 @@ def is_deployed(domain: str) -> bool:
|
||||
|
||||
|
||||
def health_code(spec: dict) -> int:
|
||||
# health is probed on `health_domain` (defaults to the app domain). For traefik the app domain
|
||||
# (traefik.ci…) has no route of its own — health is a ROUTED host (e.g. the dashboard
|
||||
# ci.commoninternet.net), so a 200 proves traefik is up + routing + TLS-terminating.
|
||||
# health is probed on `health_domain` (defaults to the app domain). For traefik, health is
|
||||
# traefik.ci.commoninternet.net/api/version — traefik's own endpoint, no backend needed.
|
||||
domain = spec.get("health_domain", spec["domain"])
|
||||
r = _run(
|
||||
[
|
||||
|
||||
Reference in New Issue
Block a user