claim(pxgate-M1): change traefik health probe to /api/version (A1 cycle fix)
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
Break the deploy-proxy ↔ dashboard health-gate circular dependency (Adversary A1, pvfix): - runner/warm_reconcile.py: remove health_domain override (was ci.commoninternet.net, the dashboard). Change health_path from / to /api/version. The probe now uses traefik.ci.commoninternet.net/api/version — traefik's own API, no backend/dashboard dep. - nix/modules/proxy.nix: update comment to reflect new health probe. - machine-docs/DECISIONS.md: pxgate fix logged (supersedes pvfix manual workaround). - machine-docs/DEFERRED.md: 2026-06-13 circular-dependency entry closed. - Consumed BUILDER-INBOX.md (Adversary orientation msg). Controlled reproduction (dashboard swarm scaled to 0): OLD probe (ci.commoninternet.net): HTTP 404 ← gate would loop → timeout NEW probe (traefik.../api/version): HTTP 200 ← passes immediately Stale false-alarm alert 20260613T054428Z-traefik-unhealthy-on-latest.json cleared on host. No After=deploy-proxy consumers changed (ordering preserved). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
71
machine-docs/JOURNAL-pxgate.md
Normal file
71
machine-docs/JOURNAL-pxgate.md
Normal file
@ -0,0 +1,71 @@
|
||||
# JOURNAL — phase pxgate (Builder)
|
||||
|
||||
## 2026-06-13 — Phase start
|
||||
|
||||
**Orientation:**
|
||||
- Phase plan read: `/srv/cc-ci/cc-ci-plan/plan-phase-pxgate-proxy-healthgate.md`
|
||||
- A1 finding from BACKLOG-pvfix.md: confirmed. Root cause exactly as stated.
|
||||
- Pre-check: `https://traefik.ci.commoninternet.net/api/version` → HTTP/2 200 (Traefik serves it directly, no dashboard dep)
|
||||
- `https://traefik.ci.commoninternet.net/ping` → 404 (ping entrypoint not enabled)
|
||||
- So `/api/version` is the correct endpoint to use
|
||||
|
||||
**Code examination:**
|
||||
- `runner/warm_reconcile.py` lines 117-127: traefik spec uses `health_domain: "ci.commoninternet.net"`, `health_path: "/"`
|
||||
- Comment at lines 254-256 explains "traefik's own domain has no route of its own" — this is outdated; `traefik.ci.commoninternet.net/api/version` does have a route and returns 200
|
||||
- `nix/modules/proxy.nix`: deploy-proxy service; no health-related config here, just invokes warm_reconcile.py
|
||||
- `nix/modules/dashboard.nix`: `after = [ "deploy-bridge.service" "deploy-proxy.service" ... ]` — confirms the ordering
|
||||
|
||||
**Other consumers of `After=deploy-proxy.service`:** backupbot, nightly-sweep, dashboard, reports, drone, bridge, warm-keycloak. None of these need to change ordering; the fix only changes what the health gate INSIDE deploy-proxy waits for.
|
||||
|
||||
**Fix approach (committed to DECISIONS.md):** change health probe to `traefik.ci.commoninternet.net/api/version`. This is traefik's built-in API (no backend needed). The health signal remains meaningful: a broken traefik will NOT serve /api/version, so rollback still triggers correctly.
|
||||
|
||||
**Fix applied:**
|
||||
- `runner/warm_reconcile.py` traefik spec: removed `health_domain: "ci.commoninternet.net"`, changed `health_path` from `"/"` to `"/api/version"` (domain now defaults to `traefik.ci.commoninternet.net`)
|
||||
- Updated stale comment in traefik spec explaining the old reasoning (dashboard/routing proof) and why it's replaced
|
||||
- Updated stale comment in `health_code` function
|
||||
- Updated `nix/modules/proxy.nix` comment to reflect the new health probe
|
||||
|
||||
**Controlled reproduction (2026-06-13):**
|
||||
```
|
||||
# Scaled dashboard swarm service to 0 replicas (simulates dashboard absent on cold boot):
|
||||
docker service scale ccci-dashboard_app=0
|
||||
|
||||
# OLD probe (ci.commoninternet.net) with dashboard scaled to 0:
|
||||
curl -sk -o /dev/null -w "%{http_code}" --max-time 5 --resolve "ci.commoninternet.net:443:127.0.0.1" "https://ci.commoninternet.net/"
|
||||
→ HTTP 404 ← FAILS (would loop in wait_healthy until 900s timeout)
|
||||
|
||||
# NEW probe (traefik.ci.commoninternet.net/api/version) with dashboard scaled to 0:
|
||||
curl -sk -o /dev/null -w "%{http_code}" --max-time 10 --resolve "traefik.ci.commoninternet.net:443:127.0.0.1" "https://traefik.ci.commoninternet.net/api/version"
|
||||
→ HTTP 200 ← PASSES immediately (traefik's own API, no dashboard dependency)
|
||||
|
||||
# New probe body:
|
||||
→ {"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}
|
||||
|
||||
# Dashboard restored:
|
||||
docker service scale ccci-dashboard_app=1 → 1/1 ✓
|
||||
systemctl start deploy-dashboard
|
||||
curl -sk https://ci.commoninternet.net/ → 200 ✓
|
||||
```
|
||||
|
||||
**Rollback-still-works reasoning:** if Traefik is broken (not serving), `https://traefik.ci.commoninternet.net/api/version` will return non-200 (connection refused, TLS error, 5xx) or time out. `wait_healthy` polls this and triggers rollback on failure. The new probe is not weaker — it probes the same Traefik process. The old probe was stronger only in that it also tested a routed backend, but that made it unworkable on cold boot.
|
||||
|
||||
**DEFERRED.md update:** 2026-06-13 entry closed with this fix commit.
|
||||
|
||||
**Alert clearance:**
|
||||
```
|
||||
# /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json
|
||||
# Content: {"app": "traefik", "reason": "unhealthy-on-latest", "ts": "20260613T054428Z", "version": "5.1.1+v3.6.15"}
|
||||
# This was a false alarm from the old health gate (traefik was healthy; probe checked ci.commoninternet.net
|
||||
# which wasn't up yet due to the circular dependency). No credentials in the file.
|
||||
ssh cc-ci 'rm /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json'
|
||||
→ alert cleared; ls /var/lib/ci-warm/alerts/ → empty ✓
|
||||
```
|
||||
|
||||
**P1-neg (gate has teeth) — manual verification:**
|
||||
The new gate probes `https://traefik.ci.commoninternet.net/api/version`. If traefik is broken:
|
||||
- Connection refused: curl returns code 000 (not in health_ok=(200,)) → unhealthy
|
||||
- TLS error: curl exits non-zero, health_code returns 999 (error sentinel) → unhealthy
|
||||
- Traefik running but broken: may return 5xx → not in health_ok=(200,) → unhealthy
|
||||
Confirmed in code: health_code() at line 253 returns 999 on curl failure. P1-neg holds by construction.
|
||||
|
||||
**Next:** commit + claim M1.
|
||||
Reference in New Issue
Block a user