5.3 KiB
JOURNAL — phase pxgate (Builder)
2026-06-13 — Phase start
Orientation:
- Phase plan read:
/srv/cc-ci/cc-ci-plan/plan-phase-pxgate-proxy-healthgate.md - A1 finding from BACKLOG-pvfix.md: confirmed. Root cause exactly as stated.
- Pre-check:
https://traefik.ci.commoninternet.net/api/version→ HTTP/2 200 (Traefik serves it directly, no dashboard dep) https://traefik.ci.commoninternet.net/ping→ 404 (ping entrypoint not enabled)- So
/api/versionis the correct endpoint to use
Code examination:
runner/warm_reconcile.pylines 117-127: traefik spec useshealth_domain: "ci.commoninternet.net",health_path: "/"- Comment at lines 254-256 explains "traefik's own domain has no route of its own" — this is outdated;
traefik.ci.commoninternet.net/api/versiondoes have a route and returns 200 nix/modules/proxy.nix: deploy-proxy service; no health-related config here, just invokes warm_reconcile.pynix/modules/dashboard.nix:after = [ "deploy-bridge.service" "deploy-proxy.service" ... ]— confirms the ordering
Other consumers of After=deploy-proxy.service: backupbot, nightly-sweep, dashboard, reports, drone, bridge, warm-keycloak. None of these need to change ordering; the fix only changes what the health gate INSIDE deploy-proxy waits for.
Fix approach (committed to DECISIONS.md): change health probe to traefik.ci.commoninternet.net/api/version. This is traefik's built-in API (no backend needed). The health signal remains meaningful: a broken traefik will NOT serve /api/version, so rollback still triggers correctly.
Fix applied:
runner/warm_reconcile.pytraefik spec: removedhealth_domain: "ci.commoninternet.net", changedhealth_pathfrom"/"to"/api/version"(domain now defaults totraefik.ci.commoninternet.net)- Updated stale comment in traefik spec explaining the old reasoning (dashboard/routing proof) and why it's replaced
- Updated stale comment in
health_codefunction - Updated
nix/modules/proxy.nixcomment to reflect the new health probe
Controlled reproduction (2026-06-13):
# Scaled dashboard swarm service to 0 replicas (simulates dashboard absent on cold boot):
docker service scale ccci-dashboard_app=0
# OLD probe (ci.commoninternet.net) with dashboard scaled to 0:
curl -sk -o /dev/null -w "%{http_code}" --max-time 5 --resolve "ci.commoninternet.net:443:127.0.0.1" "https://ci.commoninternet.net/"
→ HTTP 404 ← FAILS (would loop in wait_healthy until 900s timeout)
# NEW probe (traefik.ci.commoninternet.net/api/version) with dashboard scaled to 0:
curl -sk -o /dev/null -w "%{http_code}" --max-time 10 --resolve "traefik.ci.commoninternet.net:443:127.0.0.1" "https://traefik.ci.commoninternet.net/api/version"
→ HTTP 200 ← PASSES immediately (traefik's own API, no dashboard dependency)
# New probe body:
→ {"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}
# Dashboard restored:
docker service scale ccci-dashboard_app=1 → 1/1 ✓
systemctl start deploy-dashboard
curl -sk https://ci.commoninternet.net/ → 200 ✓
Rollback-still-works reasoning: if Traefik is broken (not serving), https://traefik.ci.commoninternet.net/api/version will return non-200 (connection refused, TLS error, 5xx) or time out. wait_healthy polls this and triggers rollback on failure. The new probe is not weaker — it probes the same Traefik process. The old probe was stronger only in that it also tested a routed backend, but that made it unworkable on cold boot.
DEFERRED.md update: 2026-06-13 entry closed with this fix commit.
Alert clearance:
# /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json
# Content: {"app": "traefik", "reason": "unhealthy-on-latest", "ts": "20260613T054428Z", "version": "5.1.1+v3.6.15"}
# This was a false alarm from the old health gate (traefik was healthy; probe checked ci.commoninternet.net
# which wasn't up yet due to the circular dependency). No credentials in the file.
ssh cc-ci 'rm /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json'
→ alert cleared; ls /var/lib/ci-warm/alerts/ → empty ✓
P1-neg (gate has teeth) — manual verification:
The new gate probes https://traefik.ci.commoninternet.net/api/version. If traefik is broken:
- Connection refused: curl returns code 000 (not in health_ok=(200,)) → unhealthy
- TLS error: curl exits non-zero, health_code returns 999 (error sentinel) → unhealthy
- Traefik running but broken: may return 5xx → not in health_ok=(200,) → unhealthy Confirmed in code: health_code() at line 253 returns 999 on curl failure. P1-neg holds by construction.
Next: commit + claim M1. → M1 PASS received @13:00Z. Awaiting orchestrator nixos-rebuild for M2.
2026-06-13T13:24Z — Builder poll (M2 monitoring)
Builder loop re-launched by orchestrator. Checked current state:
- deploy-proxy:
active (exited)since 05:44:28 UTC (OLD probe still live) - Active reconcile script:
/nix/store/ls5d6s7q2892z0n0qv7sfk03zimwx3nd-runner/warm_reconcile.py(old — hashealth_domain: "ci.commoninternet.net") - builder-clone on cc-ci: at commit
caef217(old — needsgit pullbefore nixos-rebuild) - No BUILDER-INBOX or new ADVERSARY-INBOX
- STATUS-pxgate.md M2 section has full orchestrator instructions (pull + nixos-rebuild switch)
Monitoring loop active. Will poll every ≤10 min for nixos-rebuild completion.