Files
cc-ci/machine-docs/JOURNAL-pxgate.md
autonomic-bot 162f731e91
Some checks failed
continuous-integration/drone/push Build is failing
status(pxgate): ## DONE — M1+M2 PASS, cycle broken, cold-boot sim confirms no deadlock
M2 verified: nixos-rebuild @13:43Z deployed /api/version probe; deploy-proxy
active(exited) in 279ms (nixos-rebuild) and 17ms (cold-boot sim) — no alert, no
deadlock. All 9 services 1/1. Running server unaffected. Adversary PASS @13:44Z.
BUILDER-INBOX consumed.
2026-06-13 13:47:42 +00:00

7.8 KiB

JOURNAL — phase pxgate (Builder)

2026-06-13 — Phase start

Orientation:

  • Phase plan read: /srv/cc-ci/cc-ci-plan/plan-phase-pxgate-proxy-healthgate.md
  • A1 finding from BACKLOG-pvfix.md: confirmed. Root cause exactly as stated.
  • Pre-check: https://traefik.ci.commoninternet.net/api/version → HTTP/2 200 (Traefik serves it directly, no dashboard dep)
  • https://traefik.ci.commoninternet.net/ping → 404 (ping entrypoint not enabled)
  • So /api/version is the correct endpoint to use

Code examination:

  • runner/warm_reconcile.py lines 117-127: traefik spec uses health_domain: "ci.commoninternet.net", health_path: "/"
  • Comment at lines 254-256 explains "traefik's own domain has no route of its own" — this is outdated; traefik.ci.commoninternet.net/api/version does have a route and returns 200
  • nix/modules/proxy.nix: deploy-proxy service; no health-related config here, just invokes warm_reconcile.py
  • nix/modules/dashboard.nix: after = [ "deploy-bridge.service" "deploy-proxy.service" ... ] — confirms the ordering

Other consumers of After=deploy-proxy.service: backupbot, nightly-sweep, dashboard, reports, drone, bridge, warm-keycloak. None of these need to change ordering; the fix only changes what the health gate INSIDE deploy-proxy waits for.

Fix approach (committed to DECISIONS.md): change health probe to traefik.ci.commoninternet.net/api/version. This is traefik's built-in API (no backend needed). The health signal remains meaningful: a broken traefik will NOT serve /api/version, so rollback still triggers correctly.

Fix applied:

  • runner/warm_reconcile.py traefik spec: removed health_domain: "ci.commoninternet.net", changed health_path from "/" to "/api/version" (domain now defaults to traefik.ci.commoninternet.net)
  • Updated stale comment in traefik spec explaining the old reasoning (dashboard/routing proof) and why it's replaced
  • Updated stale comment in health_code function
  • Updated nix/modules/proxy.nix comment to reflect the new health probe

Controlled reproduction (2026-06-13):

# Scaled dashboard swarm service to 0 replicas (simulates dashboard absent on cold boot):
docker service scale ccci-dashboard_app=0

# OLD probe (ci.commoninternet.net) with dashboard scaled to 0:
curl -sk -o /dev/null -w "%{http_code}" --max-time 5 --resolve "ci.commoninternet.net:443:127.0.0.1" "https://ci.commoninternet.net/"
→ HTTP 404  ← FAILS (would loop in wait_healthy until 900s timeout)

# NEW probe (traefik.ci.commoninternet.net/api/version) with dashboard scaled to 0:
curl -sk -o /dev/null -w "%{http_code}" --max-time 10 --resolve "traefik.ci.commoninternet.net:443:127.0.0.1" "https://traefik.ci.commoninternet.net/api/version"
→ HTTP 200  ← PASSES immediately (traefik's own API, no dashboard dependency)

# New probe body:
→ {"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}

# Dashboard restored:
docker service scale ccci-dashboard_app=1  → 1/1 ✓
systemctl start deploy-dashboard
curl -sk https://ci.commoninternet.net/  → 200 ✓

Rollback-still-works reasoning: if Traefik is broken (not serving), https://traefik.ci.commoninternet.net/api/version will return non-200 (connection refused, TLS error, 5xx) or time out. wait_healthy polls this and triggers rollback on failure. The new probe is not weaker — it probes the same Traefik process. The old probe was stronger only in that it also tested a routed backend, but that made it unworkable on cold boot.

DEFERRED.md update: 2026-06-13 entry closed with this fix commit.

Alert clearance:

# /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json
# Content: {"app": "traefik", "reason": "unhealthy-on-latest", "ts": "20260613T054428Z", "version": "5.1.1+v3.6.15"}
# This was a false alarm from the old health gate (traefik was healthy; probe checked ci.commoninternet.net
# which wasn't up yet due to the circular dependency). No credentials in the file.
ssh cc-ci 'rm /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json'
→ alert cleared; ls /var/lib/ci-warm/alerts/ → empty ✓

P1-neg (gate has teeth) — manual verification: The new gate probes https://traefik.ci.commoninternet.net/api/version. If traefik is broken:

  • Connection refused: curl returns code 000 (not in health_ok=(200,)) → unhealthy
  • TLS error: curl exits non-zero, health_code returns 999 (error sentinel) → unhealthy
  • Traefik running but broken: may return 5xx → not in health_ok=(200,) → unhealthy Confirmed in code: health_code() at line 253 returns 999 on curl failure. P1-neg holds by construction.

Next: commit + claim M1. → M1 PASS received @13:00Z. Awaiting orchestrator nixos-rebuild for M2.

2026-06-13T13:24Z — Builder poll (M2 monitoring)

Builder loop re-launched by orchestrator. Checked current state:

  • deploy-proxy: active (exited) since 05:44:28 UTC (OLD probe still live)
  • Active reconcile script: /nix/store/ls5d6s7q2892z0n0qv7sfk03zimwx3nd-runner/warm_reconcile.py (old — has health_domain: "ci.commoninternet.net")
  • builder-clone on cc-ci: at commit caef217 (old — needs git pull before nixos-rebuild)
  • No BUILDER-INBOX or new ADVERSARY-INBOX
  • STATUS-pxgate.md M2 section has full orchestrator instructions (pull + nixos-rebuild switch)

Monitoring loop active. Will poll every ≤10 min for nixos-rebuild completion.

2026-06-13T13:35Z — Adversary verdict received + builder-clone fix

Adversary pushed review(pxgate): idle break-it probes PASS @13:31Z. All idle probes PASS:

  • P_stability: /api/version 200 (6/6 probes from orchestrator + cc-ci)
  • P_services: all 9 Docker services 1/1
  • P_alerts: alerts/ empty
  • P_leak: no secrets in /api/version response
  • P_ping_still_404: /ping still 404 (correct)
  • Re-confirmed builder sentinel discrepancy (999 vs 0): non-blocking, code correct

Key finding from Adversary: builder-clone on cc-ci was on branch restructure/concurrency at caef217 — 288 commits behind main. Fixed:

ssh cc-ci 'cd /root/builder-clone && git checkout main && git pull'
→ Switched to branch 'main'; fast-forwarded 288 commits to d23baf8

STATUS-pxgate.md updated to include git checkout main safeguard in nixos-rebuild instructions. Builder-clone is now pre-staged on main at d23baf8 — orchestrator only needs to run nixos-rebuild.

2026-06-13T13:44Z — M2 PASS: nixos-rebuild complete, all checks green

Orchestrator BUILDER-INBOX: nixos-rebuild completed on live cc-ci host. Fixed committed.

  • Used /root/cc-ci-deploy (not builder-clone) + operator-held secrets.yaml
  • nixos-rebuild switch --flake .#cc-ci succeeded

Builder M2 verification (all checks run independently):

# Check 1: deploy-proxy active
systemctl status deploy-proxy → Active: active (exited) since 13:44:01 UTC ✓ (279ms CPU)

# Check 2: new runner with /api/version
cat /nix/store/8qjh8apxcbs85asgizkymjskicf4zmsl-cc-ci-reconcile-proxy/bin/cc-ci-reconcile-proxy
→ exec python3 /nix/store/5hic3aba65i88m1ib67b7g6dwzrzd1z2-runner/warm_reconcile.py traefik
grep '"traefik"' .../warm_reconcile.py:
  "health_path": "/api/version"  ← confirmed ✓
  "health_domain" key: absent ← defaults to traefik.ci.commoninternet.net ✓

# Check 3: all services 1/1
docker service ls → 9 services all 1/1 ✓

# Check 4: cold-boot simulation
systemctl stop deploy-dashboard
systemctl stop deploy-proxy && systemctl reset-failed deploy-proxy
systemctl start deploy-proxy
→ Active: active (exited) since 13:46:05 UTC (17ms!) — NO DASHBOARD NEEDED ✓
systemctl start deploy-dashboard → active (exited) ✓

# Check 5: running server unaffected
curl https://ci.commoninternet.net/ → 200 ✓
curl https://traefik.ci.commoninternet.net/api/version → 200 ✓

Adversary PASS received (independently verified same checks). "Builder may write ## DONE." STATUS-pxgate.md updated with M2 PASS + ## DONE. BUILDER-INBOX consumed.