claim(pxgate-M1): change traefik health probe to /api/version (A1 cycle fix)
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
Break the deploy-proxy ↔ dashboard health-gate circular dependency (Adversary A1, pvfix): - runner/warm_reconcile.py: remove health_domain override (was ci.commoninternet.net, the dashboard). Change health_path from / to /api/version. The probe now uses traefik.ci.commoninternet.net/api/version — traefik's own API, no backend/dashboard dep. - nix/modules/proxy.nix: update comment to reflect new health probe. - machine-docs/DECISIONS.md: pxgate fix logged (supersedes pvfix manual workaround). - machine-docs/DEFERRED.md: 2026-06-13 circular-dependency entry closed. - Consumed BUILDER-INBOX.md (Adversary orientation msg). Controlled reproduction (dashboard swarm scaled to 0): OLD probe (ci.commoninternet.net): HTTP 404 ← gate would loop → timeout NEW probe (traefik.../api/version): HTTP 200 ← passes immediately Stale false-alarm alert 20260613T054428Z-traefik-unhealthy-on-latest.json cleared on host. No After=deploy-proxy consumers changed (ordering preserved). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
83
machine-docs/STATUS-pxgate.md
Normal file
83
machine-docs/STATUS-pxgate.md
Normal file
@ -0,0 +1,83 @@
|
||||
# STATUS — phase pxgate (Builder)
|
||||
|
||||
**Phase plan:** `/srv/cc-ci/cc-ci-plan/plan-phase-pxgate-proxy-healthgate.md`
|
||||
**Phase start:** 2026-06-13
|
||||
|
||||
---
|
||||
|
||||
## Gate: M1 — CLAIMED, awaiting Adversary
|
||||
|
||||
### WHAT is claimed
|
||||
|
||||
The deploy-proxy ↔ dashboard health-gate circular dependency (Adversary A1, pvfix) is broken.
|
||||
|
||||
**Changed files:**
|
||||
- `runner/warm_reconcile.py` — SPECS["traefik"]: removed `"health_domain": "ci.commoninternet.net"`, changed `"health_path"` from `"/"` to `"/api/version"`. The health probe now uses `traefik.ci.commoninternet.net/api/version` (traefik's own API endpoint, no backend/dashboard dependency).
|
||||
- `nix/modules/proxy.nix` — updated comment to reflect the new health probe.
|
||||
- `machine-docs/DECISIONS.md` — pxgate decision logged (supersedes pvfix workaround).
|
||||
- `machine-docs/DEFERRED.md` — 2026-06-13 circular-dependency entry closed.
|
||||
|
||||
**No ordering changes:** all `After=deploy-proxy` consumers (drone, warm-keycloak, bridge, dashboard, backupbot, reports, nightly-sweep) unchanged.
|
||||
|
||||
### HOW to verify (cold-clone commands)
|
||||
|
||||
```bash
|
||||
# 1. Code change correct:
|
||||
grep -A5 '"traefik"' runner/warm_reconcile.py
|
||||
# EXPECTED: no "health_domain" key, "health_path": "/api/version"
|
||||
|
||||
# 2. New probe works with only traefik up (controlled repro):
|
||||
ssh cc-ci 'docker service scale ccci-dashboard_app=0'
|
||||
ssh cc-ci 'curl -sk -o /dev/null -w "%{http_code}" --max-time 10 --resolve "traefik.ci.commoninternet.net:443:127.0.0.1" "https://traefik.ci.commoninternet.net/api/version"'
|
||||
# EXPECTED: 200
|
||||
# Restore: ssh cc-ci 'docker service scale ccci-dashboard_app=1'
|
||||
|
||||
# 3. Old probe fails with dashboard stopped:
|
||||
ssh cc-ci 'docker service scale ccci-dashboard_app=0'
|
||||
ssh cc-ci 'curl -sk -o /dev/null -w "%{http_code}" --max-time 5 --resolve "ci.commoninternet.net:443:127.0.0.1" "https://ci.commoninternet.net/"'
|
||||
# EXPECTED: 404 (confirms the old gate would fail/loop → rollback after timeout)
|
||||
# Restore: ssh cc-ci 'docker service scale ccci-dashboard_app=1'
|
||||
|
||||
# 4. No After=deploy-proxy consumers regressed:
|
||||
for svc in deploy-drone deploy-bridge deploy-dashboard backupbot-backup.timer nightly-sweep.timer warm-keycloak; do
|
||||
ssh cc-ci "systemctl cat $svc 2>/dev/null | grep -E 'After|Wants|Requires' | grep -v '^#'"
|
||||
done
|
||||
# EXPECTED: each still has After=deploy-proxy.service (ordering preserved)
|
||||
|
||||
# 5. Alert cleared:
|
||||
ssh cc-ci 'ls /var/lib/ci-warm/alerts/'
|
||||
# EXPECTED: empty (stale false-alarm alert from old gate removed)
|
||||
|
||||
# 6. Rollback semantics (P1-neg — gate has teeth):
|
||||
# health_code() returns 999 on curl failure; 200 from /api/version is only returned when traefik
|
||||
# is actually serving. Verify in code: health_code() → 999 on error path.
|
||||
grep -n "health_code\|999" runner/warm_reconcile.py
|
||||
# EXPECTED: error sentinel 999 returned when curl fails
|
||||
```
|
||||
|
||||
### EXPECTED outcomes
|
||||
|
||||
| Check | Expected |
|
||||
|---|---|
|
||||
| `health_path` in traefik spec | `/api/version` |
|
||||
| `health_domain` in traefik spec | absent (defaults to `traefik.ci.commoninternet.net`) |
|
||||
| New probe (dashboard=0) | HTTP 200 |
|
||||
| Old probe (dashboard=0) | HTTP 404 |
|
||||
| After=deploy-proxy consumers | Unchanged (still order after proxy) |
|
||||
| Alert dir | Empty |
|
||||
| health_code error sentinel | 999 |
|
||||
|
||||
### WHERE (commit sha)
|
||||
|
||||
Commit hash: see `git log --oneline -1` after this claim commit lands.
|
||||
|
||||
---
|
||||
|
||||
## Gate: M2 — OPEN (awaiting M1 PASS + orchestrator cold-boot)
|
||||
|
||||
M2 requires a from-scratch / cold boot where:
|
||||
1. `deploy-proxy.service` reaches `active` without dashboard pre-deployed
|
||||
2. Rollback path still works on deliberately-broken traefik
|
||||
3. Running server unaffected
|
||||
|
||||
M2 is orchestrator-owned (they run the nixos-rebuild on the live host). The loops produce the code + M1 proof; the orchestrator deploys and runs the cold-boot test.
|
||||
Reference in New Issue
Block a user