review(pxgate-M1): PASS — cycle broken, /api/version probe dashboard-independent, rollback intact
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
Cold verification of commit 0e9fd38:
1. Code change correct: health_path="/api/version", health_domain absent (falls back to
traefik.ci.commoninternet.net). Probe is traefik's own API, no backend dependency.
2. Controlled repro (dashboard=0): new probe → 200; old probe → 404. Cycle broken.
3. Consumer ordering unchanged: all After=deploy-proxy services unaffected; deploy-proxy
itself has no After=dashboard. Fix does not change any service ordering.
4. Alert dir empty: stale alert cleared.
5. proxy.nix comment updated correctly.
6. Gate has teeth: on curl failure, health_code() returns 0 (not 999 as STATUS claimed —
non-blocking doc discrepancy); 0 not in health_ok=(200,) → rollback triggers. Functional PASS.
7. DEFERRED entry closed, DECISIONS logged.
No blocking findings. M2 pending orchestrator cold-boot.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@ -94,14 +94,83 @@ or by examining the reconciler code path (deploy_version raises → upgrade_ok=F
|
||||
|
||||
## M1 — Fix + controlled reproduction
|
||||
|
||||
### PENDING — awaiting Builder implementation
|
||||
### PASS @2026-06-13T13:00Z — Adversary cold-verified
|
||||
|
||||
Acceptance criteria I will independently verify:
|
||||
1. **Code change correct**: SPECS["traefik"] removes `health_domain` override and uses `/api/version`
|
||||
2. **New gate is meaningful**: a STOPPED/broken traefik must cause the probe to fail (return non-200)
|
||||
3. **Controlled reproduction**: with dashboard held back, old gate hangs/fails; new gate passes on traefik alone
|
||||
4. **No `After=deploy-proxy` consumer regressed**: drone, warm-keycloak, bridge, dashboard, backupbot, reports ordering still correct
|
||||
5. **PR merged or ready** to the cc-ci repo
|
||||
**Commit:** `0e9fd38` (`claim(pxgate-M1): change traefik health probe to /api/version`)
|
||||
|
||||
#### Check 1 — Code change correct ✅
|
||||
|
||||
`runner/warm_reconcile.py` SPECS["traefik"] (lines 120–129):
|
||||
```python
|
||||
"traefik": {
|
||||
"recipe": "traefik",
|
||||
"domain": "traefik.ci.commoninternet.net",
|
||||
"health_path": "/api/version", # ← changed from "/"
|
||||
"health_ok": (200,),
|
||||
"stateful": False,
|
||||
"deploy_timeout": 600,
|
||||
"health_timeout": 300,
|
||||
"setup": _traefik_setup,
|
||||
},
|
||||
```
|
||||
`health_domain` key is **absent** → `health_code()` falls back to `spec["domain"]` =
|
||||
`"traefik.ci.commoninternet.net"`. Probe is now `https://traefik.ci.commoninternet.net/api/version`
|
||||
with `--resolve traefik.ci.commoninternet.net:443:127.0.0.1` — traefik's own API, no backend dep.
|
||||
|
||||
#### Check 2 — Controlled reproduction ✅
|
||||
|
||||
Scaled `ccci-dashboard_app` to 0 replicas (dashboard absent):
|
||||
- **New probe** (`/api/version` on traefik domain): HTTP **200** ← cycle broken
|
||||
- **Old probe** (`ci.commoninternet.net/`): HTTP **404** ← confirms old gate was deadlocked
|
||||
|
||||
Dashboard restored to 1/1 and returns 200 after scale-up.
|
||||
|
||||
#### Check 3 — Consumer ordering unchanged ✅
|
||||
|
||||
All `After=deploy-proxy.service` consumers unchanged:
|
||||
```
|
||||
deploy-drone: After=deploy-proxy.service swarm-init.service docker.service network-online.target
|
||||
deploy-bridge: After=deploy-drone.service deploy-proxy.service ...
|
||||
deploy-dashboard: After=deploy-bridge.service deploy-proxy.service ...
|
||||
deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service ...
|
||||
deploy-reports: After=deploy-dashboard.service deploy-proxy.service ...
|
||||
nightly-sweep: After=deploy-proxy.service warm-keycloak.service docker.service
|
||||
warm-keycloak: After=deploy-proxy.service ...
|
||||
```
|
||||
`deploy-proxy` itself: `After=swarm-init.service docker.service network-online.target` — no dashboard
|
||||
dependency in its own ordering (correct). Fix does not change any service ordering.
|
||||
|
||||
#### Check 4 — Alert dir empty ✅
|
||||
|
||||
`/var/lib/ci-warm/alerts/` is empty — Builder cleared the stale 05:44Z alert (valid false-alarm from
|
||||
the old gate hitting the deadlock this morning).
|
||||
|
||||
#### Check 5 — proxy.nix comment ✅
|
||||
|
||||
Comment updated: "health-gate (traefik.ci.commoninternet.net/api/version returns 200 — traefik's own
|
||||
API, no backend dep)". No functional change to the nix module (same systemd unit).
|
||||
|
||||
#### Check 6 — Gate has teeth ✅ (with one documentation note)
|
||||
|
||||
**Functional PASS:** `health_code()` line 276 returns `int(r.stdout.strip() or "0")` → on curl
|
||||
connection failure, stdout = "000" (curl's HTTP-code sentinel) → `int("000") = 0` → 0 ∉ `health_ok=(200,)`
|
||||
→ `wait_healthy()` returns False → rollback triggered. Gate genuinely fails on a broken traefik.
|
||||
|
||||
**Documentation discrepancy (non-blocking):** The STATUS claim says "EXPECTED: error sentinel 999 returned
|
||||
when curl fails." The actual code returns 0 (not 999) on curl failure. `grep` for "999" returns no matches.
|
||||
This is a documentation error in the M1 claim only — the functional behavior is correct (0 ≠ 200 → gate
|
||||
fails → rollback). No code defect; no blocking finding.
|
||||
|
||||
#### Check 7 — DEFERRED + DECISIONS updated ✅
|
||||
|
||||
`machine-docs/DEFERRED.md`: 2026-06-13 circular-dependency entry marked `[x] CLOSED @2026-06-13` with fix pointer.
|
||||
`machine-docs/DECISIONS.md`: "deploy-proxy health gate — SETTLED (2026-06-13, phase pxgate)" entry added with rationale.
|
||||
|
||||
---
|
||||
|
||||
**M1 VERDICT: PASS** — cycle broken, new probe is dashboard-independent, rollback gate has teeth,
|
||||
ordering unchanged, DEFERRED closed, docs updated. One non-blocking STATUS discrepancy (999 vs 0
|
||||
sentinel) noted; no code defect.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user