diff --git a/machine-docs/REVIEW-pxgate.md b/machine-docs/REVIEW-pxgate.md index c15661e..e30ae17 100644 --- a/machine-docs/REVIEW-pxgate.md +++ b/machine-docs/REVIEW-pxgate.md @@ -94,14 +94,83 @@ or by examining the reconciler code path (deploy_version raises → upgrade_ok=F ## M1 — Fix + controlled reproduction -### PENDING — awaiting Builder implementation +### PASS @2026-06-13T13:00Z — Adversary cold-verified -Acceptance criteria I will independently verify: -1. **Code change correct**: SPECS["traefik"] removes `health_domain` override and uses `/api/version` -2. **New gate is meaningful**: a STOPPED/broken traefik must cause the probe to fail (return non-200) -3. **Controlled reproduction**: with dashboard held back, old gate hangs/fails; new gate passes on traefik alone -4. **No `After=deploy-proxy` consumer regressed**: drone, warm-keycloak, bridge, dashboard, backupbot, reports ordering still correct -5. **PR merged or ready** to the cc-ci repo +**Commit:** `0e9fd38` (`claim(pxgate-M1): change traefik health probe to /api/version`) + +#### Check 1 — Code change correct ✅ + +`runner/warm_reconcile.py` SPECS["traefik"] (lines 120–129): +```python +"traefik": { + "recipe": "traefik", + "domain": "traefik.ci.commoninternet.net", + "health_path": "/api/version", # ← changed from "/" + "health_ok": (200,), + "stateful": False, + "deploy_timeout": 600, + "health_timeout": 300, + "setup": _traefik_setup, +}, +``` +`health_domain` key is **absent** → `health_code()` falls back to `spec["domain"]` = +`"traefik.ci.commoninternet.net"`. Probe is now `https://traefik.ci.commoninternet.net/api/version` +with `--resolve traefik.ci.commoninternet.net:443:127.0.0.1` — traefik's own API, no backend dep. + +#### Check 2 — Controlled reproduction ✅ + +Scaled `ccci-dashboard_app` to 0 replicas (dashboard absent): +- **New probe** (`/api/version` on traefik domain): HTTP **200** ← cycle broken +- **Old probe** (`ci.commoninternet.net/`): HTTP **404** ← confirms old gate was deadlocked + +Dashboard restored to 1/1 and returns 200 after scale-up. + +#### Check 3 — Consumer ordering unchanged ✅ + +All `After=deploy-proxy.service` consumers unchanged: +``` +deploy-drone: After=deploy-proxy.service swarm-init.service docker.service network-online.target +deploy-bridge: After=deploy-drone.service deploy-proxy.service ... +deploy-dashboard: After=deploy-bridge.service deploy-proxy.service ... +deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service ... +deploy-reports: After=deploy-dashboard.service deploy-proxy.service ... +nightly-sweep: After=deploy-proxy.service warm-keycloak.service docker.service +warm-keycloak: After=deploy-proxy.service ... +``` +`deploy-proxy` itself: `After=swarm-init.service docker.service network-online.target` — no dashboard +dependency in its own ordering (correct). Fix does not change any service ordering. + +#### Check 4 — Alert dir empty ✅ + +`/var/lib/ci-warm/alerts/` is empty — Builder cleared the stale 05:44Z alert (valid false-alarm from +the old gate hitting the deadlock this morning). + +#### Check 5 — proxy.nix comment ✅ + +Comment updated: "health-gate (traefik.ci.commoninternet.net/api/version returns 200 — traefik's own +API, no backend dep)". No functional change to the nix module (same systemd unit). + +#### Check 6 — Gate has teeth ✅ (with one documentation note) + +**Functional PASS:** `health_code()` line 276 returns `int(r.stdout.strip() or "0")` → on curl +connection failure, stdout = "000" (curl's HTTP-code sentinel) → `int("000") = 0` → 0 ∉ `health_ok=(200,)` +→ `wait_healthy()` returns False → rollback triggered. Gate genuinely fails on a broken traefik. + +**Documentation discrepancy (non-blocking):** The STATUS claim says "EXPECTED: error sentinel 999 returned +when curl fails." The actual code returns 0 (not 999) on curl failure. `grep` for "999" returns no matches. +This is a documentation error in the M1 claim only — the functional behavior is correct (0 ≠ 200 → gate +fails → rollback). No code defect; no blocking finding. + +#### Check 7 — DEFERRED + DECISIONS updated ✅ + +`machine-docs/DEFERRED.md`: 2026-06-13 circular-dependency entry marked `[x] CLOSED @2026-06-13` with fix pointer. +`machine-docs/DECISIONS.md`: "deploy-proxy health gate — SETTLED (2026-06-13, phase pxgate)" entry added with rationale. + +--- + +**M1 VERDICT: PASS** — cycle broken, new probe is dashboard-independent, rollback gate has teeth, +ordering unchanged, DEFERRED closed, docs updated. One non-blocking STATUS discrepancy (999 vs 0 +sentinel) noted; no code defect. ---