Files
cc-ci/machine-docs/REVIEW-pxgate.md
autonomic-bot 67e13f3a1f
Some checks failed
continuous-integration/drone/push Build is failing
chore(pxgate): M2 blocked on orchestrator nixos-rebuild — old probe still live
Active nix store (km6173hm5a...) calls ls5d6s7q...-runner/warm_reconcile.py which
still has health_domain=ci.commoninternet.net (OLD probe). Fix 0e9fd38 in git but not
deployed. Waiting for: cd /root/builder-clone && git pull && nixos-rebuild switch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 13:03:36 +00:00

199 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — phase pxgate
**Phase:** pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix)
**Adversary:** autonomic-bot (Sonnet 4.6)
**Started:** 2026-06-13T12:41Z
---
## Adversary orientation (cold start — 2026-06-13T12:41Z)
Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so
the M1 verdict below is COLD and reproducible.
### Root cause — INDEPENDENTLY CONFIRMED
Reading `nix/modules/proxy.nix` + `runner/warm_reconcile.py` + `nix/modules/dashboard.nix`:
1. `deploy-proxy.service` runs `warm_reconcile.py traefik`.
2. The traefik SPEC in `warm_reconcile.py:117-128` sets:
```python
"health_domain": "ci.commoninternet.net",
"health_path": "/",
```
So `health_code()` probes `https://ci.commoninternet.net/` — the dashboard.
3. `deploy-dashboard.service` (dashboard.nix:89) has:
```
After=deploy-bridge.service deploy-proxy.service ...
```
systemd will not start deploy-dashboard until deploy-proxy exits.
4. **Deadlock:** proxy waits for dashboard; dashboard waits for proxy.
### Root cause — PROVEN LIVE (not merely theoretical)
The alert file `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
confirms the deadlock hit TODAY at boot time:
```
deploy-proxy started: 05:38:21 UTC
→ probed ci.commoninternet.net (60s timeout): unhealthy
→ redeployed traefik
→ probed ci.commoninternet.net (300s timeout): still unhealthy
→ wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
→ deployed dashboard successfully
→ ci.commoninternet.net now returns 200
```
traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at
05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.
### Verified fix endpoint
`curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version`
→ `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}` (200)
This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.
`/ping` → 404 (not configured in the current recipe — avoid).
### Required change (my independent read of the fix)
In `runner/warm_reconcile.py` SPECS["traefik"]:
- Remove `"health_domain": "ci.commoninternet.net"` — so `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"`
- Change `"health_path": "/"` → `"health_path": "/api/version"`
`health_code()` will then probe `https://traefik.ci.commoninternet.net/api/version` directly
(via `--resolve traefik.ci.commoninternet.net:443:127.0.0.1`), which returns 200 as soon as
traefik is up — no dashboard dependency.
### Pre-M1 break-it probes (before Builder's fix, 2026-06-13T12:50Z)
**P5 — Secret leak in alert files:** PASS. `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
contains only `{"app": "traefik", "reason": "unhealthy-on-latest", "ts": "...", "version": "5.1.1+v3.6.15"}`.
No credentials, no secrets.
**P3 — After=deploy-proxy consumers ordering:** PASS (no regression in current ordering):
- deploy-drone: After=deploy-proxy.service
- deploy-bridge: After=deploy-drone.service deploy-proxy.service
- deploy-dashboard: After=deploy-bridge.service deploy-proxy.service
- deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service
- deploy-reports: After=deploy-dashboard.service deploy-proxy.service
- nightly-sweep: After=deploy-proxy.service warm-keycloak.service
- warm-keycloak: After=deploy-proxy.service
These all correctly depend on deploy-proxy; after the fix, proxy completes without
deadlock and the rest of the chain proceeds normally.
**Endpoint stability:** `/api/version` returns 200 reliably (3/3 probes). No backend dependency.
**P1-negative (traefik-down):** PENDING at M1 gate — requires a controlled stop of
traefik (risky on live system); will execute at M1 verification using a short pause
or by examining the reconciler code path (deploy_version raises → upgrade_ok=False → rollback).
---
## M1 — Fix + controlled reproduction
### PASS @2026-06-13T13:00Z — Adversary cold-verified
**Commit:** `0e9fd38` (`claim(pxgate-M1): change traefik health probe to /api/version`)
#### Check 1 — Code change correct ✅
`runner/warm_reconcile.py` SPECS["traefik"] (lines 120129):
```python
"traefik": {
"recipe": "traefik",
"domain": "traefik.ci.commoninternet.net",
"health_path": "/api/version", # ← changed from "/"
"health_ok": (200,),
"stateful": False,
"deploy_timeout": 600,
"health_timeout": 300,
"setup": _traefik_setup,
},
```
`health_domain` key is **absent** → `health_code()` falls back to `spec["domain"]` =
`"traefik.ci.commoninternet.net"`. Probe is now `https://traefik.ci.commoninternet.net/api/version`
with `--resolve traefik.ci.commoninternet.net:443:127.0.0.1` — traefik's own API, no backend dep.
#### Check 2 — Controlled reproduction ✅
Scaled `ccci-dashboard_app` to 0 replicas (dashboard absent):
- **New probe** (`/api/version` on traefik domain): HTTP **200** ← cycle broken
- **Old probe** (`ci.commoninternet.net/`): HTTP **404** ← confirms old gate was deadlocked
Dashboard restored to 1/1 and returns 200 after scale-up.
#### Check 3 — Consumer ordering unchanged ✅
All `After=deploy-proxy.service` consumers unchanged:
```
deploy-drone: After=deploy-proxy.service swarm-init.service docker.service network-online.target
deploy-bridge: After=deploy-drone.service deploy-proxy.service ...
deploy-dashboard: After=deploy-bridge.service deploy-proxy.service ...
deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service ...
deploy-reports: After=deploy-dashboard.service deploy-proxy.service ...
nightly-sweep: After=deploy-proxy.service warm-keycloak.service docker.service
warm-keycloak: After=deploy-proxy.service ...
```
`deploy-proxy` itself: `After=swarm-init.service docker.service network-online.target` — no dashboard
dependency in its own ordering (correct). Fix does not change any service ordering.
#### Check 4 — Alert dir empty ✅
`/var/lib/ci-warm/alerts/` is empty — Builder cleared the stale 05:44Z alert (valid false-alarm from
the old gate hitting the deadlock this morning).
#### Check 5 — proxy.nix comment ✅
Comment updated: "health-gate (traefik.ci.commoninternet.net/api/version returns 200 — traefik's own
API, no backend dep)". No functional change to the nix module (same systemd unit).
#### Check 6 — Gate has teeth ✅ (with one documentation note)
**Functional PASS:** `health_code()` line 276 returns `int(r.stdout.strip() or "0")` → on curl
connection failure, stdout = "000" (curl's HTTP-code sentinel) → `int("000") = 0` → 0 ∉ `health_ok=(200,)`
→ `wait_healthy()` returns False → rollback triggered. Gate genuinely fails on a broken traefik.
**Documentation discrepancy (non-blocking):** The STATUS claim says "EXPECTED: error sentinel 999 returned
when curl fails." The actual code returns 0 (not 999) on curl failure. `grep` for "999" returns no matches.
This is a documentation error in the M1 claim only — the functional behavior is correct (0 ≠ 200 → gate
fails → rollback). No code defect; no blocking finding.
#### Check 7 — DEFERRED + DECISIONS updated ✅
`machine-docs/DEFERRED.md`: 2026-06-13 circular-dependency entry marked `[x] CLOSED @2026-06-13` with fix pointer.
`machine-docs/DECISIONS.md`: "deploy-proxy health gate — SETTLED (2026-06-13, phase pxgate)" entry added with rationale.
---
**M1 VERDICT: PASS** — cycle broken, new probe is dashboard-independent, rollback gate has teeth,
ordering unchanged, DEFERRED closed, docs updated. One non-blocking STATUS discrepancy (999 vs 0
sentinel) noted; no code defect.
---
## M2 — Proven on a real from-scratch boot
### PENDING — awaiting orchestrator nixos-rebuild (as of 2026-06-13T13:08Z)
M1 is PASS. The fix is in the repo (`0e9fd38`). The live cc-ci host still has the OLD probe:
- Active reconcile script: `/nix/store/km6173hm5a77wxggd7zba3mfakrz0c94-cc-ci-reconcile-proxy`
- Calls: `/nix/store/ls5d6s7q2892z0n0qv7sfk03zimwx3nd-runner/warm_reconcile.py`
- That file has: `"health_domain": "ci.commoninternet.net"`, `"health_path": "/"` — OLD probe still live
**Orchestrator action required:**
```bash
ssh cc-ci
cd /root/builder-clone
git pull # to get commit 0e9fd38
nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"
```
After nixos-rebuild, I will verify (per STATUS-pxgate.md M2 checks):
1. `deploy-proxy.service` shows `active (exited)` (not unhealthy alert)
2. New nix store path with `/api/version` in use
3. All services 1/1 unaffected
4. Cold-boot simulation: stop dashboard + restart proxy → proxy completes healthy without dashboard