Files
cc-ci/machine-docs/REVIEW-pxgate.md
autonomic-bot 0a32854853
Some checks failed
continuous-integration/drone/push Build is failing
review(pxgate-M2): PASS — cold-boot sim confirms cycle broken, proxy active without dashboard
nixos-rebuild deployed fix; new nix store path 8qjh8apxcbs85 with /api/version probe;
deploy-proxy active(exited) at 13:43:15 UTC; cold-boot sim: proxy started active(exited)
with dashboard stopped; all 9 services 1/1; alert dir empty; rollback gate unchanged.
Phase pxgate DoD fully met. Builder may write ## DONE.
2026-06-13 13:45:25 +00:00

291 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — phase pxgate
**Phase:** pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix)
**Adversary:** autonomic-bot (Sonnet 4.6)
**Started:** 2026-06-13T12:41Z
---
## Adversary orientation (cold start — 2026-06-13T12:41Z)
Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so
the M1 verdict below is COLD and reproducible.
### Root cause — INDEPENDENTLY CONFIRMED
Reading `nix/modules/proxy.nix` + `runner/warm_reconcile.py` + `nix/modules/dashboard.nix`:
1. `deploy-proxy.service` runs `warm_reconcile.py traefik`.
2. The traefik SPEC in `warm_reconcile.py:117-128` sets:
```python
"health_domain": "ci.commoninternet.net",
"health_path": "/",
```
So `health_code()` probes `https://ci.commoninternet.net/` — the dashboard.
3. `deploy-dashboard.service` (dashboard.nix:89) has:
```
After=deploy-bridge.service deploy-proxy.service ...
```
systemd will not start deploy-dashboard until deploy-proxy exits.
4. **Deadlock:** proxy waits for dashboard; dashboard waits for proxy.
### Root cause — PROVEN LIVE (not merely theoretical)
The alert file `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
confirms the deadlock hit TODAY at boot time:
```
deploy-proxy started: 05:38:21 UTC
→ probed ci.commoninternet.net (60s timeout): unhealthy
→ redeployed traefik
→ probed ci.commoninternet.net (300s timeout): still unhealthy
→ wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
→ deployed dashboard successfully
→ ci.commoninternet.net now returns 200
```
traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at
05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.
### Verified fix endpoint
`curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version`
→ `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}` (200)
This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.
`/ping` → 404 (not configured in the current recipe — avoid).
### Required change (my independent read of the fix)
In `runner/warm_reconcile.py` SPECS["traefik"]:
- Remove `"health_domain": "ci.commoninternet.net"` — so `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"`
- Change `"health_path": "/"` → `"health_path": "/api/version"`
`health_code()` will then probe `https://traefik.ci.commoninternet.net/api/version` directly
(via `--resolve traefik.ci.commoninternet.net:443:127.0.0.1`), which returns 200 as soon as
traefik is up — no dashboard dependency.
### Pre-M1 break-it probes (before Builder's fix, 2026-06-13T12:50Z)
**P5 — Secret leak in alert files:** PASS. `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
contains only `{"app": "traefik", "reason": "unhealthy-on-latest", "ts": "...", "version": "5.1.1+v3.6.15"}`.
No credentials, no secrets.
**P3 — After=deploy-proxy consumers ordering:** PASS (no regression in current ordering):
- deploy-drone: After=deploy-proxy.service
- deploy-bridge: After=deploy-drone.service deploy-proxy.service
- deploy-dashboard: After=deploy-bridge.service deploy-proxy.service
- deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service
- deploy-reports: After=deploy-dashboard.service deploy-proxy.service
- nightly-sweep: After=deploy-proxy.service warm-keycloak.service
- warm-keycloak: After=deploy-proxy.service
These all correctly depend on deploy-proxy; after the fix, proxy completes without
deadlock and the rest of the chain proceeds normally.
**Endpoint stability:** `/api/version` returns 200 reliably (3/3 probes). No backend dependency.
**P1-negative (traefik-down):** PENDING at M1 gate — requires a controlled stop of
traefik (risky on live system); will execute at M1 verification using a short pause
or by examining the reconciler code path (deploy_version raises → upgrade_ok=False → rollback).
---
## M1 — Fix + controlled reproduction
### PASS @2026-06-13T13:00Z — Adversary cold-verified
**Commit:** `0e9fd38` (`claim(pxgate-M1): change traefik health probe to /api/version`)
#### Check 1 — Code change correct ✅
`runner/warm_reconcile.py` SPECS["traefik"] (lines 120129):
```python
"traefik": {
"recipe": "traefik",
"domain": "traefik.ci.commoninternet.net",
"health_path": "/api/version", # ← changed from "/"
"health_ok": (200,),
"stateful": False,
"deploy_timeout": 600,
"health_timeout": 300,
"setup": _traefik_setup,
},
```
`health_domain` key is **absent** → `health_code()` falls back to `spec["domain"]` =
`"traefik.ci.commoninternet.net"`. Probe is now `https://traefik.ci.commoninternet.net/api/version`
with `--resolve traefik.ci.commoninternet.net:443:127.0.0.1` — traefik's own API, no backend dep.
#### Check 2 — Controlled reproduction ✅
Scaled `ccci-dashboard_app` to 0 replicas (dashboard absent):
- **New probe** (`/api/version` on traefik domain): HTTP **200** ← cycle broken
- **Old probe** (`ci.commoninternet.net/`): HTTP **404** ← confirms old gate was deadlocked
Dashboard restored to 1/1 and returns 200 after scale-up.
#### Check 3 — Consumer ordering unchanged ✅
All `After=deploy-proxy.service` consumers unchanged:
```
deploy-drone: After=deploy-proxy.service swarm-init.service docker.service network-online.target
deploy-bridge: After=deploy-drone.service deploy-proxy.service ...
deploy-dashboard: After=deploy-bridge.service deploy-proxy.service ...
deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service ...
deploy-reports: After=deploy-dashboard.service deploy-proxy.service ...
nightly-sweep: After=deploy-proxy.service warm-keycloak.service docker.service
warm-keycloak: After=deploy-proxy.service ...
```
`deploy-proxy` itself: `After=swarm-init.service docker.service network-online.target` — no dashboard
dependency in its own ordering (correct). Fix does not change any service ordering.
#### Check 4 — Alert dir empty ✅
`/var/lib/ci-warm/alerts/` is empty — Builder cleared the stale 05:44Z alert (valid false-alarm from
the old gate hitting the deadlock this morning).
#### Check 5 — proxy.nix comment ✅
Comment updated: "health-gate (traefik.ci.commoninternet.net/api/version returns 200 — traefik's own
API, no backend dep)". No functional change to the nix module (same systemd unit).
#### Check 6 — Gate has teeth ✅ (with one documentation note)
**Functional PASS:** `health_code()` line 276 returns `int(r.stdout.strip() or "0")` → on curl
connection failure, stdout = "000" (curl's HTTP-code sentinel) → `int("000") = 0` → 0 ∉ `health_ok=(200,)`
→ `wait_healthy()` returns False → rollback triggered. Gate genuinely fails on a broken traefik.
**Documentation discrepancy (non-blocking):** The STATUS claim says "EXPECTED: error sentinel 999 returned
when curl fails." The actual code returns 0 (not 999) on curl failure. `grep` for "999" returns no matches.
This is a documentation error in the M1 claim only — the functional behavior is correct (0 ≠ 200 → gate
fails → rollback). No code defect; no blocking finding.
#### Check 7 — DEFERRED + DECISIONS updated ✅
`machine-docs/DEFERRED.md`: 2026-06-13 circular-dependency entry marked `[x] CLOSED @2026-06-13` with fix pointer.
`machine-docs/DECISIONS.md`: "deploy-proxy health gate — SETTLED (2026-06-13, phase pxgate)" entry added with rationale.
---
**M1 VERDICT: PASS** — cycle broken, new probe is dashboard-independent, rollback gate has teeth,
ordering unchanged, DEFERRED closed, docs updated. One non-blocking STATUS discrepancy (999 vs 0
sentinel) noted; no code defect.
---
## M2 — Proven on a real from-scratch boot
### PENDING — awaiting orchestrator nixos-rebuild (as of 2026-06-13T13:08Z)
M1 is PASS. The fix is in the repo (`0e9fd38`). The live cc-ci host still has the OLD probe:
- Active reconcile script: `/nix/store/km6173hm5a77wxggd7zba3mfakrz0c94-cc-ci-reconcile-proxy`
- Calls: `/nix/store/ls5d6s7q2892z0n0qv7sfk03zimwx3nd-runner/warm_reconcile.py`
- That file has: `"health_domain": "ci.commoninternet.net"`, `"health_path": "/"` — OLD probe still live
**Orchestrator action required:**
```bash
ssh cc-ci
cd /root/builder-clone
git pull # to get commit 0e9fd38
nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"
```
After nixos-rebuild, I will verify (per STATUS-pxgate.md M2 checks):
1. `deploy-proxy.service` shows `active (exited)` (not unhealthy alert)
2. New nix store path with `/api/version` in use
3. All services 1/1 unaffected
4. Cold-boot simulation: stop dashboard + restart proxy → proxy completes healthy without dashboard
---
## Idle break-it probes @2026-06-13T13:31Z (M2 still pending — no nixos-rebuild yet)
Confirmed: old probe still live in active nix store path (km6173hm5a77wxggd7zba3mfakrz0c94); builder-clone on cc-ci at `caef217` (old). M2 blocked on orchestrator.
**P_stability (3 probes from orchestrator + 3 from cc-ci):** `/api/version` → 200 all 6 probes. Dashboard `/` → 200. Endpoint stable.
**P_services:** All 9 Docker services 1/1:
- backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik (app+socket-proxy), warm-keycloak (app+db)
**P_alerts:** `/var/lib/ci-warm/alerts/` empty. Builder cleared the stale boot-time alert as expected.
**P_leak:** `/api/version` response: `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}`. No secret patterns (password/token/key/cert/pem) detected.
**P_ping_still_404:** `https://traefik.ci.commoninternet.net/ping` → 404 (not configured — correct; avoids depending on an entrypoint that might not exist after nixos-rebuild).
**Builder sentinel discrepancy (re-checked):** Builder journal says "999 on curl failure" but `runner/warm_reconcile.py:276` returns `int(r.stdout.strip() or "0")` → curl error → "000" → int("000")=0. Returns 0, not 999. Non-blocking (0 ∉ (200,) → gate fails correctly). Same finding as M1 check 6 — no code defect.
**STATUS-pxgate.md M2 pre-check:** builder-clone on cc-ci must be pulled to ≥ `0e9fd38` before nixos-rebuild. Current: `caef217` (stale). Orchestrator must `cd /root/builder-clone && git pull` first.
No new findings warranting a VETO. All running-system probes PASS.
---
## M2 — Proven on a real nixos-rebuild
### PASS @2026-06-13T13:44Z — Adversary cold-verified
nixos-rebuild completed (detected by Adversary at ~13:43:15 UTC — new nix store path appeared on deploy-proxy). Full M2 acceptance run executed independently.
#### Check 1 — deploy-proxy active (exited) after nixos-rebuild ✅
```
Active: active (exited) since Sat 2026-06-13 13:43:15 UTC
Invocation: fe8a806fbb5b40239c31a5c48f381cd1
Process: 3171211 ExecStart=/nix/store/8qjh8apxcbs85asgizkymjskicf4zmsl-cc-ci-reconcile-proxy/bin/cc-ci-reconcile-proxy (code=exited, status=0/SUCCESS)
```
No alert written. New nix store path `8qjh8apxcbs85asgizkymjskicf4zmsl` — different from old `km6173hm5a77wxggd7zba3mfakrz0c94`.
#### Check 2 — `/api/version` probe in new nix store path ✅
New runner: `/nix/store/5hic3aba65i88m1ib67b7g6dwzrzd1z2-runner/warm_reconcile.py`
Traefik spec confirmed:
```python
"traefik": {
"recipe": "traefik",
"domain": "traefik.ci.commoninternet.net",
"health_path": "/api/version", # ← new probe
"health_ok": (200,),
...
}
```
`health_domain` key absent → probe URL = `https://traefik.ci.commoninternet.net/api/version` (no backend/dashboard dep). Source grep confirms the inline comment: "traefik's OWN /api/version endpoint (no backend/dashboard dependency)".
#### Check 3 — All services 1/1 (running server unaffected) ✅
All 9 Docker services 1/1 after nixos-rebuild:
`backups`, `ccci-bridge`, `ccci-dashboard`, `ccci-reports`, `drone`, `traefik_app`, `traefik_socket-proxy`, `warm-keycloak_app`, `warm-keycloak_db`.
Dashboard (`https://ci.commoninternet.net/`) → 200. `/api/version` → 200.
#### Check 4 — Cold-boot simulation: proxy starts without dashboard ✅
Adversary executed the definitive cold-boot simulation (STATUS-pxgate.md Check 5):
```
1. systemctl stop deploy-dashboard → inactive ✓
2. systemctl stop deploy-proxy && systemctl reset-failed deploy-proxy
3. systemctl start deploy-proxy
→ Active: active (exited) since Sat 2026-06-13 13:44:01 UTC ✓
→ Process: ExecStart=.../8qjh8apxcbs85asgizkymjskicf4zmsl-cc-ci-reconcile-proxy ... (status=0/SUCCESS)
4. systemctl start deploy-dashboard → active (exited) ✓
5. All services 1/1; dashboard → 200; /api/version → 200 ✓
```
**Deploy-proxy reached `active (exited)` with the dashboard not running — cycle conclusively broken.** The old probe (ci.commoninternet.net/) would have timed out at 300s (health_timeout) trying to reach a dashboard that wasn't started yet.
#### Check 5 — Alert directory empty ✅
`/var/lib/ci-warm/alerts/` empty after both the nixos-rebuild run and the cold-boot simulation. No unhealthy alert written — new probe returned 200 on first health check.
#### Check 6 — Rollback path (code-proof, unchanged) ✅
`health_code()` unchanged: returns `int(r.stdout.strip() or "0")` → 0 on curl failure → 0 ∉ (200,) → `wait_healthy()` returns False → rollback triggered. Gate has teeth. (Confirmed same as M1.)
---
**M2 VERDICT: PASS** — nixos-rebuild deployed the fix; deploy-proxy active without deadlock; cold-boot simulation confirmed cycle broken; all services unaffected; rollback intact. Phase pxgate Definition of Done fully met. Builder may write ## DONE.