Some checks failed
continuous-integration/drone/push Build is failing
P5: alert files contain no secrets (version strings only). P3: all After=deploy-proxy consumers still ordered correctly. Endpoint: /api/version returns 200 reliably (3/3 probes, no backend dep). P1-negative deferred to M1 gate time (needs controlled traefik stop). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
117 lines
5.2 KiB
Markdown
117 lines
5.2 KiB
Markdown
# REVIEW — phase pxgate
|
|
|
|
**Phase:** pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix)
|
|
**Adversary:** autonomic-bot (Sonnet 4.6)
|
|
**Started:** 2026-06-13T12:41Z
|
|
|
|
---
|
|
|
|
## Adversary orientation (cold start — 2026-06-13T12:41Z)
|
|
|
|
Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so
|
|
the M1 verdict below is COLD and reproducible.
|
|
|
|
### Root cause — INDEPENDENTLY CONFIRMED
|
|
|
|
Reading `nix/modules/proxy.nix` + `runner/warm_reconcile.py` + `nix/modules/dashboard.nix`:
|
|
|
|
1. `deploy-proxy.service` runs `warm_reconcile.py traefik`.
|
|
2. The traefik SPEC in `warm_reconcile.py:117-128` sets:
|
|
```python
|
|
"health_domain": "ci.commoninternet.net",
|
|
"health_path": "/",
|
|
```
|
|
So `health_code()` probes `https://ci.commoninternet.net/` — the dashboard.
|
|
3. `deploy-dashboard.service` (dashboard.nix:89) has:
|
|
```
|
|
After=deploy-bridge.service deploy-proxy.service ...
|
|
```
|
|
systemd will not start deploy-dashboard until deploy-proxy exits.
|
|
4. **Deadlock:** proxy waits for dashboard; dashboard waits for proxy.
|
|
|
|
### Root cause — PROVEN LIVE (not merely theoretical)
|
|
|
|
The alert file `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
|
|
confirms the deadlock hit TODAY at boot time:
|
|
|
|
```
|
|
deploy-proxy started: 05:38:21 UTC
|
|
→ probed ci.commoninternet.net (60s timeout): unhealthy
|
|
→ redeployed traefik
|
|
→ probed ci.commoninternet.net (300s timeout): still unhealthy
|
|
→ wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
|
|
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
|
|
→ deployed dashboard successfully
|
|
→ ci.commoninternet.net now returns 200
|
|
```
|
|
|
|
traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at
|
|
05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.
|
|
|
|
### Verified fix endpoint
|
|
|
|
`curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version`
|
|
→ `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}` (200)
|
|
|
|
This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.
|
|
|
|
`/ping` → 404 (not configured in the current recipe — avoid).
|
|
|
|
### Required change (my independent read of the fix)
|
|
|
|
In `runner/warm_reconcile.py` SPECS["traefik"]:
|
|
- Remove `"health_domain": "ci.commoninternet.net"` — so `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"`
|
|
- Change `"health_path": "/"` → `"health_path": "/api/version"`
|
|
|
|
`health_code()` will then probe `https://traefik.ci.commoninternet.net/api/version` directly
|
|
(via `--resolve traefik.ci.commoninternet.net:443:127.0.0.1`), which returns 200 as soon as
|
|
traefik is up — no dashboard dependency.
|
|
|
|
### Pre-M1 break-it probes (before Builder's fix, 2026-06-13T12:50Z)
|
|
|
|
**P5 — Secret leak in alert files:** PASS. `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
|
|
contains only `{"app": "traefik", "reason": "unhealthy-on-latest", "ts": "...", "version": "5.1.1+v3.6.15"}`.
|
|
No credentials, no secrets.
|
|
|
|
**P3 — After=deploy-proxy consumers ordering:** PASS (no regression in current ordering):
|
|
- deploy-drone: After=deploy-proxy.service
|
|
- deploy-bridge: After=deploy-drone.service deploy-proxy.service
|
|
- deploy-dashboard: After=deploy-bridge.service deploy-proxy.service
|
|
- deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service
|
|
- deploy-reports: After=deploy-dashboard.service deploy-proxy.service
|
|
- nightly-sweep: After=deploy-proxy.service warm-keycloak.service
|
|
- warm-keycloak: After=deploy-proxy.service
|
|
These all correctly depend on deploy-proxy; after the fix, proxy completes without
|
|
deadlock and the rest of the chain proceeds normally.
|
|
|
|
**Endpoint stability:** `/api/version` returns 200 reliably (3/3 probes). No backend dependency.
|
|
|
|
**P1-negative (traefik-down):** PENDING at M1 gate — requires a controlled stop of
|
|
traefik (risky on live system); will execute at M1 verification using a short pause
|
|
or by examining the reconciler code path (deploy_version raises → upgrade_ok=False → rollback).
|
|
|
|
---
|
|
|
|
## M1 — Fix + controlled reproduction
|
|
|
|
### PENDING — awaiting Builder implementation
|
|
|
|
Acceptance criteria I will independently verify:
|
|
1. **Code change correct**: SPECS["traefik"] removes `health_domain` override and uses `/api/version`
|
|
2. **New gate is meaningful**: a STOPPED/broken traefik must cause the probe to fail (return non-200)
|
|
3. **Controlled reproduction**: with dashboard held back, old gate hangs/fails; new gate passes on traefik alone
|
|
4. **No `After=deploy-proxy` consumer regressed**: drone, warm-keycloak, bridge, dashboard, backupbot, reports ordering still correct
|
|
5. **PR merged or ready** to the cc-ci repo
|
|
|
|
---
|
|
|
|
## M2 — Proven on a real from-scratch boot
|
|
|
|
### PENDING — awaiting Builder implementation + orchestrator cold-boot
|
|
|
|
Acceptance criteria I will independently verify:
|
|
1. **deploy-proxy reaches `active`** without the dashboard being pre-deployed
|
|
2. **Rollback path still works**: a deliberately broken traefik fails the gate and rolls back
|
|
3. **Running server unaffected**: all services still up after the fix deploys
|
|
4. **A1 / DEFERRED entry closed** with pointers
|