Some checks failed
continuous-integration/drone/push Build is failing
nixos-rebuild deployed fix; new nix store path 8qjh8apxcbs85 with /api/version probe; deploy-proxy active(exited) at 13:43:15 UTC; cold-boot sim: proxy started active(exited) with dashboard stopped; all 9 services 1/1; alert dir empty; rollback gate unchanged. Phase pxgate DoD fully met. Builder may write ## DONE.
291 lines
14 KiB
Markdown
291 lines
14 KiB
Markdown
# REVIEW — phase pxgate
|
||
|
||
**Phase:** pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix)
|
||
**Adversary:** autonomic-bot (Sonnet 4.6)
|
||
**Started:** 2026-06-13T12:41Z
|
||
|
||
---
|
||
|
||
## Adversary orientation (cold start — 2026-06-13T12:41Z)
|
||
|
||
Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so
|
||
the M1 verdict below is COLD and reproducible.
|
||
|
||
### Root cause — INDEPENDENTLY CONFIRMED
|
||
|
||
Reading `nix/modules/proxy.nix` + `runner/warm_reconcile.py` + `nix/modules/dashboard.nix`:
|
||
|
||
1. `deploy-proxy.service` runs `warm_reconcile.py traefik`.
|
||
2. The traefik SPEC in `warm_reconcile.py:117-128` sets:
|
||
```python
|
||
"health_domain": "ci.commoninternet.net",
|
||
"health_path": "/",
|
||
```
|
||
So `health_code()` probes `https://ci.commoninternet.net/` — the dashboard.
|
||
3. `deploy-dashboard.service` (dashboard.nix:89) has:
|
||
```
|
||
After=deploy-bridge.service deploy-proxy.service ...
|
||
```
|
||
systemd will not start deploy-dashboard until deploy-proxy exits.
|
||
4. **Deadlock:** proxy waits for dashboard; dashboard waits for proxy.
|
||
|
||
### Root cause — PROVEN LIVE (not merely theoretical)
|
||
|
||
The alert file `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
|
||
confirms the deadlock hit TODAY at boot time:
|
||
|
||
```
|
||
deploy-proxy started: 05:38:21 UTC
|
||
→ probed ci.commoninternet.net (60s timeout): unhealthy
|
||
→ redeployed traefik
|
||
→ probed ci.commoninternet.net (300s timeout): still unhealthy
|
||
→ wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
|
||
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
|
||
→ deployed dashboard successfully
|
||
→ ci.commoninternet.net now returns 200
|
||
```
|
||
|
||
traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at
|
||
05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.
|
||
|
||
### Verified fix endpoint
|
||
|
||
`curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version`
|
||
→ `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}` (200)
|
||
|
||
This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.
|
||
|
||
`/ping` → 404 (not configured in the current recipe — avoid).
|
||
|
||
### Required change (my independent read of the fix)
|
||
|
||
In `runner/warm_reconcile.py` SPECS["traefik"]:
|
||
- Remove `"health_domain": "ci.commoninternet.net"` — so `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"`
|
||
- Change `"health_path": "/"` → `"health_path": "/api/version"`
|
||
|
||
`health_code()` will then probe `https://traefik.ci.commoninternet.net/api/version` directly
|
||
(via `--resolve traefik.ci.commoninternet.net:443:127.0.0.1`), which returns 200 as soon as
|
||
traefik is up — no dashboard dependency.
|
||
|
||
### Pre-M1 break-it probes (before Builder's fix, 2026-06-13T12:50Z)
|
||
|
||
**P5 — Secret leak in alert files:** PASS. `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
|
||
contains only `{"app": "traefik", "reason": "unhealthy-on-latest", "ts": "...", "version": "5.1.1+v3.6.15"}`.
|
||
No credentials, no secrets.
|
||
|
||
**P3 — After=deploy-proxy consumers ordering:** PASS (no regression in current ordering):
|
||
- deploy-drone: After=deploy-proxy.service
|
||
- deploy-bridge: After=deploy-drone.service deploy-proxy.service
|
||
- deploy-dashboard: After=deploy-bridge.service deploy-proxy.service
|
||
- deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service
|
||
- deploy-reports: After=deploy-dashboard.service deploy-proxy.service
|
||
- nightly-sweep: After=deploy-proxy.service warm-keycloak.service
|
||
- warm-keycloak: After=deploy-proxy.service
|
||
These all correctly depend on deploy-proxy; after the fix, proxy completes without
|
||
deadlock and the rest of the chain proceeds normally.
|
||
|
||
**Endpoint stability:** `/api/version` returns 200 reliably (3/3 probes). No backend dependency.
|
||
|
||
**P1-negative (traefik-down):** PENDING at M1 gate — requires a controlled stop of
|
||
traefik (risky on live system); will execute at M1 verification using a short pause
|
||
or by examining the reconciler code path (deploy_version raises → upgrade_ok=False → rollback).
|
||
|
||
---
|
||
|
||
## M1 — Fix + controlled reproduction
|
||
|
||
### PASS @2026-06-13T13:00Z — Adversary cold-verified
|
||
|
||
**Commit:** `0e9fd38` (`claim(pxgate-M1): change traefik health probe to /api/version`)
|
||
|
||
#### Check 1 — Code change correct ✅
|
||
|
||
`runner/warm_reconcile.py` SPECS["traefik"] (lines 120–129):
|
||
```python
|
||
"traefik": {
|
||
"recipe": "traefik",
|
||
"domain": "traefik.ci.commoninternet.net",
|
||
"health_path": "/api/version", # ← changed from "/"
|
||
"health_ok": (200,),
|
||
"stateful": False,
|
||
"deploy_timeout": 600,
|
||
"health_timeout": 300,
|
||
"setup": _traefik_setup,
|
||
},
|
||
```
|
||
`health_domain` key is **absent** → `health_code()` falls back to `spec["domain"]` =
|
||
`"traefik.ci.commoninternet.net"`. Probe is now `https://traefik.ci.commoninternet.net/api/version`
|
||
with `--resolve traefik.ci.commoninternet.net:443:127.0.0.1` — traefik's own API, no backend dep.
|
||
|
||
#### Check 2 — Controlled reproduction ✅
|
||
|
||
Scaled `ccci-dashboard_app` to 0 replicas (dashboard absent):
|
||
- **New probe** (`/api/version` on traefik domain): HTTP **200** ← cycle broken
|
||
- **Old probe** (`ci.commoninternet.net/`): HTTP **404** ← confirms old gate was deadlocked
|
||
|
||
Dashboard restored to 1/1 and returns 200 after scale-up.
|
||
|
||
#### Check 3 — Consumer ordering unchanged ✅
|
||
|
||
All `After=deploy-proxy.service` consumers unchanged:
|
||
```
|
||
deploy-drone: After=deploy-proxy.service swarm-init.service docker.service network-online.target
|
||
deploy-bridge: After=deploy-drone.service deploy-proxy.service ...
|
||
deploy-dashboard: After=deploy-bridge.service deploy-proxy.service ...
|
||
deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service ...
|
||
deploy-reports: After=deploy-dashboard.service deploy-proxy.service ...
|
||
nightly-sweep: After=deploy-proxy.service warm-keycloak.service docker.service
|
||
warm-keycloak: After=deploy-proxy.service ...
|
||
```
|
||
`deploy-proxy` itself: `After=swarm-init.service docker.service network-online.target` — no dashboard
|
||
dependency in its own ordering (correct). Fix does not change any service ordering.
|
||
|
||
#### Check 4 — Alert dir empty ✅
|
||
|
||
`/var/lib/ci-warm/alerts/` is empty — Builder cleared the stale 05:44Z alert (valid false-alarm from
|
||
the old gate hitting the deadlock this morning).
|
||
|
||
#### Check 5 — proxy.nix comment ✅
|
||
|
||
Comment updated: "health-gate (traefik.ci.commoninternet.net/api/version returns 200 — traefik's own
|
||
API, no backend dep)". No functional change to the nix module (same systemd unit).
|
||
|
||
#### Check 6 — Gate has teeth ✅ (with one documentation note)
|
||
|
||
**Functional PASS:** `health_code()` line 276 returns `int(r.stdout.strip() or "0")` → on curl
|
||
connection failure, stdout = "000" (curl's HTTP-code sentinel) → `int("000") = 0` → 0 ∉ `health_ok=(200,)`
|
||
→ `wait_healthy()` returns False → rollback triggered. Gate genuinely fails on a broken traefik.
|
||
|
||
**Documentation discrepancy (non-blocking):** The STATUS claim says "EXPECTED: error sentinel 999 returned
|
||
when curl fails." The actual code returns 0 (not 999) on curl failure. `grep` for "999" returns no matches.
|
||
This is a documentation error in the M1 claim only — the functional behavior is correct (0 ≠ 200 → gate
|
||
fails → rollback). No code defect; no blocking finding.
|
||
|
||
#### Check 7 — DEFERRED + DECISIONS updated ✅
|
||
|
||
`machine-docs/DEFERRED.md`: 2026-06-13 circular-dependency entry marked `[x] CLOSED @2026-06-13` with fix pointer.
|
||
`machine-docs/DECISIONS.md`: "deploy-proxy health gate — SETTLED (2026-06-13, phase pxgate)" entry added with rationale.
|
||
|
||
---
|
||
|
||
**M1 VERDICT: PASS** — cycle broken, new probe is dashboard-independent, rollback gate has teeth,
|
||
ordering unchanged, DEFERRED closed, docs updated. One non-blocking STATUS discrepancy (999 vs 0
|
||
sentinel) noted; no code defect.
|
||
|
||
---
|
||
|
||
## M2 — Proven on a real from-scratch boot
|
||
|
||
### PENDING — awaiting orchestrator nixos-rebuild (as of 2026-06-13T13:08Z)
|
||
|
||
M1 is PASS. The fix is in the repo (`0e9fd38`). The live cc-ci host still has the OLD probe:
|
||
- Active reconcile script: `/nix/store/km6173hm5a77wxggd7zba3mfakrz0c94-cc-ci-reconcile-proxy`
|
||
- Calls: `/nix/store/ls5d6s7q2892z0n0qv7sfk03zimwx3nd-runner/warm_reconcile.py`
|
||
- That file has: `"health_domain": "ci.commoninternet.net"`, `"health_path": "/"` — OLD probe still live
|
||
|
||
**Orchestrator action required:**
|
||
```bash
|
||
ssh cc-ci
|
||
cd /root/builder-clone
|
||
git pull # to get commit 0e9fd38
|
||
nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"
|
||
```
|
||
|
||
After nixos-rebuild, I will verify (per STATUS-pxgate.md M2 checks):
|
||
1. `deploy-proxy.service` shows `active (exited)` (not unhealthy alert)
|
||
2. New nix store path with `/api/version` in use
|
||
3. All services 1/1 unaffected
|
||
4. Cold-boot simulation: stop dashboard + restart proxy → proxy completes healthy without dashboard
|
||
|
||
---
|
||
|
||
## Idle break-it probes @2026-06-13T13:31Z (M2 still pending — no nixos-rebuild yet)
|
||
|
||
Confirmed: old probe still live in active nix store path (km6173hm5a77wxggd7zba3mfakrz0c94); builder-clone on cc-ci at `caef217` (old). M2 blocked on orchestrator.
|
||
|
||
**P_stability (3 probes from orchestrator + 3 from cc-ci):** `/api/version` → 200 all 6 probes. Dashboard `/` → 200. Endpoint stable.
|
||
|
||
**P_services:** All 9 Docker services 1/1:
|
||
- backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik (app+socket-proxy), warm-keycloak (app+db)
|
||
|
||
**P_alerts:** `/var/lib/ci-warm/alerts/` empty. Builder cleared the stale boot-time alert as expected.
|
||
|
||
**P_leak:** `/api/version` response: `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}`. No secret patterns (password/token/key/cert/pem) detected.
|
||
|
||
**P_ping_still_404:** `https://traefik.ci.commoninternet.net/ping` → 404 (not configured — correct; avoids depending on an entrypoint that might not exist after nixos-rebuild).
|
||
|
||
**Builder sentinel discrepancy (re-checked):** Builder journal says "999 on curl failure" but `runner/warm_reconcile.py:276` returns `int(r.stdout.strip() or "0")` → curl error → "000" → int("000")=0. Returns 0, not 999. Non-blocking (0 ∉ (200,) → gate fails correctly). Same finding as M1 check 6 — no code defect.
|
||
|
||
**STATUS-pxgate.md M2 pre-check:** builder-clone on cc-ci must be pulled to ≥ `0e9fd38` before nixos-rebuild. Current: `caef217` (stale). Orchestrator must `cd /root/builder-clone && git pull` first.
|
||
|
||
No new findings warranting a VETO. All running-system probes PASS.
|
||
|
||
---
|
||
|
||
## M2 — Proven on a real nixos-rebuild
|
||
|
||
### PASS @2026-06-13T13:44Z — Adversary cold-verified
|
||
|
||
nixos-rebuild completed (detected by Adversary at ~13:43:15 UTC — new nix store path appeared on deploy-proxy). Full M2 acceptance run executed independently.
|
||
|
||
#### Check 1 — deploy-proxy active (exited) after nixos-rebuild ✅
|
||
|
||
```
|
||
Active: active (exited) since Sat 2026-06-13 13:43:15 UTC
|
||
Invocation: fe8a806fbb5b40239c31a5c48f381cd1
|
||
Process: 3171211 ExecStart=/nix/store/8qjh8apxcbs85asgizkymjskicf4zmsl-cc-ci-reconcile-proxy/bin/cc-ci-reconcile-proxy (code=exited, status=0/SUCCESS)
|
||
```
|
||
|
||
No alert written. New nix store path `8qjh8apxcbs85asgizkymjskicf4zmsl` — different from old `km6173hm5a77wxggd7zba3mfakrz0c94`.
|
||
|
||
#### Check 2 — `/api/version` probe in new nix store path ✅
|
||
|
||
New runner: `/nix/store/5hic3aba65i88m1ib67b7g6dwzrzd1z2-runner/warm_reconcile.py`
|
||
|
||
Traefik spec confirmed:
|
||
```python
|
||
"traefik": {
|
||
"recipe": "traefik",
|
||
"domain": "traefik.ci.commoninternet.net",
|
||
"health_path": "/api/version", # ← new probe
|
||
"health_ok": (200,),
|
||
...
|
||
}
|
||
```
|
||
`health_domain` key absent → probe URL = `https://traefik.ci.commoninternet.net/api/version` (no backend/dashboard dep). Source grep confirms the inline comment: "traefik's OWN /api/version endpoint (no backend/dashboard dependency)".
|
||
|
||
#### Check 3 — All services 1/1 (running server unaffected) ✅
|
||
|
||
All 9 Docker services 1/1 after nixos-rebuild:
|
||
`backups`, `ccci-bridge`, `ccci-dashboard`, `ccci-reports`, `drone`, `traefik_app`, `traefik_socket-proxy`, `warm-keycloak_app`, `warm-keycloak_db`.
|
||
|
||
Dashboard (`https://ci.commoninternet.net/`) → 200. `/api/version` → 200.
|
||
|
||
#### Check 4 — Cold-boot simulation: proxy starts without dashboard ✅
|
||
|
||
Adversary executed the definitive cold-boot simulation (STATUS-pxgate.md Check 5):
|
||
|
||
```
|
||
1. systemctl stop deploy-dashboard → inactive ✓
|
||
2. systemctl stop deploy-proxy && systemctl reset-failed deploy-proxy
|
||
3. systemctl start deploy-proxy
|
||
→ Active: active (exited) since Sat 2026-06-13 13:44:01 UTC ✓
|
||
→ Process: ExecStart=.../8qjh8apxcbs85asgizkymjskicf4zmsl-cc-ci-reconcile-proxy ... (status=0/SUCCESS)
|
||
4. systemctl start deploy-dashboard → active (exited) ✓
|
||
5. All services 1/1; dashboard → 200; /api/version → 200 ✓
|
||
```
|
||
|
||
**Deploy-proxy reached `active (exited)` with the dashboard not running — cycle conclusively broken.** The old probe (ci.commoninternet.net/) would have timed out at 300s (health_timeout) trying to reach a dashboard that wasn't started yet.
|
||
|
||
#### Check 5 — Alert directory empty ✅
|
||
|
||
`/var/lib/ci-warm/alerts/` empty after both the nixos-rebuild run and the cold-boot simulation. No unhealthy alert written — new probe returned 200 on first health check.
|
||
|
||
#### Check 6 — Rollback path (code-proof, unchanged) ✅
|
||
|
||
`health_code()` unchanged: returns `int(r.stdout.strip() or "0")` → 0 on curl failure → 0 ∉ (200,) → `wait_healthy()` returns False → rollback triggered. Gate has teeth. (Confirmed same as M1.)
|
||
|
||
---
|
||
|
||
**M2 VERDICT: PASS** — nixos-rebuild deployed the fix; deploy-proxy active without deadlock; cold-boot simulation confirmed cycle broken; all services unaffected; rollback intact. Phase pxgate Definition of Done fully met. Builder may write ## DONE.
|