chore(pxgate): init Adversary phase files — root cause cold-verified, M1/M2 PENDING
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
Independent cold read confirms the circular dependency (proxy health-gate polls ci.commoninternet.net served by dashboard which is After=deploy-proxy). Root cause is PROVEN LIVE by today's alert: 20260613T054428Z-traefik-unhealthy-on-latest.json. Fix endpoint independently verified: /api/version on traefik.ci.commoninternet.net returns 200 as soon as traefik is up, no dashboard dependency. REVIEW-pxgate.md: orientation, M1/M2 acceptance criteria. BACKLOG-pxgate.md: break-it probes P1–P5 to run at M1 gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
22
machine-docs/BACKLOG-pxgate.md
Normal file
22
machine-docs/BACKLOG-pxgate.md
Normal file
@ -0,0 +1,22 @@
|
||||
# BACKLOG — phase pxgate
|
||||
|
||||
## Build backlog
|
||||
(Builder-owned — Adversary reads only)
|
||||
|
||||
## Adversary findings
|
||||
|
||||
No findings yet. Recording break-it probes to run once the fix lands.
|
||||
|
||||
### Break-it probes to execute at M1 gate
|
||||
|
||||
- [ ] **P1-neg (traefik-down gate fails):** Stop traefik service; verify `health_code` returns non-200
|
||||
and the reconciler would roll back. (Prove the new gate has teeth — not always-pass.)
|
||||
- [ ] **P2-controlled-repro:** Simulate dashboard-absent scenario: with dashboard held back (or stopped),
|
||||
run the NEW reconciler → verify it completes healthy (no deadlock). Run the OLD reconciler with
|
||||
dashboard held back → verify it hangs/fails (confirm the fix actually breaks the cycle).
|
||||
- [ ] **P3-ordering:** Confirm `After=deploy-proxy` consumers (drone, warm-keycloak, bridge, dashboard,
|
||||
backupbot, reports-nightly) still order correctly. Check `systemctl cat <service>` for each.
|
||||
- [ ] **P4-alert-cleared:** Verify the 20260613T054428Z unhealthy-on-latest alert is addressed (either
|
||||
the Builder explicitly handles it, or the fix makes the next reconcile cycle healthy).
|
||||
- [ ] **P5-secret-leak:** grep `/var/lib/ci-warm/alerts/` for any secret values (keys, passwords).
|
||||
The alert file must contain only version strings, no credentials.
|
||||
93
machine-docs/REVIEW-pxgate.md
Normal file
93
machine-docs/REVIEW-pxgate.md
Normal file
@ -0,0 +1,93 @@
|
||||
# REVIEW — phase pxgate
|
||||
|
||||
**Phase:** pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix)
|
||||
**Adversary:** autonomic-bot (Sonnet 4.6)
|
||||
**Started:** 2026-06-13T12:41Z
|
||||
|
||||
---
|
||||
|
||||
## Adversary orientation (cold start — 2026-06-13T12:41Z)
|
||||
|
||||
Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so
|
||||
the M1 verdict below is COLD and reproducible.
|
||||
|
||||
### Root cause — INDEPENDENTLY CONFIRMED
|
||||
|
||||
Reading `nix/modules/proxy.nix` + `runner/warm_reconcile.py` + `nix/modules/dashboard.nix`:
|
||||
|
||||
1. `deploy-proxy.service` runs `warm_reconcile.py traefik`.
|
||||
2. The traefik SPEC in `warm_reconcile.py:117-128` sets:
|
||||
```python
|
||||
"health_domain": "ci.commoninternet.net",
|
||||
"health_path": "/",
|
||||
```
|
||||
So `health_code()` probes `https://ci.commoninternet.net/` — the dashboard.
|
||||
3. `deploy-dashboard.service` (dashboard.nix:89) has:
|
||||
```
|
||||
After=deploy-bridge.service deploy-proxy.service ...
|
||||
```
|
||||
systemd will not start deploy-dashboard until deploy-proxy exits.
|
||||
4. **Deadlock:** proxy waits for dashboard; dashboard waits for proxy.
|
||||
|
||||
### Root cause — PROVEN LIVE (not merely theoretical)
|
||||
|
||||
The alert file `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
|
||||
confirms the deadlock hit TODAY at boot time:
|
||||
|
||||
```
|
||||
deploy-proxy started: 05:38:21 UTC
|
||||
→ probed ci.commoninternet.net (60s timeout): unhealthy
|
||||
→ redeployed traefik
|
||||
→ probed ci.commoninternet.net (300s timeout): still unhealthy
|
||||
→ wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
|
||||
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
|
||||
→ deployed dashboard successfully
|
||||
→ ci.commoninternet.net now returns 200
|
||||
```
|
||||
|
||||
traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at
|
||||
05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.
|
||||
|
||||
### Verified fix endpoint
|
||||
|
||||
`curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version`
|
||||
→ `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}` (200)
|
||||
|
||||
This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.
|
||||
|
||||
`/ping` → 404 (not configured in the current recipe — avoid).
|
||||
|
||||
### Required change (my independent read of the fix)
|
||||
|
||||
In `runner/warm_reconcile.py` SPECS["traefik"]:
|
||||
- Remove `"health_domain": "ci.commoninternet.net"` — so `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"`
|
||||
- Change `"health_path": "/"` → `"health_path": "/api/version"`
|
||||
|
||||
`health_code()` will then probe `https://traefik.ci.commoninternet.net/api/version` directly
|
||||
(via `--resolve traefik.ci.commoninternet.net:443:127.0.0.1`), which returns 200 as soon as
|
||||
traefik is up — no dashboard dependency.
|
||||
|
||||
---
|
||||
|
||||
## M1 — Fix + controlled reproduction
|
||||
|
||||
### PENDING — awaiting Builder implementation
|
||||
|
||||
Acceptance criteria I will independently verify:
|
||||
1. **Code change correct**: SPECS["traefik"] removes `health_domain` override and uses `/api/version`
|
||||
2. **New gate is meaningful**: a STOPPED/broken traefik must cause the probe to fail (return non-200)
|
||||
3. **Controlled reproduction**: with dashboard held back, old gate hangs/fails; new gate passes on traefik alone
|
||||
4. **No `After=deploy-proxy` consumer regressed**: drone, warm-keycloak, bridge, dashboard, backupbot, reports ordering still correct
|
||||
5. **PR merged or ready** to the cc-ci repo
|
||||
|
||||
---
|
||||
|
||||
## M2 — Proven on a real from-scratch boot
|
||||
|
||||
### PENDING — awaiting Builder implementation + orchestrator cold-boot
|
||||
|
||||
Acceptance criteria I will independently verify:
|
||||
1. **deploy-proxy reaches `active`** without the dashboard being pre-deployed
|
||||
2. **Rollback path still works**: a deliberately broken traefik fails the gate and rolls back
|
||||
3. **Running server unaffected**: all services still up after the fix deploys
|
||||
4. **A1 / DEFERRED entry closed** with pointers
|
||||
Reference in New Issue
Block a user