chore(pxgate): init Adversary phase files — root cause cold-verified, M1/M2 PENDING
Some checks failed
continuous-integration/drone/push Build is failing

Independent cold read confirms the circular dependency (proxy health-gate polls
ci.commoninternet.net served by dashboard which is After=deploy-proxy). Root cause
is PROVEN LIVE by today's alert: 20260613T054428Z-traefik-unhealthy-on-latest.json.

Fix endpoint independently verified: /api/version on traefik.ci.commoninternet.net
returns 200 as soon as traefik is up, no dashboard dependency.

REVIEW-pxgate.md: orientation, M1/M2 acceptance criteria.
BACKLOG-pxgate.md: break-it probes P1–P5 to run at M1 gate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
autonomic-bot
2026-06-13 12:42:30 +00:00
parent 1c671ed045
commit a9e67af61e
2 changed files with 115 additions and 0 deletions

View File

@ -0,0 +1,22 @@
# BACKLOG — phase pxgate
## Build backlog
(Builder-owned — Adversary reads only)
## Adversary findings
No findings yet. Recording break-it probes to run once the fix lands.
### Break-it probes to execute at M1 gate
- [ ] **P1-neg (traefik-down gate fails):** Stop traefik service; verify `health_code` returns non-200
and the reconciler would roll back. (Prove the new gate has teeth — not always-pass.)
- [ ] **P2-controlled-repro:** Simulate dashboard-absent scenario: with dashboard held back (or stopped),
run the NEW reconciler → verify it completes healthy (no deadlock). Run the OLD reconciler with
dashboard held back → verify it hangs/fails (confirm the fix actually breaks the cycle).
- [ ] **P3-ordering:** Confirm `After=deploy-proxy` consumers (drone, warm-keycloak, bridge, dashboard,
backupbot, reports-nightly) still order correctly. Check `systemctl cat <service>` for each.
- [ ] **P4-alert-cleared:** Verify the 20260613T054428Z unhealthy-on-latest alert is addressed (either
the Builder explicitly handles it, or the fix makes the next reconcile cycle healthy).
- [ ] **P5-secret-leak:** grep `/var/lib/ci-warm/alerts/` for any secret values (keys, passwords).
The alert file must contain only version strings, no credentials.

View File

@ -0,0 +1,93 @@
# REVIEW — phase pxgate
**Phase:** pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix)
**Adversary:** autonomic-bot (Sonnet 4.6)
**Started:** 2026-06-13T12:41Z
---
## Adversary orientation (cold start — 2026-06-13T12:41Z)
Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so
the M1 verdict below is COLD and reproducible.
### Root cause — INDEPENDENTLY CONFIRMED
Reading `nix/modules/proxy.nix` + `runner/warm_reconcile.py` + `nix/modules/dashboard.nix`:
1. `deploy-proxy.service` runs `warm_reconcile.py traefik`.
2. The traefik SPEC in `warm_reconcile.py:117-128` sets:
```python
"health_domain": "ci.commoninternet.net",
"health_path": "/",
```
So `health_code()` probes `https://ci.commoninternet.net/` — the dashboard.
3. `deploy-dashboard.service` (dashboard.nix:89) has:
```
After=deploy-bridge.service deploy-proxy.service ...
```
systemd will not start deploy-dashboard until deploy-proxy exits.
4. **Deadlock:** proxy waits for dashboard; dashboard waits for proxy.
### Root cause — PROVEN LIVE (not merely theoretical)
The alert file `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json`
confirms the deadlock hit TODAY at boot time:
```
deploy-proxy started: 05:38:21 UTC
→ probed ci.commoninternet.net (60s timeout): unhealthy
→ redeployed traefik
→ probed ci.commoninternet.net (300s timeout): still unhealthy
→ wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
→ deployed dashboard successfully
→ ci.commoninternet.net now returns 200
```
traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at
05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.
### Verified fix endpoint
`curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version`
→ `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}` (200)
This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.
`/ping` → 404 (not configured in the current recipe — avoid).
### Required change (my independent read of the fix)
In `runner/warm_reconcile.py` SPECS["traefik"]:
- Remove `"health_domain": "ci.commoninternet.net"` — so `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"`
- Change `"health_path": "/"` → `"health_path": "/api/version"`
`health_code()` will then probe `https://traefik.ci.commoninternet.net/api/version` directly
(via `--resolve traefik.ci.commoninternet.net:443:127.0.0.1`), which returns 200 as soon as
traefik is up — no dashboard dependency.
---
## M1 — Fix + controlled reproduction
### PENDING — awaiting Builder implementation
Acceptance criteria I will independently verify:
1. **Code change correct**: SPECS["traefik"] removes `health_domain` override and uses `/api/version`
2. **New gate is meaningful**: a STOPPED/broken traefik must cause the probe to fail (return non-200)
3. **Controlled reproduction**: with dashboard held back, old gate hangs/fails; new gate passes on traefik alone
4. **No `After=deploy-proxy` consumer regressed**: drone, warm-keycloak, bridge, dashboard, backupbot, reports ordering still correct
5. **PR merged or ready** to the cc-ci repo
---
## M2 — Proven on a real from-scratch boot
### PENDING — awaiting Builder implementation + orchestrator cold-boot
Acceptance criteria I will independently verify:
1. **deploy-proxy reaches `active`** without the dashboard being pre-deployed
2. **Rollback path still works**: a deliberately broken traefik fails the gate and rolls back
3. **Running server unaffected**: all services still up after the fix deploys
4. **A1 / DEFERRED entry closed** with pointers