From a9e67af61eeac4d6f6ef0d29e7706f9a4f916ce6 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Sat, 13 Jun 2026 12:42:30 +0000 Subject: [PATCH] =?UTF-8?q?chore(pxgate):=20init=20Adversary=20phase=20fil?= =?UTF-8?q?es=20=E2=80=94=20root=20cause=20cold-verified,=20M1/M2=20PENDIN?= =?UTF-8?q?G?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Independent cold read confirms the circular dependency (proxy health-gate polls ci.commoninternet.net served by dashboard which is After=deploy-proxy). Root cause is PROVEN LIVE by today's alert: 20260613T054428Z-traefik-unhealthy-on-latest.json. Fix endpoint independently verified: /api/version on traefik.ci.commoninternet.net returns 200 as soon as traefik is up, no dashboard dependency. REVIEW-pxgate.md: orientation, M1/M2 acceptance criteria. BACKLOG-pxgate.md: break-it probes P1–P5 to run at M1 gate. Co-Authored-By: Claude Sonnet 4.6 --- machine-docs/BACKLOG-pxgate.md | 22 ++++++++ machine-docs/REVIEW-pxgate.md | 93 ++++++++++++++++++++++++++++++++++ 2 files changed, 115 insertions(+) create mode 100644 machine-docs/BACKLOG-pxgate.md create mode 100644 machine-docs/REVIEW-pxgate.md diff --git a/machine-docs/BACKLOG-pxgate.md b/machine-docs/BACKLOG-pxgate.md new file mode 100644 index 0000000..a4830a8 --- /dev/null +++ b/machine-docs/BACKLOG-pxgate.md @@ -0,0 +1,22 @@ +# BACKLOG — phase pxgate + +## Build backlog +(Builder-owned — Adversary reads only) + +## Adversary findings + +No findings yet. Recording break-it probes to run once the fix lands. + +### Break-it probes to execute at M1 gate + +- [ ] **P1-neg (traefik-down gate fails):** Stop traefik service; verify `health_code` returns non-200 + and the reconciler would roll back. (Prove the new gate has teeth — not always-pass.) +- [ ] **P2-controlled-repro:** Simulate dashboard-absent scenario: with dashboard held back (or stopped), + run the NEW reconciler → verify it completes healthy (no deadlock). Run the OLD reconciler with + dashboard held back → verify it hangs/fails (confirm the fix actually breaks the cycle). +- [ ] **P3-ordering:** Confirm `After=deploy-proxy` consumers (drone, warm-keycloak, bridge, dashboard, + backupbot, reports-nightly) still order correctly. Check `systemctl cat ` for each. +- [ ] **P4-alert-cleared:** Verify the 20260613T054428Z unhealthy-on-latest alert is addressed (either + the Builder explicitly handles it, or the fix makes the next reconcile cycle healthy). +- [ ] **P5-secret-leak:** grep `/var/lib/ci-warm/alerts/` for any secret values (keys, passwords). + The alert file must contain only version strings, no credentials. diff --git a/machine-docs/REVIEW-pxgate.md b/machine-docs/REVIEW-pxgate.md new file mode 100644 index 0000000..09d646c --- /dev/null +++ b/machine-docs/REVIEW-pxgate.md @@ -0,0 +1,93 @@ +# REVIEW — phase pxgate + +**Phase:** pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix) +**Adversary:** autonomic-bot (Sonnet 4.6) +**Started:** 2026-06-13T12:41Z + +--- + +## Adversary orientation (cold start — 2026-06-13T12:41Z) + +Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so +the M1 verdict below is COLD and reproducible. + +### Root cause — INDEPENDENTLY CONFIRMED + +Reading `nix/modules/proxy.nix` + `runner/warm_reconcile.py` + `nix/modules/dashboard.nix`: + +1. `deploy-proxy.service` runs `warm_reconcile.py traefik`. +2. The traefik SPEC in `warm_reconcile.py:117-128` sets: + ```python + "health_domain": "ci.commoninternet.net", + "health_path": "/", + ``` + So `health_code()` probes `https://ci.commoninternet.net/` — the dashboard. +3. `deploy-dashboard.service` (dashboard.nix:89) has: + ``` + After=deploy-bridge.service deploy-proxy.service ... + ``` + systemd will not start deploy-dashboard until deploy-proxy exits. +4. **Deadlock:** proxy waits for dashboard; dashboard waits for proxy. + +### Root cause — PROVEN LIVE (not merely theoretical) + +The alert file `/var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json` +confirms the deadlock hit TODAY at boot time: + +``` +deploy-proxy started: 05:38:21 UTC + → probed ci.commoninternet.net (60s timeout): unhealthy + → redeployed traefik + → probed ci.commoninternet.net (300s timeout): still unhealthy + → wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true) +deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited) + → deployed dashboard successfully + → ci.commoninternet.net now returns 200 +``` + +traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at +05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard. + +### Verified fix endpoint + +`curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version` +→ `{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}` (200) + +This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth. + +`/ping` → 404 (not configured in the current recipe — avoid). + +### Required change (my independent read of the fix) + +In `runner/warm_reconcile.py` SPECS["traefik"]: +- Remove `"health_domain": "ci.commoninternet.net"` — so `health_code()` falls back to `spec["domain"]` = `"traefik.ci.commoninternet.net"` +- Change `"health_path": "/"` → `"health_path": "/api/version"` + +`health_code()` will then probe `https://traefik.ci.commoninternet.net/api/version` directly +(via `--resolve traefik.ci.commoninternet.net:443:127.0.0.1`), which returns 200 as soon as +traefik is up — no dashboard dependency. + +--- + +## M1 — Fix + controlled reproduction + +### PENDING — awaiting Builder implementation + +Acceptance criteria I will independently verify: +1. **Code change correct**: SPECS["traefik"] removes `health_domain` override and uses `/api/version` +2. **New gate is meaningful**: a STOPPED/broken traefik must cause the probe to fail (return non-200) +3. **Controlled reproduction**: with dashboard held back, old gate hangs/fails; new gate passes on traefik alone +4. **No `After=deploy-proxy` consumer regressed**: drone, warm-keycloak, bridge, dashboard, backupbot, reports ordering still correct +5. **PR merged or ready** to the cc-ci repo + +--- + +## M2 — Proven on a real from-scratch boot + +### PENDING — awaiting Builder implementation + orchestrator cold-boot + +Acceptance criteria I will independently verify: +1. **deploy-proxy reaches `active`** without the dashboard being pre-deployed +2. **Rollback path still works**: a deliberately broken traefik fails the gate and rolls back +3. **Running server unaffected**: all services still up after the fix deploys +4. **A1 / DEFERRED entry closed** with pointers