plan: queue pxgate — fix deploy-proxy/dashboard health-gate circular dependency (D8)
Re-target the traefik health gate off ci.commoninternet.net (the dashboard, which is After=deploy-proxy) onto a traefik-self endpoint, breaking the fresh-boot deadlock while keeping health-gated rollback. M1 controlled repro by the loops; M2 from-scratch cold-boot proof owned by the orchestrator.
This commit is contained in:
@ -645,3 +645,12 @@ session cc-ci-orchestrator-stale can be killed; recipe-mirrors org still private
|
||||
DROPPED. Re-added it to agents.toml — appended AFTER ghost (the system is already past cf55/on
|
||||
pvfix, so inserting before pvfix would shift the live phase index). agents.py re-reads config every
|
||||
tick, so no watchdog bounce needed. cf48 runs as the last phase, opus 4.8, claude backend.
|
||||
|
||||
## 2026-06-13 ~12:40 — queued pxgate (deploy-proxy health-gate circular-dep D8 fix)
|
||||
- Operator: fix the A1 circular dependency (deploy-proxy health-gates on ci.commoninternet.net =
|
||||
dashboard, but dashboard is After=deploy-proxy → fresh-boot deadlock → proxy fails at 900s).
|
||||
- Plan plan-phase-pxgate-proxy-healthgate.md: re-target the traefik health gate to a
|
||||
dashboard-independent traefik-self endpoint (/ping or api/version), keep rollback semantics;
|
||||
M1 = fix + controlled repro (loops), M2 = from-scratch cold-boot proof (orchestrator owns the
|
||||
live nixos-rebuild). Appended pxgate to agents.toml (idx 14); cleared SEQUENCE-COMPLETE +
|
||||
`phase set 14` + started loops → resumes the build for this one phase.
|
||||
|
||||
@ -146,4 +146,5 @@ phases = [
|
||||
{ id = "pvcheck", plan = "plan-phase-pvcheck-post-proxy-verification.md", status = "STATUS-pvcheck.md" },
|
||||
{ id = "ghost", plan = "plan-phase-ghost-reeval.md", status = "STATUS-ghost.md" },
|
||||
{ id = "cf48", plan = "plan-phase-cf48-opus-cfold-review.md", status = "STATUS-cf48.md", models = { builder = "claude-opus-4-8", adversary = "claude-opus-4-8" } },
|
||||
{ id = "pxgate", plan = "plan-phase-pxgate-proxy-healthgate.md", status = "STATUS-pxgate.md" },
|
||||
]
|
||||
|
||||
76
cc-ci-plan/plan-phase-pxgate-proxy-healthgate.md
Normal file
76
cc-ci-plan/plan-phase-pxgate-proxy-healthgate.md
Normal file
@ -0,0 +1,76 @@
|
||||
# Phase `pxgate` — break the deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix)
|
||||
|
||||
**Mission:** fix the boot-ordering deadlock the Adversary filed as A1 (BACKLOG-pvfix /
|
||||
DEFERRED 2026-06-13) so a **from-scratch (D8) boot brings traefik up cleanly** instead of
|
||||
deploy-proxy hanging 15 min and failing. The traefik proxy must still be health-gated with
|
||||
rollback — just gated on a signal that does NOT depend on a service ordered after it.
|
||||
|
||||
State files under `machine-docs/`: `STATUS-pxgate.md`, `BACKLOG-pxgate.md`,
|
||||
`REVIEW-pxgate.md`, `JOURNAL-pxgate.md`. DECISIONS.md shared.
|
||||
|
||||
## Root cause (verified)
|
||||
1. `nix/modules/proxy.nix` — `deploy-proxy.service` runs `warm_reconcile.py traefik`; its
|
||||
`wait_healthy` polls **`https://ci.commoninternet.net` → 200**. `TimeoutStartSec = 900`.
|
||||
2. `https://ci.commoninternet.net` is served by the **dashboard**.
|
||||
3. `nix/modules/dashboard.nix` — `deploy-dashboard.service` has **`After=deploy-proxy.service`**,
|
||||
so systemd won't start the dashboard until deploy-proxy exits.
|
||||
→ On a fresh boot: proxy waits for the dashboard, dashboard waits for proxy → proxy times out
|
||||
at 900s and **fails**, then the dashboard starts onto a failed proxy. (On the *running*
|
||||
server it's invisible — the dashboard is already up — which is why it's a latent D8 risk.)
|
||||
|
||||
## Required fix
|
||||
Break the cycle while keeping a MEANINGFUL traefik health signal + the rollback semantics
|
||||
(deploy latest → health-gate → commit last-good if healthy / roll back to last-good if not).
|
||||
|
||||
**Recommended approach (Builder confirms in DECISIONS.md):** change the deploy-proxy health
|
||||
gate to probe a **traefik-self-served endpoint that is up the moment traefik is, with no
|
||||
backend/dashboard dependency** — e.g. traefik's own `ping` entrypoint (`/ping` → 200) or
|
||||
`https://traefik.ci.commoninternet.net/api/version`. That confirms "traefik (re)deployed and
|
||||
is serving" — enough to drive the rollback decision — without waiting on `ci.commoninternet.net`.
|
||||
- If you still want an end-to-end "traefik routes to a real backend" assertion, move THAT to a
|
||||
**separate converge/health step ordered AFTER the dashboard** (or rely on the dashboard's own
|
||||
health gate); it must NOT be inside the deploy-proxy oneshot.
|
||||
- Do NOT simply delete the health gate (loses the rollback safety net). Do NOT just bump the
|
||||
timeout (the deadlock still fails on a cold boot, just slower). Do NOT reintroduce any
|
||||
dependency on a service that is `After=deploy-proxy`.
|
||||
- Make sure traefik actually EXPOSES whatever endpoint you choose (the `ping` entrypoint /
|
||||
api may need enabling in the traefik recipe/compose) — verify it returns 200 with only
|
||||
traefik up.
|
||||
|
||||
Touch points: `nix/modules/proxy.nix`, the traefik health check in `runner/warm_reconcile.py`
|
||||
(and/or `runner/harness/*` health helpers), possibly the traefik recipe overlay to expose the
|
||||
health endpoint. Check the other `After=deploy-proxy` consumers (drone, warm-keycloak, reports,
|
||||
bridge, backupbot, nightly-sweep) still order correctly after the change.
|
||||
|
||||
## Gates
|
||||
|
||||
**M1 — Fix + controlled reproduction (Builder/Adversary, on a test box / the cc-ci host
|
||||
without a real wipe).** Implement the fix. Reproduce the cycle in a CONTROLLED way (per the
|
||||
Adversary's A1 repro): with the dashboard held back / absent, show the OLD gate hangs→fails and
|
||||
the NEW gate goes healthy on traefik alone. Unit/integration-test the new health check.
|
||||
Adversary cold-verifies: cycle is genuinely broken, health signal is still meaningful (a
|
||||
broken traefik must still fail the gate and roll back), no `After=deploy-proxy` consumer
|
||||
regressed. PR to cc-ci repo; do NOT merge to the live host yet.
|
||||
|
||||
**M2 — Proven on a real from-scratch boot.** The durable proof is a **from-scratch / cold
|
||||
boot** where `deploy-proxy` reaches `active` without the dashboard and the whole control plane
|
||||
converges. Coordinate with the orchestrator: the orchestrator owns the live `nixos-rebuild` and
|
||||
any throwaway-VM / from-scratch rebuild (host infra is orchestrator/operator territory, not the
|
||||
loops). Acceptance: on a clean boot deploy-proxy is `active` (not failed), the rollback path
|
||||
still works on a deliberately-broken traefik, and the running server is unaffected. Fresh
|
||||
Adversary PASS → `## DONE`.
|
||||
|
||||
## Guardrails
|
||||
- **Health verification must stay honest** — a traefik that isn't actually serving must still
|
||||
fail the gate and roll back to last-good. Breaking the cycle must not become "always pass".
|
||||
- Live host changes (`nixos-rebuild`, from-scratch boot) are **orchestrator/operator-owned**;
|
||||
the loops produce the code + controlled proof and a PR, the orchestrator deploys + runs the
|
||||
cold-boot test. The loops must NOT nixos-rebuild the live host themselves.
|
||||
- No gate weakening elsewhere; minimal change; no secrets in commits. Commit author
|
||||
`autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>`; push every commit. CI host has no
|
||||
python3 on default PATH — use the harness venv.
|
||||
|
||||
## Definition of Done
|
||||
The deploy-proxy↔dashboard cycle is broken: a from-scratch boot brings traefik up health-gated
|
||||
(on a dashboard-independent signal) with rollback intact, no consumer mis-ordered, the running
|
||||
server unaffected, the A1 / DEFERRED entry closed with pointers, and M1+M2 fresh Adversary PASSes.
|
||||
Reference in New Issue
Block a user