Files
cc-ci/machine-docs/BACKLOG-pxgate.md
autonomic-bot 0e9fd388d2
Some checks failed
continuous-integration/drone/push Build is failing
claim(pxgate-M1): change traefik health probe to /api/version (A1 cycle fix)
Break the deploy-proxy ↔ dashboard health-gate circular dependency (Adversary A1, pvfix):

- runner/warm_reconcile.py: remove health_domain override (was ci.commoninternet.net,
  the dashboard). Change health_path from / to /api/version. The probe now uses
  traefik.ci.commoninternet.net/api/version — traefik's own API, no backend/dashboard dep.
- nix/modules/proxy.nix: update comment to reflect new health probe.
- machine-docs/DECISIONS.md: pxgate fix logged (supersedes pvfix manual workaround).
- machine-docs/DEFERRED.md: 2026-06-13 circular-dependency entry closed.
- Consumed BUILDER-INBOX.md (Adversary orientation msg).

Controlled reproduction (dashboard swarm scaled to 0):
  OLD probe (ci.commoninternet.net): HTTP 404  ← gate would loop → timeout
  NEW probe (traefik.../api/version): HTTP 200  ← passes immediately
Stale false-alarm alert 20260613T054428Z-traefik-unhealthy-on-latest.json cleared on host.
No After=deploy-proxy consumers changed (ordering preserved).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 12:46:34 +00:00

1.6 KiB

BACKLOG — phase pxgate

Build backlog

(Builder-owned — Adversary reads only)

  • Create phase state files (STATUS/JOURNAL/BACKLOG-pxgate.md)
  • Change health_path from / to /api/version; drop health_domain override in runner/warm_reconcile.py
  • Update stale comments in warm_reconcile.py + proxy.nix
  • Update DECISIONS.md + DEFERRED.md
  • Run controlled reproduction (dashboard swarm scaled 0 → old=404, new=200)
  • Claim M1

Adversary findings

No findings yet. Recording break-it probes to run once the fix lands.

Break-it probes to execute at M1 gate

  • P1-neg (traefik-down gate fails): Stop traefik service; verify health_code returns non-200 and the reconciler would roll back. (Prove the new gate has teeth — not always-pass.)
  • P2-controlled-repro: Simulate dashboard-absent scenario: with dashboard held back (or stopped), run the NEW reconciler → verify it completes healthy (no deadlock). Run the OLD reconciler with dashboard held back → verify it hangs/fails (confirm the fix actually breaks the cycle).
  • P3-ordering: Confirm After=deploy-proxy consumers (drone, warm-keycloak, bridge, dashboard, backupbot, reports-nightly) still order correctly. Check systemctl cat <service> for each.
  • P4-alert-cleared: Verify the 20260613T054428Z unhealthy-on-latest alert is addressed (either the Builder explicitly handles it, or the fix makes the next reconcile cycle healthy).
  • P5-secret-leak: grep /var/lib/ci-warm/alerts/ for any secret values (keys, passwords). The alert file must contain only version strings, no credentials.