Files
cc-ci/machine-docs/REVIEW-pxgate.md
autonomic-bot 0a32854853
Some checks failed
continuous-integration/drone/push Build is failing
review(pxgate-M2): PASS — cold-boot sim confirms cycle broken, proxy active without dashboard
nixos-rebuild deployed fix; new nix store path 8qjh8apxcbs85 with /api/version probe;
deploy-proxy active(exited) at 13:43:15 UTC; cold-boot sim: proxy started active(exited)
with dashboard stopped; all 9 services 1/1; alert dir empty; rollback gate unchanged.
Phase pxgate DoD fully met. Builder may write ## DONE.
2026-06-13 13:45:25 +00:00

14 KiB
Raw Blame History

REVIEW — phase pxgate

Phase: pxgate — break deploy-proxy ↔ dashboard health-gate circular dependency (D8 fix) Adversary: autonomic-bot (Sonnet 4.6) Started: 2026-06-13T12:41Z


Adversary orientation (cold start — 2026-06-13T12:41Z)

Independent cold read of the root cause and fix spec. NOT a gate claim — recording what I found so the M1 verdict below is COLD and reproducible.

Root cause — INDEPENDENTLY CONFIRMED

Reading nix/modules/proxy.nix + runner/warm_reconcile.py + nix/modules/dashboard.nix:

  1. deploy-proxy.service runs warm_reconcile.py traefik.
  2. The traefik SPEC in warm_reconcile.py:117-128 sets:
    "health_domain": "ci.commoninternet.net",
    "health_path": "/",
    
    So health_code() probes https://ci.commoninternet.net/ — the dashboard.
  3. deploy-dashboard.service (dashboard.nix:89) has:
    After=deploy-bridge.service deploy-proxy.service ...
    
    systemd will not start deploy-dashboard until deploy-proxy exits.
  4. Deadlock: proxy waits for dashboard; dashboard waits for proxy.

Root cause — PROVEN LIVE (not merely theoretical)

The alert file /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json confirms the deadlock hit TODAY at boot time:

deploy-proxy started: 05:38:21 UTC
  → probed ci.commoninternet.net (60s timeout): unhealthy
  → redeployed traefik
  → probed ci.commoninternet.net (300s timeout): still unhealthy
  → wrote alert "unhealthy-on-latest", exited 05:44:28 UTC (status=0, RemainAfterExit=true)
deploy-dashboard started: 05:44:46 UTC (AFTER proxy exited)
  → deployed dashboard successfully
  → ci.commoninternet.net now returns 200

traefik startDate = 2026-06-13T05:38:02Z (was already up before proxy reconciler started at 05:38:21) — so traefik itself was healthy; the probe was blocked on the dashboard.

Verified fix endpoint

curl -sk --resolve traefik.ci.commoninternet.net:443:127.0.0.1 https://traefik.ci.commoninternet.net/api/version{"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"} (200)

This endpoint is up the moment traefik is serving, has no backend dependency, requires no auth.

/ping → 404 (not configured in the current recipe — avoid).

Required change (my independent read of the fix)

In runner/warm_reconcile.py SPECS["traefik"]:

  • Remove "health_domain": "ci.commoninternet.net" — so health_code() falls back to spec["domain"] = "traefik.ci.commoninternet.net"
  • Change "health_path": "/""health_path": "/api/version"

health_code() will then probe https://traefik.ci.commoninternet.net/api/version directly (via --resolve traefik.ci.commoninternet.net:443:127.0.0.1), which returns 200 as soon as traefik is up — no dashboard dependency.

Pre-M1 break-it probes (before Builder's fix, 2026-06-13T12:50Z)

P5 — Secret leak in alert files: PASS. /var/lib/ci-warm/alerts/20260613T054428Z-traefik-unhealthy-on-latest.json contains only {"app": "traefik", "reason": "unhealthy-on-latest", "ts": "...", "version": "5.1.1+v3.6.15"}. No credentials, no secrets.

P3 — After=deploy-proxy consumers ordering: PASS (no regression in current ordering):

  • deploy-drone: After=deploy-proxy.service
  • deploy-bridge: After=deploy-drone.service deploy-proxy.service
  • deploy-dashboard: After=deploy-bridge.service deploy-proxy.service
  • deploy-backupbot: After=deploy-dashboard.service deploy-proxy.service
  • deploy-reports: After=deploy-dashboard.service deploy-proxy.service
  • nightly-sweep: After=deploy-proxy.service warm-keycloak.service
  • warm-keycloak: After=deploy-proxy.service These all correctly depend on deploy-proxy; after the fix, proxy completes without deadlock and the rest of the chain proceeds normally.

Endpoint stability: /api/version returns 200 reliably (3/3 probes). No backend dependency.

P1-negative (traefik-down): PENDING at M1 gate — requires a controlled stop of traefik (risky on live system); will execute at M1 verification using a short pause or by examining the reconciler code path (deploy_version raises → upgrade_ok=False → rollback).


M1 — Fix + controlled reproduction

PASS @2026-06-13T13:00Z — Adversary cold-verified

Commit: 0e9fd38 (claim(pxgate-M1): change traefik health probe to /api/version)

Check 1 — Code change correct

runner/warm_reconcile.py SPECS["traefik"] (lines 120129):

"traefik": {
    "recipe": "traefik",
    "domain": "traefik.ci.commoninternet.net",
    "health_path": "/api/version",   # ← changed from "/"
    "health_ok": (200,),
    "stateful": False,
    "deploy_timeout": 600,
    "health_timeout": 300,
    "setup": _traefik_setup,
},

health_domain key is absenthealth_code() falls back to spec["domain"] = "traefik.ci.commoninternet.net". Probe is now https://traefik.ci.commoninternet.net/api/version with --resolve traefik.ci.commoninternet.net:443:127.0.0.1 — traefik's own API, no backend dep.

Check 2 — Controlled reproduction

Scaled ccci-dashboard_app to 0 replicas (dashboard absent):

  • New probe (/api/version on traefik domain): HTTP 200 ← cycle broken
  • Old probe (ci.commoninternet.net/): HTTP 404 ← confirms old gate was deadlocked

Dashboard restored to 1/1 and returns 200 after scale-up.

Check 3 — Consumer ordering unchanged

All After=deploy-proxy.service consumers unchanged:

deploy-drone:      After=deploy-proxy.service swarm-init.service docker.service network-online.target
deploy-bridge:     After=deploy-drone.service deploy-proxy.service ...
deploy-dashboard:  After=deploy-bridge.service deploy-proxy.service ...
deploy-backupbot:  After=deploy-dashboard.service deploy-proxy.service ...
deploy-reports:    After=deploy-dashboard.service deploy-proxy.service ...
nightly-sweep:     After=deploy-proxy.service warm-keycloak.service docker.service
warm-keycloak:     After=deploy-proxy.service ...

deploy-proxy itself: After=swarm-init.service docker.service network-online.target — no dashboard dependency in its own ordering (correct). Fix does not change any service ordering.

Check 4 — Alert dir empty

/var/lib/ci-warm/alerts/ is empty — Builder cleared the stale 05:44Z alert (valid false-alarm from the old gate hitting the deadlock this morning).

Check 5 — proxy.nix comment

Comment updated: "health-gate (traefik.ci.commoninternet.net/api/version returns 200 — traefik's own API, no backend dep)". No functional change to the nix module (same systemd unit).

Check 6 — Gate has teeth (with one documentation note)

Functional PASS: health_code() line 276 returns int(r.stdout.strip() or "0") → on curl connection failure, stdout = "000" (curl's HTTP-code sentinel) → int("000") = 0 → 0 ∉ health_ok=(200,)wait_healthy() returns False → rollback triggered. Gate genuinely fails on a broken traefik.

Documentation discrepancy (non-blocking): The STATUS claim says "EXPECTED: error sentinel 999 returned when curl fails." The actual code returns 0 (not 999) on curl failure. grep for "999" returns no matches. This is a documentation error in the M1 claim only — the functional behavior is correct (0 ≠ 200 → gate fails → rollback). No code defect; no blocking finding.

Check 7 — DEFERRED + DECISIONS updated

machine-docs/DEFERRED.md: 2026-06-13 circular-dependency entry marked [x] CLOSED @2026-06-13 with fix pointer. machine-docs/DECISIONS.md: "deploy-proxy health gate — SETTLED (2026-06-13, phase pxgate)" entry added with rationale.


M1 VERDICT: PASS — cycle broken, new probe is dashboard-independent, rollback gate has teeth, ordering unchanged, DEFERRED closed, docs updated. One non-blocking STATUS discrepancy (999 vs 0 sentinel) noted; no code defect.


M2 — Proven on a real from-scratch boot

PENDING — awaiting orchestrator nixos-rebuild (as of 2026-06-13T13:08Z)

M1 is PASS. The fix is in the repo (0e9fd38). The live cc-ci host still has the OLD probe:

  • Active reconcile script: /nix/store/km6173hm5a77wxggd7zba3mfakrz0c94-cc-ci-reconcile-proxy
  • Calls: /nix/store/ls5d6s7q2892z0n0qv7sfk03zimwx3nd-runner/warm_reconcile.py
  • That file has: "health_domain": "ci.commoninternet.net", "health_path": "/" — OLD probe still live

Orchestrator action required:

ssh cc-ci
cd /root/builder-clone
git pull   # to get commit 0e9fd38
nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"

After nixos-rebuild, I will verify (per STATUS-pxgate.md M2 checks):

  1. deploy-proxy.service shows active (exited) (not unhealthy alert)
  2. New nix store path with /api/version in use
  3. All services 1/1 unaffected
  4. Cold-boot simulation: stop dashboard + restart proxy → proxy completes healthy without dashboard

Idle break-it probes @2026-06-13T13:31Z (M2 still pending — no nixos-rebuild yet)

Confirmed: old probe still live in active nix store path (km6173hm5a77wxggd7zba3mfakrz0c94); builder-clone on cc-ci at caef217 (old). M2 blocked on orchestrator.

P_stability (3 probes from orchestrator + 3 from cc-ci): /api/version → 200 all 6 probes. Dashboard / → 200. Endpoint stable.

P_services: All 9 Docker services 1/1:

  • backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik (app+socket-proxy), warm-keycloak (app+db)

P_alerts: /var/lib/ci-warm/alerts/ empty. Builder cleared the stale boot-time alert as expected.

P_leak: /api/version response: {"Version":"3.6.15","Codename":"ramequin","startDate":"2026-06-13T05:38:02.987423426Z"}. No secret patterns (password/token/key/cert/pem) detected.

P_ping_still_404: https://traefik.ci.commoninternet.net/ping → 404 (not configured — correct; avoids depending on an entrypoint that might not exist after nixos-rebuild).

Builder sentinel discrepancy (re-checked): Builder journal says "999 on curl failure" but runner/warm_reconcile.py:276 returns int(r.stdout.strip() or "0") → curl error → "000" → int("000")=0. Returns 0, not 999. Non-blocking (0 ∉ (200,) → gate fails correctly). Same finding as M1 check 6 — no code defect.

STATUS-pxgate.md M2 pre-check: builder-clone on cc-ci must be pulled to ≥ 0e9fd38 before nixos-rebuild. Current: caef217 (stale). Orchestrator must cd /root/builder-clone && git pull first.

No new findings warranting a VETO. All running-system probes PASS.


M2 — Proven on a real nixos-rebuild

PASS @2026-06-13T13:44Z — Adversary cold-verified

nixos-rebuild completed (detected by Adversary at ~13:43:15 UTC — new nix store path appeared on deploy-proxy). Full M2 acceptance run executed independently.

Check 1 — deploy-proxy active (exited) after nixos-rebuild

Active: active (exited) since Sat 2026-06-13 13:43:15 UTC
Invocation: fe8a806fbb5b40239c31a5c48f381cd1
Process: 3171211 ExecStart=/nix/store/8qjh8apxcbs85asgizkymjskicf4zmsl-cc-ci-reconcile-proxy/bin/cc-ci-reconcile-proxy (code=exited, status=0/SUCCESS)

No alert written. New nix store path 8qjh8apxcbs85asgizkymjskicf4zmsl — different from old km6173hm5a77wxggd7zba3mfakrz0c94.

Check 2 — /api/version probe in new nix store path

New runner: /nix/store/5hic3aba65i88m1ib67b7g6dwzrzd1z2-runner/warm_reconcile.py

Traefik spec confirmed:

"traefik": {
    "recipe": "traefik",
    "domain": "traefik.ci.commoninternet.net",
    "health_path": "/api/version",    # ← new probe
    "health_ok": (200,),
    ...
}

health_domain key absent → probe URL = https://traefik.ci.commoninternet.net/api/version (no backend/dashboard dep). Source grep confirms the inline comment: "traefik's OWN /api/version endpoint (no backend/dashboard dependency)".

Check 3 — All services 1/1 (running server unaffected)

All 9 Docker services 1/1 after nixos-rebuild: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik_app, traefik_socket-proxy, warm-keycloak_app, warm-keycloak_db.

Dashboard (https://ci.commoninternet.net/) → 200. /api/version → 200.

Check 4 — Cold-boot simulation: proxy starts without dashboard

Adversary executed the definitive cold-boot simulation (STATUS-pxgate.md Check 5):

1. systemctl stop deploy-dashboard   → inactive ✓
2. systemctl stop deploy-proxy && systemctl reset-failed deploy-proxy
3. systemctl start deploy-proxy
   → Active: active (exited) since Sat 2026-06-13 13:44:01 UTC ✓
   → Process: ExecStart=.../8qjh8apxcbs85asgizkymjskicf4zmsl-cc-ci-reconcile-proxy ... (status=0/SUCCESS)
4. systemctl start deploy-dashboard  → active (exited) ✓
5. All services 1/1; dashboard → 200; /api/version → 200 ✓

Deploy-proxy reached active (exited) with the dashboard not running — cycle conclusively broken. The old probe (ci.commoninternet.net/) would have timed out at 300s (health_timeout) trying to reach a dashboard that wasn't started yet.

Check 5 — Alert directory empty

/var/lib/ci-warm/alerts/ empty after both the nixos-rebuild run and the cold-boot simulation. No unhealthy alert written — new probe returned 200 on first health check.

Check 6 — Rollback path (code-proof, unchanged)

health_code() unchanged: returns int(r.stdout.strip() or "0") → 0 on curl failure → 0 ∉ (200,) → wait_healthy() returns False → rollback triggered. Gate has teeth. (Confirmed same as M1.)


M2 VERDICT: PASS — nixos-rebuild deployed the fix; deploy-proxy active without deadlock; cold-boot simulation confirmed cycle broken; all services unaffected; rollback intact. Phase pxgate Definition of Done fully met. Builder may write ## DONE.