diff --git a/machine-docs/BUILDER-INBOX.md b/machine-docs/BUILDER-INBOX.md deleted file mode 100644 index 5cb7ac7..0000000 --- a/machine-docs/BUILDER-INBOX.md +++ /dev/null @@ -1,24 +0,0 @@ -# BUILDER-INBOX — from Orchestrator, 2026-06-13 - -**pxgate M2 is UNBLOCKED — the orchestrator completed the cc-ci-host nixos-rebuild.** - -Done on the live cc-ci host (operator authorized; no CI running): -- Staged current main at `/root/cc-ci-deploy` (+ copied the operator-held `secrets/secrets.yaml` - from `/etc/cc-ci/secrets/`, dropped `.git` so the untracked secrets are in the flake source). -- `nixos-rebuild switch --flake .#cc-ci` — succeeded; only the proxy/keycloak/sweep units rebuilt - (nixpkgs pinned), sops secrets imported OK. - -**Verification (your M2 evidence — Adversary should re-check on the host via `ssh cc-ci`):** -- Running `deploy-proxy.service` execs `/nix/store/5hic3aba65i88m1ib67b7g6dwzrzd1z2-runner/warm_reconcile.py traefik`, - whose traefik spec is `domain: traefik.ci.commoninternet.net, health_path: /api/version` - (lines ~122-123) — **the probe no longer references `ci.commoninternet.net` (the dashboard)**, so - the circular dependency is broken by construction. -- `deploy-proxy.service` is `active`; all 9 infra services 1/1; no `--failed` units; - `traefik.ci.commoninternet.net/api/version` → 200 independently. -- Rollback intact (a broken traefik won't serve /api/version → still rolls back to last-good). - -NOTE: a true from-scratch *reboot* proof (the ultimate D8 cold-boot) is pending operator decision — -the static + active-service evidence above already proves the deadlock can't occur. Proceed to claim -M2 on this; if the operator later does a reboot, fold that in as extra confirmation. - -Delete this file (commit + push) once consumed. diff --git a/machine-docs/JOURNAL-pxgate.md b/machine-docs/JOURNAL-pxgate.md index b4c84f8..99d35ad 100644 --- a/machine-docs/JOURNAL-pxgate.md +++ b/machine-docs/JOURNAL-pxgate.md @@ -98,3 +98,40 @@ ssh cc-ci 'cd /root/builder-clone && git checkout main && git pull' ``` STATUS-pxgate.md updated to include `git checkout main` safeguard in nixos-rebuild instructions. Builder-clone is now pre-staged on main at d23baf8 — orchestrator only needs to run nixos-rebuild. + +## 2026-06-13T13:44Z — M2 PASS: nixos-rebuild complete, all checks green + +**Orchestrator BUILDER-INBOX:** nixos-rebuild completed on live cc-ci host. Fixed committed. +- Used `/root/cc-ci-deploy` (not builder-clone) + operator-held secrets.yaml +- `nixos-rebuild switch --flake .#cc-ci` succeeded + +**Builder M2 verification (all checks run independently):** + +``` +# Check 1: deploy-proxy active +systemctl status deploy-proxy → Active: active (exited) since 13:44:01 UTC ✓ (279ms CPU) + +# Check 2: new runner with /api/version +cat /nix/store/8qjh8apxcbs85asgizkymjskicf4zmsl-cc-ci-reconcile-proxy/bin/cc-ci-reconcile-proxy +→ exec python3 /nix/store/5hic3aba65i88m1ib67b7g6dwzrzd1z2-runner/warm_reconcile.py traefik +grep '"traefik"' .../warm_reconcile.py: + "health_path": "/api/version" ← confirmed ✓ + "health_domain" key: absent ← defaults to traefik.ci.commoninternet.net ✓ + +# Check 3: all services 1/1 +docker service ls → 9 services all 1/1 ✓ + +# Check 4: cold-boot simulation +systemctl stop deploy-dashboard +systemctl stop deploy-proxy && systemctl reset-failed deploy-proxy +systemctl start deploy-proxy +→ Active: active (exited) since 13:46:05 UTC (17ms!) — NO DASHBOARD NEEDED ✓ +systemctl start deploy-dashboard → active (exited) ✓ + +# Check 5: running server unaffected +curl https://ci.commoninternet.net/ → 200 ✓ +curl https://traefik.ci.commoninternet.net/api/version → 200 ✓ +``` + +**Adversary PASS received** (independently verified same checks). "Builder may write ## DONE." +STATUS-pxgate.md updated with M2 PASS + ## DONE. BUILDER-INBOX consumed. diff --git a/machine-docs/STATUS-pxgate.md b/machine-docs/STATUS-pxgate.md index a9884ff..b03b649 100644 --- a/machine-docs/STATUS-pxgate.md +++ b/machine-docs/STATUS-pxgate.md @@ -17,68 +17,32 @@ One non-blocking documentation note from Adversary: STATUS claim said "999 error --- -## Gate: M2 — AWAITING ORCHESTRATOR nixos-rebuild +## Gate: M2 — PASS @2026-06-13T13:44Z (Adversary cold-verified) -M2 requires the orchestrator to deploy the fix to the live cc-ci host and verify deploy-proxy completes without deadlock. +See REVIEW-pxgate.md for full evidence. Summary: +- nixos-rebuild at ~13:43 UTC; deploy-proxy re-ran with new nix store path `8qjh8apxcbs85asgizkymjskicf4zmsl` +- New runner `/nix/store/5hic3aba65i88m1ib67b7g6dwzrzd1z2-runner/warm_reconcile.py` confirmed: `health_path="/api/version"`, `health_domain` absent → probe is `traefik.ci.commoninternet.net/api/version` +- deploy-proxy `active (exited)` in 279ms (nixos-rebuild run) and 17ms (cold-boot sim) — no alert, no deadlock +- Cold-boot simulation: dashboard stopped → proxy started → `active (exited)` immediately ✓ → dashboard restored ✓ +- All 9 services 1/1 after rebuild and cold-boot sim; `ci.commoninternet.net` → 200; `/api/version` → 200 +- Rollback path unchanged: `health_code()` returns 0 on curl failure → 0 ∉ `health_ok=(200,)` → rollback ✓ +- A1/DEFERRED entry closed (at M1); consumer ordering unchanged ✓ -### WHAT is needed from the orchestrator +--- -Run `nixos-rebuild switch` on cc-ci. The builder-clone **has been pre-staged** (checked out to `main` at `d23baf8` — 2026-06-13T13:35Z). The orchestrator only needs to run nixos-rebuild: +## DONE -```bash -ssh cc-ci 'cd /root/builder-clone && git checkout main && git pull && git log --oneline -1' -# EXPECTED: d23baf8 (or newer) review(pxgate): idle break-it probes PASS @13:31Z... +Phase pxgate complete. All Definition-of-Done items met and Adversary-verified: -nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci" -``` +| Item | Status | Evidence | +|---|---|---| +| Cycle broken (deploy-proxy↔dashboard) | ✅ | Cold-boot sim: proxy active (exited) without dashboard | +| Dashboard-independent health gate | ✅ | `traefik.ci.commoninternet.net/api/version` — traefik's own API | +| Rollback intact | ✅ | Gate returns 0 on failure → not in (200,) → rollback triggered | +| No consumer mis-ordered | ✅ | Adversary P3 probe: all After=deploy-proxy consumers unchanged | +| Running server unaffected | ✅ | All 9 services 1/1; ci.commoninternet.net → 200 | +| A1/DEFERRED closed | ✅ | DEFERRED.md entry closed at M1; DECISIONS.md updated | +| M1 Adversary PASS | ✅ | REVIEW-pxgate.md @2026-06-13T13:00Z | +| M2 Adversary PASS | ✅ | REVIEW-pxgate.md @2026-06-13T13:44Z | -Note: `git checkout main` is included as a safeguard — the builder-clone was previously on `restructure/concurrency`; it is now on `main` but the checkout ensures correctness if it drifts. - -This rebuilds the nix store with the new `runner/warm_reconcile.py` and restarts `deploy-proxy.service` (unit script path changes → systemd restarts it on daemon-reload). - -### HOW the Adversary verifies M2 (after nixos-rebuild) - -```bash -# 1. deploy-proxy is active (not failed): -ssh cc-ci 'systemctl status deploy-proxy --no-pager | head -10' -# EXPECTED: Active: active (exited) - -# 2. New nix store path is in use: -ssh cc-ci 'systemctl cat cc-ci-reconcile-proxy 2>/dev/null || cat $(systemctl cat deploy-proxy | grep ExecStart | awk "{print \$2}")' -# OR: -ssh cc-ci 'grep -r "api/version" /nix/store/*cc-ci-reconcile-proxy*/bin/ 2>/dev/null | head -3' -# EXPECTED: /api/version appears in the reconcile script (new nix store path) - -# 3. All services still up (running server unaffected): -ssh cc-ci 'docker service ls --format "{{.Name}}\t{{.Replicas}}"' -# EXPECTED: all services 1/1 (or their normal replica count) - -# 4. Rollback path — code-proof (no live rollback test needed; logic unchanged): -# health_code() line 276: returns int(r.stdout.strip() or "0") -# → on curl failure: stdout="000" → int("000")=0 → 0 ∉ health_ok=(200,) → wait_healthy returns False -# → upgrade path: unhealthy → write_alert + roll back to last_good -# → no-op path: unhealthy → try redeploy → if still bad → write_alert -# Unchanged from pre-fix; M1 confirms endpoint is dashboard-independent. - -# 5. Cold-boot simulation (optional but durable — run if not doing a fresh VM): -ssh cc-ci 'systemctl stop deploy-dashboard' -ssh cc-ci 'systemctl stop deploy-proxy && systemctl reset-failed deploy-proxy' -ssh cc-ci 'systemctl start deploy-proxy' -ssh cc-ci 'systemctl status deploy-proxy --no-pager | head -5' -# EXPECTED: Active: active (exited) WITHOUT needing deploy-dashboard running -ssh cc-ci 'systemctl start deploy-dashboard' -``` - -### EXPECTED M2 outcomes - -| Check | Expected | -|---|---| -| deploy-proxy after nixos-rebuild | `active (exited)` | -| `/api/version` in nix store reconcile script | present | -| All services 1/1 | yes | -| Cold-boot sim (proxy starts without dashboard) | `active (exited)` | -| Running server unaffected | all routes return expected codes | - -### WHERE - -Fix commit: `0e9fd38` (on origin/main). nixos-rebuild command: `nixos-rebuild switch --flake "git+file:///root/builder-clone?submodules=1#cc-ci"` (pull main first). +Fix commit: `0e9fd38` (`claim(pxgate-M1): change traefik health probe to /api/version`). Live since nixos-rebuild @2026-06-13T13:43Z.