Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-pvcheck-post-proxy-verification.md
2026-06-12 15:56:03 +00:00

3.3 KiB

Phase pvcheck — post-proxy verification and regression proof

Mission: prove that the durable proxy overlay fix is actually safe in production: the network has the intended headroom, routing works, real recipe CI still deploys through Traefik, and the IPAM/VIP exhaustion signature no longer threatens the weekly upgrade path.

State files live under machine-docs/: STATUS-pvcheck.md, BACKLOG-pvcheck.md, REVIEW-pvcheck.md, JOURNAL-pvcheck.md.

Preconditions

  • Phase pvfix is ## DONE.
  • docker network inspect proxy shows the intended /16 subnet.
  • Core control-plane services are back after the proxy recreation.

Verification Scope

  1. Host/network facts. Capture and record:
    • docker network inspect proxy subnet and endpoint count
    • docker stack ls
    • Traefik, Drone, bridge, dashboard, and report service health
    • recent dockerd journal lines for VIP/IPAM errors
  2. Routing checks. Verify externally visible routes still work:
    • Drone UI/API route
    • dashboard route
    • bridge/poller health if exposed locally
    • report site route
  3. Real deploy proof. Trigger one low-risk enrolled recipe !testme or equivalent harness run that joins proxy, completes all expected tiers, and tears down cleanly. Prefer a small stable recipe unless cfold needs a broader sweep at the same time. Do not duplicate an active cfold sweep.
  4. Allocator-headroom proof. Run a bounded reproduction derived from plan-proxy-vip-exhaustion-fix.md:
    • deploy/remove a small batch of throwaway published-port stacks, preferably in the same concurrent pattern that previously leaked endpoints
    • confirm leaked endpoint count, if any, is tiny relative to /16 headroom
    • confirm no fresh could not find an available IP while allocating VIP errors
    • prune throwaway networks/stacks and verify no residue
  5. Upgrade safety check. Confirm the /upgrade-all Step-0 guard still exists and would detect/recover the known VIP exhaustion signature if it ever recurs.

Gates

M1 — Control plane and routing verified. All cc-ci control-plane routes/services are healthy after the proxy recreation, with before/after evidence in STATUS-pvcheck.md. Adversary verifies independently from live commands, not just Builder notes.

M2 — Real CI and allocator proof verified. At least one real recipe deploy/test passes through proxy and tears down cleanly; bounded allocator reproduction does not threaten the new /16; no VIP exhaustion signature remains in fresh logs. Adversary verifies all claims and checks for leaks.

Guardrails

  • Do not run a large recipe sweep here if cfold already owns that proof. This phase is the proxy-specific post-change proof.
  • Keep concurrency bounded. The point is to prove headroom, not stress the host into a new unrelated failure.
  • Clean up every throwaway stack/network. Zero residue is part of the acceptance criteria.
  • If any core route is down, stop new test traffic and fix routing first.

Definition of Done

Control-plane routes are healthy, one real proxy-joining recipe CI run succeeds and cleans up, bounded allocator reproduction is documented, fresh logs show no VIP exhaustion, and Adversary has signed off on M1 and M2 in machine-docs/REVIEW-pvcheck.md. Builder writes ## DONE only after both gates have fresh Adversary PASSes.