Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-pvcheck-post-proxy-verification.md
2026-06-12 15:56:03 +00:00

68 lines
3.3 KiB
Markdown

# Phase `pvcheck` — post-proxy verification and regression proof
**Mission:** prove that the durable `proxy` overlay fix is actually safe in production:
the network has the intended headroom, routing works, real recipe CI still deploys through
Traefik, and the IPAM/VIP exhaustion signature no longer threatens the weekly upgrade path.
State files live under `machine-docs/`: `STATUS-pvcheck.md`, `BACKLOG-pvcheck.md`,
`REVIEW-pvcheck.md`, `JOURNAL-pvcheck.md`.
## Preconditions
- Phase `pvfix` is `## DONE`.
- `docker network inspect proxy` shows the intended `/16` subnet.
- Core control-plane services are back after the proxy recreation.
## Verification Scope
1. **Host/network facts.** Capture and record:
- `docker network inspect proxy` subnet and endpoint count
- `docker stack ls`
- Traefik, Drone, bridge, dashboard, and report service health
- recent dockerd journal lines for VIP/IPAM errors
2. **Routing checks.** Verify externally visible routes still work:
- Drone UI/API route
- dashboard route
- bridge/poller health if exposed locally
- report site route
3. **Real deploy proof.** Trigger one low-risk enrolled recipe `!testme` or equivalent
harness run that joins `proxy`, completes all expected tiers, and tears down cleanly.
Prefer a small stable recipe unless `cfold` needs a broader sweep at the same time. Do
not duplicate an active `cfold` sweep.
4. **Allocator-headroom proof.** Run a bounded reproduction derived from
`plan-proxy-vip-exhaustion-fix.md`:
- deploy/remove a small batch of throwaway published-port stacks, preferably in the same
concurrent pattern that previously leaked endpoints
- confirm leaked endpoint count, if any, is tiny relative to `/16` headroom
- confirm no fresh `could not find an available IP while allocating VIP` errors
- prune throwaway networks/stacks and verify no residue
5. **Upgrade safety check.** Confirm the `/upgrade-all` Step-0 guard still exists and would
detect/recover the known VIP exhaustion signature if it ever recurs.
## Gates
**M1 — Control plane and routing verified.** All cc-ci control-plane routes/services are
healthy after the proxy recreation, with before/after evidence in `STATUS-pvcheck.md`.
Adversary verifies independently from live commands, not just Builder notes.
**M2 — Real CI and allocator proof verified.** At least one real recipe deploy/test passes
through `proxy` and tears down cleanly; bounded allocator reproduction does not threaten the
new `/16`; no VIP exhaustion signature remains in fresh logs. Adversary verifies all claims
and checks for leaks.
## Guardrails
- Do not run a large recipe sweep here if `cfold` already owns that proof. This phase is the
proxy-specific post-change proof.
- Keep concurrency bounded. The point is to prove headroom, not stress the host into a new
unrelated failure.
- Clean up every throwaway stack/network. Zero residue is part of the acceptance criteria.
- If any core route is down, stop new test traffic and fix routing first.
## Definition of Done
Control-plane routes are healthy, one real proxy-joining recipe CI run succeeds and cleans
up, bounded allocator reproduction is documented, fresh logs show no VIP exhaustion, and
Adversary has signed off on M1 and M2 in `machine-docs/REVIEW-pvcheck.md`. Builder writes
`## DONE` only after both gates have fresh Adversary PASSes.