68 lines
3.3 KiB
Markdown
68 lines
3.3 KiB
Markdown
# Phase `pvcheck` — post-proxy verification and regression proof
|
|
|
|
**Mission:** prove that the durable `proxy` overlay fix is actually safe in production:
|
|
the network has the intended headroom, routing works, real recipe CI still deploys through
|
|
Traefik, and the IPAM/VIP exhaustion signature no longer threatens the weekly upgrade path.
|
|
|
|
State files live under `machine-docs/`: `STATUS-pvcheck.md`, `BACKLOG-pvcheck.md`,
|
|
`REVIEW-pvcheck.md`, `JOURNAL-pvcheck.md`.
|
|
|
|
## Preconditions
|
|
|
|
- Phase `pvfix` is `## DONE`.
|
|
- `docker network inspect proxy` shows the intended `/16` subnet.
|
|
- Core control-plane services are back after the proxy recreation.
|
|
|
|
## Verification Scope
|
|
|
|
1. **Host/network facts.** Capture and record:
|
|
- `docker network inspect proxy` subnet and endpoint count
|
|
- `docker stack ls`
|
|
- Traefik, Drone, bridge, dashboard, and report service health
|
|
- recent dockerd journal lines for VIP/IPAM errors
|
|
2. **Routing checks.** Verify externally visible routes still work:
|
|
- Drone UI/API route
|
|
- dashboard route
|
|
- bridge/poller health if exposed locally
|
|
- report site route
|
|
3. **Real deploy proof.** Trigger one low-risk enrolled recipe `!testme` or equivalent
|
|
harness run that joins `proxy`, completes all expected tiers, and tears down cleanly.
|
|
Prefer a small stable recipe unless `cfold` needs a broader sweep at the same time. Do
|
|
not duplicate an active `cfold` sweep.
|
|
4. **Allocator-headroom proof.** Run a bounded reproduction derived from
|
|
`plan-proxy-vip-exhaustion-fix.md`:
|
|
- deploy/remove a small batch of throwaway published-port stacks, preferably in the same
|
|
concurrent pattern that previously leaked endpoints
|
|
- confirm leaked endpoint count, if any, is tiny relative to `/16` headroom
|
|
- confirm no fresh `could not find an available IP while allocating VIP` errors
|
|
- prune throwaway networks/stacks and verify no residue
|
|
5. **Upgrade safety check.** Confirm the `/upgrade-all` Step-0 guard still exists and would
|
|
detect/recover the known VIP exhaustion signature if it ever recurs.
|
|
|
|
## Gates
|
|
|
|
**M1 — Control plane and routing verified.** All cc-ci control-plane routes/services are
|
|
healthy after the proxy recreation, with before/after evidence in `STATUS-pvcheck.md`.
|
|
Adversary verifies independently from live commands, not just Builder notes.
|
|
|
|
**M2 — Real CI and allocator proof verified.** At least one real recipe deploy/test passes
|
|
through `proxy` and tears down cleanly; bounded allocator reproduction does not threaten the
|
|
new `/16`; no VIP exhaustion signature remains in fresh logs. Adversary verifies all claims
|
|
and checks for leaks.
|
|
|
|
## Guardrails
|
|
|
|
- Do not run a large recipe sweep here if `cfold` already owns that proof. This phase is the
|
|
proxy-specific post-change proof.
|
|
- Keep concurrency bounded. The point is to prove headroom, not stress the host into a new
|
|
unrelated failure.
|
|
- Clean up every throwaway stack/network. Zero residue is part of the acceptance criteria.
|
|
- If any core route is down, stop new test traffic and fix routing first.
|
|
|
|
## Definition of Done
|
|
|
|
Control-plane routes are healthy, one real proxy-joining recipe CI run succeeds and cleans
|
|
up, bounded allocator reproduction is documented, fresh logs show no VIP exhaustion, and
|
|
Adversary has signed off on M1 and M2 in `machine-docs/REVIEW-pvcheck.md`. Builder writes
|
|
`## DONE` only after both gates have fresh Adversary PASSes.
|