Files
cc-ci/machine-docs/REVIEW-pvcheck.md
autonomic-bot 17cf4d249f
Some checks failed
continuous-integration/drone/push Build is failing
continuous-integration/drone Build is passing
review(pvcheck-M1): M1 PASS — control plane and routing verified cold
Cold verify 2026-06-13T06:10Z: proxy 10.10.0.0/16/7 endpoints confirmed,
all 9 services 1/1, ci=200/drone=303/report=200, zero VIP exhaustion since
05:38Z, swarm.nix e6349a9 confirmed, Step-0 guard text updated in 84e13a7.
[A2] closed — stale description fix confirmed in orchestrator.
2026-06-13 06:01:26 +00:00

116 lines
6.5 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — phase pvcheck (post-proxy verification)
Adversary-owned. Append-only verdicts. All commands run cold from /srv/cc-ci-orch/cc-ci-adv (own clone).
---
## Adversary baseline probe — 2026-06-13T05:56Z
**Context:** Phase pvfix is DONE (STATUS-pvfix.md ## DONE). pvcheck preconditions verified cold.
### Precondition checks
| Check | Result |
|---|---|
| pvfix DONE | ✅ STATUS-pvfix.md shows `## DONE`, both M1+M2 PASS |
| `proxy` subnet | ✅ `10.10.0.0/16` (docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}") |
| `proxy` IPAM driver | ✅ default, gateway 10.10.0.1 |
| All services 1/1 | ✅ 9 services all `1/1` (backups, bridge, dashboard, reports, drone, traefik×2, keycloak×2) |
| `ci.commoninternet.net` | ✅ HTTP/2 200 |
| `drone.ci.commoninternet.net` | ✅ HTTP/2 303 |
| `report.ci.commoninternet.net` | ✅ HTTP/2 200 |
| VIP exhaustion after 05:38Z | ✅ NONE — `journalctl -u docker --since "2026-06-13 05:38:00" | grep "available IP while allocating VIP"` → empty |
| Transient errors at 05:35Z | "could not find network allocator STATE" for OLD net IDs (mlxau8…, 85p3aq…) — these are expected during proxy recreation (swarm allocator losing state for the deleted /24 network) |
| No new VIP exhaustion | ✅ post-fix journal clean |
**Command evidence:**
```
$ docker network inspect proxy --format "{{json .IPAM}}"
{"Driver":"default","Options":null,"Config":[{"Subnet":"10.10.0.0/16","Gateway":"10.10.0.1"}]}
$ docker service ls --format "{{.Name}}\t{{.Replicas}}"
backups_ci_commoninternet_net_app 1/1
ccci-bridge_app 1/1
ccci-dashboard_app 1/1
ccci-reports_app 1/1
drone_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_socket-proxy 1/1
warm-keycloak_ci_commoninternet_net_app 1/1
warm-keycloak_ci_commoninternet_net_db 1/1
```
### Upgrade-all Step-0 guard — independent check
**Guard location:** `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` §0, lines 61-81
**Guard logic:** `VIPFAIL=$(ssh cc-ci 'journalctl -u docker --since "26 hours ago" | grep -c "available IP while allocating VIP"')` → if >0, `systemctl restart docker`
**Guard exists:** ✅ confirmed cold-read
**Guard would fire:** ✅ triggers on the EXACT original error signature (`"available IP while allocating VIP"`) — would detect and recover if VIP exhaustion recurs despite the /16 fix (belt+suspenders)
**STALE TEXT NOTE:** Skill still says "(The durable fix ... is tracked in plan-proxy-vip-exhaustion-fix.md; this guard is the per-run safety net until that lands.)" — but the durable fix HAS now landed. This is a documentation smell, not a functional defect; the guard logic is correct and still useful. Filing as advisory finding [A2].
---
## Adversary independent allocator-headroom probe — 2026-06-13T06:02Z
**Method:** deploy 5 throwaway nginx stacks concurrently joining `proxy`, then remove all 5 concurrently (same concurrent-rm pattern that caused endpoint GC races under the old /24).
| Check | Result |
|---|---|
| BASELINE proxy containers | 9 |
| AFTER DEPLOY (5 stacks added) | 14 |
| AFTER concurrent stack rm | 9 (back to baseline) |
| Leaked endpoints | **0** |
| VIP exhaustion errors during test | **0** |
| Swarm GC race errors (key modified / network proxy remove failed) | **0** |
| Network prune output | empty (nothing to reclaim) |
| AFTER prune residue | **0** |
| All pvcheck-throwaway stacks removed | ✅ confirmed |
**Verdict:** The /16 subnet has sufficient headroom that 5 concurrent deploy/rm cycles produce zero endpoint leaks and zero VIP errors. No residue after prune.
**Note:** 5 stacks is a conservative test — the original exhaustion required ~45 GC races over 11 days uptime. The /16 has 65534 VIPs vs the old /24's 254 — the leak rate would need to be ~258× faster to hit the same ceiling. This probe confirms the allocator is healthy and the /16 provides the claimed headroom.
---
## M1 — PASS @2026-06-13T06:10Z
**Cold verify run — Adversary's own commands, no cached state.**
| Check | Command | Result |
|---|---|---|
| proxy subnet | `docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"` | **`10.10.0.0/16`, Endpoints: 7** ✅ |
| 9 services 1/1 | `docker service ls --format "{{.Name}}\t{{.Replicas}}"` | all 1/1 ✅ |
| ci.commoninternet.net | `curl -sk -o /dev/null -w "%{http_code}"` | **200** ✅ |
| drone.ci.commoninternet.net | same | **303** ✅ |
| report.ci.commoninternet.net | same | **200** ✅ |
| VIP exhaustion since 05:38Z | `journalctl -u docker --since "2026-06-13 05:38:00" \| grep -c "available IP while allocating VIP"` | **0** ✅ |
| swarm.nix /16 declared | `grep "10.10" nix/modules/swarm.nix` | `--subnet 10.10.0.0/16` ✅ |
| swarm.nix commit | `git show e6349a9 --stat` | confirmed ✅ |
| Step-0 guard text | `grep -A8 "VIPFAIL" upgrade-all/SKILL.md` | guard exists, checks exact signature ✅ |
| [A2] fix | `git -C /srv/cc-ci-orch log --oneline \| grep 84e13a7` | `fix(pvcheck/A2): update upgrade-all SKILL.md guard description` ✅ |
| [A2] text updated | SKILL.md line ~81 | "belt-and-suspenders even after the /16 fix" ✅ |
**All M1 criteria verified independently from cold start.** Builder's before/after evidence is consistent with what Adversary observed directly. No discrepancies.
[A2] CLOSED — fix confirmed in orchestrator commit 84e13a7.
## M2 — PENDING (awaiting Builder claim)
Real recipe CI run AFTER the proxy fix (05:38Z) still needed. Dashboard shows run #585 (ghost, ~04:56Z) was before the fix — a new !testme run post-fix is required for M2.
Adversary independent allocator-headroom probe already completed (2026-06-13T06:02Z — see above): 5 concurrent stacks, 0 leaks, 0 VIP errors. Awaiting Builder's full headroom proof + real recipe run claim.
---
## Adversary findings
### [A2] upgrade-all SKILL.md stale description — guard text still says "until that lands" (2026-06-13T05:56Z)
**Severity:** Documentation / low
**Location:** `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` line 81
**Current text:** "this guard is the per-run safety net until that lands"
**Issue:** the durable fix (proxy /16) has landed — this text now misleads about the guard's purpose (it IS still useful as belt+suspenders, but no longer "until the fix lands")
**Suggested fix:** update to "this guard remains as belt-and-suspenders even after the /16 subnet fix"
**NOT a VETO** — guard logic is correct; this is documentation only.
Status: open (Builder may fix; Adversary closes after re-read)