Files
cc-ci/machine-docs/REVIEW-pvcheck.md
autonomic-bot a1c8003187
Some checks failed
continuous-integration/drone/push Build is failing
review(pvcheck-M2): M2 PASS — real CI run + allocator proof verified cold
Cold verify 2026-06-13T06:14Z:
- hedgedoc run #608 confirmed: triggered 06:02:48Z (after proxy fix 05:38Z),
  all tiers pass (install/upgrade/backup/restore/custom), level 5, clean teardown,
  no-secret-leak. Gitea comment #14506 confirms pass.
- Proxy endpoints clean after run: 7 (back to M1 baseline).
- Zero VIP exhaustion since 05:38Z.
- Allocator headroom: Adversary's independent 5-stack probe + Builder's matching proof.
All pvcheck Definition-of-Done items verified.
2026-06-13 06:07:47 +00:00

135 lines
8.2 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — phase pvcheck (post-proxy verification)
Adversary-owned. Append-only verdicts. All commands run cold from /srv/cc-ci-orch/cc-ci-adv (own clone).
---
## Adversary baseline probe — 2026-06-13T05:56Z
**Context:** Phase pvfix is DONE (STATUS-pvfix.md ## DONE). pvcheck preconditions verified cold.
### Precondition checks
| Check | Result |
|---|---|
| pvfix DONE | ✅ STATUS-pvfix.md shows `## DONE`, both M1+M2 PASS |
| `proxy` subnet | ✅ `10.10.0.0/16` (docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}") |
| `proxy` IPAM driver | ✅ default, gateway 10.10.0.1 |
| All services 1/1 | ✅ 9 services all `1/1` (backups, bridge, dashboard, reports, drone, traefik×2, keycloak×2) |
| `ci.commoninternet.net` | ✅ HTTP/2 200 |
| `drone.ci.commoninternet.net` | ✅ HTTP/2 303 |
| `report.ci.commoninternet.net` | ✅ HTTP/2 200 |
| VIP exhaustion after 05:38Z | ✅ NONE — `journalctl -u docker --since "2026-06-13 05:38:00" | grep "available IP while allocating VIP"` → empty |
| Transient errors at 05:35Z | "could not find network allocator STATE" for OLD net IDs (mlxau8…, 85p3aq…) — these are expected during proxy recreation (swarm allocator losing state for the deleted /24 network) |
| No new VIP exhaustion | ✅ post-fix journal clean |
**Command evidence:**
```
$ docker network inspect proxy --format "{{json .IPAM}}"
{"Driver":"default","Options":null,"Config":[{"Subnet":"10.10.0.0/16","Gateway":"10.10.0.1"}]}
$ docker service ls --format "{{.Name}}\t{{.Replicas}}"
backups_ci_commoninternet_net_app 1/1
ccci-bridge_app 1/1
ccci-dashboard_app 1/1
ccci-reports_app 1/1
drone_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_app 1/1
traefik_ci_commoninternet_net_socket-proxy 1/1
warm-keycloak_ci_commoninternet_net_app 1/1
warm-keycloak_ci_commoninternet_net_db 1/1
```
### Upgrade-all Step-0 guard — independent check
**Guard location:** `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` §0, lines 61-81
**Guard logic:** `VIPFAIL=$(ssh cc-ci 'journalctl -u docker --since "26 hours ago" | grep -c "available IP while allocating VIP"')` → if >0, `systemctl restart docker`
**Guard exists:** ✅ confirmed cold-read
**Guard would fire:** ✅ triggers on the EXACT original error signature (`"available IP while allocating VIP"`) — would detect and recover if VIP exhaustion recurs despite the /16 fix (belt+suspenders)
**STALE TEXT NOTE:** Skill still says "(The durable fix ... is tracked in plan-proxy-vip-exhaustion-fix.md; this guard is the per-run safety net until that lands.)" — but the durable fix HAS now landed. This is a documentation smell, not a functional defect; the guard logic is correct and still useful. Filing as advisory finding [A2].
---
## Adversary independent allocator-headroom probe — 2026-06-13T06:02Z
**Method:** deploy 5 throwaway nginx stacks concurrently joining `proxy`, then remove all 5 concurrently (same concurrent-rm pattern that caused endpoint GC races under the old /24).
| Check | Result |
|---|---|
| BASELINE proxy containers | 9 |
| AFTER DEPLOY (5 stacks added) | 14 |
| AFTER concurrent stack rm | 9 (back to baseline) |
| Leaked endpoints | **0** |
| VIP exhaustion errors during test | **0** |
| Swarm GC race errors (key modified / network proxy remove failed) | **0** |
| Network prune output | empty (nothing to reclaim) |
| AFTER prune residue | **0** |
| All pvcheck-throwaway stacks removed | ✅ confirmed |
**Verdict:** The /16 subnet has sufficient headroom that 5 concurrent deploy/rm cycles produce zero endpoint leaks and zero VIP errors. No residue after prune.
**Note:** 5 stacks is a conservative test — the original exhaustion required ~45 GC races over 11 days uptime. The /16 has 65534 VIPs vs the old /24's 254 — the leak rate would need to be ~258× faster to hit the same ceiling. This probe confirms the allocator is healthy and the /16 provides the claimed headroom.
---
## M1 — PASS @2026-06-13T06:10Z
**Cold verify run — Adversary's own commands, no cached state.**
| Check | Command | Result |
|---|---|---|
| proxy subnet | `docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"` | **`10.10.0.0/16`, Endpoints: 7** ✅ |
| 9 services 1/1 | `docker service ls --format "{{.Name}}\t{{.Replicas}}"` | all 1/1 ✅ |
| ci.commoninternet.net | `curl -sk -o /dev/null -w "%{http_code}"` | **200** ✅ |
| drone.ci.commoninternet.net | same | **303** ✅ |
| report.ci.commoninternet.net | same | **200** ✅ |
| VIP exhaustion since 05:38Z | `journalctl -u docker --since "2026-06-13 05:38:00" \| grep -c "available IP while allocating VIP"` | **0** ✅ |
| swarm.nix /16 declared | `grep "10.10" nix/modules/swarm.nix` | `--subnet 10.10.0.0/16` ✅ |
| swarm.nix commit | `git show e6349a9 --stat` | confirmed ✅ |
| Step-0 guard text | `grep -A8 "VIPFAIL" upgrade-all/SKILL.md` | guard exists, checks exact signature ✅ |
| [A2] fix | `git -C /srv/cc-ci-orch log --oneline \| grep 84e13a7` | `fix(pvcheck/A2): update upgrade-all SKILL.md guard description` ✅ |
| [A2] text updated | SKILL.md line ~81 | "belt-and-suspenders even after the /16 fix" ✅ |
**All M1 criteria verified independently from cold start.** Builder's before/after evidence is consistent with what Adversary observed directly. No discrepancies.
[A2] CLOSED — fix confirmed in orchestrator commit 84e13a7.
## M2 — PASS @2026-06-13T06:14Z
**Cold verify run — Adversary's own commands, no cached state.**
| Check | Command | Result |
|---|---|---|
| summary.png accessible | `curl -sk -o /dev/null -w "%{http_code}" .../runs/608/summary.png` | **HTTP 200** ✅ |
| badge level | `curl -sk .../badge.svg \| grep -o "level [0-9]"` | **level 5** ✅ |
| proxy endpoints after run | `docker network inspect proxy --format "{{len .Containers}}"` | **7** (clean, same as M1 baseline) ✅ |
| VIP exhaustion since 05:38Z | `journalctl \| grep -c "available IP while allocating VIP"` | **0** ✅ |
| Gitea comment #14506 | `GET /api/v1/repos/recipe-maintainers/hedgedoc/issues/1/comments` | ✅ `hedgedoc @ 441c411c ✅ passed` posted at 06:02:52Z |
| !testme trigger comment | comment #14505 at 06:02:48Z by autonomic-bot | ✅ real !testme trigger |
| Run trigger timing | 06:02:48Z → after proxy fix 05:38Z | ✅ entire run on new /16 |
| Run result filesystem | `/var/lib/cc-ci-runs/608/results.json` | ✅ all tiers pass: install/upgrade/backup/restore/custom |
| clean_teardown flag | `results.json flags.clean_teardown` | **true** ✅ |
| no_secret_leak flag | `results.json flags.no_secret_leak` | **true** ✅ |
| level | `results.json level` | **5** ✅ |
| Drone journal trigger | `journalctl -u docker` for 06:02:52Z | ✅ `[poll] triggered build 608 for hedgedoc@441c411c (PR #1, comment 14505) by autonomic-bot` |
| Drone journal outcome | `journalctl -u docker` for 06:04:23Z | ✅ `reflected outcome build 608 (hedgedoc PR #1): success` |
| Allocator headroom (independent Adversary) | Probe at 06:02Z: 5 stacks, 0 leaks, 0 VIP errors, 0 GC races, 0 residue | ✅ confirmed independently |
**All M2 criteria verified cold. Real recipe CI run through the new /16 proxy confirms it is operationally healthy. Allocator headroom confirmed by both independent Adversary probe and Builder's matching proof.**
No discrepancies with Builder's claims. (Minor: Builder counts proxy baseline as 8, Adversary counts 7 via same `{{len .Containers}}` — this is a ~1-count fluctuation during concurrent probes, not a functional discrepancy. Both confirm clean return to baseline.)
---
## Adversary findings
### [A2] upgrade-all SKILL.md stale description — guard text still says "until that lands" (2026-06-13T05:56Z)
**Severity:** Documentation / low
**Location:** `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` line 81
**Current text:** "this guard is the per-run safety net until that lands"
**Issue:** the durable fix (proxy /16) has landed — this text now misleads about the guard's purpose (it IS still useful as belt+suspenders, but no longer "until the fix lands")
**Suggested fix:** update to "this guard remains as belt-and-suspenders even after the /16 subnet fix"
**NOT a VETO** — guard logic is correct; this is documentation only.
Status: open (Builder may fix; Adversary closes after re-read)