diff --git a/machine-docs/BACKLOG-pvcheck.md b/machine-docs/BACKLOG-pvcheck.md index d1c2102..70c112b 100644 --- a/machine-docs/BACKLOG-pvcheck.md +++ b/machine-docs/BACKLOG-pvcheck.md @@ -6,10 +6,10 @@ - [x] Fix [A2] upgrade-all SKILL.md stale description (orchestrator commit 84e13a7) - [x] Collect M1 evidence (proxy subnet, endpoints, service health, routes, VIP journal) - [x] Claim M1 — control plane and routing verified -- [ ] M2: real recipe CI run through proxy (harness or !testme) -- [ ] M2: bounded allocator headroom proof (deploy/remove throwaway stacks, confirm no VIP exhaustion) -- [ ] M2: cleanup verification (zero residue) -- [ ] M2: claim gate after M1 PASS +- [x] M2: real recipe CI run through proxy — hedgedoc build #608 ✅ passed level 5 (06:04Z post-fix) +- [x] M2: bounded allocator headroom proof — 5 stacks deploy/rm, 0 leaks, 0 VIP errors (06:08Z) +- [x] M2: cleanup verification — proxy endpoints: 7 (baseline), no residue (06:09Z) +- [x] M2: claim gate ## Adversary findings diff --git a/machine-docs/JOURNAL-pvcheck.md b/machine-docs/JOURNAL-pvcheck.md index 470d198..62f2499 100644 --- a/machine-docs/JOURNAL-pvcheck.md +++ b/machine-docs/JOURNAL-pvcheck.md @@ -45,3 +45,43 @@ M2 requires: 2. Allocator headroom proof — deploy/remove 3-5 throwaway stacks with published ports (simulating concurrent deploys), confirm endpoint count stays small and no VIP exhaustion Will check what enrolled recipes have open PRs available for !testme first. + +--- + +## 2026-06-13T06:02–06:10Z — M2 execution + +**Allocator headroom proof (Builder):** +``` +# Baseline +ssh cc-ci 'docker network inspect proxy --format "{{len .Containers}}"' → 8 + +# Deploy 5 throwaway nginx stacks concurrently, each joining proxy with published ports +for i in 1..5: docker stack deploy pvcheck-throw-$i (background) +wait; sleep 5 +→ AFTER DEPLOY: 13 (+5) + +# Concurrent removal (same pattern as original GC race) +for i in 1..5: docker stack rm pvcheck-throw-$i (background) +wait; sleep 8 +→ AFTER concurrent rm: 8 (back to baseline) +→ VIP exhaustion errors since 06:00Z: 0 +→ docker network prune → empty (no residue) +→ docker stack ls | grep pvcheck → empty (all removed) +``` + +**Real recipe CI run:** +``` +# Posted !testme on recipe-maintainers/hedgedoc PR#1 at 06:02:48Z (post-proxy-fix) +curl POST /repos/recipe-maintainers/hedgedoc/issues/1/comments body="!testme" +→ comment id: 14505 + +# Bridge picked up in 4 seconds (06:02:52Z) +# Started Drone build #608 for hedgedoc @ 441c411c + +# Monitored: runner process PID 3016375 with RECIPE=hedgedoc, CI_BUILD_NUMBER=608 + +# Build #608 completed at 06:04:22Z → ✅ passed, level 5 +# Proxy endpoint count after run: 7 (same as M1 baseline, clean teardown) +``` + +Key confirmation: the build was triggered at 06:02Z which is 24 minutes AFTER the proxy recreation at 05:38Z. Recipe containers deployed into and cleaned up from the /16 proxy network without issue. diff --git a/machine-docs/STATUS-pvcheck.md b/machine-docs/STATUS-pvcheck.md index 215c274..60e4142 100644 --- a/machine-docs/STATUS-pvcheck.md +++ b/machine-docs/STATUS-pvcheck.md @@ -1,87 +1,91 @@ # STATUS — phase pvcheck (post-proxy verification) -**Updated:** 2026-06-13T06:02Z +**Updated:** 2026-06-13T06:10Z **Phase:** pvcheck **Builder:** autonomic-bot --- -## Gate: M1 — CLAIMED, awaiting Adversary +## Gate: M1 — PASS @2026-06-13T06:10Z (Adversary verified) -### M1 — Control plane and routing verified - -**Claim:** All cc-ci control-plane routes/services are healthy after the proxy recreation. Before/after evidence captured. - -#### How to verify (run cold from Adversary's clone on cc-ci host): - -```bash -# 1. Proxy subnet and endpoint count -ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}Subnet: {{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"' -# EXPECTED: Subnet: 10.10.0.0/16, Endpoints: 7 - -# 2. All services healthy -ssh cc-ci 'docker service ls --format "{{.Name}}\t{{.Replicas}}"' -# EXPECTED: all 9 services show 1/1 - -# 3. External routes -curl -sk -o /dev/null -w "%{http_code}" https://ci.commoninternet.net/ # EXPECTED: 200 -curl -sk -o /dev/null -w "%{http_code}" https://drone.ci.commoninternet.net/ # EXPECTED: 303 -curl -sk -o /dev/null -w "%{http_code}" https://report.ci.commoninternet.net/ # EXPECTED: 200 - -# 4. No VIP exhaustion since proxy recreation (05:38Z) -ssh cc-ci 'journalctl -u docker --since "2026-06-13 05:38:00" | grep -c "available IP while allocating VIP"' -# EXPECTED: 0 - -# 5. Upgrade-all Step-0 guard exists and is correct -grep -A5 "VIPFAIL" /srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md -# EXPECTED: guard logic checking for "available IP while allocating VIP" signature -``` - -#### Evidence (Builder run 2026-06-13T06:00Z): - -| Check | Command | Result | -|---|---|---| -| proxy subnet | `docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"` | `10.10.0.0/16` ✅ | -| proxy endpoints | `docker network inspect proxy --format "{{len .Containers}}"` | `7` (6 service + 1 lb) ✅ | -| proxy endpoint list | `docker network inspect proxy --format "{{range $k,$v := .Containers}}{{$v.Name}}{{end}}"` | drone, traefik, keycloak, reports, bridge, dashboard + lb-proxy ✅ | -| 9 services 1/1 | `docker service ls` | all 1/1 ✅ | -| ci.commoninternet.net | `curl -sk -o /dev/null -w "%{http_code}"` | `200` ✅ | -| drone.ci.commoninternet.net | same | `303` ✅ | -| report.ci.commoninternet.net | same | `200` ✅ | -| VIP exhaustion since 05:38Z | `journalctl | grep "available IP while allocating VIP"` | `0` ✅ | -| transient errors at 05:35Z | "could not find network allocator STATE" for old net IDs | expected during recreation, pre-38Z only ✅ | -| upgrade-all Step-0 guard | SKILL.md §0 lines 61-81 | guard checks exact signature, fires + restarts docker ✅ | - -#### Before/after evidence: - -| Metric | Before (pvfix) | After (pvcheck) | -|---|---|---| -| proxy subnet | `10.0.1.0/24` (254 IPs) | `10.10.0.0/16` (65534 IPs) | -| proxy endpoints | ~200 leaked (caused VIP exhaustion) | 7 (clean) | -| VIP exhaustion errors | recurring "could not find an available IP" | 0 since 05:38Z | -| Services healthy | intermittent failures | all 9 at 1/1 | - -#### Adversary finding A2 fix: - -[A2] upgrade-all SKILL.md stale description — **FIXED** in orchestrator repo commit `84e13a7` (2026-06-13T05:59Z). -Guard description updated from "safety net until that lands" → "belt-and-suspenders even after the /16 fix". +All cc-ci control-plane routes/services healthy after proxy recreation. See REVIEW-pvcheck.md for Adversary cold-verify evidence. --- -## M2 — IN PROGRESS +## Gate: M2 — CLAIMED, awaiting Adversary -### Tasks for M2: -- [ ] Real deploy proof: trigger one recipe `!testme` or equivalent harness run through proxy -- [ ] Allocator-headroom proof: deploy/remove batch of throwaway stacks, confirm no VIP exhaustion -- [ ] Confirm no residue after cleanup +### M2 — Real CI and allocator proof + +**Claim:** One real recipe CI run (hedgedoc build #608) completed successfully through proxy, and bounded allocator proof confirms no VIP exhaustion risk. + +#### How to verify (run cold from Adversary's clone): + +```bash +# 1. Real CI run passed post-fix +# Build #608 for hedgedoc triggered 2026-06-13T06:02Z, passed 2026-06-13T06:04Z +curl -sk -o /dev/null -w "%{http_code}" https://ci.commoninternet.net/runs/608/summary.png +# EXPECTED: 200 + +curl -sk https://ci.commoninternet.net/runs/608/badge.svg | grep -o "level [0-9]" +# EXPECTED: level 5 (green) + +# Gitea comment on recipe-maintainers/hedgedoc PR#1 (comment #14506) +# EXPECTED: "cc-ci: hedgedoc @ 441c411c ✅ passed" + +# 2. Proxy clean after run +ssh cc-ci 'docker network inspect proxy --format "{{len .Containers}}"' +# EXPECTED: 7 (same as M1 baseline — no leaked endpoints from the run) + +# 3. No VIP exhaustion since proxy recreation +ssh cc-ci 'journalctl -u docker --since "2026-06-13 05:38:00" | grep -c "available IP while allocating VIP"' +# EXPECTED: 0 + +# 4. Allocator headroom proof (Adversary's independent probe is in REVIEW-pvcheck.md) +# Builder's proof: deploy 5 throwaway stacks → rm concurrently → count endpoints +# EXPECTED: endpoints return to baseline, 0 VIP errors, 0 residue +``` + +#### Evidence (Builder run 2026-06-13T06:02–06:10Z): + +**Real deploy proof:** + +| Check | Result | +|---|---| +| Recipe | `hedgedoc` | +| Trigger | `!testme` comment on recipe-maintainers/hedgedoc PR#1 (comment #14505, 06:02:48Z) | +| Bridge response | 4 seconds (comment #14506, 06:02:52Z) | +| Drone build | [#608](https://drone.ci.commoninternet.net/recipe-maintainers/cc-ci/608) | +| Build result | ✅ **passed** (comment updated 06:04:22Z) | +| Level | **level 5** (badge.svg shows `level 5`, green) | +| Summary artifact | `https://ci.commoninternet.net/runs/608/summary.png` → HTTP 200 | +| Proxy endpoint count after run | 7 (clean — same as M1 baseline) | +| Trigger time | 2026-06-13T06:02:48Z (after proxy fix at 05:38Z) ✅ | + +**Allocator headroom proof (Builder):** + +| Check | Result | +|---|---| +| BASELINE proxy containers | 8 | +| AFTER concurrent deploy (5 throwaway nginx stacks) | 13 (+5) | +| AFTER concurrent stack rm | 8 (back to baseline) | +| Leaked endpoints | **0** | +| VIP exhaustion errors (since 06:00Z) | **0** | +| `docker network prune` residue | empty (nothing to reclaim) | +| All pvcheck-throw-* stacks removed | ✅ confirmed | + +**Adversary independent allocator probe (from REVIEW-pvcheck.md):** +5 throwaway stacks deployed/removed concurrently → 0 leaks, 0 VIP errors, 0 residue. (Pre-verified 2026-06-13T06:02Z) + +**VIP exhaustion in post-fix journal:** +`journalctl -u docker --since "2026-06-13 05:38:00" | grep "available IP while allocating VIP"` → **0** ✅ --- ## Definition-of-Done checklist (pvcheck) -- [ ] Control-plane routes are healthy (M1 — claimed) -- [ ] One real proxy-joining recipe CI run succeeds and cleans up (M2) -- [ ] Bounded allocator reproduction documented (M2) -- [ ] Fresh logs show no VIP exhaustion (M1 — claimed, ongoing) -- [ ] Adversary signed off M1 in `machine-docs/REVIEW-pvcheck.md` +- [x] Control-plane routes are healthy (M1 PASS @06:10Z) +- [x] One real proxy-joining recipe CI run succeeds and cleans up (hedgedoc #608 PASS @06:04Z, level 5) +- [x] Bounded allocator reproduction documented (Builder + Adversary independent probes) +- [x] Fresh logs show no VIP exhaustion (0 errors since proxy fix at 05:38Z) +- [x] Adversary signed off M1 in `machine-docs/REVIEW-pvcheck.md` - [ ] Adversary signed off M2 in `machine-docs/REVIEW-pvcheck.md`