diff --git a/machine-docs/BACKLOG-pvcheck.md b/machine-docs/BACKLOG-pvcheck.md index b2eb175..2e05cce 100644 --- a/machine-docs/BACKLOG-pvcheck.md +++ b/machine-docs/BACKLOG-pvcheck.md @@ -1,14 +1,20 @@ # BACKLOG — phase pvcheck (post-proxy verification) ## Build backlog -(Builder-owned — read-only to Adversary) + +- [x] Create pvcheck phase files (STATUS, JOURNAL, BACKLOG) +- [x] Fix [A2] upgrade-all SKILL.md stale description (orchestrator commit 84e13a7) +- [x] Collect M1 evidence (proxy subnet, endpoints, service health, routes, VIP journal) +- [x] Claim M1 — control plane and routing verified +- [ ] M2: real recipe CI run through proxy (harness or !testme) +- [ ] M2: bounded allocator headroom proof (deploy/remove throwaway stacks, confirm no VIP exhaustion) +- [ ] M2: cleanup verification (zero residue) +- [ ] M2: claim gate after M1 PASS ## Adversary findings ### [A2] upgrade-all SKILL.md guard description stale (2026-06-13T05:56Z) - [x] Filed -- [ ] Builder fix +- [x] Builder fix — orchestrator commit `84e13a7` (2026-06-13T05:59Z): updated guard description from "until that lands" to "belt-and-suspenders even after the /16 fix" - [ ] Adversary re-verify and close - -**Details:** `SKILL.md` line 81 still says the guard is "the per-run safety net until [the /16 fix] lands" — but the fix has now landed. The guard logic is correct; this is documentation-only. Suggested fix: change to "belt-and-suspenders even after the /16 fix." diff --git a/machine-docs/JOURNAL-pvcheck.md b/machine-docs/JOURNAL-pvcheck.md new file mode 100644 index 0000000..470d198 --- /dev/null +++ b/machine-docs/JOURNAL-pvcheck.md @@ -0,0 +1,47 @@ +# JOURNAL — phase pvcheck (post-proxy verification) + +Builder-private reasoning and working notes. Anti-anchoring: Adversary reads STATUS for claims, not this file. + +--- + +## 2026-06-13T05:55–06:02Z — Phase orientation and M1 data collection + +Phase pvfix is DONE. Entered pvcheck. No phase files existed yet — the Adversary had proactively created REVIEW-pvcheck.md and BACKLOG-pvcheck.md with a baseline probe at 05:56Z. + +**Adversary baseline findings (from REVIEW-pvcheck.md):** +- All preconditions verified cold (pvfix DONE, proxy /16 live, all services 1/1, all routes 200/303) +- [A2]: stale text in upgrade-all SKILL.md — "per-run safety net until that lands" (fix: proxy /16 HAS landed) + +**My verification runs:** +``` +$ ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"' +10.10.0.0/16, Endpoints: 7 + +$ curl -sk -o /dev/null -w "%{http_code}" https://ci.commoninternet.net/ → 200 +$ curl -sk -o /dev/null -w "%{http_code}" https://drone.ci.commoninternet.net/ → 303 +$ curl -sk -o /dev/null -w "%{http_code}" https://report.ci.commoninternet.net/ → 200 + +$ ssh cc-ci 'journalctl -u docker --since "2026-06-13 05:38:00" | grep -c "available IP while allocating VIP"' +0 +``` + +The "could not find network allocator STATE" errors in the 05:35Z window are expected transient noise: they occur when swarm tries to allocate VIPs for the old deleted /24 network IDs (mlxau8…, 85p3aq…) during the recreation — not the "available IP while allocating VIP" signature of actual exhaustion. + +**A2 fix applied:** +- Edited `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` line 80-81 +- Committed to orchestrator repo as `84e13a7` +- Guard logic unchanged — only the description now reflects reality (durable fix has landed) + +**Decision on bridge /hook:** bridge is exposed at `PathPrefix(/hook)` and only accepts POST (webhook). A GET to `/hook` returns 404 — expected; health is confirmed via service logs showing the poller running and commenting on repos. + +**M1 claim:** All control-plane facts documented. Claiming M1 now. Will work on M2 while awaiting verdict. + +--- + +## 2026-06-13T06:02Z — M2 planning + +M2 requires: +1. Real recipe CI run through proxy — will use a small enrolled recipe like `hedgedoc` or `cryptpad` if a !testme PR exists, or trigger via the harness directly +2. Allocator headroom proof — deploy/remove 3-5 throwaway stacks with published ports (simulating concurrent deploys), confirm endpoint count stays small and no VIP exhaustion + +Will check what enrolled recipes have open PRs available for !testme first. diff --git a/machine-docs/STATUS-pvcheck.md b/machine-docs/STATUS-pvcheck.md new file mode 100644 index 0000000..215c274 --- /dev/null +++ b/machine-docs/STATUS-pvcheck.md @@ -0,0 +1,87 @@ +# STATUS — phase pvcheck (post-proxy verification) + +**Updated:** 2026-06-13T06:02Z +**Phase:** pvcheck +**Builder:** autonomic-bot + +--- + +## Gate: M1 — CLAIMED, awaiting Adversary + +### M1 — Control plane and routing verified + +**Claim:** All cc-ci control-plane routes/services are healthy after the proxy recreation. Before/after evidence captured. + +#### How to verify (run cold from Adversary's clone on cc-ci host): + +```bash +# 1. Proxy subnet and endpoint count +ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}Subnet: {{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"' +# EXPECTED: Subnet: 10.10.0.0/16, Endpoints: 7 + +# 2. All services healthy +ssh cc-ci 'docker service ls --format "{{.Name}}\t{{.Replicas}}"' +# EXPECTED: all 9 services show 1/1 + +# 3. External routes +curl -sk -o /dev/null -w "%{http_code}" https://ci.commoninternet.net/ # EXPECTED: 200 +curl -sk -o /dev/null -w "%{http_code}" https://drone.ci.commoninternet.net/ # EXPECTED: 303 +curl -sk -o /dev/null -w "%{http_code}" https://report.ci.commoninternet.net/ # EXPECTED: 200 + +# 4. No VIP exhaustion since proxy recreation (05:38Z) +ssh cc-ci 'journalctl -u docker --since "2026-06-13 05:38:00" | grep -c "available IP while allocating VIP"' +# EXPECTED: 0 + +# 5. Upgrade-all Step-0 guard exists and is correct +grep -A5 "VIPFAIL" /srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md +# EXPECTED: guard logic checking for "available IP while allocating VIP" signature +``` + +#### Evidence (Builder run 2026-06-13T06:00Z): + +| Check | Command | Result | +|---|---|---| +| proxy subnet | `docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"` | `10.10.0.0/16` ✅ | +| proxy endpoints | `docker network inspect proxy --format "{{len .Containers}}"` | `7` (6 service + 1 lb) ✅ | +| proxy endpoint list | `docker network inspect proxy --format "{{range $k,$v := .Containers}}{{$v.Name}}{{end}}"` | drone, traefik, keycloak, reports, bridge, dashboard + lb-proxy ✅ | +| 9 services 1/1 | `docker service ls` | all 1/1 ✅ | +| ci.commoninternet.net | `curl -sk -o /dev/null -w "%{http_code}"` | `200` ✅ | +| drone.ci.commoninternet.net | same | `303` ✅ | +| report.ci.commoninternet.net | same | `200` ✅ | +| VIP exhaustion since 05:38Z | `journalctl | grep "available IP while allocating VIP"` | `0` ✅ | +| transient errors at 05:35Z | "could not find network allocator STATE" for old net IDs | expected during recreation, pre-38Z only ✅ | +| upgrade-all Step-0 guard | SKILL.md §0 lines 61-81 | guard checks exact signature, fires + restarts docker ✅ | + +#### Before/after evidence: + +| Metric | Before (pvfix) | After (pvcheck) | +|---|---|---| +| proxy subnet | `10.0.1.0/24` (254 IPs) | `10.10.0.0/16` (65534 IPs) | +| proxy endpoints | ~200 leaked (caused VIP exhaustion) | 7 (clean) | +| VIP exhaustion errors | recurring "could not find an available IP" | 0 since 05:38Z | +| Services healthy | intermittent failures | all 9 at 1/1 | + +#### Adversary finding A2 fix: + +[A2] upgrade-all SKILL.md stale description — **FIXED** in orchestrator repo commit `84e13a7` (2026-06-13T05:59Z). +Guard description updated from "safety net until that lands" → "belt-and-suspenders even after the /16 fix". + +--- + +## M2 — IN PROGRESS + +### Tasks for M2: +- [ ] Real deploy proof: trigger one recipe `!testme` or equivalent harness run through proxy +- [ ] Allocator-headroom proof: deploy/remove batch of throwaway stacks, confirm no VIP exhaustion +- [ ] Confirm no residue after cleanup + +--- + +## Definition-of-Done checklist (pvcheck) + +- [ ] Control-plane routes are healthy (M1 — claimed) +- [ ] One real proxy-joining recipe CI run succeeds and cleans up (M2) +- [ ] Bounded allocator reproduction documented (M2) +- [ ] Fresh logs show no VIP exhaustion (M1 — claimed, ongoing) +- [ ] Adversary signed off M1 in `machine-docs/REVIEW-pvcheck.md` +- [ ] Adversary signed off M2 in `machine-docs/REVIEW-pvcheck.md`