claim(pvcheck-M1): control plane and routing verified post-proxy-recreation
Some checks failed
continuous-integration/drone/push Build is failing

proxy subnet: 10.10.0.0/16, 7 endpoints (6 services + lb)
All 9 swarm services: 1/1
Routes: ci (200), drone (303), report (200)
VIP exhaustion since 05:38Z: 0 errors
Upgrade-all Step-0 guard confirmed in SKILL.md §0
[A2] SKILL.md stale description fixed (orchestrator commit 84e13a7)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
autonomic-bot
2026-06-13 05:59:58 +00:00
parent 99482cb387
commit 3df0ee154d
3 changed files with 144 additions and 4 deletions

View File

@ -1,14 +1,20 @@
# BACKLOG — phase pvcheck (post-proxy verification)
## Build backlog
(Builder-owned — read-only to Adversary)
- [x] Create pvcheck phase files (STATUS, JOURNAL, BACKLOG)
- [x] Fix [A2] upgrade-all SKILL.md stale description (orchestrator commit 84e13a7)
- [x] Collect M1 evidence (proxy subnet, endpoints, service health, routes, VIP journal)
- [x] Claim M1 — control plane and routing verified
- [ ] M2: real recipe CI run through proxy (harness or !testme)
- [ ] M2: bounded allocator headroom proof (deploy/remove throwaway stacks, confirm no VIP exhaustion)
- [ ] M2: cleanup verification (zero residue)
- [ ] M2: claim gate after M1 PASS
## Adversary findings
### [A2] upgrade-all SKILL.md guard description stale (2026-06-13T05:56Z)
- [x] Filed
- [ ] Builder fix
- [x] Builder fix — orchestrator commit `84e13a7` (2026-06-13T05:59Z): updated guard description from "until that lands" to "belt-and-suspenders even after the /16 fix"
- [ ] Adversary re-verify and close
**Details:** `SKILL.md` line 81 still says the guard is "the per-run safety net until [the /16 fix] lands" — but the fix has now landed. The guard logic is correct; this is documentation-only. Suggested fix: change to "belt-and-suspenders even after the /16 fix."

View File

@ -0,0 +1,47 @@
# JOURNAL — phase pvcheck (post-proxy verification)
Builder-private reasoning and working notes. Anti-anchoring: Adversary reads STATUS for claims, not this file.
---
## 2026-06-13T05:5506:02Z — Phase orientation and M1 data collection
Phase pvfix is DONE. Entered pvcheck. No phase files existed yet — the Adversary had proactively created REVIEW-pvcheck.md and BACKLOG-pvcheck.md with a baseline probe at 05:56Z.
**Adversary baseline findings (from REVIEW-pvcheck.md):**
- All preconditions verified cold (pvfix DONE, proxy /16 live, all services 1/1, all routes 200/303)
- [A2]: stale text in upgrade-all SKILL.md — "per-run safety net until that lands" (fix: proxy /16 HAS landed)
**My verification runs:**
```
$ ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"'
10.10.0.0/16, Endpoints: 7
$ curl -sk -o /dev/null -w "%{http_code}" https://ci.commoninternet.net/ → 200
$ curl -sk -o /dev/null -w "%{http_code}" https://drone.ci.commoninternet.net/ → 303
$ curl -sk -o /dev/null -w "%{http_code}" https://report.ci.commoninternet.net/ → 200
$ ssh cc-ci 'journalctl -u docker --since "2026-06-13 05:38:00" | grep -c "available IP while allocating VIP"'
0
```
The "could not find network allocator STATE" errors in the 05:35Z window are expected transient noise: they occur when swarm tries to allocate VIPs for the old deleted /24 network IDs (mlxau8…, 85p3aq…) during the recreation — not the "available IP while allocating VIP" signature of actual exhaustion.
**A2 fix applied:**
- Edited `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` line 80-81
- Committed to orchestrator repo as `84e13a7`
- Guard logic unchanged — only the description now reflects reality (durable fix has landed)
**Decision on bridge /hook:** bridge is exposed at `PathPrefix(/hook)` and only accepts POST (webhook). A GET to `/hook` returns 404 — expected; health is confirmed via service logs showing the poller running and commenting on repos.
**M1 claim:** All control-plane facts documented. Claiming M1 now. Will work on M2 while awaiting verdict.
---
## 2026-06-13T06:02Z — M2 planning
M2 requires:
1. Real recipe CI run through proxy — will use a small enrolled recipe like `hedgedoc` or `cryptpad` if a !testme PR exists, or trigger via the harness directly
2. Allocator headroom proof — deploy/remove 3-5 throwaway stacks with published ports (simulating concurrent deploys), confirm endpoint count stays small and no VIP exhaustion
Will check what enrolled recipes have open PRs available for !testme first.

View File

@ -0,0 +1,87 @@
# STATUS — phase pvcheck (post-proxy verification)
**Updated:** 2026-06-13T06:02Z
**Phase:** pvcheck
**Builder:** autonomic-bot
---
## Gate: M1 — CLAIMED, awaiting Adversary
### M1 — Control plane and routing verified
**Claim:** All cc-ci control-plane routes/services are healthy after the proxy recreation. Before/after evidence captured.
#### How to verify (run cold from Adversary's clone on cc-ci host):
```bash
# 1. Proxy subnet and endpoint count
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}Subnet: {{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"'
# EXPECTED: Subnet: 10.10.0.0/16, Endpoints: 7
# 2. All services healthy
ssh cc-ci 'docker service ls --format "{{.Name}}\t{{.Replicas}}"'
# EXPECTED: all 9 services show 1/1
# 3. External routes
curl -sk -o /dev/null -w "%{http_code}" https://ci.commoninternet.net/ # EXPECTED: 200
curl -sk -o /dev/null -w "%{http_code}" https://drone.ci.commoninternet.net/ # EXPECTED: 303
curl -sk -o /dev/null -w "%{http_code}" https://report.ci.commoninternet.net/ # EXPECTED: 200
# 4. No VIP exhaustion since proxy recreation (05:38Z)
ssh cc-ci 'journalctl -u docker --since "2026-06-13 05:38:00" | grep -c "available IP while allocating VIP"'
# EXPECTED: 0
# 5. Upgrade-all Step-0 guard exists and is correct
grep -A5 "VIPFAIL" /srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md
# EXPECTED: guard logic checking for "available IP while allocating VIP" signature
```
#### Evidence (Builder run 2026-06-13T06:00Z):
| Check | Command | Result |
|---|---|---|
| proxy subnet | `docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"` | `10.10.0.0/16` ✅ |
| proxy endpoints | `docker network inspect proxy --format "{{len .Containers}}"` | `7` (6 service + 1 lb) ✅ |
| proxy endpoint list | `docker network inspect proxy --format "{{range $k,$v := .Containers}}{{$v.Name}}{{end}}"` | drone, traefik, keycloak, reports, bridge, dashboard + lb-proxy ✅ |
| 9 services 1/1 | `docker service ls` | all 1/1 ✅ |
| ci.commoninternet.net | `curl -sk -o /dev/null -w "%{http_code}"` | `200` ✅ |
| drone.ci.commoninternet.net | same | `303` ✅ |
| report.ci.commoninternet.net | same | `200` ✅ |
| VIP exhaustion since 05:38Z | `journalctl | grep "available IP while allocating VIP"` | `0` ✅ |
| transient errors at 05:35Z | "could not find network allocator STATE" for old net IDs | expected during recreation, pre-38Z only ✅ |
| upgrade-all Step-0 guard | SKILL.md §0 lines 61-81 | guard checks exact signature, fires + restarts docker ✅ |
#### Before/after evidence:
| Metric | Before (pvfix) | After (pvcheck) |
|---|---|---|
| proxy subnet | `10.0.1.0/24` (254 IPs) | `10.10.0.0/16` (65534 IPs) |
| proxy endpoints | ~200 leaked (caused VIP exhaustion) | 7 (clean) |
| VIP exhaustion errors | recurring "could not find an available IP" | 0 since 05:38Z |
| Services healthy | intermittent failures | all 9 at 1/1 |
#### Adversary finding A2 fix:
[A2] upgrade-all SKILL.md stale description — **FIXED** in orchestrator repo commit `84e13a7` (2026-06-13T05:59Z).
Guard description updated from "safety net until that lands" → "belt-and-suspenders even after the /16 fix".
---
## M2 — IN PROGRESS
### Tasks for M2:
- [ ] Real deploy proof: trigger one recipe `!testme` or equivalent harness run through proxy
- [ ] Allocator-headroom proof: deploy/remove batch of throwaway stacks, confirm no VIP exhaustion
- [ ] Confirm no residue after cleanup
---
## Definition-of-Done checklist (pvcheck)
- [ ] Control-plane routes are healthy (M1 — claimed)
- [ ] One real proxy-joining recipe CI run succeeds and cleans up (M2)
- [ ] Bounded allocator reproduction documented (M2)
- [ ] Fresh logs show no VIP exhaustion (M1 — claimed, ongoing)
- [ ] Adversary signed off M1 in `machine-docs/REVIEW-pvcheck.md`
- [ ] Adversary signed off M2 in `machine-docs/REVIEW-pvcheck.md`