claim(pvcheck-M1): control plane and routing verified post-proxy-recreation
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
proxy subnet: 10.10.0.0/16, 7 endpoints (6 services + lb) All 9 swarm services: 1/1 Routes: ci (200), drone (303), report (200) VIP exhaustion since 05:38Z: 0 errors Upgrade-all Step-0 guard confirmed in SKILL.md §0 [A2] SKILL.md stale description fixed (orchestrator commit 84e13a7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@ -1,14 +1,20 @@
|
||||
# BACKLOG — phase pvcheck (post-proxy verification)
|
||||
|
||||
## Build backlog
|
||||
(Builder-owned — read-only to Adversary)
|
||||
|
||||
- [x] Create pvcheck phase files (STATUS, JOURNAL, BACKLOG)
|
||||
- [x] Fix [A2] upgrade-all SKILL.md stale description (orchestrator commit 84e13a7)
|
||||
- [x] Collect M1 evidence (proxy subnet, endpoints, service health, routes, VIP journal)
|
||||
- [x] Claim M1 — control plane and routing verified
|
||||
- [ ] M2: real recipe CI run through proxy (harness or !testme)
|
||||
- [ ] M2: bounded allocator headroom proof (deploy/remove throwaway stacks, confirm no VIP exhaustion)
|
||||
- [ ] M2: cleanup verification (zero residue)
|
||||
- [ ] M2: claim gate after M1 PASS
|
||||
|
||||
## Adversary findings
|
||||
|
||||
### [A2] upgrade-all SKILL.md guard description stale (2026-06-13T05:56Z)
|
||||
|
||||
- [x] Filed
|
||||
- [ ] Builder fix
|
||||
- [x] Builder fix — orchestrator commit `84e13a7` (2026-06-13T05:59Z): updated guard description from "until that lands" to "belt-and-suspenders even after the /16 fix"
|
||||
- [ ] Adversary re-verify and close
|
||||
|
||||
**Details:** `SKILL.md` line 81 still says the guard is "the per-run safety net until [the /16 fix] lands" — but the fix has now landed. The guard logic is correct; this is documentation-only. Suggested fix: change to "belt-and-suspenders even after the /16 fix."
|
||||
|
||||
47
machine-docs/JOURNAL-pvcheck.md
Normal file
47
machine-docs/JOURNAL-pvcheck.md
Normal file
@ -0,0 +1,47 @@
|
||||
# JOURNAL — phase pvcheck (post-proxy verification)
|
||||
|
||||
Builder-private reasoning and working notes. Anti-anchoring: Adversary reads STATUS for claims, not this file.
|
||||
|
||||
---
|
||||
|
||||
## 2026-06-13T05:55–06:02Z — Phase orientation and M1 data collection
|
||||
|
||||
Phase pvfix is DONE. Entered pvcheck. No phase files existed yet — the Adversary had proactively created REVIEW-pvcheck.md and BACKLOG-pvcheck.md with a baseline probe at 05:56Z.
|
||||
|
||||
**Adversary baseline findings (from REVIEW-pvcheck.md):**
|
||||
- All preconditions verified cold (pvfix DONE, proxy /16 live, all services 1/1, all routes 200/303)
|
||||
- [A2]: stale text in upgrade-all SKILL.md — "per-run safety net until that lands" (fix: proxy /16 HAS landed)
|
||||
|
||||
**My verification runs:**
|
||||
```
|
||||
$ ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"'
|
||||
10.10.0.0/16, Endpoints: 7
|
||||
|
||||
$ curl -sk -o /dev/null -w "%{http_code}" https://ci.commoninternet.net/ → 200
|
||||
$ curl -sk -o /dev/null -w "%{http_code}" https://drone.ci.commoninternet.net/ → 303
|
||||
$ curl -sk -o /dev/null -w "%{http_code}" https://report.ci.commoninternet.net/ → 200
|
||||
|
||||
$ ssh cc-ci 'journalctl -u docker --since "2026-06-13 05:38:00" | grep -c "available IP while allocating VIP"'
|
||||
0
|
||||
```
|
||||
|
||||
The "could not find network allocator STATE" errors in the 05:35Z window are expected transient noise: they occur when swarm tries to allocate VIPs for the old deleted /24 network IDs (mlxau8…, 85p3aq…) during the recreation — not the "available IP while allocating VIP" signature of actual exhaustion.
|
||||
|
||||
**A2 fix applied:**
|
||||
- Edited `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` line 80-81
|
||||
- Committed to orchestrator repo as `84e13a7`
|
||||
- Guard logic unchanged — only the description now reflects reality (durable fix has landed)
|
||||
|
||||
**Decision on bridge /hook:** bridge is exposed at `PathPrefix(/hook)` and only accepts POST (webhook). A GET to `/hook` returns 404 — expected; health is confirmed via service logs showing the poller running and commenting on repos.
|
||||
|
||||
**M1 claim:** All control-plane facts documented. Claiming M1 now. Will work on M2 while awaiting verdict.
|
||||
|
||||
---
|
||||
|
||||
## 2026-06-13T06:02Z — M2 planning
|
||||
|
||||
M2 requires:
|
||||
1. Real recipe CI run through proxy — will use a small enrolled recipe like `hedgedoc` or `cryptpad` if a !testme PR exists, or trigger via the harness directly
|
||||
2. Allocator headroom proof — deploy/remove 3-5 throwaway stacks with published ports (simulating concurrent deploys), confirm endpoint count stays small and no VIP exhaustion
|
||||
|
||||
Will check what enrolled recipes have open PRs available for !testme first.
|
||||
87
machine-docs/STATUS-pvcheck.md
Normal file
87
machine-docs/STATUS-pvcheck.md
Normal file
@ -0,0 +1,87 @@
|
||||
# STATUS — phase pvcheck (post-proxy verification)
|
||||
|
||||
**Updated:** 2026-06-13T06:02Z
|
||||
**Phase:** pvcheck
|
||||
**Builder:** autonomic-bot
|
||||
|
||||
---
|
||||
|
||||
## Gate: M1 — CLAIMED, awaiting Adversary
|
||||
|
||||
### M1 — Control plane and routing verified
|
||||
|
||||
**Claim:** All cc-ci control-plane routes/services are healthy after the proxy recreation. Before/after evidence captured.
|
||||
|
||||
#### How to verify (run cold from Adversary's clone on cc-ci host):
|
||||
|
||||
```bash
|
||||
# 1. Proxy subnet and endpoint count
|
||||
ssh cc-ci 'docker network inspect proxy --format "{{range .IPAM.Config}}Subnet: {{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"'
|
||||
# EXPECTED: Subnet: 10.10.0.0/16, Endpoints: 7
|
||||
|
||||
# 2. All services healthy
|
||||
ssh cc-ci 'docker service ls --format "{{.Name}}\t{{.Replicas}}"'
|
||||
# EXPECTED: all 9 services show 1/1
|
||||
|
||||
# 3. External routes
|
||||
curl -sk -o /dev/null -w "%{http_code}" https://ci.commoninternet.net/ # EXPECTED: 200
|
||||
curl -sk -o /dev/null -w "%{http_code}" https://drone.ci.commoninternet.net/ # EXPECTED: 303
|
||||
curl -sk -o /dev/null -w "%{http_code}" https://report.ci.commoninternet.net/ # EXPECTED: 200
|
||||
|
||||
# 4. No VIP exhaustion since proxy recreation (05:38Z)
|
||||
ssh cc-ci 'journalctl -u docker --since "2026-06-13 05:38:00" | grep -c "available IP while allocating VIP"'
|
||||
# EXPECTED: 0
|
||||
|
||||
# 5. Upgrade-all Step-0 guard exists and is correct
|
||||
grep -A5 "VIPFAIL" /srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md
|
||||
# EXPECTED: guard logic checking for "available IP while allocating VIP" signature
|
||||
```
|
||||
|
||||
#### Evidence (Builder run 2026-06-13T06:00Z):
|
||||
|
||||
| Check | Command | Result |
|
||||
|---|---|---|
|
||||
| proxy subnet | `docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}"` | `10.10.0.0/16` ✅ |
|
||||
| proxy endpoints | `docker network inspect proxy --format "{{len .Containers}}"` | `7` (6 service + 1 lb) ✅ |
|
||||
| proxy endpoint list | `docker network inspect proxy --format "{{range $k,$v := .Containers}}{{$v.Name}}{{end}}"` | drone, traefik, keycloak, reports, bridge, dashboard + lb-proxy ✅ |
|
||||
| 9 services 1/1 | `docker service ls` | all 1/1 ✅ |
|
||||
| ci.commoninternet.net | `curl -sk -o /dev/null -w "%{http_code}"` | `200` ✅ |
|
||||
| drone.ci.commoninternet.net | same | `303` ✅ |
|
||||
| report.ci.commoninternet.net | same | `200` ✅ |
|
||||
| VIP exhaustion since 05:38Z | `journalctl | grep "available IP while allocating VIP"` | `0` ✅ |
|
||||
| transient errors at 05:35Z | "could not find network allocator STATE" for old net IDs | expected during recreation, pre-38Z only ✅ |
|
||||
| upgrade-all Step-0 guard | SKILL.md §0 lines 61-81 | guard checks exact signature, fires + restarts docker ✅ |
|
||||
|
||||
#### Before/after evidence:
|
||||
|
||||
| Metric | Before (pvfix) | After (pvcheck) |
|
||||
|---|---|---|
|
||||
| proxy subnet | `10.0.1.0/24` (254 IPs) | `10.10.0.0/16` (65534 IPs) |
|
||||
| proxy endpoints | ~200 leaked (caused VIP exhaustion) | 7 (clean) |
|
||||
| VIP exhaustion errors | recurring "could not find an available IP" | 0 since 05:38Z |
|
||||
| Services healthy | intermittent failures | all 9 at 1/1 |
|
||||
|
||||
#### Adversary finding A2 fix:
|
||||
|
||||
[A2] upgrade-all SKILL.md stale description — **FIXED** in orchestrator repo commit `84e13a7` (2026-06-13T05:59Z).
|
||||
Guard description updated from "safety net until that lands" → "belt-and-suspenders even after the /16 fix".
|
||||
|
||||
---
|
||||
|
||||
## M2 — IN PROGRESS
|
||||
|
||||
### Tasks for M2:
|
||||
- [ ] Real deploy proof: trigger one recipe `!testme` or equivalent harness run through proxy
|
||||
- [ ] Allocator-headroom proof: deploy/remove batch of throwaway stacks, confirm no VIP exhaustion
|
||||
- [ ] Confirm no residue after cleanup
|
||||
|
||||
---
|
||||
|
||||
## Definition-of-Done checklist (pvcheck)
|
||||
|
||||
- [ ] Control-plane routes are healthy (M1 — claimed)
|
||||
- [ ] One real proxy-joining recipe CI run succeeds and cleans up (M2)
|
||||
- [ ] Bounded allocator reproduction documented (M2)
|
||||
- [ ] Fresh logs show no VIP exhaustion (M1 — claimed, ongoing)
|
||||
- [ ] Adversary signed off M1 in `machine-docs/REVIEW-pvcheck.md`
|
||||
- [ ] Adversary signed off M2 in `machine-docs/REVIEW-pvcheck.md`
|
||||
Reference in New Issue
Block a user