Some checks failed
continuous-integration/drone/push Build is failing
Cold verify 2026-06-13T06:14Z: - hedgedoc run #608 confirmed: triggered 06:02:48Z (after proxy fix 05:38Z), all tiers pass (install/upgrade/backup/restore/custom), level 5, clean teardown, no-secret-leak. Gitea comment #14506 confirms pass. - Proxy endpoints clean after run: 7 (back to M1 baseline). - Zero VIP exhaustion since 05:38Z. - Allocator headroom: Adversary's independent 5-stack probe + Builder's matching proof. All pvcheck Definition-of-Done items verified.
135 lines
8.2 KiB
Markdown
135 lines
8.2 KiB
Markdown
# REVIEW — phase pvcheck (post-proxy verification)
|
||
|
||
Adversary-owned. Append-only verdicts. All commands run cold from /srv/cc-ci-orch/cc-ci-adv (own clone).
|
||
|
||
---
|
||
|
||
## Adversary baseline probe — 2026-06-13T05:56Z
|
||
|
||
**Context:** Phase pvfix is DONE (STATUS-pvfix.md ## DONE). pvcheck preconditions verified cold.
|
||
|
||
### Precondition checks
|
||
|
||
| Check | Result |
|
||
|---|---|
|
||
| pvfix DONE | ✅ STATUS-pvfix.md shows `## DONE`, both M1+M2 PASS |
|
||
| `proxy` subnet | ✅ `10.10.0.0/16` (docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}") |
|
||
| `proxy` IPAM driver | ✅ default, gateway 10.10.0.1 |
|
||
| All services 1/1 | ✅ 9 services all `1/1` (backups, bridge, dashboard, reports, drone, traefik×2, keycloak×2) |
|
||
| `ci.commoninternet.net` | ✅ HTTP/2 200 |
|
||
| `drone.ci.commoninternet.net` | ✅ HTTP/2 303 |
|
||
| `report.ci.commoninternet.net` | ✅ HTTP/2 200 |
|
||
| VIP exhaustion after 05:38Z | ✅ NONE — `journalctl -u docker --since "2026-06-13 05:38:00" | grep "available IP while allocating VIP"` → empty |
|
||
| Transient errors at 05:35Z | ℹ️ "could not find network allocator STATE" for OLD net IDs (mlxau8…, 85p3aq…) — these are expected during proxy recreation (swarm allocator losing state for the deleted /24 network) |
|
||
| No new VIP exhaustion | ✅ post-fix journal clean |
|
||
|
||
**Command evidence:**
|
||
```
|
||
$ docker network inspect proxy --format "{{json .IPAM}}"
|
||
{"Driver":"default","Options":null,"Config":[{"Subnet":"10.10.0.0/16","Gateway":"10.10.0.1"}]}
|
||
|
||
$ docker service ls --format "{{.Name}}\t{{.Replicas}}"
|
||
backups_ci_commoninternet_net_app 1/1
|
||
ccci-bridge_app 1/1
|
||
ccci-dashboard_app 1/1
|
||
ccci-reports_app 1/1
|
||
drone_ci_commoninternet_net_app 1/1
|
||
traefik_ci_commoninternet_net_app 1/1
|
||
traefik_ci_commoninternet_net_socket-proxy 1/1
|
||
warm-keycloak_ci_commoninternet_net_app 1/1
|
||
warm-keycloak_ci_commoninternet_net_db 1/1
|
||
```
|
||
|
||
### Upgrade-all Step-0 guard — independent check
|
||
|
||
**Guard location:** `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` §0, lines 61-81
|
||
**Guard logic:** `VIPFAIL=$(ssh cc-ci 'journalctl -u docker --since "26 hours ago" | grep -c "available IP while allocating VIP"')` → if >0, `systemctl restart docker`
|
||
**Guard exists:** ✅ confirmed cold-read
|
||
**Guard would fire:** ✅ triggers on the EXACT original error signature (`"available IP while allocating VIP"`) — would detect and recover if VIP exhaustion recurs despite the /16 fix (belt+suspenders)
|
||
**STALE TEXT NOTE:** Skill still says "(The durable fix ... is tracked in plan-proxy-vip-exhaustion-fix.md; this guard is the per-run safety net until that lands.)" — but the durable fix HAS now landed. This is a documentation smell, not a functional defect; the guard logic is correct and still useful. Filing as advisory finding [A2].
|
||
|
||
---
|
||
|
||
## Adversary independent allocator-headroom probe — 2026-06-13T06:02Z
|
||
|
||
**Method:** deploy 5 throwaway nginx stacks concurrently joining `proxy`, then remove all 5 concurrently (same concurrent-rm pattern that caused endpoint GC races under the old /24).
|
||
|
||
| Check | Result |
|
||
|---|---|
|
||
| BASELINE proxy containers | 9 |
|
||
| AFTER DEPLOY (5 stacks added) | 14 |
|
||
| AFTER concurrent stack rm | 9 (back to baseline) |
|
||
| Leaked endpoints | **0** |
|
||
| VIP exhaustion errors during test | **0** |
|
||
| Swarm GC race errors (key modified / network proxy remove failed) | **0** |
|
||
| Network prune output | empty (nothing to reclaim) |
|
||
| AFTER prune residue | **0** |
|
||
| All pvcheck-throwaway stacks removed | ✅ confirmed |
|
||
|
||
**Verdict:** The /16 subnet has sufficient headroom that 5 concurrent deploy/rm cycles produce zero endpoint leaks and zero VIP errors. No residue after prune.
|
||
|
||
**Note:** 5 stacks is a conservative test — the original exhaustion required ~45 GC races over 11 days uptime. The /16 has 65534 VIPs vs the old /24's 254 — the leak rate would need to be ~258× faster to hit the same ceiling. This probe confirms the allocator is healthy and the /16 provides the claimed headroom.
|
||
|
||
---
|
||
|
||
## M1 — PASS @2026-06-13T06:10Z
|
||
|
||
**Cold verify run — Adversary's own commands, no cached state.**
|
||
|
||
| Check | Command | Result |
|
||
|---|---|---|
|
||
| proxy subnet | `docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"` | **`10.10.0.0/16`, Endpoints: 7** ✅ |
|
||
| 9 services 1/1 | `docker service ls --format "{{.Name}}\t{{.Replicas}}"` | all 1/1 ✅ |
|
||
| ci.commoninternet.net | `curl -sk -o /dev/null -w "%{http_code}"` | **200** ✅ |
|
||
| drone.ci.commoninternet.net | same | **303** ✅ |
|
||
| report.ci.commoninternet.net | same | **200** ✅ |
|
||
| VIP exhaustion since 05:38Z | `journalctl -u docker --since "2026-06-13 05:38:00" \| grep -c "available IP while allocating VIP"` | **0** ✅ |
|
||
| swarm.nix /16 declared | `grep "10.10" nix/modules/swarm.nix` | `--subnet 10.10.0.0/16` ✅ |
|
||
| swarm.nix commit | `git show e6349a9 --stat` | confirmed ✅ |
|
||
| Step-0 guard text | `grep -A8 "VIPFAIL" upgrade-all/SKILL.md` | guard exists, checks exact signature ✅ |
|
||
| [A2] fix | `git -C /srv/cc-ci-orch log --oneline \| grep 84e13a7` | `fix(pvcheck/A2): update upgrade-all SKILL.md guard description` ✅ |
|
||
| [A2] text updated | SKILL.md line ~81 | "belt-and-suspenders even after the /16 fix" ✅ |
|
||
|
||
**All M1 criteria verified independently from cold start.** Builder's before/after evidence is consistent with what Adversary observed directly. No discrepancies.
|
||
|
||
[A2] CLOSED — fix confirmed in orchestrator commit 84e13a7.
|
||
|
||
## M2 — PASS @2026-06-13T06:14Z
|
||
|
||
**Cold verify run — Adversary's own commands, no cached state.**
|
||
|
||
| Check | Command | Result |
|
||
|---|---|---|
|
||
| summary.png accessible | `curl -sk -o /dev/null -w "%{http_code}" .../runs/608/summary.png` | **HTTP 200** ✅ |
|
||
| badge level | `curl -sk .../badge.svg \| grep -o "level [0-9]"` | **level 5** ✅ |
|
||
| proxy endpoints after run | `docker network inspect proxy --format "{{len .Containers}}"` | **7** (clean, same as M1 baseline) ✅ |
|
||
| VIP exhaustion since 05:38Z | `journalctl \| grep -c "available IP while allocating VIP"` | **0** ✅ |
|
||
| Gitea comment #14506 | `GET /api/v1/repos/recipe-maintainers/hedgedoc/issues/1/comments` | ✅ `hedgedoc @ 441c411c ✅ passed` posted at 06:02:52Z |
|
||
| !testme trigger comment | comment #14505 at 06:02:48Z by autonomic-bot | ✅ real !testme trigger |
|
||
| Run trigger timing | 06:02:48Z → after proxy fix 05:38Z | ✅ entire run on new /16 |
|
||
| Run result filesystem | `/var/lib/cc-ci-runs/608/results.json` | ✅ all tiers pass: install/upgrade/backup/restore/custom |
|
||
| clean_teardown flag | `results.json flags.clean_teardown` | **true** ✅ |
|
||
| no_secret_leak flag | `results.json flags.no_secret_leak` | **true** ✅ |
|
||
| level | `results.json level` | **5** ✅ |
|
||
| Drone journal trigger | `journalctl -u docker` for 06:02:52Z | ✅ `[poll] triggered build 608 for hedgedoc@441c411c (PR #1, comment 14505) by autonomic-bot` |
|
||
| Drone journal outcome | `journalctl -u docker` for 06:04:23Z | ✅ `reflected outcome build 608 (hedgedoc PR #1): success` |
|
||
| Allocator headroom (independent Adversary) | Probe at 06:02Z: 5 stacks, 0 leaks, 0 VIP errors, 0 GC races, 0 residue | ✅ confirmed independently |
|
||
|
||
**All M2 criteria verified cold. Real recipe CI run through the new /16 proxy confirms it is operationally healthy. Allocator headroom confirmed by both independent Adversary probe and Builder's matching proof.**
|
||
|
||
No discrepancies with Builder's claims. (Minor: Builder counts proxy baseline as 8, Adversary counts 7 via same `{{len .Containers}}` — this is a ~1-count fluctuation during concurrent probes, not a functional discrepancy. Both confirm clean return to baseline.)
|
||
|
||
---
|
||
|
||
## Adversary findings
|
||
|
||
### [A2] upgrade-all SKILL.md stale description — guard text still says "until that lands" (2026-06-13T05:56Z)
|
||
|
||
**Severity:** Documentation / low
|
||
**Location:** `/srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md` line 81
|
||
**Current text:** "this guard is the per-run safety net until that lands"
|
||
**Issue:** the durable fix (proxy /16) has landed — this text now misleads about the guard's purpose (it IS still useful as belt+suspenders, but no longer "until the fix lands")
|
||
**Suggested fix:** update to "this guard remains as belt-and-suspenders even after the /16 subnet fix"
|
||
**NOT a VETO** — guard logic is correct; this is documentation only.
|
||
Status: open (Builder may fix; Adversary closes after re-read)
|