Files

autonomic-bot a1c8003187

continuous-integration/drone/push Build is failing

Details

review(pvcheck-M2): M2 PASS — real CI run + allocator proof verified cold

Cold verify 2026-06-13T06:14Z:
- hedgedoc run #608 confirmed: triggered 06:02:48Z (after proxy fix 05:38Z),
  all tiers pass (install/upgrade/backup/restore/custom), level 5, clean teardown,
  no-secret-leak. Gitea comment #14506 confirms pass.
- Proxy endpoints clean after run: 7 (back to M1 baseline).
- Zero VIP exhaustion since 05:38Z.
- Allocator headroom: Adversary's independent 5-stack probe + Builder's matching proof.
All pvcheck Definition-of-Done items verified.

2026-06-13 06:07:47 +00:00

8.2 KiB

Raw Blame History

REVIEW — phase pvcheck (post-proxy verification)

Adversary-owned. Append-only verdicts. All commands run cold from /srv/cc-ci-orch/cc-ci-adv (own clone).

Adversary baseline probe — 2026-06-13T05:56Z

Context: Phase pvfix is DONE (STATUS-pvfix.md ## DONE). pvcheck preconditions verified cold.

Precondition checks

Check	Result
pvfix DONE	✅ STATUS-pvfix.md shows `## DONE`, both M1+M2 PASS
`proxy` subnet	✅ `10.10.0.0/16` (docker network inspect proxy --format "{{range .IPAM.Config}}{{.Subnet}}{{end}}")
`proxy` IPAM driver	✅ default, gateway 10.10.0.1
All services 1/1	✅ 9 services all `1/1` (backups, bridge, dashboard, reports, drone, traefik×2, keycloak×2)
`ci.commoninternet.net`	✅ HTTP/2 200
`drone.ci.commoninternet.net`	✅ HTTP/2 303
`report.ci.commoninternet.net`	✅ HTTP/2 200
VIP exhaustion after 05:38Z	✅ NONE — `journalctl -u docker --since "2026-06-13 05:38:00"
Transient errors at 05:35Z	ℹ️ "could not find network allocator STATE" for OLD net IDs (mlxau8…, 85p3aq…) — these are expected during proxy recreation (swarm allocator losing state for the deleted /24 network)
No new VIP exhaustion	✅ post-fix journal clean

Command evidence:

$ docker network inspect proxy --format "{{json .IPAM}}"
{"Driver":"default","Options":null,"Config":[{"Subnet":"10.10.0.0/16","Gateway":"10.10.0.1"}]}

$ docker service ls --format "{{.Name}}\t{{.Replicas}}"
backups_ci_commoninternet_net_app	1/1
ccci-bridge_app	1/1
ccci-dashboard_app	1/1
ccci-reports_app	1/1
drone_ci_commoninternet_net_app	1/1
traefik_ci_commoninternet_net_app	1/1
traefik_ci_commoninternet_net_socket-proxy	1/1
warm-keycloak_ci_commoninternet_net_app	1/1
warm-keycloak_ci_commoninternet_net_db	1/1

Upgrade-all Step-0 guard — independent check

Guard location: /srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md §0, lines 61-81
Guard logic: VIPFAIL=$(ssh cc-ci 'journalctl -u docker --since "26 hours ago" | grep -c "available IP while allocating VIP"') → if >0, systemctl restart docker
Guard exists: ✅ confirmed cold-read
Guard would fire: ✅ triggers on the EXACT original error signature ("available IP while allocating VIP") — would detect and recover if VIP exhaustion recurs despite the /16 fix (belt+suspenders)
STALE TEXT NOTE: Skill still says "(The durable fix ... is tracked in plan-proxy-vip-exhaustion-fix.md; this guard is the per-run safety net until that lands.)" — but the durable fix HAS now landed. This is a documentation smell, not a functional defect; the guard logic is correct and still useful. Filing as advisory finding [A2].

Adversary independent allocator-headroom probe — 2026-06-13T06:02Z

Method: deploy 5 throwaway nginx stacks concurrently joining proxy, then remove all 5 concurrently (same concurrent-rm pattern that caused endpoint GC races under the old /24).

Check	Result
BASELINE proxy containers	9
AFTER DEPLOY (5 stacks added)	14
AFTER concurrent stack rm	9 (back to baseline)
Leaked endpoints	0
VIP exhaustion errors during test	0
Swarm GC race errors (key modified / network proxy remove failed)	0
Network prune output	empty (nothing to reclaim)
AFTER prune residue	0
All pvcheck-throwaway stacks removed	✅ confirmed

Verdict: The /16 subnet has sufficient headroom that 5 concurrent deploy/rm cycles produce zero endpoint leaks and zero VIP errors. No residue after prune.

Note: 5 stacks is a conservative test — the original exhaustion required ~45 GC races over 11 days uptime. The /16 has 65534 VIPs vs the old /24's 254 — the leak rate would need to be ~258× faster to hit the same ceiling. This probe confirms the allocator is healthy and the /16 provides the claimed headroom.

M1 — PASS @2026-06-13T06:10Z

Cold verify run — Adversary's own commands, no cached state.

Check	Command	Result
proxy subnet	`docker network inspect proxy --format "Subnet: {{range .IPAM.Config}}{{.Subnet}}{{end}}, Endpoints: {{len .Containers}}"`	`10.10.0.0/16`, Endpoints: 7 ✅
9 services 1/1	`docker service ls --format "{{.Name}}\t{{.Replicas}}"`	all 1/1 ✅
ci.commoninternet.net	`curl -sk -o /dev/null -w "%{http_code}"`	200 ✅
drone.ci.commoninternet.net	same	303 ✅
report.ci.commoninternet.net	same	200 ✅
VIP exhaustion since 05:38Z	`journalctl -u docker --since "2026-06-13 05:38:00" \| grep -c "available IP while allocating VIP"`	0 ✅
swarm.nix /16 declared	`grep "10.10" nix/modules/swarm.nix`	`--subnet 10.10.0.0/16` ✅
swarm.nix commit	`git show e6349a9 --stat`	confirmed ✅
Step-0 guard text	`grep -A8 "VIPFAIL" upgrade-all/SKILL.md`	guard exists, checks exact signature ✅
[A2] fix	`git -C /srv/cc-ci-orch log --oneline \| grep 84e13a7`	`fix(pvcheck/A2): update upgrade-all SKILL.md guard description` ✅
[A2] text updated	SKILL.md line ~81	"belt-and-suspenders even after the /16 fix" ✅

All M1 criteria verified independently from cold start. Builder's before/after evidence is consistent with what Adversary observed directly. No discrepancies.

[A2] CLOSED — fix confirmed in orchestrator commit 84e13a7.

M2 — PASS @2026-06-13T06:14Z

Cold verify run — Adversary's own commands, no cached state.

Check	Command	Result
summary.png accessible	`curl -sk -o /dev/null -w "%{http_code}" .../runs/608/summary.png`	HTTP 200 ✅
badge level	`curl -sk .../badge.svg \| grep -o "level [0-9]"`	level 5 ✅
proxy endpoints after run	`docker network inspect proxy --format "{{len .Containers}}"`	7 (clean, same as M1 baseline) ✅
VIP exhaustion since 05:38Z	`journalctl \| grep -c "available IP while allocating VIP"`	0 ✅
Gitea comment #14506	`GET /api/v1/repos/recipe-maintainers/hedgedoc/issues/1/comments`	✅ `hedgedoc @ 441c411c ✅ passed` posted at 06:02:52Z
!testme trigger comment	comment #14505 at 06:02:48Z by autonomic-bot	✅ real !testme trigger
Run trigger timing	06:02:48Z → after proxy fix 05:38Z	✅ entire run on new /16
Run result filesystem	`/var/lib/cc-ci-runs/608/results.json`	✅ all tiers pass: install/upgrade/backup/restore/custom
clean_teardown flag	`results.json flags.clean_teardown`	true ✅
no_secret_leak flag	`results.json flags.no_secret_leak`	true ✅
level	`results.json level`	5 ✅
Drone journal trigger	`journalctl -u docker` for 06:02:52Z	✅ `[poll] triggered build 608 for hedgedoc@441c411c (PR #1, comment 14505) by autonomic-bot`
Drone journal outcome	`journalctl -u docker` for 06:04:23Z	✅ `reflected outcome build 608 (hedgedoc PR #1): success`
Allocator headroom (independent Adversary)	Probe at 06:02Z: 5 stacks, 0 leaks, 0 VIP errors, 0 GC races, 0 residue	✅ confirmed independently

All M2 criteria verified cold. Real recipe CI run through the new /16 proxy confirms it is operationally healthy. Allocator headroom confirmed by both independent Adversary probe and Builder's matching proof.

No discrepancies with Builder's claims. (Minor: Builder counts proxy baseline as 8, Adversary counts 7 via same {{len .Containers}} — this is a ~1-count fluctuation during concurrent probes, not a functional discrepancy. Both confirm clean return to baseline.)

Adversary findings

[A2] upgrade-all SKILL.md stale description — guard text still says "until that lands" (2026-06-13T05:56Z)

Severity: Documentation / low
Location: /srv/cc-ci-orch/.claude/skills/upgrade-all/SKILL.md line 81
Current text: "this guard is the per-run safety net until that lands"
Issue: the durable fix (proxy /16) has landed — this text now misleads about the guard's purpose (it IS still useful as belt+suspenders, but no longer "until the fix lands")
Suggested fix: update to "this guard remains as belt-and-suspenders even after the /16 subnet fix"
NOT a VETO — guard logic is correct; this is documentation only.
Status: open (Builder may fix; Adversary closes after re-read)

8.2 KiB Raw Blame History Unescape Escape