upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
This commit is contained in:
@ -58,6 +58,28 @@ the dedicated reaper too so the start/end cleanup is explicit and symmetric (`TH
|
||||
ssh cc-ci 'THRESHOLD=0 bash -s' < /srv/cc-ci/.claude/skills/upgrade-all/reap-dev-deploys.sh
|
||||
```
|
||||
|
||||
Then **reclaim leaked overlay IPs and guard against `proxy` VIP exhaustion.** The shared `proxy`
|
||||
overlay (a `/24` = 254 VIPs that EVERY recipe deploy joins) leaks endpoints under concurrent stack
|
||||
`rm` (a Swarm endpoint-GC race); over many days of churn the pool exhausts and new test deploys hang
|
||||
in Swarm `New` state with `could not find an available IP while allocating VIP` — which looks exactly
|
||||
like a recipe failure but is infra (root-caused 2026-06-12; see cc-ci-plan `plan-proxy-vip-exhaustion-fix.md`).
|
||||
Reclaim before the run:
|
||||
```
|
||||
# 1. reclaim leaked per-stack overlay networks (cheap, always safe)
|
||||
ssh cc-ci 'docker network prune -f'
|
||||
# 2. proxy VIP-exhaustion guard: if the allocator recently failed to assign a VIP, the leak has hit
|
||||
# the ceiling — rebuild the allocator with a docker restart (the box is QUIESCENT at run start, so
|
||||
# this is a ~30s infra blip that auto-recovers). Only fires when actually needed.
|
||||
VIPFAIL=$(ssh cc-ci 'journalctl -u docker --since "26 hours ago" --no-pager 2>/dev/null | grep -c "available IP while allocating VIP"')
|
||||
if [ "${VIPFAIL:-0}" -gt 0 ]; then
|
||||
echo "!! proxy VIP exhaustion detected ($VIPFAIL recent failures) — restarting docker to reclaim leaked endpoints"
|
||||
ssh cc-ci 'sudo systemctl restart docker'; sleep 25
|
||||
ssh cc-ci 'docker node ls && docker service ls --format "{{.Replicas}}" | grep -c "/"' # sanity: node Ready, infra back
|
||||
fi
|
||||
```
|
||||
(The durable fix — enlarging the `proxy` subnet to a /16 so it never exhausts — is tracked in
|
||||
`plan-proxy-vip-exhaustion-fix.md`; this guard is the per-run safety net until that lands.)
|
||||
|
||||
## 1. Build the candidate list
|
||||
Enrolled recipes = the cc-ci `tests/<recipe>/` dirs (same set `ci-test-review` sweeps):
|
||||
```
|
||||
|
||||
@ -528,3 +528,18 @@ session cc-ci-orchestrator-stale can be killed; recipe-mirrors org still private
|
||||
it with the diagnosis → "one clean discourse retry then move on regardless; comment+skip
|
||||
if it re-wedges". Agent recovered, now checking build state before retry. Rest of queue
|
||||
(ghost/immich/keycloak/lasuite-*/mailu/matrix-synapse) still ahead. cfold still paused.
|
||||
|
||||
## 2026-06-12 ~03:30 — ROOT CAUSE: proxy overlay VIP exhaustion (not "tired box")
|
||||
- Empirically verified from dockerd logs: the shared `proxy` overlay (10.0.1.0/24 = 254 VIPs,
|
||||
joined by every recipe deploy) exhausted its IP pool. Endpoint-GC race on concurrent stack rm
|
||||
(`key modified`/`network proxy remove failed`, 45×) leaked IPs over 11 days of dockerd uptime →
|
||||
13× `could not find an available IP while allocating VIP` from 22:53 → tasks stuck in Swarm `New`
|
||||
→ discourse + ghost deploys wedged (looked like recipe failures; were infra). 02:50 docker
|
||||
restart rebuilt the allocator → cleared.
|
||||
- FIXES: (a) upgrade-all Step 0 now prunes leaked overlays + restarts docker if VIP-failures are in
|
||||
the journal (per-run safety net, committed). (b) DURABLE: enlarge proxy to /16 in swarm.nix —
|
||||
runbook plan-proxy-vip-exhaustion-fix.md + memory [[proxy-vip-exhaustion-runbook]], orchestrator
|
||||
to execute in a maintenance window AFTER the current upgrade (recreating proxy disrupts routing).
|
||||
(c) ghost PR debug: plan-ghostpr-debug-fix.md + memory [[ghost-pr-debug]].
|
||||
- NOT switching the upgrade to sequential (operator: concurrency is fine; the leak is the issue).
|
||||
Duplicate ghost subagent from the interrupt churn — told the upgrader to TaskStop one.
|
||||
|
||||
41
cc-ci-plan/plan-ghostpr-debug-fix.md
Normal file
41
cc-ci-plan/plan-ghostpr-debug-fix.md
Normal file
@ -0,0 +1,41 @@
|
||||
# Plan — debug & fix the ghost recipe upgrade PR
|
||||
|
||||
**Context:** during the 2026-06-12 weekly upgrade, ghost (ghost 6.42.0→6.44.1 + mysql bump) was the
|
||||
recipe whose `!testme` kept wedging. Its test deploys (`ghos-bdd2f3` etc.) hung at 0/1 in Swarm
|
||||
`New` state — which we now know was the **`proxy` VIP exhaustion** (see
|
||||
[[proxy-vip-exhaustion-runbook]] / `plan-proxy-vip-exhaustion-fix.md`), NOT necessarily a ghost
|
||||
defect. It also got run by a DUPLICATE subagent during the interrupt churn, so the PR/branch state
|
||||
may be messy. This plan figures out what actually went wrong and leaves the ghost PR clean + green.
|
||||
|
||||
**Execute AFTER** the proxy VIP fix (so the infra confound is gone) and the current upgrade settles.
|
||||
Owner: orchestrator, or a focused `/recipe-upgrade ghost` re-run.
|
||||
|
||||
## Steps
|
||||
1. **Inventory the ghost PR state.** On recipe-maintainers/ghost: list open PRs — is there ONE
|
||||
upgrade PR or a DUPLICATE (two branches/PRs from the two ghost subagents)? Capture each PR's
|
||||
branch, diff (image tag + version-label bumps), and its `!testme` comment history / build
|
||||
results. Read the upgrader transcript for both ghost subagents to see what each did.
|
||||
2. **Separate infra failure from real failure.** The deploy wedges were proxy-VIP exhaustion
|
||||
(infra). Determine whether ghost ALSO has a genuine upgrade problem: does ghost 6.44.1 + the
|
||||
mysql bump deploy + pass its tests on a HEALTHY swarm? Re-run `!testme` on the ghost PR now that
|
||||
the box is healthy (post docker-restart / post proxy fix) and watch the real result.
|
||||
3. **Dedup.** If two ghost PRs/branches exist, keep the correct one (right version bump, clean
|
||||
diff), close the duplicate with a note, and ensure no leftover `dev-ghost`/`ghos-*` stacks remain
|
||||
(reap).
|
||||
4. **Fix forward to green.** If `!testme` is RED for a REAL reason (e.g. ghost 6.44.1 needs a config
|
||||
change, or the mysql major bump needs a migration step / a genuinely-stale test): apply the
|
||||
minimal recipe fix per `/recipe-upgrade` rules — recipe PR changes only; if a cc-ci TEST is
|
||||
genuinely stale, leave an explanatory PR COMMENT (do NOT edit tests in default mode). Iterate
|
||||
`!testme` ≤3× to green.
|
||||
5. **Leave it operator-ready.** One clean ghost PR, `!testme` GREEN (or a clear comment explaining a
|
||||
legitimately-deferred issue), no duplicate, no leaked deploys. NEVER merge — operator merges.
|
||||
|
||||
## Acceptance
|
||||
The ghost upgrade is represented by exactly one PR with a clear, green (or clearly-explained)
|
||||
`!testme`, the duplicate-subagent mess cleaned, and a one-line note on whether ghost's original
|
||||
failure was purely the proxy-VIP infra issue or a real upgrade problem (and how it was fixed).
|
||||
|
||||
## Guardrails
|
||||
Recipe mirror = PR only, never merge / never push main. Reap any `dev-ghost`/`ghos-*` test stacks on
|
||||
exit. No secrets in logs/commits. Don't run while the proxy recreate (maintenance window) is in
|
||||
progress.
|
||||
59
cc-ci-plan/plan-proxy-vip-exhaustion-fix.md
Normal file
59
cc-ci-plan/plan-proxy-vip-exhaustion-fix.md
Normal file
@ -0,0 +1,59 @@
|
||||
# Runbook — fix `proxy` overlay VIP exhaustion (durable) + empirical verification
|
||||
|
||||
**Owner: ORCHESTRATOR** (host/swarm infra, not a recipe/test change). Execute **after the current
|
||||
weekly upgrade run finishes** (the box must be quiescent — recreating `proxy` disrupts traefik
|
||||
routing for every live service). Do NOT run mid-upgrade.
|
||||
|
||||
## Root cause (empirically verified 2026-06-12, from dockerd logs)
|
||||
- The shared **`proxy` overlay network** (ID was `ab54…`) is **`10.0.1.0/24` = 254 VIPs**. EVERY
|
||||
recipe deploy joins it (traefik routing).
|
||||
- Under concurrent stack `rm`, Swarm's endpoint GC races (`Unable to complete atomic operation,
|
||||
key modified` / `network proxy remove failed`) and **leaks endpoints → leaks IPs** (45 such
|
||||
errors over the day). `dockerd` had **11 days** uptime accumulating leaks.
|
||||
- The pool exhausted → **13× `could not find an available IP while allocating VIP`** (first 22:53,
|
||||
straddling both wedges) → new services' tasks stuck in Swarm **`New`** state (never scheduled).
|
||||
- The 02:50 docker restart rebuilt the allocator and reclaimed everything → healthy.
|
||||
- This presents as a recipe FAILURE (discourse, ghost both "failed") but is purely infra.
|
||||
|
||||
## The fix (durable): enlarge the `proxy` subnet
|
||||
`nix/modules/swarm.nix:~43` creates it with no `--subnet` (defaults to a /24):
|
||||
```
|
||||
docker network create --driver overlay --attachable proxy
|
||||
```
|
||||
Change to a **/16** (≈65,534 VIPs, ~258× headroom — the leak can't reach it before a routine
|
||||
reboot/`nixos-rebuild` resets the allocator). Pick a block clear of `ingress` (10.0.0.0/24) and the
|
||||
current proxy (10.0.1.0/24); the default-addr-pool is 10.0.0.0/8, so use e.g. **`10.10.0.0/16`**:
|
||||
```
|
||||
docker network create --driver overlay --attachable --subnet 10.10.0.0/16 proxy
|
||||
```
|
||||
|
||||
## Procedure
|
||||
1. **Pre-req:** weekly upgrade run done; `docker stack ls` shows only infra + `warm-*`.
|
||||
2. **EMPIRICAL BEFORE — measure the leak.** Baseline `proxy` endpoint/IP count, then deploy +
|
||||
*concurrently* `rm` N (~10) throwaway published-port stacks; re-count. Show endpoints/IPs do NOT
|
||||
return to baseline (leak), and grep dockerd for fresh `key modified`/`network proxy remove`
|
||||
errors. Record the per-cycle leak rate → projects the /24 exhaustion time.
|
||||
3. **Edit `nix/modules/swarm.nix`** — add `--subnet 10.10.0.0/16` to the proxy create (commit to
|
||||
the cc-ci repo; this is infra/nix, orchestrator-authored, push to git.autonomic.zone).
|
||||
4. **Recreate `proxy` on the host (DISRUPTIVE):** the network can't be resized in place. Either
|
||||
`nixos-rebuild` after temporarily removing proxy, or manually: detach services / `docker stack
|
||||
rm` the live recipe stacks (none mid-upgrade), `docker network rm proxy`, recreate with the /16,
|
||||
then redeploy/reconcile traefik + the `ccci-*` control plane + `warm-*` so they rejoin. Verify
|
||||
traefik routing, drone, dashboard, bridge, reports all healthy.
|
||||
5. **`nixos-rebuild switch`** so the /16 persists across reboots (sync `/root/cc-ci` first, per the
|
||||
host-deploy mechanism).
|
||||
6. **EMPIRICAL AFTER — prove it.** Re-run step 2's reproduction: confirm (a) `proxy` now reports a
|
||||
/16 with vast headroom, (b) `docker network prune -f` reclaims the leaked per-stack overlays,
|
||||
(c) the leak no longer approaches exhaustion. Confirm a fresh recipe `!testme` deploys clean (no
|
||||
`New`-state hang).
|
||||
|
||||
## Acceptance
|
||||
`proxy` is a /16 (pinned in swarm.nix, survives rebuild); reproduction shows the leak is bounded
|
||||
far below the new ceiling; the upgrade Step-0 guard (prune + VIP-failure docker-restart, already
|
||||
added to the skill 2026-06-12) remains as the per-run safety net. Then delete the
|
||||
[[proxy-vip-exhaustion-runbook]] memory.
|
||||
|
||||
## Guardrails
|
||||
- Maintenance window only (recreating proxy = brief routing outage for ALL services). Never during
|
||||
a live upgrade or phase run. No secrets in commits. Author `autonomic-bot
|
||||
<autonomic-bot@noreply.git.autonomic.zone>`; push after commit.
|
||||
@ -12,3 +12,5 @@
|
||||
- [Swarm UpdateStatus convergence gotchas](swarm-updatestatus-convergence-gotchas.md) — N/N is not converged mid stop-first update; paused flag persists forever; only updating/rollback_started are active
|
||||
- [Weekly upgrade queued after phases](weekly-upgrade-queued-after-phases.md) — 06-12 cron skipped; auto-runs /upgrade-all when phase queue (drone) finishes; don'\''t systemctl start the timer
|
||||
- [cfold paused pending upgrade](cfold-paused-pending-upgrade.md) — cfold phase loops+watchdog STOPPED until /upgrade-all (cc-ci-upgrader) finishes; resume = restart watchdog (phase-idx 9)
|
||||
- [proxy VIP exhaustion runbook](proxy-vip-exhaustion-runbook.md) — TODO after upgrade: enlarge proxy overlay to /16 (exhausts at /24=254 VIPs); root cause of discourse/ghost deploy wedges
|
||||
- [ghost PR debug](ghost-pr-debug.md) — TODO after proxy fix: debug+fix the ghost upgrade PR (wedged on proxy VIP exhaustion; possible duplicate PR)
|
||||
|
||||
20
memory/ghost-pr-debug.md
Normal file
20
memory/ghost-pr-debug.md
Normal file
@ -0,0 +1,20 @@
|
||||
---
|
||||
name: ghost-pr-debug
|
||||
description: TODO after proxy fix — debug & fix the ghost recipe upgrade PR (its !testme kept wedging; possible duplicate PR from interrupt churn)
|
||||
metadata:
|
||||
node_type: memory
|
||||
type: project
|
||||
originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b
|
||||
---
|
||||
|
||||
During the 2026-06-12 weekly upgrade, **ghost** (6.42.0→6.44.1 + mysql bump) was the recipe whose
|
||||
`!testme` kept wedging — its deploys hung at 0/1 in Swarm `New`, which was the **proxy VIP
|
||||
exhaustion** infra issue ([[proxy-vip-exhaustion-runbook]]), not necessarily a ghost defect. It also
|
||||
got run by a DUPLICATE subagent during the interrupt churn, so the ghost PR/branch state may be messy.
|
||||
|
||||
**TODO (after the proxy fix removes the infra confound):** inventory the ghost PR(s) on
|
||||
recipe-maintainers/ghost (one or duplicate?), separate infra-failure from a real upgrade problem by
|
||||
re-running `!testme` on a HEALTHY swarm, dedup any duplicate PR, fix-forward to green (recipe PR only;
|
||||
comment on genuinely-stale tests, never edit them in default mode), and leave exactly one clean,
|
||||
operator-ready ghost PR. NEVER merge. Plan: `cc-ci-plan/plan-ghostpr-debug-fix.md`. Delete this memory
|
||||
once the ghost PR is clean + green (or clearly explained).
|
||||
25
memory/proxy-vip-exhaustion-runbook.md
Normal file
25
memory/proxy-vip-exhaustion-runbook.md
Normal file
@ -0,0 +1,25 @@
|
||||
---
|
||||
name: proxy-vip-exhaustion-runbook
|
||||
description: TODO after the weekly upgrade — enlarge the proxy overlay subnet to /16 (it exhausts at /24=254 VIPs); runbook + empirical verify
|
||||
metadata:
|
||||
node_type: memory
|
||||
type: project
|
||||
originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b
|
||||
---
|
||||
|
||||
**Root cause found 2026-06-12 (empirically, from dockerd logs):** recipe test deploys hung at 0/1 in
|
||||
Swarm `New` state (looked like discourse/ghost "failing") because the shared **`proxy` overlay
|
||||
network** (`10.0.1.0/24` = 254 VIPs, joined by every recipe deploy) **exhausted its IP pool**.
|
||||
Leaked endpoints from concurrent stack `rm` (Swarm endpoint-GC race: `key modified` / `network proxy
|
||||
remove failed`, 45×) accumulated over 11 days of dockerd uptime → `could not find an available IP
|
||||
while allocating VIP` (13×). A `docker restart` rebuilds the allocator and reclaims it (proven).
|
||||
|
||||
**Per-run safety net (DONE 2026-06-12):** upgrade-all Step 0 now runs `docker network prune -f` + a
|
||||
guard that restarts docker if recent VIP-allocation failures are in the journal.
|
||||
|
||||
**TODO (durable fix, ORCHESTRATOR, in a maintenance window AFTER the current upgrade + when the box
|
||||
is quiescent — recreating proxy disrupts traefik routing):** enlarge `proxy` to a /16. Edit
|
||||
`nix/modules/swarm.nix:~43` (`docker network create --driver overlay --attachable proxy` → add
|
||||
`--subnet 10.10.0.0/16`), recreate the proxy network on the host, `nixos-rebuild`, and empirically
|
||||
verify (reproduce the leak before/after). Full runbook: `cc-ci-plan/plan-proxy-vip-exhaustion-fix.md`.
|
||||
Then debug the ghost PR ([[ghost-pr-debug]]). Delete this memory once proxy is /16 + verified.
|
||||
Reference in New Issue
Block a user