upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
This commit is contained in:
@ -12,3 +12,5 @@
|
||||
- [Swarm UpdateStatus convergence gotchas](swarm-updatestatus-convergence-gotchas.md) — N/N is not converged mid stop-first update; paused flag persists forever; only updating/rollback_started are active
|
||||
- [Weekly upgrade queued after phases](weekly-upgrade-queued-after-phases.md) — 06-12 cron skipped; auto-runs /upgrade-all when phase queue (drone) finishes; don'\''t systemctl start the timer
|
||||
- [cfold paused pending upgrade](cfold-paused-pending-upgrade.md) — cfold phase loops+watchdog STOPPED until /upgrade-all (cc-ci-upgrader) finishes; resume = restart watchdog (phase-idx 9)
|
||||
- [proxy VIP exhaustion runbook](proxy-vip-exhaustion-runbook.md) — TODO after upgrade: enlarge proxy overlay to /16 (exhausts at /24=254 VIPs); root cause of discourse/ghost deploy wedges
|
||||
- [ghost PR debug](ghost-pr-debug.md) — TODO after proxy fix: debug+fix the ghost upgrade PR (wedged on proxy VIP exhaustion; possible duplicate PR)
|
||||
|
||||
20
memory/ghost-pr-debug.md
Normal file
20
memory/ghost-pr-debug.md
Normal file
@ -0,0 +1,20 @@
|
||||
---
|
||||
name: ghost-pr-debug
|
||||
description: TODO after proxy fix — debug & fix the ghost recipe upgrade PR (its !testme kept wedging; possible duplicate PR from interrupt churn)
|
||||
metadata:
|
||||
node_type: memory
|
||||
type: project
|
||||
originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b
|
||||
---
|
||||
|
||||
During the 2026-06-12 weekly upgrade, **ghost** (6.42.0→6.44.1 + mysql bump) was the recipe whose
|
||||
`!testme` kept wedging — its deploys hung at 0/1 in Swarm `New`, which was the **proxy VIP
|
||||
exhaustion** infra issue ([[proxy-vip-exhaustion-runbook]]), not necessarily a ghost defect. It also
|
||||
got run by a DUPLICATE subagent during the interrupt churn, so the ghost PR/branch state may be messy.
|
||||
|
||||
**TODO (after the proxy fix removes the infra confound):** inventory the ghost PR(s) on
|
||||
recipe-maintainers/ghost (one or duplicate?), separate infra-failure from a real upgrade problem by
|
||||
re-running `!testme` on a HEALTHY swarm, dedup any duplicate PR, fix-forward to green (recipe PR only;
|
||||
comment on genuinely-stale tests, never edit them in default mode), and leave exactly one clean,
|
||||
operator-ready ghost PR. NEVER merge. Plan: `cc-ci-plan/plan-ghostpr-debug-fix.md`. Delete this memory
|
||||
once the ghost PR is clean + green (or clearly explained).
|
||||
25
memory/proxy-vip-exhaustion-runbook.md
Normal file
25
memory/proxy-vip-exhaustion-runbook.md
Normal file
@ -0,0 +1,25 @@
|
||||
---
|
||||
name: proxy-vip-exhaustion-runbook
|
||||
description: TODO after the weekly upgrade — enlarge the proxy overlay subnet to /16 (it exhausts at /24=254 VIPs); runbook + empirical verify
|
||||
metadata:
|
||||
node_type: memory
|
||||
type: project
|
||||
originSessionId: 85355980-5e4f-4f90-b1ca-d0e4fe82f04b
|
||||
---
|
||||
|
||||
**Root cause found 2026-06-12 (empirically, from dockerd logs):** recipe test deploys hung at 0/1 in
|
||||
Swarm `New` state (looked like discourse/ghost "failing") because the shared **`proxy` overlay
|
||||
network** (`10.0.1.0/24` = 254 VIPs, joined by every recipe deploy) **exhausted its IP pool**.
|
||||
Leaked endpoints from concurrent stack `rm` (Swarm endpoint-GC race: `key modified` / `network proxy
|
||||
remove failed`, 45×) accumulated over 11 days of dockerd uptime → `could not find an available IP
|
||||
while allocating VIP` (13×). A `docker restart` rebuilds the allocator and reclaims it (proven).
|
||||
|
||||
**Per-run safety net (DONE 2026-06-12):** upgrade-all Step 0 now runs `docker network prune -f` + a
|
||||
guard that restarts docker if recent VIP-allocation failures are in the journal.
|
||||
|
||||
**TODO (durable fix, ORCHESTRATOR, in a maintenance window AFTER the current upgrade + when the box
|
||||
is quiescent — recreating proxy disrupts traefik routing):** enlarge `proxy` to a /16. Edit
|
||||
`nix/modules/swarm.nix:~43` (`docker network create --driver overlay --attachable proxy` → add
|
||||
`--subnet 10.10.0.0/16`), recreate the proxy network on the host, `nixos-rebuild`, and empirically
|
||||
verify (reproduce the leak before/after). Full runbook: `cc-ci-plan/plan-proxy-vip-exhaustion-fix.md`.
|
||||
Then debug the ghost PR ([[ghost-pr-debug]]). Delete this memory once proxy is /16 + verified.
|
||||
Reference in New Issue
Block a user