upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug

Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
2026-06-12 03:30:00 +00:00
parent 7ce898e0e4
commit ca02a0dd6f
7 changed files with 184 additions and 0 deletions
--- a/memory/MEMORY.md
+++ b/memory/MEMORY.md
@ -12,3 +12,5 @@
 - [Swarm UpdateStatus convergence gotchas](swarm-updatestatus-convergence-gotchas.md) — N/N is not converged mid stop-first update; paused flag persists forever; only updating/rollback_started are active
 - [Weekly upgrade queued after phases](weekly-upgrade-queued-after-phases.md) — 06-12 cron skipped; auto-runs /upgrade-all when phase queue (drone) finishes; don'\''t systemctl start the timer
 - [cfold paused pending upgrade](cfold-paused-pending-upgrade.md) — cfold phase loops+watchdog STOPPED until /upgrade-all (cc-ci-upgrader) finishes; resume = restart watchdog (phase-idx 9)
+- [proxy VIP exhaustion runbook](proxy-vip-exhaustion-runbook.md) — TODO after upgrade: enlarge proxy overlay to /16 (exhausts at /24=254 VIPs); root cause of discourse/ghost deploy wedges
+- [ghost PR debug](ghost-pr-debug.md) — TODO after proxy fix: debug+fix the ghost upgrade PR (wedged on proxy VIP exhaustion; possible duplicate PR)