Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
42 lines
2.8 KiB
Markdown
42 lines
2.8 KiB
Markdown
# Plan — debug & fix the ghost recipe upgrade PR
|
||
|
||
**Context:** during the 2026-06-12 weekly upgrade, ghost (ghost 6.42.0→6.44.1 + mysql bump) was the
|
||
recipe whose `!testme` kept wedging. Its test deploys (`ghos-bdd2f3` etc.) hung at 0/1 in Swarm
|
||
`New` state — which we now know was the **`proxy` VIP exhaustion** (see
|
||
[[proxy-vip-exhaustion-runbook]] / `plan-proxy-vip-exhaustion-fix.md`), NOT necessarily a ghost
|
||
defect. It also got run by a DUPLICATE subagent during the interrupt churn, so the PR/branch state
|
||
may be messy. This plan figures out what actually went wrong and leaves the ghost PR clean + green.
|
||
|
||
**Execute AFTER** the proxy VIP fix (so the infra confound is gone) and the current upgrade settles.
|
||
Owner: orchestrator, or a focused `/recipe-upgrade ghost` re-run.
|
||
|
||
## Steps
|
||
1. **Inventory the ghost PR state.** On recipe-maintainers/ghost: list open PRs — is there ONE
|
||
upgrade PR or a DUPLICATE (two branches/PRs from the two ghost subagents)? Capture each PR's
|
||
branch, diff (image tag + version-label bumps), and its `!testme` comment history / build
|
||
results. Read the upgrader transcript for both ghost subagents to see what each did.
|
||
2. **Separate infra failure from real failure.** The deploy wedges were proxy-VIP exhaustion
|
||
(infra). Determine whether ghost ALSO has a genuine upgrade problem: does ghost 6.44.1 + the
|
||
mysql bump deploy + pass its tests on a HEALTHY swarm? Re-run `!testme` on the ghost PR now that
|
||
the box is healthy (post docker-restart / post proxy fix) and watch the real result.
|
||
3. **Dedup.** If two ghost PRs/branches exist, keep the correct one (right version bump, clean
|
||
diff), close the duplicate with a note, and ensure no leftover `dev-ghost`/`ghos-*` stacks remain
|
||
(reap).
|
||
4. **Fix forward to green.** If `!testme` is RED for a REAL reason (e.g. ghost 6.44.1 needs a config
|
||
change, or the mysql major bump needs a migration step / a genuinely-stale test): apply the
|
||
minimal recipe fix per `/recipe-upgrade` rules — recipe PR changes only; if a cc-ci TEST is
|
||
genuinely stale, leave an explanatory PR COMMENT (do NOT edit tests in default mode). Iterate
|
||
`!testme` ≤3× to green.
|
||
5. **Leave it operator-ready.** One clean ghost PR, `!testme` GREEN (or a clear comment explaining a
|
||
legitimately-deferred issue), no duplicate, no leaked deploys. NEVER merge — operator merges.
|
||
|
||
## Acceptance
|
||
The ghost upgrade is represented by exactly one PR with a clear, green (or clearly-explained)
|
||
`!testme`, the duplicate-subagent mess cleaned, and a one-line note on whether ghost's original
|
||
failure was purely the proxy-VIP infra issue or a real upgrade problem (and how it was fixed).
|
||
|
||
## Guardrails
|
||
Recipe mirror = PR only, never merge / never push main. Reap any `dev-ghost`/`ghos-*` test stacks on
|
||
exit. No secrets in logs/commits. Don't run while the proxy recreate (maintenance window) is in
|
||
progress.
|