Files
cc-ci-orchestrator/cc-ci-plan/plan-ghostpr-debug-fix.md
autonomic-bot ca02a0dd6f upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges:
the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks
endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety
net to Step 0 (network prune + docker restart when VIP-allocation failures are
logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix,
maintenance window) and for debugging/fixing the ghost PR afterward.
2026-06-12 03:30:00 +00:00

42 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Plan — debug & fix the ghost recipe upgrade PR
**Context:** during the 2026-06-12 weekly upgrade, ghost (ghost 6.42.0→6.44.1 + mysql bump) was the
recipe whose `!testme` kept wedging. Its test deploys (`ghos-bdd2f3` etc.) hung at 0/1 in Swarm
`New` state — which we now know was the **`proxy` VIP exhaustion** (see
[[proxy-vip-exhaustion-runbook]] / `plan-proxy-vip-exhaustion-fix.md`), NOT necessarily a ghost
defect. It also got run by a DUPLICATE subagent during the interrupt churn, so the PR/branch state
may be messy. This plan figures out what actually went wrong and leaves the ghost PR clean + green.
**Execute AFTER** the proxy VIP fix (so the infra confound is gone) and the current upgrade settles.
Owner: orchestrator, or a focused `/recipe-upgrade ghost` re-run.
## Steps
1. **Inventory the ghost PR state.** On recipe-maintainers/ghost: list open PRs — is there ONE
upgrade PR or a DUPLICATE (two branches/PRs from the two ghost subagents)? Capture each PR's
branch, diff (image tag + version-label bumps), and its `!testme` comment history / build
results. Read the upgrader transcript for both ghost subagents to see what each did.
2. **Separate infra failure from real failure.** The deploy wedges were proxy-VIP exhaustion
(infra). Determine whether ghost ALSO has a genuine upgrade problem: does ghost 6.44.1 + the
mysql bump deploy + pass its tests on a HEALTHY swarm? Re-run `!testme` on the ghost PR now that
the box is healthy (post docker-restart / post proxy fix) and watch the real result.
3. **Dedup.** If two ghost PRs/branches exist, keep the correct one (right version bump, clean
diff), close the duplicate with a note, and ensure no leftover `dev-ghost`/`ghos-*` stacks remain
(reap).
4. **Fix forward to green.** If `!testme` is RED for a REAL reason (e.g. ghost 6.44.1 needs a config
change, or the mysql major bump needs a migration step / a genuinely-stale test): apply the
minimal recipe fix per `/recipe-upgrade` rules — recipe PR changes only; if a cc-ci TEST is
genuinely stale, leave an explanatory PR COMMENT (do NOT edit tests in default mode). Iterate
`!testme` ≤3× to green.
5. **Leave it operator-ready.** One clean ghost PR, `!testme` GREEN (or a clear comment explaining a
legitimately-deferred issue), no duplicate, no leaked deploys. NEVER merge — operator merges.
## Acceptance
The ghost upgrade is represented by exactly one PR with a clear, green (or clearly-explained)
`!testme`, the duplicate-subagent mess cleaned, and a one-line note on whether ghost's original
failure was purely the proxy-VIP infra issue or a real upgrade problem (and how it was fixed).
## Guardrails
Recipe mirror = PR only, never merge / never push main. Reap any `dev-ghost`/`ghos-*` test stacks on
exit. No secrets in logs/commits. Don't run while the proxy recreate (maintenance window) is in
progress.