Files
cc-ci-orchestrator/cc-ci-plan/plan-ghostpr-debug-fix.md
autonomic-bot ca02a0dd6f upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges:
the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks
endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety
net to Step 0 (network prune + docker restart when VIP-allocation failures are
logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix,
maintenance window) and for debugging/fixing the ghost PR afterward.
2026-06-12 03:30:00 +00:00

2.8 KiB
Raw Blame History

Plan — debug & fix the ghost recipe upgrade PR

Context: during the 2026-06-12 weekly upgrade, ghost (ghost 6.42.0→6.44.1 + mysql bump) was the recipe whose !testme kept wedging. Its test deploys (ghos-bdd2f3 etc.) hung at 0/1 in Swarm New state — which we now know was the proxy VIP exhaustion (see proxy-vip-exhaustion-runbook / plan-proxy-vip-exhaustion-fix.md), NOT necessarily a ghost defect. It also got run by a DUPLICATE subagent during the interrupt churn, so the PR/branch state may be messy. This plan figures out what actually went wrong and leaves the ghost PR clean + green.

Execute AFTER the proxy VIP fix (so the infra confound is gone) and the current upgrade settles. Owner: orchestrator, or a focused /recipe-upgrade ghost re-run.

Steps

  1. Inventory the ghost PR state. On recipe-maintainers/ghost: list open PRs — is there ONE upgrade PR or a DUPLICATE (two branches/PRs from the two ghost subagents)? Capture each PR's branch, diff (image tag + version-label bumps), and its !testme comment history / build results. Read the upgrader transcript for both ghost subagents to see what each did.
  2. Separate infra failure from real failure. The deploy wedges were proxy-VIP exhaustion (infra). Determine whether ghost ALSO has a genuine upgrade problem: does ghost 6.44.1 + the mysql bump deploy + pass its tests on a HEALTHY swarm? Re-run !testme on the ghost PR now that the box is healthy (post docker-restart / post proxy fix) and watch the real result.
  3. Dedup. If two ghost PRs/branches exist, keep the correct one (right version bump, clean diff), close the duplicate with a note, and ensure no leftover dev-ghost/ghos-* stacks remain (reap).
  4. Fix forward to green. If !testme is RED for a REAL reason (e.g. ghost 6.44.1 needs a config change, or the mysql major bump needs a migration step / a genuinely-stale test): apply the minimal recipe fix per /recipe-upgrade rules — recipe PR changes only; if a cc-ci TEST is genuinely stale, leave an explanatory PR COMMENT (do NOT edit tests in default mode). Iterate !testme ≤3× to green.
  5. Leave it operator-ready. One clean ghost PR, !testme GREEN (or a clear comment explaining a legitimately-deferred issue), no duplicate, no leaked deploys. NEVER merge — operator merges.

Acceptance

The ghost upgrade is represented by exactly one PR with a clear, green (or clearly-explained) !testme, the duplicate-subagent mess cleaned, and a one-line note on whether ghost's original failure was purely the proxy-VIP infra issue or a real upgrade problem (and how it was fixed).

Guardrails

Recipe mirror = PR only, never merge / never push main. Reap any dev-ghost/ghos-* test stacks on exit. No secrets in logs/commits. Don't run while the proxy recreate (maintenance window) is in progress.