Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-ghost-reeval.md
2026-06-12 15:56:03 +00:00

71 lines
3.8 KiB
Markdown

# Phase `ghost` — re-evaluate ghost after proxy fix and leave one clean PR
**Mission:** re-evaluate the `ghost` upgrade failure after the Swarm proxy/IPAM infra
confound has been removed, then leave exactly one operator-ready ghost PR: green if the
recipe is sound, or clearly explained with the minimum required recipe fix/comment if a real
Ghost/MySQL upgrade issue remains.
State files live under `machine-docs/`: `STATUS-ghost.md`, `BACKLOG-ghost.md`,
`REVIEW-ghost.md`, `JOURNAL-ghost.md`.
## Context
The 2026-06-12 `/upgrade-all` recorded `ghost` as the only failed recipe, but the evidence
was mixed:
- One failure was definitely infra: shared `proxy` overlay VIP exhaustion left tasks stuck
in Swarm `New` state.
- A later failure may be recipe-specific: MySQL 8.0 to 8.4 data-dir upgrade timing under
Swarm's default update monitor, producing `UpdateStatus=paused` under load.
- A previous run on 2026-06-05 passed the Ghost/MySQL path under lighter load.
- Duplicate ghost subagent churn may have left branch/PR/comment state messy.
Existing focused plan/background: `/srv/cc-ci/cc-ci-plan/plan-ghostpr-debug-fix.md`.
## Required Work
1. **Inventory PR state.** On `recipe-maintainers/ghost`, list all open PRs and branches
related to the upgrade. Identify the correct PR, expected to be ghost PR `#4`, and close
or clearly mark any duplicate only if it is truly superseded. Never merge recipe PRs.
2. **Separate infra from recipe behavior.** After `pvfix` and `pvcheck`, trigger a fresh
`!testme` on the correct ghost PR and watch the run. Do not count pre-proxy failures as
current recipe evidence.
3. **If green:** record that the prior failure was infra/timing-confounded, ensure no stale
stacks/volumes remain, and leave the PR ready for operator review.
4. **If red for a real recipe reason:** make the smallest recipe PR change needed. The
suspected fix is a longer Swarm update monitor/start grace around the MySQL 8.0 to 8.4
data-dir migration, e.g. `update_config.monitor: 300s` and related minimal service health
timing. Validate the hypothesis with logs; do not cargo-cult timing knobs.
5. **If the test is genuinely stale:** default recipe-upgrade policy applies: leave an
explanatory PR comment for the operator. Do not edit cc-ci tests in this phase unless the
operator explicitly asks for a test-update phase.
6. **Deduplicate and clean up.** Ensure exactly one relevant open ghost upgrade PR remains,
comments explain the final state, and no `ghos-*`/`dev-ghost` stacks or volumes leak.
## Gates
**M1 — State inventory and clean retry.** Builder documents PR/branch/comment/build state,
identifies the correct PR, and runs one clean post-proxy `!testme`. Adversary verifies that
pre-proxy infra failures were not misclassified as current recipe failures.
**M2 — Operator-ready outcome.** The ghost PR is green, or it has the minimal justified
recipe fix/comment and a clear current blocker. Duplicate PR/branch mess is resolved and
no ghost resources leak. Adversary verifies live PR state, build evidence, and cleanup.
## Guardrails
- Recipe PRs are never merged by agents.
- Do not weaken tests to get green.
- Do not re-run ghost during proxy maintenance or while `cfold` owns a broad CI sweep.
- Keep iterations bounded: at most three fresh post-proxy `!testme` attempts unless the
operator authorizes more.
- Preserve useful failure evidence in PR comments and `machine-docs/STATUS-ghost.md`.
## Definition of Done
Exactly one ghost upgrade PR is operator-ready, with a fresh post-proxy verdict and clear
classification of the 2026-06-12 failure. Any real recipe fix is minimal and verified;
otherwise the PR is green or has a precise operator-facing explanation. Adversary has
signed off on M1 and M2 in `machine-docs/REVIEW-ghost.md`; Builder writes `## DONE` only
after both gates have fresh Adversary PASSes.