Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-ghost-reeval.md
2026-06-12 15:56:03 +00:00

3.8 KiB

Phase ghost — re-evaluate ghost after proxy fix and leave one clean PR

Mission: re-evaluate the ghost upgrade failure after the Swarm proxy/IPAM infra confound has been removed, then leave exactly one operator-ready ghost PR: green if the recipe is sound, or clearly explained with the minimum required recipe fix/comment if a real Ghost/MySQL upgrade issue remains.

State files live under machine-docs/: STATUS-ghost.md, BACKLOG-ghost.md, REVIEW-ghost.md, JOURNAL-ghost.md.

Context

The 2026-06-12 /upgrade-all recorded ghost as the only failed recipe, but the evidence was mixed:

  • One failure was definitely infra: shared proxy overlay VIP exhaustion left tasks stuck in Swarm New state.
  • A later failure may be recipe-specific: MySQL 8.0 to 8.4 data-dir upgrade timing under Swarm's default update monitor, producing UpdateStatus=paused under load.
  • A previous run on 2026-06-05 passed the Ghost/MySQL path under lighter load.
  • Duplicate ghost subagent churn may have left branch/PR/comment state messy.

Existing focused plan/background: /srv/cc-ci/cc-ci-plan/plan-ghostpr-debug-fix.md.

Required Work

  1. Inventory PR state. On recipe-maintainers/ghost, list all open PRs and branches related to the upgrade. Identify the correct PR, expected to be ghost PR #4, and close or clearly mark any duplicate only if it is truly superseded. Never merge recipe PRs.
  2. Separate infra from recipe behavior. After pvfix and pvcheck, trigger a fresh !testme on the correct ghost PR and watch the run. Do not count pre-proxy failures as current recipe evidence.
  3. If green: record that the prior failure was infra/timing-confounded, ensure no stale stacks/volumes remain, and leave the PR ready for operator review.
  4. If red for a real recipe reason: make the smallest recipe PR change needed. The suspected fix is a longer Swarm update monitor/start grace around the MySQL 8.0 to 8.4 data-dir migration, e.g. update_config.monitor: 300s and related minimal service health timing. Validate the hypothesis with logs; do not cargo-cult timing knobs.
  5. If the test is genuinely stale: default recipe-upgrade policy applies: leave an explanatory PR comment for the operator. Do not edit cc-ci tests in this phase unless the operator explicitly asks for a test-update phase.
  6. Deduplicate and clean up. Ensure exactly one relevant open ghost upgrade PR remains, comments explain the final state, and no ghos-*/dev-ghost stacks or volumes leak.

Gates

M1 — State inventory and clean retry. Builder documents PR/branch/comment/build state, identifies the correct PR, and runs one clean post-proxy !testme. Adversary verifies that pre-proxy infra failures were not misclassified as current recipe failures.

M2 — Operator-ready outcome. The ghost PR is green, or it has the minimal justified recipe fix/comment and a clear current blocker. Duplicate PR/branch mess is resolved and no ghost resources leak. Adversary verifies live PR state, build evidence, and cleanup.

Guardrails

  • Recipe PRs are never merged by agents.
  • Do not weaken tests to get green.
  • Do not re-run ghost during proxy maintenance or while cfold owns a broad CI sweep.
  • Keep iterations bounded: at most three fresh post-proxy !testme attempts unless the operator authorizes more.
  • Preserve useful failure evidence in PR comments and machine-docs/STATUS-ghost.md.

Definition of Done

Exactly one ghost upgrade PR is operator-ready, with a fresh post-proxy verdict and clear classification of the 2026-06-12 failure. Any real recipe fix is minimal and verified; otherwise the PR is green or has a precise operator-facing explanation. Adversary has signed off on M1 and M2 in machine-docs/REVIEW-ghost.md; Builder writes ## DONE only after both gates have fresh Adversary PASSes.