Files
cc-ci/tests/ghost/compose.ccci.yml
autonomic-bot 3ca45c7308 fix(2): ghost F2-14b — add db start_period grace to base overlay
Run #2 base deploy: fresh mysql:8.0 init on the loaded cc-ci host (load ~8) took >6min
(InnoDB ~90s + system-tables + root-pw apply, starved by the app crash-loop churn), exceeding
the recipe's 1m db start_period (+6min retry grace) → swarm killed mysql mid-init (exit 137
unhealthy) → corrupt InnoDB redo logs → permanent deadlock (same signature as run #1's stale
vol). Widen db healthcheck start_period to 15m (matches app) so the slow first-boot finishes
before the healthcheck can fail it. Grace-only, masks no defect; bites base+head (published
recipe ships db start_period 1m everywhere) so overlay covers both. Torn down corrupt vol.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 17:58:30 +01:00

39 lines
2.7 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
# cc-ci overlay (Phase 2 F2-14b) — minimal, single-purpose: widen the `app` healthcheck
# start_period so the UPGRADE-tier BASE deploy (a previous published ghost version) can converge.
#
# WHY THIS OVERLAY EXISTS (plan-ccci-compose-overlay-policy.md §1 "minimal justified fallback"):
# upgrade-to-latest must always run (policy §1) → the harness base-deploys the previous published
# version (e.g. 1.1.1+6-alpine), then `deploy --chaos` to the recipe-PR head. Ghost's fresh-DB first
# boot runs a full schema migration that is ~6-9 min on cc-ci (round-trip-bound, NOT CPU-bound). The
# published base versions ship `start_period: 1m` (+10×30s ≈ 6 min grace) on the app healthcheck —
# too tight: swarm kills the still-migrating task, leaving a held `migrations_lock` → every later
# task deadlocks (MigrationsAreLockedError) → the base never converges → upgrade-to-latest can't run.
#
# The recipe-PR (recipe-maintainers/ghost#1) fixes this for the HEAD by bumping start_period to a
# literal 15m IN THE RECIPE. But the BASE is a *published* version that predates the PR, so it still
# carries 1m. start_period CANNOT be an env var (abra validates the literal compose 'duration' BEFORE
# substitution → FATA; Adversary-reproduced, REVIEW-2 4b862f6), so this cc-ci overlay applies the same
# 15m grace to the base ONLY to make the from-version deployable — exactly the policy-blessed
# "minimal overlay on the from-version so upgrade-to-latest can run". It is grace-only: a healthy
# check still marks healthy immediately, so NO test/assertion is weakened and fast hosts are
# unaffected. It is idempotent on the head (head already ships 15m). Merges deeply onto the base
# healthcheck (test/interval/timeout/retries preserved; only start_period overridden).
#
# The `db` (mysql:8.0) healthcheck gets the same grace: on the loaded cc-ci host a FRESH mysql data
# dir init (InnoDB + system tables + root-password apply) takes ~6-10 min, far exceeding the recipe's
# 1m db start_period (+10×30s ≈ 6 min) — swarm kills mysql MID-INIT (exit 137 "unhealthy container"),
# leaving a half-written data dir whose InnoDB redo logs are corrupt ("Cannot create redo log files
# because data files are corrupt") → every restart fails → permanent deadlock. Widening the db
# start_period to 15m lets the slow first-boot init finish before the healthcheck can fail it. This
# bites BOTH base and head (the published recipe ships db start_period 1m everywhere), so the overlay
# applies on both (persists untracked across the head checkout) — a recipe-PR candidate too.
# Grace-only; masks no defect; weakens no test.
services:
app:
healthcheck:
start_period: 15m
db:
healthcheck:
start_period: 15m