fix(2): ghost healthcheck start_period overlay — fixes fresh-migration lock deadlock

Root cause: Ghost's fresh-DB first boot runs a ~6-9min schema migration (round-trip-bound, not CPU);
the recipe healthcheck start_period:1m (~6min grace) kills the still-migrating task, leaving a stale
migrations_lock → every later task deadlocks (MigrationsAreLockedError). Hit on both 2- and 4-vCPU.
Fix (cc-ci deploy overlay, NOT a recipe/test change): compose.ccci-health.yml raises app healthcheck
start_period to 900s, wired via recipe_meta COMPOSE_FILE + install_steps.sh (+ CHAOS_BASE_DEPLOY for
the untracked overlay). No assertion weakened. Budget 1200s = migration + convergence. Only the
install tier needs it (upgrade redeploys on the populated DB → fast boot).
This commit is contained in:
2026-05-30 05:23:47 +01:00
parent 9771b6e16a
commit 13da216f8d
3 changed files with 60 additions and 9 deletions

View File

@ -8,14 +8,21 @@
# mysqldump pre-hook; P4 (ops.py + test_{backup,restore,upgrade}.py) seeds a `ci_marker` row there.
HEALTH_PATH = "/" # Ghost serves a themed site HTML at root (200)
HEALTH_OK = (200,)
DEPLOY_TIMEOUT = 900 # subprocess timeout for `abra app deploy`
DEPLOY_TIMEOUT = 1200 # subprocess timeout for `abra app deploy`
HTTP_TIMEOUT = 900
# Ghost's first-boot does a full schema migration (dozens of tables) against a fresh MySQL `ghost`
# DB. The migration must finish within the recipe healthcheck grace (start_period 1m + 10×30s ≈ 6min)
# — otherwise swarm kills the still-migrating task, which leaves a stale `migrations_lock` row and
# every later task then refuses to boot (`MigrationsAreLockedError` deadlock). On the cc-ci node with
# 4 dedicated vCPU the migration completes well inside that grace and the app converges in a few
# minutes, so 900s is an ample-but-bounded budget (fails fast if the deadlock ever recurs, rather
# than a long blackout). See DECISIONS (ghost MySQL cold-boot).
EXTRA_ENV = {"TIMEOUT": "900"}
# Ghost's fresh-DB first boot runs a full schema migration (dozens of CREATE TABLEs, each a separate
# MySQL round-trip → ~6-9min on cc-ci, round-trip-bound so more vCPU doesn't help). The upstream
# recipe healthcheck (`start_period: 1m` + 10×30s ≈ 6min grace) is too tight: swarm kills the still-
# migrating task, leaving a stale `migrations_lock` → every later task deadlocks
# (`MigrationsAreLockedError`). cc-ci provides a DEPLOY overlay `compose.ccci-health.yml` (raises the
# app healthcheck start_period to 900s; failures ignored during it, a PASS still marks healthy at
# once) via COMPOSE_FILE + install_steps.sh, so the fresh migration finishes + releases the lock.
# This is infra/deploy tuning — NO test/assertion is weakened. CHAOS_BASE_DEPLOY lets the pinned base
# deploy proceed with the untracked overlay present. TIMEOUT 1200s = migration (≤9min) + convergence,
# bounded so a genuine failure still fails (not a long blackout). See DECISIONS (ghost MySQL cold-boot).
CHAOS_BASE_DEPLOY = True
EXTRA_ENV = {
"TIMEOUT": "1200",
"COMPOSE_FILE": "compose.yml:compose.ccci-health.yml",
}