feat(2): ghost F2-14b overlay migration — start_period bump moved to recipe-PR (ghost#1 head ae43ffe, literal 15m on app healthcheck); DELETE cc-ci compose.ccci-health.yml + install_steps.sh + COMPOSE_FILE/CHAOS_BASE_DEPLOY. Anti-drift (plan §9): recipe-as-tested == recipe-as-published. env-var start_period impossible (abra pre-subst duration validation, Adversary-reproduced 4b862f6). Next: run ghost on ae43ffe head.

2026-05-30 17:20:20 +01:00
parent 2f5900a5a9
commit 0f2cc2d704
3 changed files with 17 additions and 55 deletions
--- a/tests/ghost/compose.ccci-health.yml
+++ b/tests/ghost/compose.ccci-health.yml
@ -1,18 +0,0 @@
-# cc-ci deploy overlay (NOT a recipe change) — raises ONLY the app healthcheck start_period.
-#
-# Ghost's first-boot runs a full schema migration (dozens of CREATE TABLEs, each a separate MySQL
-# round-trip → ~6-9min on cc-ci) against the fresh `ghost` DB. The upstream recipe healthcheck uses
-# `start_period: 1m` (+ 10×30s retries ≈ 6min grace); on cc-ci the migration regularly exceeds that,
-# so swarm marks the still-migrating task unhealthy and KILLS it mid-migration — which leaves a stale
-# `migrations_lock` row, and every later task then refuses to boot (`MigrationsAreLockedError`
-# deadlock). This is round-trip-bound, so more vCPU does not close the gap.
-#
-# Raising the START_PERIOD (failures ignored during it; a PASS still marks healthy immediately) lets
-# the fresh migration finish + release the lock, after which Ghost serves and the (unchanged) check
-# passes. This is DEPLOY/infra tuning, not a test change — no assertion is weakened, and the app's
-# real healthcheck still gates readiness. Applied via recipe_meta COMPOSE_FILE; only the install
-# tier's fresh migration needs it (the upgrade redeploy boots on the already-populated DB → fast).
-services:
-  app:
-    healthcheck:
-      start_period: 900s
--- a/tests/ghost/install_steps.sh
+++ b/tests/ghost/install_steps.sh
@ -1,26 +0,0 @@
-#!/usr/bin/env bash
-# ghost — INSTALL-TIME hook (Phase 2 Q4.4). Runs during the install tier AFTER `abra app new` +
-# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
-# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN / CCCI_APP_ENV in env.
-#
-# Purpose: provide the cc-ci deploy overlay `compose.ccci-health.yml` (app healthcheck start_period
-# bump) into the recipe checkout so recipe_meta's COMPOSE_FILE (compose.yml:compose.ccci-health.yml)
-# resolves. Without the larger start_period, Ghost's ~6-9min fresh-DB migration is killed mid-flight
-# by the recipe's 1m-start_period healthcheck, leaving a stale migrations_lock → deadlock (see the
-# overlay file header). The overlay is an UNTRACKED file in the recipe repo, so `git checkout -f`
-# (the upgrade tier's re-checkout to PR head) preserves it — COMPOSE_FILE keeps resolving across
-# install AND upgrade deploys. CHAOS_BASE_DEPLOY=True (recipe_meta) lets the pinned base deploy
-# proceed despite this untracked file (abra's clean-tree check would otherwise FATA).
-set -euo pipefail
-
-: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
-
-if [ ! -d "$RECIPE_DIR" ]; then
-  echo "  ghost install_steps: recipe dir $RECIPE_DIR missing — cannot provide health overlay" >&2
-  exit 1
-fi
-
-cp "$SCRIPT_DIR/compose.ccci-health.yml" "$RECIPE_DIR/compose.ccci-health.yml"
-echo "  ghost install_steps: provided compose.ccci-health.yml (healthcheck start_period bump) to ${CCCI_RECIPE}"
--- a/tests/ghost/recipe_meta.py
+++ b/tests/ghost/recipe_meta.py
@ -12,17 +12,23 @@ DEPLOY_TIMEOUT = 1200  # subprocess timeout for `abra app deploy`
 HTTP_TIMEOUT = 900

 # Ghost's fresh-DB first boot runs a full schema migration (dozens of CREATE TABLEs, each a separate
-# MySQL round-trip → ~6-9min on cc-ci, round-trip-bound so more vCPU doesn't help). The upstream
-# recipe healthcheck (`start_period: 1m` + 10×30s ≈ 6min grace) is too tight: swarm kills the still-
-# migrating task, leaving a stale `migrations_lock` → every later task deadlocks
-# (`MigrationsAreLockedError`). cc-ci provides a DEPLOY overlay `compose.ccci-health.yml` (raises the
-# app healthcheck start_period to 900s; failures ignored during it, a PASS still marks healthy at
-# once) via COMPOSE_FILE + install_steps.sh, so the fresh migration finishes + releases the lock.
-# This is infra/deploy tuning — NO test/assertion is weakened. CHAOS_BASE_DEPLOY lets the pinned base
-# deploy proceed with the untracked overlay present. TIMEOUT 1200s = migration (≤9min) + convergence,
-# bounded so a genuine failure still fails (not a long blackout). See DECISIONS (ghost MySQL cold-boot).
-CHAOS_BASE_DEPLOY = True
+# MySQL round-trip → ~6-9min on cc-ci, round-trip-bound so more vCPU doesn't help). The published
+# recipe healthcheck used `start_period: 1m` (+10×30s ≈ 6min grace) — too tight on cc-ci: swarm kills
+# the still-migrating task, leaving a stale `migrations_lock` → every later task deadlocks
+# (`MigrationsAreLockedError`).
+#
+# FIXED IN THE RECIPE-PR (recipe-maintainers/ghost#1, branch ci/mysql-backup): the app-service
+# healthcheck `start_period` is bumped to a literal 15m in the recipe itself — the real recipe
+# everyone runs, NOT a cc-ci compose fork. This is the plan §9 / plan-prefer-env-over-compose-overlay.md
+# anti-drift path: start_period CANNOT be expressed as an env var (abra validates the literal compose
+# 'duration' format BEFORE env substitution — `${VAR}` / `"${VAR:-1m}"` → FATA 'Does not match format
+# duration'; reproduced by the Adversary, REVIEW-2 4b862f6), so a literal recipe-PR bump is the only
+# §9-compliant way to widen it. Precedent: discourse + lasuite-drive collabora start_period recipe-PRs.
+# start_period only widens the startup grace window (a healthy check still marks healthy at once → fast
+# hosts unaffected); NO test/assertion is weakened. With the bump in the recipe, the former cc-ci
+# DEPLOY overlay (`compose.ccci-health.yml` + `install_steps.sh` + COMPOSE_FILE + CHAOS_BASE_DEPLOY)
+# is DELETED. TIMEOUT 1200s = migration (≤9min) + convergence, bounded so a genuine failure still
+# fails (not a long blackout). See DECISIONS (ghost MySQL cold-boot / start_period recipe-PR).
 EXTRA_ENV = {
    "TIMEOUT": "1200",
-    "COMPOSE_FILE": "compose.yml:compose.ccci-health.yml",
 }