fix(2): discourse healthcheck start_period overlay (slow Rails boot) + CHAOS_BASE_DEPLOY + TIMEOUT 2400

Install timed out at 1800s: discourse's 15-25min Rails cold boot overran both the deploy timeout and
the recipe healthcheck start_period:5m (swarm killed the booting app). Add compose.ccci-health.yml
(app healthcheck start_period 1200s) via install_steps.sh + recipe_meta COMPOSE_FILE + CHAOS_BASE_DEPLOY,
bump DEPLOY_TIMEOUT/TIMEOUT to 2400. Image re-pin (bitnamilegacy) already proven working. NO test weakened.
This commit is contained in:
2026-05-30 11:48:18 +01:00
parent 0f597f2e3d
commit a432058aca
2 changed files with 44 additions and 0 deletions

View File

@ -0,0 +1,18 @@
# cc-ci deploy overlay (NOT a recipe change) — raises ONLY the app healthcheck start_period.
#
# Discourse (bitnamilegacy/discourse) is a slow-booting Rails app: its first cold boot does DB
# migrate + asset precompile + bootstrap, which on cc-ci's single node regularly takes 15-25min. The
# upstream recipe healthcheck on the `app` service uses `start_period: 5m` (+ 6×30s retries ≈ 8min
# grace); on cc-ci the boot exceeds that, so swarm marks the still-booting task unhealthy and KILLS
# it mid-boot, it restarts, and the deploy never converges within the timeout (observed: deploy timed
# out at 1800s with the app task still Running).
#
# Raising the START_PERIOD (failures ignored during it; a PASS still marks healthy immediately) lets
# the cold boot finish, after which discourse serves /srv/status and the (unchanged) check passes.
# This is DEPLOY/infra tuning, not a test change — no assertion is weakened, and the app's real
# healthcheck still gates readiness. Applied via recipe_meta COMPOSE_FILE. The `app` service name is
# verified against the PR-head compose (ci/bitnamilegacy-repin: services.app holds the healthcheck).
services:
app:
healthcheck:
start_period: 1200s

View File

@ -0,0 +1,26 @@
#!/usr/bin/env bash
# discourse — INSTALL-TIME hook (Phase 2 Q4.6). Runs during the install tier AFTER `abra app new` +
# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN / CCCI_APP_ENV in env.
#
# Purpose: provide the cc-ci deploy overlay `compose.ccci-health.yml` (app healthcheck start_period
# bump) into the recipe checkout so recipe_meta's COMPOSE_FILE (compose.yml:compose.ccci-health.yml)
# resolves. Without the larger start_period, discourse's 15-25min Rails cold boot is killed mid-boot
# by the recipe's 5m-start_period healthcheck and the deploy never converges (see the overlay header).
# The overlay is an UNTRACKED file in the recipe repo, so `git checkout -f` (the upgrade tier's
# re-checkout to PR head) preserves it — COMPOSE_FILE keeps resolving across install AND upgrade
# deploys. CHAOS_BASE_DEPLOY=True (recipe_meta) lets the pinned base deploy proceed despite this
# untracked file (abra's clean-tree check would otherwise FATA).
set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then
echo " discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide health overlay" >&2
exit 1
fi
cp "$SCRIPT_DIR/compose.ccci-health.yml" "$RECIPE_DIR/compose.ccci-health.yml"
echo " discourse install_steps: provided compose.ccci-health.yml (healthcheck start_period bump) to ${CCCI_RECIPE}"