fix(2): discourse healthcheck start_period overlay (slow Rails boot) + CHAOS_BASE_DEPLOY + TIMEOUT 2400
Install timed out at 1800s: discourse's 15-25min Rails cold boot overran both the deploy timeout and the recipe healthcheck start_period:5m (swarm killed the booting app). Add compose.ccci-health.yml (app healthcheck start_period 1200s) via install_steps.sh + recipe_meta COMPOSE_FILE + CHAOS_BASE_DEPLOY, bump DEPLOY_TIMEOUT/TIMEOUT to 2400. Image re-pin (bitnamilegacy) already proven working. NO test weakened.
This commit is contained in:
18
tests/discourse/compose.ccci-health.yml
Normal file
18
tests/discourse/compose.ccci-health.yml
Normal file
@ -0,0 +1,18 @@
|
||||
# cc-ci deploy overlay (NOT a recipe change) — raises ONLY the app healthcheck start_period.
|
||||
#
|
||||
# Discourse (bitnamilegacy/discourse) is a slow-booting Rails app: its first cold boot does DB
|
||||
# migrate + asset precompile + bootstrap, which on cc-ci's single node regularly takes 15-25min. The
|
||||
# upstream recipe healthcheck on the `app` service uses `start_period: 5m` (+ 6×30s retries ≈ 8min
|
||||
# grace); on cc-ci the boot exceeds that, so swarm marks the still-booting task unhealthy and KILLS
|
||||
# it mid-boot, it restarts, and the deploy never converges within the timeout (observed: deploy timed
|
||||
# out at 1800s with the app task still Running).
|
||||
#
|
||||
# Raising the START_PERIOD (failures ignored during it; a PASS still marks healthy immediately) lets
|
||||
# the cold boot finish, after which discourse serves /srv/status and the (unchanged) check passes.
|
||||
# This is DEPLOY/infra tuning, not a test change — no assertion is weakened, and the app's real
|
||||
# healthcheck still gates readiness. Applied via recipe_meta COMPOSE_FILE. The `app` service name is
|
||||
# verified against the PR-head compose (ci/bitnamilegacy-repin: services.app holds the healthcheck).
|
||||
services:
|
||||
app:
|
||||
healthcheck:
|
||||
start_period: 1200s
|
||||
26
tests/discourse/install_steps.sh
Executable file
26
tests/discourse/install_steps.sh
Executable file
@ -0,0 +1,26 @@
|
||||
#!/usr/bin/env bash
|
||||
# discourse — INSTALL-TIME hook (Phase 2 Q4.6). Runs during the install tier AFTER `abra app new` +
|
||||
# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
|
||||
# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN / CCCI_APP_ENV in env.
|
||||
#
|
||||
# Purpose: provide the cc-ci deploy overlay `compose.ccci-health.yml` (app healthcheck start_period
|
||||
# bump) into the recipe checkout so recipe_meta's COMPOSE_FILE (compose.yml:compose.ccci-health.yml)
|
||||
# resolves. Without the larger start_period, discourse's 15-25min Rails cold boot is killed mid-boot
|
||||
# by the recipe's 5m-start_period healthcheck and the deploy never converges (see the overlay header).
|
||||
# The overlay is an UNTRACKED file in the recipe repo, so `git checkout -f` (the upgrade tier's
|
||||
# re-checkout to PR head) preserves it — COMPOSE_FILE keeps resolving across install AND upgrade
|
||||
# deploys. CHAOS_BASE_DEPLOY=True (recipe_meta) lets the pinned base deploy proceed despite this
|
||||
# untracked file (abra's clean-tree check would otherwise FATA).
|
||||
set -euo pipefail
|
||||
|
||||
: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
|
||||
|
||||
if [ ! -d "$RECIPE_DIR" ]; then
|
||||
echo " discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide health overlay" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
cp "$SCRIPT_DIR/compose.ccci-health.yml" "$RECIPE_DIR/compose.ccci-health.yml"
|
||||
echo " discourse install_steps: provided compose.ccci-health.yml (healthcheck start_period bump) to ${CCCI_RECIPE}"
|
||||
Reference in New Issue
Block a user