fix(2): discourse healthcheck start_period overlay (slow Rails boot) + CHAOS_BASE_DEPLOY + TIMEOUT 2400

Install timed out at 1800s: discourse's 15-25min Rails cold boot overran both the deploy timeout and the recipe healthcheck start_period:5m (swarm killed the booting app). Add compose.ccci-health.yml (app healthcheck start_period 1200s) via install_steps.sh + recipe_meta COMPOSE_FILE + CHAOS_BASE_DEPLOY, bump DEPLOY_TIMEOUT/TIMEOUT to 2400. Image re-pin (bitnamilegacy) already proven working. NO test weakened.
2026-05-30 11:48:18 +01:00
parent 0f597f2e3d
commit a432058aca
2 changed files with 44 additions and 0 deletions
--- a/tests/discourse/compose.ccci-health.yml
+++ b/tests/discourse/compose.ccci-health.yml
@ -0,0 +1,18 @@
+# cc-ci deploy overlay (NOT a recipe change) — raises ONLY the app healthcheck start_period.
+#
+# Discourse (bitnamilegacy/discourse) is a slow-booting Rails app: its first cold boot does DB
+# migrate + asset precompile + bootstrap, which on cc-ci's single node regularly takes 15-25min. The
+# upstream recipe healthcheck on the `app` service uses `start_period: 5m` (+ 6×30s retries ≈ 8min
+# grace); on cc-ci the boot exceeds that, so swarm marks the still-booting task unhealthy and KILLS
+# it mid-boot, it restarts, and the deploy never converges within the timeout (observed: deploy timed
+# out at 1800s with the app task still Running).
+#
+# Raising the START_PERIOD (failures ignored during it; a PASS still marks healthy immediately) lets
+# the cold boot finish, after which discourse serves /srv/status and the (unchanged) check passes.
+# This is DEPLOY/infra tuning, not a test change — no assertion is weakened, and the app's real
+# healthcheck still gates readiness. Applied via recipe_meta COMPOSE_FILE. The `app` service name is
+# verified against the PR-head compose (ci/bitnamilegacy-repin: services.app holds the healthcheck).
+services:
+  app:
+    healthcheck:
+      start_period: 1200s
--- a/tests/discourse/install_steps.sh
+++ b/tests/discourse/install_steps.sh
@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+# discourse — INSTALL-TIME hook (Phase 2 Q4.6). Runs during the install tier AFTER `abra app new` +
+# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
+# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN / CCCI_APP_ENV in env.
+#
+# Purpose: provide the cc-ci deploy overlay `compose.ccci-health.yml` (app healthcheck start_period
+# bump) into the recipe checkout so recipe_meta's COMPOSE_FILE (compose.yml:compose.ccci-health.yml)
+# resolves. Without the larger start_period, discourse's 15-25min Rails cold boot is killed mid-boot
+# by the recipe's 5m-start_period healthcheck and the deploy never converges (see the overlay header).
+# The overlay is an UNTRACKED file in the recipe repo, so `git checkout -f` (the upgrade tier's
+# re-checkout to PR head) preserves it — COMPOSE_FILE keeps resolving across install AND upgrade
+# deploys. CHAOS_BASE_DEPLOY=True (recipe_meta) lets the pinned base deploy proceed despite this
+# untracked file (abra's clean-tree check would otherwise FATA).
+set -euo pipefail
+
+: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
+
+if [ ! -d "$RECIPE_DIR" ]; then
+  echo "  discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide health overlay" >&2
+  exit 1
+fi
+
+cp "$SCRIPT_DIR/compose.ccci-health.yml" "$RECIPE_DIR/compose.ccci-health.yml"
+echo "  discourse install_steps: provided compose.ccci-health.yml (healthcheck start_period bump) to ${CCCI_RECIPE}"