feat(2): discourse Q4.6 policy-compliant shape (plan §9) — env-var start_period, delete cc-ci overlay, upgrade N/A

Migrate discourse off the cc-ci compose overlay per plan §9 / plan-prefer-env-over-compose-overlay.md: - recipe_meta: drop UPGRADE_BASE_VERSION + COMPOSE_FILE + CHAOS_BASE_DEPLOY; set APP_START_PERIOD=1200s via EXTRA_ENV (the recipe-PR exposes start_period: ${APP_START_PERIOD:-5m}); declare upgrade tier N/A (both published prev bases pin removed bitnami images; Adversary §7.1 granted, REVIEW-2 efe3790). - delete tests/discourse/compose.ccci-health.yml + install_steps.sh (existed only to copy the overlay). - DECISIONS.md + STATUS-2 record the §9 guardrail + discourse shape (upgrade N/A, env start_period, pg_backup restore-hook recipe-PR = 5th data-loss recipe cc-ci caught). recipe-PR head now 8b8df17 (start_period env var added). Not a claim — run STAGES=install,backup,restore,custom next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 15:47:28 +01:00
parent a389bd0832
commit c346b9763b
5 changed files with 71 additions and 99 deletions
--- a/tests/discourse/compose.ccci-health.yml
+++ b/tests/discourse/compose.ccci-health.yml
@ -1,32 +0,0 @@
-# cc-ci deploy overlay (NOT a recipe change) — raises ONLY the app healthcheck start_period.
-#
-# Discourse (bitnamilegacy/discourse) is a slow-booting Rails app: its first cold boot does DB
-# migrate + asset precompile + bootstrap, which on cc-ci's single node regularly takes 15-25min. The
-# upstream recipe healthcheck on the `app` service uses `start_period: 5m` (+ 6×30s retries ≈ 8min
-# grace); on cc-ci the boot exceeds that, so swarm marks the still-booting task unhealthy and KILLS
-# it mid-boot, it restarts, and the deploy never converges within the timeout (observed: deploy timed
-# out at 1800s with the app task still Running).
-#
-# Raising the START_PERIOD (failures ignored during it; a PASS still marks healthy immediately) lets
-# the cold boot finish, after which discourse serves /srv/status and the (unchanged) check passes.
-# This is DEPLOY/infra tuning, not a test change — no assertion is weakened, and the app's real
-# healthcheck still gates readiness. Applied via recipe_meta COMPOSE_FILE. The `app` service name is
-# verified against the PR-head compose (ci/bitnamilegacy-repin: services.app holds the healthcheck).
-#
-# IMAGE RE-PIN (upgrade-tier honesty, Adversary §7.1): the upgrade tier base-deploys the previous
-# published version 0.7.0+3.3.1 (UPGRADE_BASE_VERSION in recipe_meta — the PR's TRUE predecessor,
-# sharing the head's discourse 3.3.1 image), whose compose.yml pins `bitnami/discourse:3.3.1` on the
-# `app` AND `sidekiq` services — but Docker Hub no longer serves any `bitnami/discourse:*` tag (404).
-# This overlay re-pins BOTH to the servable `bitnamilegacy/discourse:3.3.1` (identical discourse
-# version, namespace-only) so the base deploy pulls, and the chaos head redeploy (PR 0.8.0, already
-# re-pinned to bitnamilegacy in compose.yml) gets the SAME value — making an HONEST 0.7.0→0.8.0
-# crossover testable. NOT a test weakening: the served discourse app image is the same 3.3.1 either
-# side; only the recipe-version label moves (the PR's actual change). Applies uniformly to base+head.
-version: "3.8"  # MUST match compose.yml's version — abra lint R011/R012 FATAs on a mismatch
-services:
-  app:
-    image: bitnamilegacy/discourse:3.3.1
-    healthcheck:
-      start_period: 1200s
-  sidekiq:
-    image: bitnamilegacy/discourse:3.3.1
--- a/tests/discourse/install_steps.sh
+++ b/tests/discourse/install_steps.sh
@ -1,26 +0,0 @@
-#!/usr/bin/env bash
-# discourse — INSTALL-TIME hook (Phase 2 Q4.6). Runs during the install tier AFTER `abra app new` +
-# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy`
-# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN / CCCI_APP_ENV in env.
-#
-# Purpose: provide the cc-ci deploy overlay `compose.ccci-health.yml` (app healthcheck start_period
-# bump) into the recipe checkout so recipe_meta's COMPOSE_FILE (compose.yml:compose.ccci-health.yml)
-# resolves. Without the larger start_period, discourse's 15-25min Rails cold boot is killed mid-boot
-# by the recipe's 5m-start_period healthcheck and the deploy never converges (see the overlay header).
-# The overlay is an UNTRACKED file in the recipe repo, so `git checkout -f` (the upgrade tier's
-# re-checkout to PR head) preserves it — COMPOSE_FILE keeps resolving across install AND upgrade
-# deploys. CHAOS_BASE_DEPLOY=True (recipe_meta) lets the pinned base deploy proceed despite this
-# untracked file (abra's clean-tree check would otherwise FATA).
-set -euo pipefail
-
-: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
-
-if [ ! -d "$RECIPE_DIR" ]; then
-  echo "  discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide health overlay" >&2
-  exit 1
-fi
-
-cp "$SCRIPT_DIR/compose.ccci-health.yml" "$RECIPE_DIR/compose.ccci-health.yml"
-echo "  discourse install_steps: provided compose.ccci-health.yml (healthcheck start_period bump) to ${CCCI_RECIPE}"
--- a/tests/discourse/recipe_meta.py
+++ b/tests/discourse/recipe_meta.py
@ -1,33 +1,31 @@
 # Per-recipe harness config for discourse (Phase 2 Q4.6 — forum; postgres + redis + sidekiq).
 #
-# Discourse (bitnami/discourse) is a slow-booting Rails app: the recipe healthcheck polls
-# /srv/status with a 5-minute start_period, and a cold first boot (DB migrate + asset precompile)
-# regularly takes 8-15 min, so the deploy/HTTP timeouts are generous. /srv/status returns 200 only
-# once the app is actually serving (the canonical "is discourse up" signal — NOT "/", which may
-# redirect to setup).
+# Discourse (bitnamilegacy/discourse) is a slow-booting Rails app: the recipe healthcheck polls
+# /srv/status, and a cold first boot (DB migrate + asset precompile) regularly takes 15-25 min on
+# cc-ci's single node, so the deploy/HTTP timeouts are generous. /srv/status returns 200 only once the
+# app is actually serving (the canonical "is discourse up" signal — NOT "/", which may redirect to setup).
 HEALTH_PATH = "/srv/status"
 HEALTH_OK = (200,)
-DEPLOY_TIMEOUT = 2400  # was 1800 — slow Rails cold boot (15-25min) overran it; bumped to match TIMEOUT
+DEPLOY_TIMEOUT = 2400  # slow Rails cold boot (15-25min); matches the EXTRA_ENV TIMEOUT below
 HTTP_TIMEOUT = 1200

-# cc-ci deploy overlay: discourse's 15-25min Rails cold boot exceeds the recipe healthcheck's
-# start_period:5m (+8min grace), so swarm kills the still-booting app and the deploy never converges
-# (observed: 1800s timeout). compose.ccci-health.yml raises the app healthcheck start_period to 1200s
-# (failures ignored during it; a PASS still marks healthy at once) — DEPLOY/infra tuning, NO test
-# weakened. install_steps.sh provides the overlay into the checkout; COMPOSE_FILE wires it; TIMEOUT
-# 2400 lets abra's convergence wait outlast the boot. CHAOS_BASE_DEPLOY lets the pinned base deploy
-# proceed with the untracked overlay present. (Same pattern as tests/ghost/.)
-CHAOS_BASE_DEPLOY = True
+# Slow-cold-boot handling via env, NOT a cc-ci compose overlay (plan.md §9 anti-drift guardrail):
+# discourse's 15-25min Rails cold boot exceeds the recipe healthcheck's default start_period (5m) +
+# grace, so swarm would kill the still-booting app and the deploy never converges. Rather than fork
+# the recipe with a compose.*.yml overlay (which drifts from what ships), the recipe-PR
+# (recipe-maintainers/discourse#1) parameterizes the app healthcheck as
+# `start_period: ${APP_START_PERIOD:-5m}` (default unchanged for real users); cc-ci just sets a larger
+# value here. TIMEOUT (abra's internal convergence wait) is raised to outlast the boot.
 EXTRA_ENV = {
    "TIMEOUT": "2400",
-    "COMPOSE_FILE": "compose.yml:compose.ccci-health.yml",
+    "APP_START_PERIOD": "1200s",
 }

-# Upgrade-tier base version (Adversary §7.1): the harness default base = recipe_versions[-2], which
-# for discourse is 0.6.3+3.1.2 (discourse 3.1.2). But this PR (recipe-maintainers/discourse#1) ADDS a
-# version (0.8.0+3.3.1) ABOVE the newest published tag, so the PR's TRUE predecessor is [-1] =
-# 0.7.0+3.3.1 — which shares the head's discourse 3.3.1 image, making an HONEST 0.7.0→0.8.0 crossover
-# testable via the uniform bitnamilegacy:3.3.1 image overlay (compose.ccci-health.yml). [-2]=3.1.2
-# differs from head 3.3.1, so a uniform overlay there would be a hollow (fake-version) base. Pinning
-# the base to [-1] is the correct predecessor whenever a PR adds a version above the catalogue head.
-UPGRADE_BASE_VERSION = "0.7.0+3.3.1"
+# Upgrade tier — N/A (declared NOT-TESTABLE under cc-ci; Adversary §7.1 sign-off GRANTED, REVIEW-2
+# efe3790). Both published predecessor versions pin Docker-Hub-removed images:
+#   0.7.0+3.3.1 → bitnami/discourse:3.3.1 (404),  0.6.3+3.1.2 → bitnami/discourse:3.1.2 (404).
+# The recipe-PR re-pins the HEAD to bitnamilegacy/discourse:3.3.1 (a legit upstream fix), but per
+# plan.md §9 / plan-prefer-env-over-compose-overlay.md pt2 we declare an old base whose image is gone
+# NOT-TESTABLE rather than authoring an image-repin compose overlay to resurrect it. So no honest
+# prev→head crossover is deployable here → the upgrade tier is omitted (run STAGES without `upgrade`).
+# (P1 coverage is the maximal subset install+backup+restore+custom; P4 restore-hook is the headline.)