From c346b9763bdfd7617a249971f503b218ae0f7224 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Sat, 30 May 2026 15:47:28 +0100 Subject: [PATCH] =?UTF-8?q?feat(2):=20discourse=20Q4.6=20policy-compliant?= =?UTF-8?q?=20shape=20(plan=20=C2=A79)=20=E2=80=94=20env-var=20start=5Fper?= =?UTF-8?q?iod,=20delete=20cc-ci=20overlay,=20upgrade=20N/A?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Migrate discourse off the cc-ci compose overlay per plan §9 / plan-prefer-env-over-compose-overlay.md: - recipe_meta: drop UPGRADE_BASE_VERSION + COMPOSE_FILE + CHAOS_BASE_DEPLOY; set APP_START_PERIOD=1200s via EXTRA_ENV (the recipe-PR exposes start_period: ${APP_START_PERIOD:-5m}); declare upgrade tier N/A (both published prev bases pin removed bitnami images; Adversary §7.1 granted, REVIEW-2 efe3790). - delete tests/discourse/compose.ccci-health.yml + install_steps.sh (existed only to copy the overlay). - DECISIONS.md + STATUS-2 record the §9 guardrail + discourse shape (upgrade N/A, env start_period, pg_backup restore-hook recipe-PR = 5th data-loss recipe cc-ci caught). recipe-PR head now 8b8df17 (start_period env var added). Not a claim — run STAGES=install,backup,restore,custom next. Co-Authored-By: Claude Opus 4.8 (1M context) --- machine-docs/DECISIONS.md | 32 ++++++++++++++++++ machine-docs/STATUS-2.md | 36 ++++++++++---------- tests/discourse/compose.ccci-health.yml | 32 ------------------ tests/discourse/install_steps.sh | 26 --------------- tests/discourse/recipe_meta.py | 44 ++++++++++++------------- 5 files changed, 71 insertions(+), 99 deletions(-) delete mode 100644 tests/discourse/compose.ccci-health.yml delete mode 100755 tests/discourse/install_steps.sh diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index 3772336..74ecf0b 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -1001,3 +1001,35 @@ run when ClickHouse fails to boot. NOT weakening anything. **Re-entry:** when the ClickHouse boot is stabilised (e.g. a recipe-level readiness/restart margin, a ulimit/mmap fix, or an operator node tweak), re-run `RECIPE=plausible STAGES=install,upgrade,backup, restore,custom` until a clean ClickHouse boot lands, then claim the full Q4.7 gate. Filed in DEFERRED.md. + +## 2026-05-30 — plan §9 anti-overlay guardrail + discourse Q4.6 policy-compliant shape + +Orchestrator policy (plan.md §9 + cc-ci-plan/plan-prefer-env-over-compose-overlay.md): AVOID cc-ci +`compose.*.yml` overlays (a private fork that drifts from what ships). Preferred fixes: +1. cc-ci-tuned value (e.g. healthcheck start_period) → UPSTREAM recipe-PR exposing it as an env var + (current value as default in env.sample); cc-ci sets it via `recipe_meta` EXTRA_ENV. No new compose. +2. Old upgrade-base needing a custom compose (removed image, or predates an overlay) → DECLARE that + base NOT-TESTABLE under CI (record + scope the crossover) rather than authoring a custom compose. + +**discourse Q4.6 applies both:** +- **start_period** → recipe-PR `recipe-maintainers/discourse#1` parameterizes the app healthcheck + `start_period: ${APP_START_PERIOD:-5m}` (+ commented `APP_START_PERIOD` in .env.sample, default + unchanged for real users); cc-ci sets `APP_START_PERIOD=1200s` via EXTRA_ENV. The cc-ci overlay + `tests/discourse/compose.ccci-health.yml` + `install_steps.sh` + `COMPOSE_FILE`/`CHAOS_BASE_DEPLOY` + are DELETED. +- **upgrade tier N/A** (Adversary §7.1 sign-off GRANTED, REVIEW-2 efe3790): both published + predecessors pin Docker-Hub-removed images (0.7.0+3.3.1→bitnami/discourse:3.3.1 404, 0.6.3+3.1.2→ + bitnami/discourse:3.1.2 404). Per §9 pt2 we declare them not-testable rather than resurrect an old + base with an image-repin overlay. So discourse runs the maximal subset install,backup,restore,custom. + (The earlier "honest 0.7.0→0.8.0 crossover via UPGRADE_BASE_VERSION + uniform bitnamilegacy overlay" + is SUPERSEDED by this policy. The generic `UPGRADE_BASE_VERSION` recipe_meta knob added to + run_recipe_ci.py stays as a harmless unused generic hook.) +- **postgres restore-hook** (recipe-PR, policy-neutral): the published recipe pg_dumped on backup but + had NO restore hook → a restored backup silently kept the live (un-restored) state. cc-ci's P4 + overlay caught it (seeded ci_marker gone after restore). The PR adds `pg_backup.sh` + (backup=pg_dump|gzip into the postgresql_data volume; restore=terminate conns + DROP DATABASE WITH + FORCE + createdb + reimport) + db config-mount + backupbot backup/restore hooks. discourse is the + 5th data-loss recipe cc-ci caught (immich / mattermost-lts / ghost class). + +Follow-ups (F2-14 / sub-plan E1-E6, DONE veto'd until cleared): ghost start_period overlay → +APP_START_PERIOD env PR (E1); mumble host-ports overlay → justify-as-last-resort or migrate (E4). diff --git a/machine-docs/STATUS-2.md b/machine-docs/STATUS-2.md index e19595f..cf0e7e5 100644 --- a/machine-docs/STATUS-2.md +++ b/machine-docs/STATUS-2.md @@ -66,24 +66,24 @@ tree must carry: the running `drone_…` stack is the platform's OWN CI engine (infra), NOT the recipe-under-test (false alarm cleared). Deferral SOUND; maximal subset (declarative fix + scoped gitea+drone suite) ready for post-rebuild run. -- **discourse (Q4.6)** — IN PROGRESS @2026-05-30. Re-pin **PR `recipe-maintainers/discourse#1`** - (branch `ci/bitnamilegacy-repin`, head `7b7ddd70bc753608d086884b8de1ad3c327d9ac5`) re-pins both - `bitnami/discourse:3.3.1` → `bitnamilegacy/discourse:3.3.1` (legacy=200, bitnami=404) + bumps version - 0.7.0→0.8.0. install+custom GREEN (pr5, healthcheck-overlay + re-pin both work); P3 authored (§4.3 - create-topic + site config). **UPGRADE TIER — implementing the HONEST crossover (Adversary §7.1 leans - DENY on a skip-with-sign-off; agreed).** Honest 0.7.0+3.3.1 → 0.8.0+3.3.1 is achievable: harness - default upgrade base = `recipe_versions[-2]` = 0.6.3+3.1.2 (img 3.1.2 — hollow, ≠ head's 3.3.1), but - the PR's TRUE predecessor is [-1] = 0.7.0+3.3.1 (shares head's 3.3.1). Implemented cc-ci-side (commit - a750937): (a) `recipe_meta.UPGRADE_BASE_VERSION="0.7.0+3.3.1"` + generic override in `run_recipe_ci.py` - (`prev = meta.get("UPGRADE_BASE_VERSION") or previous_version`); (b) `compose.ccci-health.yml` re-pins - `services.{app,sidekiq}.image: bitnamilegacy/discourse:3.3.1` (servable base 0.7.0 whose compose pins - the 404 bitnami:3.3.1; idempotent on head). → real HC1 crossover (version-label 0.7.0→0.8.0, same - servable discourse 3.3.1; namespace-only re-pin = the PR's change). **FULL run install,upgrade,backup, - restore,custom IN FLIGHT** on cc-ci `/root/builder-clone`, log `/root/ccci-discourse-maxsub.log`, - `RECIPE=discourse PR=1 REF=7b7ddd70... SRC=recipe-maintainers/discourse`. On green → CLAIM Q4.6 (no §7.1 - deferral). If restore (P4) RED → discourse postgres restore-hook recipe-PR (immich/mattermost/ghost - class). **POLL with `ssh -T` (no PTY).** **THEN:** plausible Q4.7b recipe-PR (`entrypoint.clickhouse.sh` - wget restart-storm) → plausible-full green → CLAIM Q4.7. +- **discourse (Q4.6)** — IN PROGRESS @2026-05-30, **policy-compliant shape (plan §9 anti-overlay)**. + recipe-PR `recipe-maintainers/discourse#1` (branch `ci/bitnamilegacy-repin`, head + `8b8df1730f48e4f8e8d1d7e2c0a7c9b5e4f3a2d1`): (1) re-pins app+sidekiq `bitnami/discourse:3.3.1` → + `bitnamilegacy/discourse:3.3.1` (bitnami 404; legit upstream fix); (2) parameterizes the app + healthcheck `start_period: ${APP_START_PERIOD:-5m}` + `.env.sample` default (cc-ci sets + `APP_START_PERIOD=1200s` via EXTRA_ENV — NO cc-ci compose overlay); (3) adds `pg_backup.sh` + + db config-mount + backupbot backup/restore hooks (P4 restore-hook — published recipe had pg_dump + backup but no restore → silent data loss; cc-ci caught it: 5th data-loss recipe, immich/mattermost/ + ghost class). **UPGRADE TIER = N/A** (Adversary §7.1 sign-off GRANTED, REVIEW-2 `efe3790`): both + published predecessors pin Docker-Hub-removed images (0.7.0→bitnami:3.3.1 404, 0.6.3→bitnami:3.1.2 + 404); per §9 pt2 declared NOT-TESTABLE rather than image-repin overlay. cc-ci overlay + (`compose.ccci-health.yml` + `install_steps.sh` + COMPOSE_FILE/CHAOS_BASE_DEPLOY) **DELETED**; + `UPGRADE_BASE_VERSION` removed from recipe_meta (the generic harness knob stays, unused). **Run shape: + `STAGES=install,backup,restore,custom`** (no upgrade). **NEXT:** run + `RECIPE=discourse PR=1 REF=8b8df1730f48e4f8e8d1d7e2c0a7c9b5e4f3a2d1 SRC=recipe-maintainers/discourse + STAGES=install,backup,restore,custom` on `/root/builder-clone` → on all-green CLAIM Q4.6. **POLL with + `ssh -T` (no PTY).** **THEN:** ghost E1 (start_period→APP_START_PERIOD env PR) + plausible Q4.7b + + mumble E4 → Q5 (these + the overlay migrations gate the DONE veto F2-14). - authentik / various --extra-flag tests — DEFERRED (Phase-2 DONE NOT gated on them per operator policy). DoD P2/P5/P6/P7/P8 broadly satisfied; remaining is P1 coverage of the above + Q5 docs/sample re-verify. diff --git a/tests/discourse/compose.ccci-health.yml b/tests/discourse/compose.ccci-health.yml deleted file mode 100644 index 602c65d..0000000 --- a/tests/discourse/compose.ccci-health.yml +++ /dev/null @@ -1,32 +0,0 @@ -# cc-ci deploy overlay (NOT a recipe change) — raises ONLY the app healthcheck start_period. -# -# Discourse (bitnamilegacy/discourse) is a slow-booting Rails app: its first cold boot does DB -# migrate + asset precompile + bootstrap, which on cc-ci's single node regularly takes 15-25min. The -# upstream recipe healthcheck on the `app` service uses `start_period: 5m` (+ 6×30s retries ≈ 8min -# grace); on cc-ci the boot exceeds that, so swarm marks the still-booting task unhealthy and KILLS -# it mid-boot, it restarts, and the deploy never converges within the timeout (observed: deploy timed -# out at 1800s with the app task still Running). -# -# Raising the START_PERIOD (failures ignored during it; a PASS still marks healthy immediately) lets -# the cold boot finish, after which discourse serves /srv/status and the (unchanged) check passes. -# This is DEPLOY/infra tuning, not a test change — no assertion is weakened, and the app's real -# healthcheck still gates readiness. Applied via recipe_meta COMPOSE_FILE. The `app` service name is -# verified against the PR-head compose (ci/bitnamilegacy-repin: services.app holds the healthcheck). -# -# IMAGE RE-PIN (upgrade-tier honesty, Adversary §7.1): the upgrade tier base-deploys the previous -# published version 0.7.0+3.3.1 (UPGRADE_BASE_VERSION in recipe_meta — the PR's TRUE predecessor, -# sharing the head's discourse 3.3.1 image), whose compose.yml pins `bitnami/discourse:3.3.1` on the -# `app` AND `sidekiq` services — but Docker Hub no longer serves any `bitnami/discourse:*` tag (404). -# This overlay re-pins BOTH to the servable `bitnamilegacy/discourse:3.3.1` (identical discourse -# version, namespace-only) so the base deploy pulls, and the chaos head redeploy (PR 0.8.0, already -# re-pinned to bitnamilegacy in compose.yml) gets the SAME value — making an HONEST 0.7.0→0.8.0 -# crossover testable. NOT a test weakening: the served discourse app image is the same 3.3.1 either -# side; only the recipe-version label moves (the PR's actual change). Applies uniformly to base+head. -version: "3.8" # MUST match compose.yml's version — abra lint R011/R012 FATAs on a mismatch -services: - app: - image: bitnamilegacy/discourse:3.3.1 - healthcheck: - start_period: 1200s - sidekiq: - image: bitnamilegacy/discourse:3.3.1 diff --git a/tests/discourse/install_steps.sh b/tests/discourse/install_steps.sh deleted file mode 100755 index 01a80ba..0000000 --- a/tests/discourse/install_steps.sh +++ /dev/null @@ -1,26 +0,0 @@ -#!/usr/bin/env bash -# discourse — INSTALL-TIME hook (Phase 2 Q4.6). Runs during the install tier AFTER `abra app new` + -# EXTRA_ENV + `abra app secret generate` and BEFORE the single `abra app deploy` -# (lifecycle.py::_run_install_steps), with CCCI_RECIPE / CCCI_APP_DOMAIN / CCCI_APP_ENV in env. -# -# Purpose: provide the cc-ci deploy overlay `compose.ccci-health.yml` (app healthcheck start_period -# bump) into the recipe checkout so recipe_meta's COMPOSE_FILE (compose.yml:compose.ccci-health.yml) -# resolves. Without the larger start_period, discourse's 15-25min Rails cold boot is killed mid-boot -# by the recipe's 5m-start_period healthcheck and the deploy never converges (see the overlay header). -# The overlay is an UNTRACKED file in the recipe repo, so `git checkout -f` (the upgrade tier's -# re-checkout to PR head) preserves it — COMPOSE_FILE keeps resolving across install AND upgrade -# deploys. CHAOS_BASE_DEPLOY=True (recipe_meta) lets the pinned base deploy proceed despite this -# untracked file (abra's clean-tree check would otherwise FATA). -set -euo pipefail - -: "${CCCI_RECIPE:?missing CCCI_RECIPE}" -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}" - -if [ ! -d "$RECIPE_DIR" ]; then - echo " discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide health overlay" >&2 - exit 1 -fi - -cp "$SCRIPT_DIR/compose.ccci-health.yml" "$RECIPE_DIR/compose.ccci-health.yml" -echo " discourse install_steps: provided compose.ccci-health.yml (healthcheck start_period bump) to ${CCCI_RECIPE}" diff --git a/tests/discourse/recipe_meta.py b/tests/discourse/recipe_meta.py index f36f9fc..8503a85 100644 --- a/tests/discourse/recipe_meta.py +++ b/tests/discourse/recipe_meta.py @@ -1,33 +1,31 @@ # Per-recipe harness config for discourse (Phase 2 Q4.6 — forum; postgres + redis + sidekiq). # -# Discourse (bitnami/discourse) is a slow-booting Rails app: the recipe healthcheck polls -# /srv/status with a 5-minute start_period, and a cold first boot (DB migrate + asset precompile) -# regularly takes 8-15 min, so the deploy/HTTP timeouts are generous. /srv/status returns 200 only -# once the app is actually serving (the canonical "is discourse up" signal — NOT "/", which may -# redirect to setup). +# Discourse (bitnamilegacy/discourse) is a slow-booting Rails app: the recipe healthcheck polls +# /srv/status, and a cold first boot (DB migrate + asset precompile) regularly takes 15-25 min on +# cc-ci's single node, so the deploy/HTTP timeouts are generous. /srv/status returns 200 only once the +# app is actually serving (the canonical "is discourse up" signal — NOT "/", which may redirect to setup). HEALTH_PATH = "/srv/status" HEALTH_OK = (200,) -DEPLOY_TIMEOUT = 2400 # was 1800 — slow Rails cold boot (15-25min) overran it; bumped to match TIMEOUT +DEPLOY_TIMEOUT = 2400 # slow Rails cold boot (15-25min); matches the EXTRA_ENV TIMEOUT below HTTP_TIMEOUT = 1200 -# cc-ci deploy overlay: discourse's 15-25min Rails cold boot exceeds the recipe healthcheck's -# start_period:5m (+8min grace), so swarm kills the still-booting app and the deploy never converges -# (observed: 1800s timeout). compose.ccci-health.yml raises the app healthcheck start_period to 1200s -# (failures ignored during it; a PASS still marks healthy at once) — DEPLOY/infra tuning, NO test -# weakened. install_steps.sh provides the overlay into the checkout; COMPOSE_FILE wires it; TIMEOUT -# 2400 lets abra's convergence wait outlast the boot. CHAOS_BASE_DEPLOY lets the pinned base deploy -# proceed with the untracked overlay present. (Same pattern as tests/ghost/.) -CHAOS_BASE_DEPLOY = True +# Slow-cold-boot handling via env, NOT a cc-ci compose overlay (plan.md §9 anti-drift guardrail): +# discourse's 15-25min Rails cold boot exceeds the recipe healthcheck's default start_period (5m) + +# grace, so swarm would kill the still-booting app and the deploy never converges. Rather than fork +# the recipe with a compose.*.yml overlay (which drifts from what ships), the recipe-PR +# (recipe-maintainers/discourse#1) parameterizes the app healthcheck as +# `start_period: ${APP_START_PERIOD:-5m}` (default unchanged for real users); cc-ci just sets a larger +# value here. TIMEOUT (abra's internal convergence wait) is raised to outlast the boot. EXTRA_ENV = { "TIMEOUT": "2400", - "COMPOSE_FILE": "compose.yml:compose.ccci-health.yml", + "APP_START_PERIOD": "1200s", } -# Upgrade-tier base version (Adversary §7.1): the harness default base = recipe_versions[-2], which -# for discourse is 0.6.3+3.1.2 (discourse 3.1.2). But this PR (recipe-maintainers/discourse#1) ADDS a -# version (0.8.0+3.3.1) ABOVE the newest published tag, so the PR's TRUE predecessor is [-1] = -# 0.7.0+3.3.1 — which shares the head's discourse 3.3.1 image, making an HONEST 0.7.0→0.8.0 crossover -# testable via the uniform bitnamilegacy:3.3.1 image overlay (compose.ccci-health.yml). [-2]=3.1.2 -# differs from head 3.3.1, so a uniform overlay there would be a hollow (fake-version) base. Pinning -# the base to [-1] is the correct predecessor whenever a PR adds a version above the catalogue head. -UPGRADE_BASE_VERSION = "0.7.0+3.3.1" +# Upgrade tier — N/A (declared NOT-TESTABLE under cc-ci; Adversary §7.1 sign-off GRANTED, REVIEW-2 +# efe3790). Both published predecessor versions pin Docker-Hub-removed images: +# 0.7.0+3.3.1 → bitnami/discourse:3.3.1 (404), 0.6.3+3.1.2 → bitnami/discourse:3.1.2 (404). +# The recipe-PR re-pins the HEAD to bitnamilegacy/discourse:3.3.1 (a legit upstream fix), but per +# plan.md §9 / plan-prefer-env-over-compose-overlay.md pt2 we declare an old base whose image is gone +# NOT-TESTABLE rather than authoring an image-repin compose overlay to resurrect it. So no honest +# prev→head crossover is deployable here → the upgrade tier is omitted (run STAGES without `upgrade`). +# (P1 coverage is the maximal subset install+backup+restore+custom; P4 restore-hook is the headline.)