All checks were successful
continuous-integration/drone/push Build is passing
- resolve_upgrade_base: BasePlan(kind=version|ref|skip); last-green (warm canonical) primary, main-tip fallback, declared skip else. UPGRADE_BASE_VERSION retained as optional override. - deploy_app: base_ref path (chaos-deploy a main-tip/last-green commit) + apply_previous wiring. - lifecycle: previous/ surface (has_previous, previous_target_version, previous_status decision, provide/remove overlay, compose_file add/remove, recipe_branch_commit, stack_service_names). - generic.perform_upgrade: strip previous/ overlay + COMPOSE_FILE entry before head redeploy. - discourse: compose.ccci.yml now environmental-only (order: stop-first); removed bitnamilegacy pins + sidekiq + UPGRADE_BASE_VERSION; test_upgrade.py asserts head image == official 3.5.3 + no sidekiq. - unit tests: resolve_upgrade_base matrix + previous/ apply/skip/stale + COMPOSE_FILE layering.
73 lines
4.7 KiB
Python
73 lines
4.7 KiB
Python
# Per-recipe harness config for discourse (Phase 2 Q4.6 — forum; postgres + redis + sidekiq).
|
|
#
|
|
# Discourse (bitnamilegacy/discourse) is a slow-booting Rails app: the recipe healthcheck polls
|
|
# /srv/status, and a cold first boot (DB migrate + asset precompile) regularly takes 15-25 min on
|
|
# cc-ci's single node, so the deploy/HTTP timeouts are generous. /srv/status returns 200 only once the
|
|
# app is actually serving (the canonical "is discourse up" signal — NOT "/", which may redirect to setup).
|
|
HEALTH_PATH = "/srv/status"
|
|
HEALTH_OK = (200,)
|
|
DEPLOY_TIMEOUT = (
|
|
3600 # slow Rails cold boot (15-25min) on the 7-GiB single node; bumped 2400→3600 for
|
|
)
|
|
# headroom after full4's base deploy timed out at 2400s (RAM/CPU-constrained boot + image re-pull).
|
|
HTTP_TIMEOUT = 1200
|
|
|
|
# Slow-cold-boot handling: the recipe-PR (recipe-maintainers/discourse#1) bumps the app healthcheck
|
|
# `start_period` to a LITERAL 20m for the HEAD. discourse's 15-25min Rails cold boot (DB migrate +
|
|
# asset precompile) exceeds the published 5m start_period → swarm would kill the still-booting app.
|
|
# start_period CANNOT be an env var (abra validates the literal compose 'duration' BEFORE substitution
|
|
# → `FATA ...Does not match format 'duration'`; Adversary-reproduced, REVIEW-2 4b862f6), so a literal
|
|
# recipe-PR bump is the only §9-compliant way to widen it. start_period is grace-only (a healthy check
|
|
# still marks healthy immediately → fast hosts unaffected). Precedent: lasuite-drive collabora PR.
|
|
# TIMEOUT (abra's internal convergence wait) is raised to outlast the boot.
|
|
#
|
|
# UPGRADE-tier BASE (phase prevb — DYNAMIC, no hardcoded UPGRADE_BASE_VERSION): the base the head
|
|
# upgrades from is resolved at run time — last-green (warm canonical) → fallback target-branch (`main`)
|
|
# tip → else skip (run_recipe_ci.resolve_upgrade_base). discourse has no warm canonical, so the base is
|
|
# the `main` tip = bitnamilegacy/discourse:3.5.0, which deploys clean (bitnamilegacy exists) with NO
|
|
# `previous/` repair needed. The PR head (recipe-maintainers/discourse#4) switches app to the official
|
|
# `discourse/discourse:3.5.3` and drops the sidekiq service, so the upgrade tier now exercises the REAL
|
|
# bitnamilegacy→official image migration the PR claims to support.
|
|
#
|
|
# compose.ccci.yml is now the ENVIRONMENTAL overlay (all deploys): only app.deploy.update_config.order:
|
|
# stop-first (node memory reality on the upgrade crossover — see its header). The version-specific
|
|
# bitnamilegacy re-pin + sidekiq block were REMOVED (they leaked onto the head and masked the migration
|
|
# — the prevb bug). No assertion weakened: the head runs unmodified and full assertions run on it.
|
|
EXTRA_ENV = {
|
|
"TIMEOUT": "3600", # abra's internal convergence wait; matches DEPLOY_TIMEOUT (slow Rails boot headroom)
|
|
"COMPOSE_FILE": "compose.yml:compose.ccci.yml",
|
|
}
|
|
|
|
|
|
def BACKUP_VERIFY(ctx):
|
|
"""Post-backup integrity check (Q4.6, same race ghost F2-14b hit). The recipe's backupbot db
|
|
pre-hook (`/pg_backup.sh backup`) dumps the discourse postgres DB to `/var/lib/postgresql/data/
|
|
backup.sql` (gzip), then restic captures that path. On the loaded single CI node the db container
|
|
is cycled by the immediately-preceding UPGRADE tier (chaos redeploy), and at backup time the
|
|
pre-hook's pg_dump can race that cycle — the dump is truncated/never written, restic snapshots an
|
|
empty/absent path, and a later restore reimports nothing → the seeded ci_marker is lost (P4 RED;
|
|
observed full1/full2 WITH upgrade, vs full3 WITHOUT upgrade green). Proven first-hand: the pre-hook
|
|
itself succeeds on a stable db (manual exec → valid 922KB dump), so the failure is the cycle race,
|
|
not the script. This probe proves the dump completed: backup.sql exists, is a VALID gzip, non-empty.
|
|
False → the harness re-runs the WHOLE backup with a re-stabilised db (run_recipe_ci _perform_op,
|
|
caps at 3 then proceeds — a persistent failure still surfaces RED at restore, so it weakens no
|
|
assertion; it only retries a flaky CAPTURE). READ-ONLY."""
|
|
# recipe_meta.py is exec()'d into a bare namespace (no __file__); runner/ is already on sys.path
|
|
# and `harness` importable — import directly (ghost F2-14b shipped broken by computing a path here).
|
|
from harness import lifecycle
|
|
|
|
try:
|
|
out = lifecycle.exec_in_app(
|
|
ctx.domain,
|
|
[
|
|
"sh",
|
|
"-c",
|
|
"gzip -t /var/lib/postgresql/data/backup.sql && wc -c < /var/lib/postgresql/data/backup.sql",
|
|
],
|
|
service="db",
|
|
timeout=60,
|
|
).strip()
|
|
except Exception: # noqa: BLE001 — exec fails if the db is mid-cycle: treat as not-yet-captured
|
|
return False
|
|
return out.isdigit() and int(out) > 0
|