feat(prevb): dynamic upgrade base (last-green→main→skip) + per-recipe previous/ overlay; migrate discourse off static base + leaky overlay

- resolve_upgrade_base: BasePlan(kind=version|ref|skip); last-green (warm canonical) primary, main-tip fallback, declared skip else. UPGRADE_BASE_VERSION retained as optional override. - deploy_app: base_ref path (chaos-deploy a main-tip/last-green commit) + apply_previous wiring. - lifecycle: previous/ surface (has_previous, previous_target_version, previous_status decision, provide/remove overlay, compose_file add/remove, recipe_branch_commit, stack_service_names). - generic.perform_upgrade: strip previous/ overlay + COMPOSE_FILE entry before head redeploy. - discourse: compose.ccci.yml now environmental-only (order: stop-first); removed bitnamilegacy pins + sidekiq + UPGRADE_BASE_VERSION; test_upgrade.py asserts head image == official 3.5.3 + no sidekiq. - unit tests: resolve_upgrade_base matrix + previous/ apply/skip/stale + COMPOSE_FILE layering.
2026-06-17 00:14:53 +00:00
parent 1090abb97a
commit bb2e3c6b2c
8 changed files with 532 additions and 137 deletions
--- a/tests/discourse/compose.ccci.yml
+++ b/tests/discourse/compose.ccci.yml
@ -1,57 +1,29 @@
 ---
 version: "3.8"
-# cc-ci overlay (Phase 2 Q4.6) — minimal, single-purpose: make the UPGRADE-tier BASE deploy (the
-# previous published discourse version) deployable so upgrade-to-latest can run.
+# cc-ci ENVIRONMENTAL overlay (phase prevb) — applies to ALL deploys (base + PR head). Node-reality
+# tweaks ONLY; NO version-specific image pins or service add/drop. (Version-specific repairs, when a
+# previous base can't deploy as-published, live in tests/<recipe>/previous/ — base-only. discourse
+# needs none: the dynamic base = last-green / main tip = bitnamilegacy/discourse:3.5.0 deploys clean.)
 #
-# WHY THIS OVERLAY EXISTS (plan-ccci-compose-overlay-policy.md §1 "minimal justified fallback" +
-# the §1 mandate that upgrade-to-latest must ALWAYS run): the harness base-deploys the from-version
-# (UPGRADE_BASE_VERSION = 0.7.0+3.3.1), then `deploy --chaos` to the recipe-PR head. Two blockers on
-# that published base, both resolved here, NEITHER weakening any test:
-#   1. RE-PIN: every published discourse tag pins `bitnami/discourse:3.3.1` (and 0.6.3 → 3.1.2),
-#      but Docker Hub REMOVED the bitnami/discourse namespace (404). The recipe-PR (recipe-maintainers/
-#      discourse#1) re-pins app+sidekiq to `bitnamilegacy/discourse:3.3.1` (the legit upstream
-#      relocation of the identical image). This overlay applies the SAME namespace-only re-pin to the
-#      BASE 0.7.0 (identical version 3.3.1, identical image content) so the from-version pulls — exactly
-#      the policy-blessed "minimal bitnami→bitnamilegacy re-pin overlay on the 0.7.0 from-version".
-#   2. GRACE: discourse's Rails cold first boot (DB migrate + asset precompile) is 15-25min on cc-ci,
-#      exceeding the published 5m start_period → swarm kills the still-booting app. start_period CANNOT
-#      be an env var (abra validates the literal 'duration' BEFORE substitution → FATA; Adversary-
-#      reproduced, REVIEW-2 4b862f6), so we widen it to a literal 20m on the BASE. The PR head already
-#      ships 20m, so this overlay is idempotent on the head (it persists untracked across the checkout).
-# Both changes are namespace/grace-only: identical image content, a healthy check still marks healthy
-# immediately → NO assertion is weakened and no defect is masked.
+# Fusing version-specific config into this all-deploys overlay was the prevb bug: the old overlay
+# re-pinned app+sidekiq to bitnamilegacy/discourse:3.3.1 and re-added the sidekiq service on EVERY
+# deploy, so the PR head (official discourse/discourse:3.5.3, sidekiq dropped) was silently reverted
+# to the old image and its migration never tested. Removed entirely; only the environmental tweak below
+# remains.
 #
-# NOTE (prepull): the published recipe ships `sidekiq.depends_on: [discourse]` but the main service is
-# named `app` (`discourse` is undefined), so `abra app config --images` returns invalid-compose (rc=15)
-# and the harness prepull is SKIPPED. This overlay does NOT try to override depends_on — compose
-# normalizes short-form depends_on to a map and map-merge is additive, so an override can't REMOVE the
-# bad `discourse` key. Instead the 2.4GB `bitnamilegacy/discourse:3.3.1` image is kept warm in the node
-# image cache, so the inline pull during deploy is a no-op and convergence isn't pull-bound. (swarm
-# ignores depends_on, so the dangling ref has zero runtime effect — a recipe lint nit, not a defect.)
-#
-#   3. UPGRADE ROLLOUT (dstamp 2026-06-11, direct-evidence attribution in JOURNAL-dstamp): the
-#      published app service sets `deploy.update_config: { failure_action: rollback, order:
-#      start-first }`. On the upgrade chaos redeploy (base 0.7.0 → PR head), start-first runs the OLD
-#      and NEW precompile/Rails-heavy discourse tasks CO-RESIDENT (~2x memory); under host memory
-#      pressure the NEW task intermittently OOMs/fails swarm's update monitor → `failure_action:
-#      rollback` reverts the app service to its PREVIOUS spec, INCLUDING the
-#      `coop-cloud.<stack>.chaos-version` label (head → base). Because start-first keeps the OLD task
-#      serving, wait_healthy still passes, and HC1 then reads the reverted BASE commit (eb96de9+U) and
-#      misreports it as 'the re-checkout failed' — the dstamp drift, reproduced solo (runs
-#      dstamp-repro1/4) with `.Spec.chaos-version=7ae7b0f7+U` (head applied) flipping to
-#      `.PreviousSpec=eb96de94+U` after the rollback. FIX: `order: stop-first` so the NEW task boots
-#      with the full host memory (no 2x co-residency) and genuinely becomes healthy → no spurious
-#      rollback. This is a CI deploy-rollout tweak only: the upgrade still really deploys + asserts the
-#      PR-head code under test, and `failure_action: rollback` is LEFT intact, so a genuinely broken
-#      head still rolls back and is caught (lifecycle.assert_upgrade_converged) — NO test is weakened.
-#      Trade-off: brief real downtime during the CI upgrade (covered by DEPLOY_TIMEOUT 3600).
+# WHY (environmental — depends on the cc-ci node, must apply to the head too): the upgrade chaos
+# redeploy (base bitnamilegacy:3.5.0 → PR-head official 3.5.3) runs the OLD and NEW Rails-heavy
+# (DB-migrate + asset-precompile) tasks. The published recipe sets
+# app.deploy.update_config.order: start-first, which keeps both co-resident (~2x memory); under the
+# single CI node's memory pressure the NEW task intermittently OOMs swarm's update monitor →
+# failure_action: rollback reverts the app spec (incl. the coop-cloud.<stack>.chaos-version label)
+# while the old task keeps serving, and the upgrade is misreported (dstamp finding, JOURNAL-dstamp).
+# FIX: order: stop-first so the NEW task boots with the full host memory (no 2x co-residency) and
+# genuinely becomes healthy. failure_action: rollback is LEFT intact → a genuinely broken head still
+# rolls back and is caught (lifecycle.assert_upgrade_converged) — NO test/assertion is weakened.
+# Trade-off: brief real downtime during the CI upgrade (covered by DEPLOY_TIMEOUT 3600).
 services:
  app:
-    image: bitnamilegacy/discourse:3.3.1
-    healthcheck:
-      start_period: 20m
    deploy:
      update_config:
        order: stop-first
-  sidekiq:
-    image: bitnamilegacy/discourse:3.3.1
--- a/tests/discourse/recipe_meta.py
+++ b/tests/discourse/recipe_meta.py
@ -21,20 +21,18 @@ HTTP_TIMEOUT = 1200
 # still marks healthy immediately → fast hosts unaffected). Precedent: lasuite-drive collabora PR.
 # TIMEOUT (abra's internal convergence wait) is raised to outlast the boot.
 #
-# UPGRADE-tier BASE (compose.ccci.yml + UPGRADE_BASE_VERSION): upgrade-to-latest must ALWAYS run
-# (plan-ccci-compose-overlay-policy.md §1). The from-version is the latest published 0.7.0+3.3.1
-# (UPGRADE_BASE_VERSION below; the PR head is 0.7.0-based, so 0.7.0 is the true predecessor — not the
-# default [-2]=0.6.3). The published 0.7.0 has TWO blockers, both resolved by the policy-blessed
-# minimal base overlay compose.ccci.yml (see its header), neither weakening a test:
-#   (1) it pins the Docker-Hub-removed `bitnami/discourse:3.3.1` (404) → overlay re-pins app+sidekiq to
-#       `bitnamilegacy/discourse:3.3.1` (namespace-only, identical image), the same re-pin the PR makes;
-#   (2) its 5m start_period is too tight for the 15-25min Rails boot → overlay widens it to 20m (grace).
-# The harness auto-provides the overlay to the checkout and auto-chaoses the base deploy
-# (first-class compose.ccci.yml, rcust P2a); it persists across the head checkout (idempotent — the
-# PR head already re-pins + ships 20m).
-# Upgrade crossover: 0.7.0 (re-pinned base) → PR head; full assertions run on the HEAD. The 0.7.0
-# *custom* tests are not separately run (custom tier runs once, on the head — policy §1 allows skip+record).
-UPGRADE_BASE_VERSION = "0.7.0+3.3.1"
+# UPGRADE-tier BASE (phase prevb — DYNAMIC, no hardcoded UPGRADE_BASE_VERSION): the base the head
+# upgrades from is resolved at run time — last-green (warm canonical) → fallback target-branch (`main`)
+# tip → else skip (run_recipe_ci.resolve_upgrade_base). discourse has no warm canonical, so the base is
+# the `main` tip = bitnamilegacy/discourse:3.5.0, which deploys clean (bitnamilegacy exists) with NO
+# `previous/` repair needed. The PR head (recipe-maintainers/discourse#4) switches app to the official
+# `discourse/discourse:3.5.3` and drops the sidekiq service, so the upgrade tier now exercises the REAL
+# bitnamilegacy→official image migration the PR claims to support.
+#
+# compose.ccci.yml is now the ENVIRONMENTAL overlay (all deploys): only app.deploy.update_config.order:
+# stop-first (node memory reality on the upgrade crossover — see its header). The version-specific
+# bitnamilegacy re-pin + sidekiq block were REMOVED (they leaked onto the head and masked the migration
+# — the prevb bug). No assertion weakened: the head runs unmodified and full assertions run on it.
 EXTRA_ENV = {
    "TIMEOUT": "3600",  # abra's internal convergence wait; matches DEPLOY_TIMEOUT (slow Rails boot headroom)
    "COMPOSE_FILE": "compose.yml:compose.ccci.yml",
--- a/tests/discourse/test_upgrade.py
+++ b/tests/discourse/test_upgrade.py
@ -0,0 +1,40 @@
+"""discourse — UPGRADE overlay (phase prevb): FAITHFULNESS assertion that the PR head genuinely ran.
+
+The whole point of phase prevb: the old all-deploys overlay re-pinned the head back to
+bitnamilegacy/discourse:3.3.1 and re-added the sidekiq service, so the head's official-image
+migration was never tested. With the version-specific config removed from the all-deploys overlay
+and the dynamic base (last-green/main = bitnamilegacy:3.5.0) deployed only as the *base*, the upgrade
+chaos redeploy must land the PR head UNMODIFIED. This overlay asserts exactly that, post-upgrade:
+
+  1. the running `app` service image IS the official discourse/discourse:3.5.3 — NOT bitnamilegacy;
+  2. the `sidekiq` service the PR deletes is GONE from the deployed stack.
+
+If either fails, the head did not really run (the overlay leaked onto it) → RED. Assertion-only,
+additive to the generic upgrade tier (which already proves reconverge/serving/moved + HC1 commit stamp).
+"""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
+from harness import lifecycle  # noqa: E402
+
+
+def test_head_runs_official_image_not_bitnamilegacy(live_app):
+    image = lifecycle.deployed_identity(live_app, service="app").get("image") or ""
+    assert "bitnamilegacy" not in image, (
+        f"app image is {image!r} — the bitnamilegacy base leaked onto the PR head "
+        "(the version-specific overlay was applied to the head, the prevb bug)"
+    )
+    assert image.startswith("discourse/discourse:3.5.3"), (
+        f"app image is {image!r}, expected the PR head's official discourse/discourse:3.5.3 "
+        "— the head's image migration was not exercised"
+    )
+
+
+def test_sidekiq_service_dropped_by_head(live_app):
+    services = lifecycle.stack_service_names(live_app)
+    assert "sidekiq" not in services, (
+        f"sidekiq service still present after the upgrade to the PR head: {services} — the head "
+        "(which deletes sidekiq) did not really deploy"
+    )