Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-prevb-previous-dynamic-base.md
autonomic-bot 65ee741869 plan: queue prevb (dynamic upgrade base + previous/ config, opus) + regall (all-recipe regression, sonnet)
Operator 2026-06-16. Replaces the static UPGRADE_BASE_VERSION + leaky single
compose.ccci.yml overlay model: dynamic base = last-green(warm canonical) ->
main fallback -> skip; optional minimal per-recipe previous/ folder for
base-only version repairs (ignored for head, version-guarded, removable when
stale). Validated on discourse PR #4 (official-image switch the current overlay
masks). regall then sweeps all recipes for regressions on sonnet.
2026-06-16 23:55:28 +00:00

8.8 KiB

Phase prevb — dynamic upgrade base + per-recipe previous/ config

Mission (operator-specified 2026-06-16): fix how cc-ci handles version-specific needs in the upgrade tier. Today a single per-recipe overlay (compose.ccci.yml) conflates two different things — environmental tweaks (cc-ci node is slow/memory-tight) and version-specific repairs (an old base's image reference rotted) — and applies BOTH to EVERY deploy, including the PR head. That silently overrides the head and masks the real change. Proven live on discourse PR #4 (recipe-maintainers/discourse#4, discourse-official-image → main): the overlay re-pins app.image to bitnamilegacy/discourse:3.3.1 and re-adds the dropped sidekiq service, so !testme deploys the OLD image instead of the PR's new official discourse/discourse:3.5.3 — the migration is never tested.

Replace that model with two changes, then prove them on discourse PR #4:

  1. Dynamic upgrade base (no hardcoded UPGRADE_BASE_VERSION): the base the head upgrades from is resolved at run time as last-green (warm canonical) → fallback target-branch (main) tip → else skip the upgrade tier (recorded reason).
  2. Optional per-recipe previous/ folder holding the minimal config needed to deploy the previous (last-green) version, applied only to the base deploy and ignored for the head.

State files: STATUS-prevb.md, BACKLOG-prevb.md, REVIEW-prevb.md, JOURNAL-prevb.md. DECISIONS.md shared.

1. Root cause (read first)

tests/<recipe>/compose.ccci.yml + EXTRA_ENV.COMPOSE_FILE = "compose.yml:compose.ccci.yml" is applied to every deploy in the recipe-under-test flow. For discourse it carries BOTH:

  • environmental (start_period: 20m grace, order: stop-first) — depends on the cc-ci node, must apply to all deploys incl. head; and
  • version-specific repair (app/sidekiqbitnamilegacy/discourse:3.3.1) — depends on the old 0.7.0 base whose published bitnami/discourse:3.3.1 404s; must apply ONLY to that base. Fusing them + applying to all deploys is the bug: the version-specific half leaks onto the head (scalar image: last-file-wins override; additive service merge re-adds dropped sidekiq).

2. Design

Decompose the overlay into two layers — the harness applies them to different deploys:

  • Environmental overlay (all deploys, incl. head). Node-reality tweaks the recipe itself doesn't encode (e.g. rollout order). Keep it MINIMAL and shrink over time (a well-formed recipe head ships its own grace — PR #4 already has start_period: 20m). It must contain no version-specific image pins or service add/drop.
  • tests/<recipe>/previous/ (base deploy ONLY, ignored for head). The minimal bundle needed to bring up the previous (last-green) version when it can't deploy as-published — e.g. an image relocation (bitnami/* → bitnamilegacy/*), or an era-specific service/step/env. Mirror the recipe-under-test layout but scoped to "deploy the previous version" (typically just a compose.previous.yml; add an install_steps.sh/ops.py/env override only if that version genuinely needs it). Keep it as small and simple as possible — add one only where necessary.

Dynamic base resolution (replace static UPGRADE_BASE_VERSION):

  1. Primary: last-green (warm canonical). Upgrade from the last version cc-ci recorded green for this recipe (prefer the warm-canonical snapshot where one exists — it's already data-warm, giving a realistic data-survival signal and avoiding a from-scratch old-version deploy).
  2. Fallback: target-branch (main) tip when there is no last-green (e.g. a recipe with no recorded green predecessor yet). This is the real predecessor the PR merges on top of.
  3. Else skip the upgrade tier with a recorded reason (new recipe / no predecessor). Structural skip, declared (EXPECTED_NA), not a silent pass.

previous/ is for the current previous version, and is removable when stale. To stop a stale folder silently overriding a non-matching base, previous/ declares the version it targets (simplest: a one-line marker, or the coop-cloud.*.version label in its compose.previous.yml). The harness applies it only when the resolved base version matches; on mismatch it skips it and flags it stale ("previous/ targets X, base is Y — remove it"). After a recipe upgrade PR merges (new last-green), the now-stale previous/ should be removed — keep it to roughly one version's worth, never an accumulating pile.

3. Discourse as the first real case

  • main today is bitnamilegacy/discourse:3.3.1 (deployable — bitnamilegacy exists). So with a dynamic base, the base = last-green (≈ main) deploys cleanly with NO previous/ needed: the rotted-base treadmill evaporates because we no longer resurrect the frozen 0.7.0 tag. (Confirm main's image; if the last-green base genuinely still needs a repair to deploy, add a minimal previous/ — but expect not.)
  • Move discourse's environmental tweaks (rollout order, any grace the head lacks) into the environmental overlay; delete the bitnamilegacy image pins and the sidekiq block from the all-deploys overlay; remove UPGRADE_BASE_VERSION.
  • PR #4 head now deploys UNMODIFIED → the chaos redeploy runs the real discourse/discourse:3.5.3 with no sidekiq, so the upgrade tier finally exercises the actual official-image migration (last-green bitnamilegacy → official head) the PR claims to support.

4. Gates

M1 — implemented + green locally. Harness: dynamic base resolution (last-green → main → skip); previous/ discovery + base-only application + version-guard/stale-flag; environmental overlay separated from version-specific config; UPGRADE_BASE_VERSION removed. Discourse migrated. Unit tests for the new harness surface (base resolution, previous/ match/skip, overlay layering). Discourse upgrade tier green locally with proof the head runs the real head image — assert the deployed app image is discourse/discourse:3.5.3 (NOT bitnamilegacy) and that no sidekiq service exists post-deploy. Adversary cold-verifies from a clean checkout: the overlay no longer touches the head; a deliberately-broken head still fails the upgrade tier (teeth — base resolution didn't paper over it); base falls back to main correctly when last-green is absent; previous/ is ignored for the head; no test weakened.

M2 — proven in real CI + a representative spot-check. discourse PR #4 !testme GREEN, with evidence the head genuinely ran discourse/discourse:3.5.3 (not the old bitnami image) and the migration was exercised. Spot-check ≥3 other recipes with upgrade tiers (e.g. one warm-canonical recipe, one with a published predecessor, one that previously relied on a .ccci overlay — keycloak/cryptpad/ghost) to confirm dynamic base works generally and nothing obvious broke. (FULL all-recipe regression is the next phase regall — do not attempt it here; just don't ship something obviously broken.) Levels / records reconciled. Fresh Adversary PASS on both milestones → ## DONE.

5. Guardrails (binding)

  • Make the test FAITHFUL, never weaker. The goal is that the head runs the head's real image; never resolve the base or apply previous/ in a way that hides a genuinely broken head. A broken upgrade must still go RED.
  • previous/ minimal + non-accumulating. Only what's strictly needed to deploy the previous base; version-guarded; removable when stale. No previous/ at all if the last-green base deploys clean.
  • Don't regress other recipes. Dynamic base must work for recipes with/without warm canonicals and with/without published predecessors. (The regall phase is the exhaustive proof; here, don't break the spot-check set.)
  • Recipe mirrors are PR-only. We VERIFY discourse PR #4 (run the harness / post !testme); we do NOT merge it (operator's call). A recipe defect found → PR comment, not a test weakened.
  • Commit author autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>; push every commit; abra over a pseudo-TTY. Host changes are coordinated (loops self-rebuilding the host is acceptable if clean — verify host health after; but this phase likely needs none).

6. Definition of Done

Dynamic upgrade-base resolution (last-green → main → skip) and the optional minimal previous/ folder shipped and unit-tested; the environmental vs version-specific layers cleanly separated; discourse migrated off the static base + leaky overlay; discourse PR #4 verified GREEN in real CI with the head genuinely running the official discourse/discourse:3.5.3 image (the migration actually tested), and a representative recipe spot-check still green; nothing merged on the mirror; M1 + M2 fresh Adversary PASSes in REVIEW-prevb.md. (Exhaustive all-recipe regression handed to phase regall.)