Operator 2026-06-16. Replaces the static UPGRADE_BASE_VERSION + leaky single compose.ccci.yml overlay model: dynamic base = last-green(warm canonical) -> main fallback -> skip; optional minimal per-recipe previous/ folder for base-only version repairs (ignored for head, version-guarded, removable when stale). Validated on discourse PR #4 (official-image switch the current overlay masks). regall then sweeps all recipes for regressions on sonnet.
8.8 KiB
Phase prevb — dynamic upgrade base + per-recipe previous/ config
Mission (operator-specified 2026-06-16): fix how cc-ci handles version-specific needs in the
upgrade tier. Today a single per-recipe overlay (compose.ccci.yml) conflates two different things —
environmental tweaks (cc-ci node is slow/memory-tight) and version-specific repairs (an old base's
image reference rotted) — and applies BOTH to EVERY deploy, including the PR head. That silently
overrides the head and masks the real change. Proven live on discourse PR #4
(recipe-maintainers/discourse#4, discourse-official-image → main): the overlay re-pins app.image
to bitnamilegacy/discourse:3.3.1 and re-adds the dropped sidekiq service, so !testme deploys the
OLD image instead of the PR's new official discourse/discourse:3.5.3 — the migration is never tested.
Replace that model with two changes, then prove them on discourse PR #4:
- Dynamic upgrade base (no hardcoded
UPGRADE_BASE_VERSION): the base the head upgrades from is resolved at run time as last-green (warm canonical) → fallback target-branch (main) tip → else skip the upgrade tier (recorded reason). - Optional per-recipe
previous/folder holding the minimal config needed to deploy the previous (last-green) version, applied only to the base deploy and ignored for the head.
State files: STATUS-prevb.md, BACKLOG-prevb.md, REVIEW-prevb.md, JOURNAL-prevb.md. DECISIONS.md shared.
1. Root cause (read first)
tests/<recipe>/compose.ccci.yml + EXTRA_ENV.COMPOSE_FILE = "compose.yml:compose.ccci.yml" is applied
to every deploy in the recipe-under-test flow. For discourse it carries BOTH:
- environmental (
start_period: 20mgrace,order: stop-first) — depends on the cc-ci node, must apply to all deploys incl. head; and - version-specific repair (
app/sidekiq→bitnamilegacy/discourse:3.3.1) — depends on the old 0.7.0 base whose publishedbitnami/discourse:3.3.1404s; must apply ONLY to that base. Fusing them + applying to all deploys is the bug: the version-specific half leaks onto the head (scalarimage:last-file-wins override; additive service merge re-adds droppedsidekiq).
2. Design
Decompose the overlay into two layers — the harness applies them to different deploys:
- Environmental overlay (all deploys, incl. head). Node-reality tweaks the recipe itself doesn't
encode (e.g. rollout
order). Keep it MINIMAL and shrink over time (a well-formed recipe head ships its own grace — PR #4 already hasstart_period: 20m). It must contain no version-specific image pins or service add/drop. tests/<recipe>/previous/(base deploy ONLY, ignored for head). The minimal bundle needed to bring up the previous (last-green) version when it can't deploy as-published — e.g. an image relocation (bitnami/* → bitnamilegacy/*), or an era-specific service/step/env. Mirror the recipe-under-test layout but scoped to "deploy the previous version" (typically just acompose.previous.yml; add aninstall_steps.sh/ops.py/env override only if that version genuinely needs it). Keep it as small and simple as possible — add one only where necessary.
Dynamic base resolution (replace static UPGRADE_BASE_VERSION):
- Primary: last-green (warm canonical). Upgrade from the last version cc-ci recorded green for this recipe (prefer the warm-canonical snapshot where one exists — it's already data-warm, giving a realistic data-survival signal and avoiding a from-scratch old-version deploy).
- Fallback: target-branch (
main) tip when there is no last-green (e.g. a recipe with no recorded green predecessor yet). This is the real predecessor the PR merges on top of. - Else skip the upgrade tier with a recorded reason (new recipe / no predecessor). Structural skip,
declared (
EXPECTED_NA), not a silent pass.
previous/ is for the current previous version, and is removable when stale. To stop a stale folder
silently overriding a non-matching base, previous/ declares the version it targets (simplest: a
one-line marker, or the coop-cloud.*.version label in its compose.previous.yml). The harness applies
it only when the resolved base version matches; on mismatch it skips it and flags it stale
("previous/ targets X, base is Y — remove it"). After a recipe upgrade PR merges (new last-green), the
now-stale previous/ should be removed — keep it to roughly one version's worth, never an accumulating pile.
3. Discourse as the first real case
- main today is
bitnamilegacy/discourse:3.3.1(deployable — bitnamilegacy exists). So with a dynamic base, the base = last-green (≈ main) deploys cleanly with NOprevious/needed: the rotted-base treadmill evaporates because we no longer resurrect the frozen 0.7.0 tag. (Confirm main's image; if the last-green base genuinely still needs a repair to deploy, add a minimalprevious/— but expect not.) - Move discourse's environmental tweaks (rollout
order, any grace the head lacks) into the environmental overlay; delete thebitnamilegacyimage pins and thesidekiqblock from the all-deploys overlay; removeUPGRADE_BASE_VERSION. - PR #4 head now deploys UNMODIFIED → the chaos redeploy runs the real
discourse/discourse:3.5.3with nosidekiq, so the upgrade tier finally exercises the actual official-image migration (last-green bitnamilegacy → official head) the PR claims to support.
4. Gates
M1 — implemented + green locally. Harness: dynamic base resolution (last-green → main → skip);
previous/ discovery + base-only application + version-guard/stale-flag; environmental overlay separated
from version-specific config; UPGRADE_BASE_VERSION removed. Discourse migrated. Unit tests for the new
harness surface (base resolution, previous/ match/skip, overlay layering). Discourse upgrade tier green
locally with proof the head runs the real head image — assert the deployed app image is
discourse/discourse:3.5.3 (NOT bitnamilegacy) and that no sidekiq service exists post-deploy. Adversary
cold-verifies from a clean checkout: the overlay no longer touches the head; a deliberately-broken head
still fails the upgrade tier (teeth — base resolution didn't paper over it); base falls back to main
correctly when last-green is absent; previous/ is ignored for the head; no test weakened.
M2 — proven in real CI + a representative spot-check. discourse PR #4 !testme GREEN, with
evidence the head genuinely ran discourse/discourse:3.5.3 (not the old bitnami image) and the migration
was exercised. Spot-check ≥3 other recipes with upgrade tiers (e.g. one warm-canonical recipe, one with a
published predecessor, one that previously relied on a .ccci overlay — keycloak/cryptpad/ghost) to
confirm dynamic base works generally and nothing obvious broke. (FULL all-recipe regression is the
next phase regall — do not attempt it here; just don't ship something obviously broken.) Levels /
records reconciled. Fresh Adversary PASS on both milestones → ## DONE.
5. Guardrails (binding)
- Make the test FAITHFUL, never weaker. The goal is that the head runs the head's real image; never
resolve the base or apply
previous/in a way that hides a genuinely broken head. A broken upgrade must still go RED. previous/minimal + non-accumulating. Only what's strictly needed to deploy the previous base; version-guarded; removable when stale. Noprevious/at all if the last-green base deploys clean.- Don't regress other recipes. Dynamic base must work for recipes with/without warm canonicals and
with/without published predecessors. (The
regallphase is the exhaustive proof; here, don't break the spot-check set.) - Recipe mirrors are PR-only. We VERIFY discourse PR #4 (run the harness / post
!testme); we do NOT merge it (operator's call). A recipe defect found → PR comment, not a test weakened. - Commit author
autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>; push every commit; abra over a pseudo-TTY. Host changes are coordinated (loops self-rebuilding the host is acceptable if clean — verify host health after; but this phase likely needs none).
6. Definition of Done
Dynamic upgrade-base resolution (last-green → main → skip) and the optional minimal previous/ folder
shipped and unit-tested; the environmental vs version-specific layers cleanly separated; discourse
migrated off the static base + leaky overlay; discourse PR #4 verified GREEN in real CI with the head
genuinely running the official discourse/discourse:3.5.3 image (the migration actually tested), and a
representative recipe spot-check still green; nothing merged on the mirror; M1 + M2 fresh Adversary PASSes
in REVIEW-prevb.md. (Exhaustive all-recipe regression handed to phase regall.)