diff --git a/cc-ci-plan/plan-ccci-compose-overlay-policy.md b/cc-ci-plan/plan-ccci-compose-overlay-policy.md new file mode 100644 index 0000000..0bfb264 --- /dev/null +++ b/cc-ci-plan/plan-ccci-compose-overlay-policy.md @@ -0,0 +1,73 @@ +# Policy + cleanup — cc-ci compose overlays (when they're justified) & upgrade-tier from-version + +**Status:** POLICY (codifies `plan.md §9`) + a small set of follow-ups. **Owner:** Builder + Adversary. +**This file:** `/srv/cc-ci/cc-ci-plan/plan-ccci-compose-overlay-policy.md` +**Supersedes** the earlier `plan-prefer-env-over-compose-overlay.md` (its premise — parameterize +`start_period` via an env var — is **wrong: abra does not support an env value for `start_period`**). + +--- + +## 0. Policy (operator, 2026-05-30) + +A cc-ci-authored compose overlay (`compose.ccci-*.yml` layered via `COMPOSE_FILE`) risks **drift** from +the recipe users run — so **avoid where possible and justify each use**. But it is a **legitimate, +uniform fallback pattern**, not forbidden: + +- **Prefer an upstream recipe PR** in most cases — a real robustness fix, or exposing a knob the recipe + should expose. That's where a fix usually belongs. +- **A ccci overlay is the right tool when the value can't be supplied any other way** — notably a + healthcheck **`start_period`**, which **abra cannot take from an env var**. The ghost/discourse + `start_period` bumps therefore **stay as overlays** (an env PR is impossible for that field). +- **Uniform pattern (acceptable fallback):** an optional `compose.ccci-.yml` per recipe, + provided into the checkout by `install_steps.sh`, wired by `recipe_meta` `COMPOSE_FILE`, kept as an + untracked file so it survives the upgrade `git checkout -f` (`CHAOS_BASE_DEPLOY=True`; `assert_upgraded` + strips the `+U` marker — see DECISIONS 2026-05-30). +- **Each overlay must:** be **minimal + single-purpose**, **document WHY** in its header (the exact + abra/upstream limitation that forces it), and be **Adversary-confirmed** to not weaken a test or mask + a recipe defect. Where the fix also belongs upstream (e.g. a `start_period` too tight for any slow + host), **file the upstream PR too** — the overlay is the cc-ci-side fallback, not a reason to skip it. + +## 1. Upgrade tier: always test the upgrade to LATEST + +Don't drop the upgrade test because the *from* (older) version is awkward. +- **Always perform the upgrade to the latest version and run the full assertions on the latest.** +- If the older from-version can't be fully deployed/tested (image tag removed from the registry, or it + predates an overlay/feature), you do **NOT** need that older version's **custom tests** to run. + Deploy it minimally (a justified overlay is fine) or upgrade from the nearest deployable prior; skip + only the from-version's custom assertions, and **record** that. +- Skipping a from-version's custom tests = honest, recorded. Skipping upgrade-to-latest = not OK. + +## 2. Disposition of the current overlays + +- [ ] **ghost `compose.ccci-health.yml` (start_period 900s) — KEEP, justified.** abra can't env-param + `start_period`; the fresh-DB migration needs the larger grace or swarm kills it → deadlock. + Confirm the header documents this; consider an upstream PR raising ghost's `start_period` (it's a + real slow-host fragility) — but the overlay stays regardless. +- [ ] **discourse `compose.ccci-health.yml` — KEEP, justified (both parts).** (a) `start_period 1200s` + (same reason as ghost). (b) The `bitnami/discourse:3.3.1 → bitnamilegacy/discourse:3.3.1` re-pin + makes the from-version (0.7.0, whose `bitnami/discourse` tag Docker Hub now 404s) **deployable so + the upgrade-to-latest test can run** — namespace-only, identical discourse version, applied to + base+head. This is the §1 case: keep the upgrade-to-latest test; the 0.7.0 custom tests need not + run. Document it; if a deployable prior without the re-pin exists, prefer upgrading from that. +- [ ] **mumble `compose.host-ports.yml` (cc-ci copy for the old base) — DROP it.** Deploying mumble + 0.2.0 does NOT need host-ports (that overlay only *publishes* 64738 for on-host tests). Per §1: + deploy 0.2.0 without it, **skip 0.2.0's voice/on-host custom tests**, then upgrade to the latest + version (which ships `compose.host-ports.yml` natively) and run the voice tests on the latest. + Remove the cc-ci copy + its `install_steps`/`COMPOSE_FILE` wiring for the old base; the current + version's native overlay is untouched. + +## 3. Definition of Done (Adversary cold-verifies) +- [ ] Every surviving cc-ci overlay is minimal, header-documents its justification (the abra/upstream + limitation), and is Adversary-confirmed to not weaken a test or mask a defect. +- [ ] The mumble old-base cc-ci host-ports copy is removed; mumble still **upgrades to latest** and runs + its voice tests **on the latest** (0.2.0's voice tests skipped + recorded). +- [ ] ghost + discourse still pass full suites; discourse still tests the upgrade to latest. +- [ ] Any upstream PR opened (e.g. ghost/discourse `start_period`) follows the recipe-PR rule + (cc-ci-green via `!testme` before operator merge); the overlay remains as the cc-ci fallback. +- [ ] No upgrade-to-latest test was dropped to avoid an awkward from-version. + +## 4. Guardrails +- **Correctness first** — never weaken/skip/soften a check to make a deploy or upgrade pass; an + overlay tunes deploy/infra only (its header must say how), the real assertions stand. +- **Justify + document every overlay**; prefer the upstream PR where the fix belongs. +- **Real abra path** throughout. diff --git a/cc-ci-plan/plan.md b/cc-ci-plan/plan.md index 54a0e23..8f729ba 100644 --- a/cc-ci-plan/plan.md +++ b/cc-ci-plan/plan.md @@ -830,20 +830,25 @@ Each default stands until the Adversary or reality forces a change; record the c a real app-level check — that **RAISES on actual non-readiness**, never a no-op that masks a failed deploy. **Prove it has teeth** (a negative test that fails on stuck convergence, e.g. F2-12's P7-negative). The Adversary treats a custom probe as a potential test-weakening until cold-verified. -- **Don't fork the recipe's compose — parameterize upstream, tune via env.** A cc-ci-authored compose - file/overlay (an extra `compose.*.yml` layered via `COMPOSE_FILE`) is **avoided wherever possible**: - it risks **silent drift** from the recipe actually shipped, so you'd no longer be testing what users - get. When a recipe needs a value tuned for cc-ci's environment (e.g. a longer healthcheck - `start_period` for the slower single node), the **preferred fix is an upstream recipe PR** that - exposes it as an **env var** (e.g. `APP_START_PERIOD`) with the **current value as the default in - `env.sample`** — then CI just sets that env in the app `.env`, no new compose. The env knob also - helps real operators on slow hosts. **Old-version testability:** if making the **upgrade tier** work - from an older base version would need a custom compose (a since-removed image tag, or an overlay the - old version predates), **prefer DECLARING that older version not-testable under this CI env** (note - it + skip that crossover) over authoring a custom compose for it. A cc-ci compose overlay is a - **last resort** only when neither path is possible — Adversary-confirmed non-drifting and paired with - the upstream-env PR that will obsolete it. (The existing ghost/discourse `compose.ccci-health.yml` - start_period overlays + discourse's image re-pin are exactly this debt — migrate per - `plan-prefer-env-over-compose-overlay.md`.) +- **Custom cc-ci compose overlays — avoid where possible, justify each, prefer upstream.** A + cc-ci-authored compose overlay (an extra `compose.*.yml` layered via `COMPOSE_FILE`) risks **drift** + from the recipe users actually run, so **avoid it where possible and justify each use**. In most + cases the cleaner fix is an **upstream recipe PR** — either a genuine robustness fix, or exposing a + knob the recipe should expose. **But a uniform, optional `compose.ccci-*.yml` overlay file per + recipe is an acceptable fallback** — especially for a value abra/compose can't take from an env var. + **Known limitation (builder, 2026-05-30): abra does NOT support an env value for a healthcheck + `start_period`.** So the ghost/discourse `start_period` bumps legitimately **need** the overlay (an + env-var PR is not possible for that field) — these overlays **stay**, justified. When you do use an + overlay: keep it **minimal + single-purpose**, **document WHY in the file header** (the exact abra/ + upstream limitation that forces it), have the **Adversary confirm it doesn't weaken a test or mask a + recipe defect**, and **file the upstream PR where the fix genuinely belongs** (e.g. if a recipe's + `start_period` is too tight for any slow host, propose raising it upstream too). +- **Upgrade tier: always test the upgrade to the LATEST version.** Don't drop the upgrade test just + because the *from* (older) version is awkward. If an older from-version can't be fully deployed/tested + (its image tag was pulled from the registry, or it predates an overlay/feature), you do **NOT** need + that older version's **custom tests** to run — deploy it minimally (a justified overlay is fine) or + pick the nearest deployable prior, then **upgrade to latest and run the full assertions on the + latest**. Skipping a from-version's custom tests is an honest, recorded outcome; skipping the + upgrade-to-latest is not. (See `plan-ccci-compose-overlay-policy.md` for the per-recipe disposition.) - **Honest reporting.** If a stage is skipped or a check failed, say so in `STATUS.md`/`JOURNAL.md` with the output. The loop's value depends entirely on the ledgers being true.