From 23b439db8309234fa879b8818639414610240aaf Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Wed, 17 Jun 2026 23:33:18 +0000 Subject: [PATCH] =?UTF-8?q?journal(redfix):=20M1=20discourse=20isolation?= =?UTF-8?q?=20=E2=80=94=20canon=20root-cause=20wrong;=20deploys=20fine,=20?= =?UTF-8?q?only=20upgrade=20overlay=20(unreleased=20official-image=20migra?= =?UTF-8?q?tion)=20fails?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- machine-docs/JOURNAL-redfix.md | 40 ++++++++++++++++++++++++++++++++++ machine-docs/STATUS-redfix.md | 4 ++-- 2 files changed, 42 insertions(+), 2 deletions(-) diff --git a/machine-docs/JOURNAL-redfix.md b/machine-docs/JOURNAL-redfix.md index 34ce996..62f655f 100644 --- a/machine-docs/JOURNAL-redfix.md +++ b/machine-docs/JOURNAL-redfix.md @@ -23,3 +23,43 @@ bluesky routing, discourse compose), then run isolation re-runs. discourse's rec UPSTREAM compose defect (`sidekiq.depends_on: discourse` while service is `app`) that FATAs before any deploy — that's deterministic, not a load timeout, so it may not even need a long isolation run to confirm; verify the compose at the latest tag directly first. + +## 2026-06-17T23:40Z — M1: discourse isolation run — CANON ROOT-CAUSE WAS WRONG + +Ran discourse ALONE on cc-ci (`recipe_checkout discourse 0.8.1+3.5.0` + `RECIPE=discourse +CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py`, log `/tmp/redfix-discourse.log`). + +RESULT: **install PASS, upgrade FAIL, backup PASS, restore PASS, custom PASS** — the recipe deploys, +serves (200 /srv/status), backs up and restores cleanly. NOT a deploy timeout, NOT a 51-min wedge, +NOT a deploy FATA. The canon DECISIONS root-cause ("`abra app deploy` FATAs: service sidekiq depends +on undefined service discourse → invalid compose project") is **misattributed**: that string appears +ONLY from the non-fatal prepull `docker compose config --images` (rc=15, harness logs "skipping +(deploy will pull as usual)"). The real `abra app deploy` is a swarm `docker stack deploy`, which +ignores `depends_on` entirely → the stack converges (`UpdateStatus=completed`). + +The ONLY failure is the cc-ci upgrade OVERLAY `tests/discourse/test_upgrade.py`: +- `test_head_runs_official_image_not_bitnamilegacy` — app image is `bitnamilegacy/discourse:3.5.0`; + test demands `discourse/discourse:3.5.3` (official). +- `test_sidekiq_service_dropped_by_head` — services `['app','db','redis','sidekiq']`; test demands + sidekiq dropped. + +These `prevb`-phase overlay tests are PR-FAITHFULNESS assertions for a specific migration PR +(bitnamilegacy → official `discourse/discourse:3.5.3`, drop sidekiq). Verified that migration exists +in **NO upstream release tag and NOT in main** — `git show main:compose.yml` and every tag +(`0.1.0…0.8.1+3.5.0`) all use `bitnamilegacy/discourse:3.5.0` + sidekiq. So the overlay asserts a +state that doesn't exist anywhere upstream → deterministic RED whenever the sweep tests the latest +release tag. The head DID deploy (chaos-version label = head f87c612d+U, converged) — the test +expectation is simply wrong for the released recipe. + +Note (M2 design): migrating discourse from the deprecated `bitnamilegacy` image to official +`discourse/discourse` is a MAJOR recipe rewrite (different fs layout, entrypoint, no `/opt/bitnami` +sidekiq run.sh) — not a 1-line image swap. So the overlay test's `discourse/discourse:3.5.3` +expectation may not be a realistic near-term recipe change. The bitnamilegacy deprecation is real +(bitnami sunset legacy images), so a migration is the right long-term direction, but the test as +written hard-codes a migration target absent upstream. Classification + fix approach to settle in M1 +table / M2. + +Classification: **stale/PR-specific cc-ci OVERLAY test mismatched to the canonical-sweep context** +(NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no +discourse stack left). Evidence: `/tmp/redfix-discourse.log` on cc-ci; junit under +`/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml`. diff --git a/machine-docs/STATUS-redfix.md b/machine-docs/STATUS-redfix.md index 9f9fb72..5c4a0e0 100644 --- a/machine-docs/STATUS-redfix.md +++ b/machine-docs/STATUS-redfix.md @@ -35,8 +35,8 @@ flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`. | Recipe | Isolation run | Result | Root cause | Classification | |---|---|---|---|---| -| discourse | pending | — | — | — | -| mattermost-lts | pending | — | — | — | +| discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) | +| mattermost-lts | running (isolation) | — | — | — | | mumble | pending | — | — | — | | bluesky-pds | pending | — | — | — | | gitea | pending | — | — | — |