# JOURNAL — phase `redfix` ## 2026-06-17T23:20Z — Bootstrap Read phase plan + plan.md §6.1/§7/§9 + canon DECISIONS exceptions (lines ~1494–1552). Six canon-sweep failures to investigate. Confirmed cc-ci access, no run in flight, sweep timer next fires 2026-06-21 (3-day window), disk 38G free. Isolation mechanism understood: `runner/nightly_sweep.run_on_tag` = `abra.recipe_checkout(r, tag)` + `run_recipe_ci.py RECIPE= CCCI_SKIP_FETCH=1` cold/full. I reproduce each failure by running ONE recipe at a time with no concurrent load. Starting canonical state notable: **mumble canonical IS present** (`1.0.0+v1.6.870-0`, written 20260617T180501Z — during today's nixenv sweep). The canon DECISIONS recorded mumble RED (`test_handshake_completes_with_channel_presence`). A canonical only gets written on a GREEN cold run on latest → mumble flipped green in a recent run. Strong early evidence for the operator's "mumble passed before" → load flake hypothesis. Must confirm with a clean isolation re-run + check whether the canon-sweep red was under concurrent load. Next: start M1 investigation. Plan order (cheap/informative first): triage the existing sweep logs on cc-ci to pin the EXACT assertion/error for each (mumble, mattermost-lts restore, gitea app.ini, bluesky routing, discourse compose), then run isolation re-runs. discourse's recorded cause is an UPSTREAM compose defect (`sidekiq.depends_on: discourse` while service is `app`) that FATAs before any deploy — that's deterministic, not a load timeout, so it may not even need a long isolation run to confirm; verify the compose at the latest tag directly first. ## 2026-06-17T23:40Z — M1: discourse isolation run — CANON ROOT-CAUSE WAS WRONG Ran discourse ALONE on cc-ci (`recipe_checkout discourse 0.8.1+3.5.0` + `RECIPE=discourse CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py`, log `/tmp/redfix-discourse.log`). RESULT: **install PASS, upgrade FAIL, backup PASS, restore PASS, custom PASS** — the recipe deploys, serves (200 /srv/status), backs up and restores cleanly. NOT a deploy timeout, NOT a 51-min wedge, NOT a deploy FATA. The canon DECISIONS root-cause ("`abra app deploy` FATAs: service sidekiq depends on undefined service discourse → invalid compose project") is **misattributed**: that string appears ONLY from the non-fatal prepull `docker compose config --images` (rc=15, harness logs "skipping (deploy will pull as usual)"). The real `abra app deploy` is a swarm `docker stack deploy`, which ignores `depends_on` entirely → the stack converges (`UpdateStatus=completed`). The ONLY failure is the cc-ci upgrade OVERLAY `tests/discourse/test_upgrade.py`: - `test_head_runs_official_image_not_bitnamilegacy` — app image is `bitnamilegacy/discourse:3.5.0`; test demands `discourse/discourse:3.5.3` (official). - `test_sidekiq_service_dropped_by_head` — services `['app','db','redis','sidekiq']`; test demands sidekiq dropped. These `prevb`-phase overlay tests are PR-FAITHFULNESS assertions for a specific migration PR (bitnamilegacy → official `discourse/discourse:3.5.3`, drop sidekiq). Verified that migration exists in **NO upstream release tag and NOT in main** — `git show main:compose.yml` and every tag (`0.1.0…0.8.1+3.5.0`) all use `bitnamilegacy/discourse:3.5.0` + sidekiq. So the overlay asserts a state that doesn't exist anywhere upstream → deterministic RED whenever the sweep tests the latest release tag. The head DID deploy (chaos-version label = head f87c612d+U, converged) — the test expectation is simply wrong for the released recipe. Note (M2 design): migrating discourse from the deprecated `bitnamilegacy` image to official `discourse/discourse` is a MAJOR recipe rewrite (different fs layout, entrypoint, no `/opt/bitnami` sidekiq run.sh) — not a 1-line image swap. So the overlay test's `discourse/discourse:3.5.3` expectation may not be a realistic near-term recipe change. The bitnamilegacy deprecation is real (bitnami sunset legacy images), so a migration is the right long-term direction, but the test as written hard-codes a migration target absent upstream. Classification + fix approach to settle in M1 table / M2. Classification: **stale/PR-specific cc-ci OVERLAY test mismatched to the canonical-sweep context** (NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no discourse stack left). Evidence: `/tmp/redfix-discourse.log` on cc-ci; junit under `/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml`.