4.4 KiB
JOURNAL — phase redfix
2026-06-17T23:20Z — Bootstrap
Read phase plan + plan.md §6.1/§7/§9 + canon DECISIONS exceptions (lines ~1494–1552). Six canon-sweep failures to investigate. Confirmed cc-ci access, no run in flight, sweep timer next fires 2026-06-21 (3-day window), disk 38G free.
Isolation mechanism understood: runner/nightly_sweep.run_on_tag = abra.recipe_checkout(r, tag) +
run_recipe_ci.py RECIPE=<r> CCCI_SKIP_FETCH=1 cold/full. I reproduce each failure by running ONE
recipe at a time with no concurrent load.
Starting canonical state notable: mumble canonical IS present (1.0.0+v1.6.870-0, written
20260617T180501Z — during today's nixenv sweep). The canon DECISIONS recorded mumble RED
(test_handshake_completes_with_channel_presence). A canonical only gets written on a GREEN cold run
on latest → mumble flipped green in a recent run. Strong early evidence for the operator's "mumble
passed before" → load flake hypothesis. Must confirm with a clean isolation re-run + check whether the
canon-sweep red was under concurrent load.
Next: start M1 investigation. Plan order (cheap/informative first): triage the existing sweep logs on
cc-ci to pin the EXACT assertion/error for each (mumble, mattermost-lts restore, gitea app.ini,
bluesky routing, discourse compose), then run isolation re-runs. discourse's recorded cause is an
UPSTREAM compose defect (sidekiq.depends_on: discourse while service is app) that FATAs before any
deploy — that's deterministic, not a load timeout, so it may not even need a long isolation run to
confirm; verify the compose at the latest tag directly first.
2026-06-17T23:40Z — M1: discourse isolation run — CANON ROOT-CAUSE WAS WRONG
Ran discourse ALONE on cc-ci (recipe_checkout discourse 0.8.1+3.5.0 + RECIPE=discourse CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py, log /tmp/redfix-discourse.log).
RESULT: install PASS, upgrade FAIL, backup PASS, restore PASS, custom PASS — the recipe deploys,
serves (200 /srv/status), backs up and restores cleanly. NOT a deploy timeout, NOT a 51-min wedge,
NOT a deploy FATA. The canon DECISIONS root-cause ("abra app deploy FATAs: service sidekiq depends
on undefined service discourse → invalid compose project") is misattributed: that string appears
ONLY from the non-fatal prepull docker compose config --images (rc=15, harness logs "skipping
(deploy will pull as usual)"). The real abra app deploy is a swarm docker stack deploy, which
ignores depends_on entirely → the stack converges (UpdateStatus=completed).
The ONLY failure is the cc-ci upgrade OVERLAY tests/discourse/test_upgrade.py:
test_head_runs_official_image_not_bitnamilegacy— app image isbitnamilegacy/discourse:3.5.0; test demandsdiscourse/discourse:3.5.3(official).test_sidekiq_service_dropped_by_head— services['app','db','redis','sidekiq']; test demands sidekiq dropped.
These prevb-phase overlay tests are PR-FAITHFULNESS assertions for a specific migration PR
(bitnamilegacy → official discourse/discourse:3.5.3, drop sidekiq). Verified that migration exists
in NO upstream release tag and NOT in main — git show main:compose.yml and every tag
(0.1.0…0.8.1+3.5.0) all use bitnamilegacy/discourse:3.5.0 + sidekiq. So the overlay asserts a
state that doesn't exist anywhere upstream → deterministic RED whenever the sweep tests the latest
release tag. The head DID deploy (chaos-version label = head f87c612d+U, converged) — the test
expectation is simply wrong for the released recipe.
Note (M2 design): migrating discourse from the deprecated bitnamilegacy image to official
discourse/discourse is a MAJOR recipe rewrite (different fs layout, entrypoint, no /opt/bitnami
sidekiq run.sh) — not a 1-line image swap. So the overlay test's discourse/discourse:3.5.3
expectation may not be a realistic near-term recipe change. The bitnamilegacy deprecation is real
(bitnami sunset legacy images), so a migration is the right long-term direction, but the test as
written hard-codes a migration target absent upstream. Classification + fix approach to settle in M1
table / M2.
Classification: stale/PR-specific cc-ci OVERLAY test mismatched to the canonical-sweep context
(NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no
discourse stack left). Evidence: /tmp/redfix-discourse.log on cc-ci; junit under
/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml.