# JOURNAL — phase `redfix` ## 2026-06-17T23:20Z — Bootstrap Read phase plan + plan.md §6.1/§7/§9 + canon DECISIONS exceptions (lines ~1494–1552). Six canon-sweep failures to investigate. Confirmed cc-ci access, no run in flight, sweep timer next fires 2026-06-21 (3-day window), disk 38G free. Isolation mechanism understood: `runner/nightly_sweep.run_on_tag` = `abra.recipe_checkout(r, tag)` + `run_recipe_ci.py RECIPE= CCCI_SKIP_FETCH=1` cold/full. I reproduce each failure by running ONE recipe at a time with no concurrent load. Starting canonical state notable: **mumble canonical IS present** (`1.0.0+v1.6.870-0`, written 20260617T180501Z — during today's nixenv sweep). The canon DECISIONS recorded mumble RED (`test_handshake_completes_with_channel_presence`). A canonical only gets written on a GREEN cold run on latest → mumble flipped green in a recent run. Strong early evidence for the operator's "mumble passed before" → load flake hypothesis. Must confirm with a clean isolation re-run + check whether the canon-sweep red was under concurrent load. Next: start M1 investigation. Plan order (cheap/informative first): triage the existing sweep logs on cc-ci to pin the EXACT assertion/error for each (mumble, mattermost-lts restore, gitea app.ini, bluesky routing, discourse compose), then run isolation re-runs. discourse's recorded cause is an UPSTREAM compose defect (`sidekiq.depends_on: discourse` while service is `app`) that FATAs before any deploy — that's deterministic, not a load timeout, so it may not even need a long isolation run to confirm; verify the compose at the latest tag directly first. ## 2026-06-17T23:40Z — M1: discourse isolation run — CANON ROOT-CAUSE WAS WRONG Ran discourse ALONE on cc-ci (`recipe_checkout discourse 0.8.1+3.5.0` + `RECIPE=discourse CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py`, log `/tmp/redfix-discourse.log`). RESULT: **install PASS, upgrade FAIL, backup PASS, restore PASS, custom PASS** — the recipe deploys, serves (200 /srv/status), backs up and restores cleanly. NOT a deploy timeout, NOT a 51-min wedge, NOT a deploy FATA. The canon DECISIONS root-cause ("`abra app deploy` FATAs: service sidekiq depends on undefined service discourse → invalid compose project") is **misattributed**: that string appears ONLY from the non-fatal prepull `docker compose config --images` (rc=15, harness logs "skipping (deploy will pull as usual)"). The real `abra app deploy` is a swarm `docker stack deploy`, which ignores `depends_on` entirely → the stack converges (`UpdateStatus=completed`). The ONLY failure is the cc-ci upgrade OVERLAY `tests/discourse/test_upgrade.py`: - `test_head_runs_official_image_not_bitnamilegacy` — app image is `bitnamilegacy/discourse:3.5.0`; test demands `discourse/discourse:3.5.3` (official). - `test_sidekiq_service_dropped_by_head` — services `['app','db','redis','sidekiq']`; test demands sidekiq dropped. These `prevb`-phase overlay tests are PR-FAITHFULNESS assertions for a specific migration PR (bitnamilegacy → official `discourse/discourse:3.5.3`, drop sidekiq). Verified that migration exists in **NO upstream release tag and NOT in main** — `git show main:compose.yml` and every tag (`0.1.0…0.8.1+3.5.0`) all use `bitnamilegacy/discourse:3.5.0` + sidekiq. So the overlay asserts a state that doesn't exist anywhere upstream → deterministic RED whenever the sweep tests the latest release tag. The head DID deploy (chaos-version label = head f87c612d+U, converged) — the test expectation is simply wrong for the released recipe. Note (M2 design): migrating discourse from the deprecated `bitnamilegacy` image to official `discourse/discourse` is a MAJOR recipe rewrite (different fs layout, entrypoint, no `/opt/bitnami` sidekiq run.sh) — not a 1-line image swap. So the overlay test's `discourse/discourse:3.5.3` expectation may not be a realistic near-term recipe change. The bitnamilegacy deprecation is real (bitnami sunset legacy images), so a migration is the right long-term direction, but the test as written hard-codes a migration target absent upstream. Classification + fix approach to settle in M1 table / M2. Classification: **stale/PR-specific cc-ci OVERLAY test mismatched to the canonical-sweep context** (NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no discourse stack left). Evidence: `/tmp/redfix-discourse.log` on cc-ci; junit under `/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml`. ## 2026-06-18T00:05Z — M1: mattermost-lts isolation run — DETERMINISTIC restore failure (recipe defect) Ran mattermost-lts ALONE (tag 2.1.9+10.11.15, log /tmp/redfix-mattermost-lts.log). RESULT: **install/upgrade/backup/custom PASS, restore FAIL** — identical to the canon failure: `tests/mattermost-lts/test_restore.py::test_restore_returns_state` → `relation "ci_marker" does not exist` after restore. So it is **deterministic in isolation, NOT a loaded-node race** (canon framing was wrong). The marker logic is sound (postgres table seeded pre-backup, dropped pre-restore, asserted post-restore — same pattern immich uses and PASSES). ROOT CAUSE (recipe backup/restore labels). Compared mattermost-lts vs immich (immich passes the IDENTICAL test): - immich `database` svc: `backupbot.backup.pre-hook: /pg_backup.sh backup`, `backupbot.backup.volumes.postgres.path: backup.sql` (backs up ONLY the dump file), and **`backupbot.restore.post-hook: /pg_backup.sh restore`** (replays the dump on restore). → round-trips. - mattermost-lts `postgres` svc: `pre-hook: pg_dump > /var/lib/postgresql/data/postgres-backup.sql`, `backup.path: /var/lib/postgresql/data/` (backs up the WHOLE live/hot PGDATA dir + the dump), `post-hook: rm .../postgres-backup.sql`, and **NO `backupbot.restore.post-hook`**. So on restore, abra restores the files but NOTHING replays the dump, and a hot-copied live PGDATA over a running postgres does not reload → `ci_marker` lost. Restore log confirms `Restoring Snapshot b0495d36 at /` with no post-hook reimport. Classification: **GENUINE RECIPE DEFECT at latest** (postgres backup/restore does not round-trip — missing restore post-hook + backs up hot PGDATA instead of dump-only). NOT a flake, NOT cc-ci test weakening (test is correct & unmodified; immich proves the pattern works). Fix (M2) = recipe PR adopting the immich-style postgres backup/restore (a `/pg_backup.sh`-style dump + restore post-hook). Teardown clean (no matt stack). Evidence: /tmp/redfix-mattermost-lts.log; junit restore__cc-ci__test_restore.xml. Tooling note: my background "waiter" loop `while pgrep -f run_recipe_ci.py` self-matched (its own cmdline contains the string) → never exited, falsely showed a run active. Use `pgrep -f "[r]un_recipe_ci.py"` or match the python invocation. Killed the stuck waiters; node confirmed free. ## 2026-06-18T00:18Z — M1: mumble isolation run — GREEN (flake confirmed) Ran mumble ALONE (tag 1.0.0+v1.6.870-0, log /tmp/redfix-mumble.log). RESULT: **ALL tiers PASS** (install/upgrade/backup/restore/custom), including `custom/test_protocol_handshake.py:: test_handshake_completes_with_channel_presence` PASSED. No orphan stacks. The canon sweep recorded this RED (`test_handshake…` failed under concurrent sweep load); it is GREEN here in isolation, and its canonical was already written green TODAY (1.0.0+v1.6.870-0 @20260617T180501Z) under the lighter nixenv sweep. → **load/timing FLAKE** on the control-channel handshake, NOT a recipe defect. The handshake test already retries (`retry_handshake(attempts=12, interval=5.0)` = 60s). So the flake is the voice server not completing the TLS+ServerSync handshake within ~60s under heavy concurrent node load (deploy contention). M2 fix = harness stabilization (stronger readiness gate before the custom tier / longer-or-smarter retry / serialize), based on the load failure mode. Classification: **FLAKE (load/concurrency)** → harness stabilization. Reproducibility: 1 green isolation run here + canonical green today + documented red under canon load. Will do 1–2 more isolation repeats before the M1 claim to firm "reproducibly green in isolation." ## 2026-06-18T00:45Z — M1: bluesky-pds isolation run — 000 REPRODUCES; root cause = `app` DNS collision on shared proxy Ran bluesky-pds ALONE (tag 0.3.0+v0.4.219, log /tmp/redfix-bluesky-pds.log). Cold lifecycle GREEN (install/backup/restore/custom pass; upgrade EXPECTED_NA per recipe_meta — moving pds:0.4 tag). Then WC5 promote-on-green-cold FAILED exactly as canon: `warm-bluesky-pds.ci.commoninternet.net: not healthy over HTTPS /xrpc/_health (last status 0)`. So **the 000 reproduces deterministically in isolation — NOT a sweep-load/ACME-rate-limit flake** (my first hypothesis, refuted). LIVE DIAGNOSIS (stack left deployed by the failed promote; probed before teardown): - app service 1/1, healthy: `docker exec app wget localhost:3000/xrpc/_health` → `{"version":"0.4.219"}`; app listens on `:::3000`; no restarts. So the PDS itself is fine. - HTTPS to warm domain → 000. caddy logs flood: `tls "failed to get permission for on-demand certificate" domain=warm-bluesky-pds… error=… Get "http://app:3000/tls-check?domain=…": dial tcp 10.10.0.X:3000: connect: connection refused` (X varies: .2 .4 .5 .6 .8 .9 .10 .12). - bluesky uses caddy **on-demand TLS** (Caddyfile: `on_demand_tls { ask http://app:3000/tls-check }`, `tls { on_demand }`, `reverse_proxy app:3000`). caddy must reach app:3000/tls-check to be GRANTED a cert before serving TLS. It can't → no cert → TLS handshake fails → 000. - WHY can't caddy reach app: **service-name `app` collision on the shared `proxy` overlay.** - app is on `warm-bluesky-pds…_internal` ONLY (IP 10.0.3.3). caddy is on `proxy` (10.10.50.223) + `…_internal` (10.0.3.6). - `docker exec caddy getent hosts app` → returns ONLY proxy IPs (8/8 tries: 10.10.0.4/.5/.6/.10/.12), **NEVER the internal 10.0.3.3.** The proxy-net `app` alias shadows bluesky's own internal app. - `docker network inspect proxy` shows EVERY stack aliases its main service `app`: `drone…_app=10.10.0.2`, `traefik…_app=10.10.0.5`, `warm-keycloak…_app=10.10.0.9`, `ccci-reports/bridge/dashboard_app`, … — exactly the IPs caddy hits. None listens a PDS on 3000 → connection refused. So caddy resolves bare `app` to OTHER stacks' app endpoints on the shared proxy, never its own PDS. WHY cold passes / warm fails: cold's health window is long (HTTP_TIMEOUT=600) and on first success caddy CACHES the issued cert; the promote's shorter health window doesn't give caddy a chance to ever resolve correctly (and here it provably never resolves to 10.0.3.3 at all). The collision is the root cause; the promote machinery is CORRECT (it refused to write a canonical for an unhealthy 000 — no canonical.json written, verified). Classification: **genuine ROUTING/recipe defect — caddy↔app cross-stack `app`-alias collision on the shared proxy net**, deterministic, reproducible in isolation. NOT a flake; NOT a promote-machinery bug. Fix approach (M2): recipe PR giving the PDS service a UNIQUE name/alias (e.g. rename `app`→`pds`) so caddy's `reverse_proxy`/`tls-check` resolve only bluesky's own internal service (no shared-proxy `app` collision). (Alternatively a caddy-side internal-only resolution; renaming is cleanest.) Will confirm the exact fix in M2 + verify the warm domain then serves 200. Cleanup: removed orphaned warm-bluesky-pds stack + its volumes/secrets (promote had left it deployed; no canonical written). Node clean.