# BACKLOG — phase `redfix` ## Build backlog ### M1 — investigate + isolate + classify (all six) - [ ] discourse — reproduce cold-deploy timeout/wedge in isolation; root-cause (headroom vs convergence bug vs upstream compose defect `sidekiq.depends_on: discourse`); classify. - [ ] mattermost-lts — `test_restore.py::test_restore_returns_state` in isolation: green→load flake, red→diagnose restore (recipe vs test). - [ ] mumble — `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` in isolation (canonical already present from today → likely flake; confirm). - [ ] bluesky-pds — warm-canonical promote routing: why `warm-bluesky-pds…` → 000 over HTTPS while container healthy internally + cold-test domain routes. Find cc-ci warm-machinery defect. - [ ] gitea — `3.5.3→3.6.0` warm advance crash (`app.ini` read-only, JWT save). Recipe vs harness. - [ ] keycloak — de-enrolled (live-warm OIDC collision). Design collision-free warm domain/namespace. ### M2 — FIX + verify all six (recipe PR or harness improvement) **Execution gated on M1 PASS** (avoid node contention with Adversary M1 re-runs; classifications must hold). Concrete fix designs from M1 evidence: - [ ] **mattermost-lts** (recipe PR, clearest) — add `pg_backup.sh` (immich pattern, no VectorChord bits): `backup(){ pg_dump -U mattermost mattermost | gzip > /var/lib/postgresql/data/backup.sql; }` `restore(){ gunzip -c …/backup.sql | psql -U mattermost -d mattermost -f -; }`. compose: add `configs: pg_backup → /pg_backup.sh`; postgres labels → `backup.pre-hook: /pg_backup.sh backup`, `restore.post-hook: /pg_backup.sh restore`, `backup.volumes.postgres.path: backup.sql` (dump-only, drop the whole-PGDATA `backup.path` + the `rm` post-hook). Verify via `!testme` → restore green. - [ ] **bluesky-pds** (recipe PR) — eliminate the `app`-alias collision on shared proxy: give the PDS service a unique name (e.g. `pds`) OR a unique network alias, and update caddy refs (`reverse_proxy`, `on_demand_tls ask http://…/tls-check`), healthcheck, backup labels, ops/test service= refs. Verify warm promote → 200 on /xrpc/_health. (NOTE: cc-ci harness `ops.py`/tests reference `service="app"` for bluesky? check + update if the recipe service renames — but recipe mirror is PR-only; cc-ci-side refs are a separate cc-ci change.) Confirm exact approach in M2. - [ ] **gitea** (recipe PR) — make app.ini writable on the warm-reattach advance so 3.6.0 can persist the JWT secret: render app.ini into the WRITABLE `config:/etc/gitea` volume via the existing `docker-setup.sh` entrypoint (copy the templated config to a writable path) instead of the read-only `app_ini` docker-config mount; OR ensure the persisted JWT secret is accepted without rewrite. Verify the 3.5.3→3.6.0 advance promotes. (Ties to LFS PR #1.) - [ ] **keycloak** (harness, cc-ci branch) — `canonical.canonical_domain(r)`: return a collision-free domain when `r` is a live-warm provider (`r in warm.WARM_DOMAINS`) → e.g. `warm-canon-.ci.commoninternet.net`; else keep `warm-` (zero blast radius on the 15 others). Set keycloak `WARM_CANONICAL=True`. Verify keycloak promotes at warm-canon-keycloak WITHOUT disrupting live warm-keycloak (200 throughout). - [ ] **mumble** (harness, cc-ci branch) — stabilize the handshake under load: add a READY_PROBE/ readiness gate (TCP 64738 stably listening + a successful handshake) before the custom tier and/or raise `retry_handshake` budget; verify green under a concurrent-load re-run. - [ ] **discourse** (TRICKIEST — decide in M2) — the overlay `test_upgrade.py` asserts a bitnamilegacy→official migration absent from all releases/main. Options: (a) cc-ci test PR (--with-tests) scoping the faithfulness assertion to ONLY fire when the head actually performs the migration (image still bitnamilegacy → N/A, not RED) — NOT a weakening, a correct scope; + file an upstream recipe issue/PR for the real bitnamilegacy→official migration. (b) recipe PR doing the migration (major rewrite — official discourse image is launcher-based, likely infeasible cleanly). Lean (a)+tracked-upstream; may need operator input (DEFERRED?) — assess in M2. ## Adversary findings (Adversary-owned — do not edit.) ### [adversary] F-redfix-1 — discourse migration INCOMPLETE: dangling image-less `sidekiq` in compose.smtpauth.yml (R011 lint regression + breaks SMTP-auth deploys) — **CLOSED @2026-06-18T07:06Z** **CLOSED by Adversary re-test.** Builder fixed in PR #4 @9ff5e19 (force-pushed onto 53ba0910): removed the orphaned `sidekiq:` block from compose.smtpauth.yml; the `app:` service retains the smtp env + secret (SMTP auth preserved — official image runs sidekiq internally). My re-verify: (1) exact lint.py repro @9ff5e19 → **R011 ✅** (R003/R004 also clean; `grep -c sidekiq compose*.yml` = 0); (2) my own full cold run `/tmp/adv-discourse-m2v2.log` → **level=5 of 5**, all 5 tiers pass, `lint rung: pass`, both overlay tests (`test_head_runs_official_image_not_bitnamilegacy`, `test_sidekiq_service_dropped_by_head`) still PASS. The fix is minimal + correct (no test change, smtp preserved). Regression resolved. **Severity:** blocks M2 (discourse not "verified green"). Fix-introduced regression on a recipe PR meant to be merged. **What:** The discourse official-image migration (PR #4 @53ba0910) drops the `sidekiq` service from `compose.yml` (correct — sidekiq is internal to the official image; `test_sidekiq_service_dropped_by_head` asserts this). BUT it leaves a `sidekiq:` service block in **`compose.smtpauth.yml`** (smtp env + `smtp_password` secret, **no `image:`**). After the drop, that block is a dangling service with no image: - The L5 lint rung (`abra recipe lint`, which globs ALL `compose*.yml`) sees the merged `compose.yml`+`compose.smtpauth.yml` with an image-less `sidekiq` → **R011 "all services have images" FAILS** (2× `WARN invalid reference format`). Run drops to **level=4 of 5** (the other 5 fixed recipes all reach level=5). - Any real deployment that enables SMTP auth (`COMPOSE_FILE` including `compose.smtpauth.yml`) would try to start a `sidekiq` service with no image → deploy failure. **Regression proof (introduced by the fix, not pre-existing):** - Pre-fix published tag `0.8.1+3.5.0`: lint R011 = ✅ — old `compose.yml` had `sidekiq:` WITH `image: bitnamilegacy/discourse:3.5.0`, so the smtpauth `sidekiq` override merged onto a real image. - Post-fix head `53ba0910`: lint R011 = ❌ (reproduced via exact `runner/harness/lint.py` flow: clone → `checkout -B main 53ba0910` → `ABRA_DIR=scratch abra recipe lint -n discourse`). - `grep -l sidekiq ~/.abra/recipes/discourse/compose*.yml` @head → ONLY `compose.smtpauth.yml`. **Why the deploy tiers still pass (so the run verdict is green but level=4):** the discourse canon/CI deploy uses `COMPOSE_FILE=compose.yml:compose.ccci.yml` (per recipe_meta EXTRA_ENV) — it does NOT include compose.smtpauth.yml, so the dangling sidekiq isn't deployed; the 5 tiers + the two upgrade-overlay tests pass. The lint rung (globs all compose files) is what surfaces it. Builder's own run **#849 was ALSO level=4 / lint=fail / R011 ❌** — so "VERIFIED — run #849 green" is overstated (deploy-green, not L5-green; masks a fix-introduced regression). **Repro:** ``` cd ~/.abra/recipes/discourse && git checkout -f 53ba0910 S=$(mktemp -d); LA=$S/abra; mkdir -p $LA/recipes git clone -q ~/.abra/recipes/discourse $LA/recipes/discourse git -C $LA/recipes/discourse checkout -f -q -B main 53ba0910 git -C $LA/recipes/discourse remote set-url origin $LA/recipes/discourse for sh in catalogue servers; do ln -s $(realpath ~/.abra/$sh) $LA/$sh; done ABRA_DIR=$LA script -qec "abra recipe lint -n discourse" /dev/null # -> R011 X "invalid reference format" x2 # vs the same flow at 0.8.1+3.5.0 -> R011 OK ``` **Proposed remedy (recipe PR #4):** remove the orphaned `sidekiq:` block from `compose.smtpauth.yml` (fold its `DISCOURSE_SMTP_PASSWORD_FILE` env + `smtp_password` secret into the `app` service, since sidekiq is now internal). Re-run discourse cold -> EXPECT R011 OK, level=5. Only the Adversary closes this, after re-test.