Files
cc-ci/machine-docs/BACKLOG-redfix.md

102 lines
7.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BACKLOG — phase `redfix`
## Build backlog
### M1 — investigate + isolate + classify (all six)
- [ ] discourse — reproduce cold-deploy timeout/wedge in isolation; root-cause (headroom vs
convergence bug vs upstream compose defect `sidekiq.depends_on: discourse`); classify.
- [ ] mattermost-lts — `test_restore.py::test_restore_returns_state` in isolation: green→load flake,
red→diagnose restore (recipe vs test).
- [ ] mumble — `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` in
isolation (canonical already present from today → likely flake; confirm).
- [ ] bluesky-pds — warm-canonical promote routing: why `warm-bluesky-pds…` → 000 over HTTPS while
container healthy internally + cold-test domain routes. Find cc-ci warm-machinery defect.
- [ ] gitea — `3.5.3→3.6.0` warm advance crash (`app.ini` read-only, JWT save). Recipe vs harness.
- [ ] keycloak — de-enrolled (live-warm OIDC collision). Design collision-free warm domain/namespace.
### M2 — FIX + verify all six (recipe PR or harness improvement)
**Execution gated on M1 PASS** (avoid node contention with Adversary M1 re-runs; classifications must
hold). Concrete fix designs from M1 evidence:
- [ ] **mattermost-lts** (recipe PR, clearest) — add `pg_backup.sh` (immich pattern, no VectorChord
bits): `backup(){ pg_dump -U mattermost mattermost | gzip > /var/lib/postgresql/data/backup.sql; }`
`restore(){ gunzip -c …/backup.sql | psql -U mattermost -d mattermost -f -; }`. compose: add
`configs: pg_backup → /pg_backup.sh`; postgres labels → `backup.pre-hook: /pg_backup.sh backup`,
`restore.post-hook: /pg_backup.sh restore`, `backup.volumes.postgres.path: backup.sql` (dump-only,
drop the whole-PGDATA `backup.path` + the `rm` post-hook). Verify via `!testme` → restore green.
- [ ] **bluesky-pds** (recipe PR) — eliminate the `app`-alias collision on shared proxy: give the PDS
service a unique name (e.g. `pds`) OR a unique network alias, and update caddy refs
(`reverse_proxy`, `on_demand_tls ask http://…/tls-check`), healthcheck, backup labels, ops/test
service= refs. Verify warm promote → 200 on /xrpc/_health. (NOTE: cc-ci harness `ops.py`/tests
reference `service="app"` for bluesky? check + update if the recipe service renames — but recipe
mirror is PR-only; cc-ci-side refs are a separate cc-ci change.) Confirm exact approach in M2.
- [ ] **gitea** (recipe PR) — make app.ini writable on the warm-reattach advance so 3.6.0 can persist
the JWT secret: render app.ini into the WRITABLE `config:/etc/gitea` volume via the existing
`docker-setup.sh` entrypoint (copy the templated config to a writable path) instead of the
read-only `app_ini` docker-config mount; OR ensure the persisted JWT secret is accepted without
rewrite. Verify the 3.5.3→3.6.0 advance promotes. (Ties to LFS PR #1.)
- [ ] **keycloak** (harness, cc-ci branch) — `canonical.canonical_domain(r)`: return a collision-free
domain when `r` is a live-warm provider (`r in warm.WARM_DOMAINS`) → e.g.
`warm-canon-<r>.ci.commoninternet.net`; else keep `warm-<r>` (zero blast radius on the 15 others).
Set keycloak `WARM_CANONICAL=True`. Verify keycloak promotes at warm-canon-keycloak WITHOUT
disrupting live warm-keycloak (200 throughout).
- [ ] **mumble** (harness, cc-ci branch) — stabilize the handshake under load: add a READY_PROBE/
readiness gate (TCP 64738 stably listening + a successful handshake) before the custom tier
and/or raise `retry_handshake` budget; verify green under a concurrent-load re-run.
- [ ] **discourse** (TRICKIEST — decide in M2) — the overlay `test_upgrade.py` asserts a
bitnamilegacy→official migration absent from all releases/main. Options: (a) cc-ci test PR
(--with-tests) scoping the faithfulness assertion to ONLY fire when the head actually performs
the migration (image still bitnamilegacy → N/A, not RED) — NOT a weakening, a correct scope; +
file an upstream recipe issue/PR for the real bitnamilegacy→official migration. (b) recipe PR
doing the migration (major rewrite — official discourse image is launcher-based, likely
infeasible cleanly). Lean (a)+tracked-upstream; may need operator input (DEFERRED?) — assess in M2.
## Adversary findings
(Adversary-owned — do not edit.)
### [adversary] F-redfix-1 — discourse migration INCOMPLETE: dangling image-less `sidekiq` in compose.smtpauth.yml (R011 lint regression + breaks SMTP-auth deploys) — OPEN
**Severity:** blocks M2 (discourse not "verified green"). Fix-introduced regression on a recipe PR meant to be merged.
**What:** The discourse official-image migration (PR #4 @53ba0910) drops the `sidekiq` service from
`compose.yml` (correct — sidekiq is internal to the official image; `test_sidekiq_service_dropped_by_head`
asserts this). BUT it leaves a `sidekiq:` service block in **`compose.smtpauth.yml`** (smtp env +
`smtp_password` secret, **no `image:`**). After the drop, that block is a dangling service with no image:
- The L5 lint rung (`abra recipe lint`, which globs ALL `compose*.yml`) sees the merged
`compose.yml`+`compose.smtpauth.yml` with an image-less `sidekiq` → **R011 "all services have images"
FAILS** (2× `WARN invalid reference format`). Run drops to **level=4 of 5** (the other 5 fixed recipes
all reach level=5).
- Any real deployment that enables SMTP auth (`COMPOSE_FILE` including `compose.smtpauth.yml`) would try to
start a `sidekiq` service with no image → deploy failure.
**Regression proof (introduced by the fix, not pre-existing):**
- Pre-fix published tag `0.8.1+3.5.0`: lint R011 = ✅ — old `compose.yml` had `sidekiq:` WITH
`image: bitnamilegacy/discourse:3.5.0`, so the smtpauth `sidekiq` override merged onto a real image.
- Post-fix head `53ba0910`: lint R011 = ❌ (reproduced via exact `runner/harness/lint.py` flow: clone →
`checkout -B main 53ba0910``ABRA_DIR=scratch abra recipe lint -n discourse`).
- `grep -l sidekiq ~/.abra/recipes/discourse/compose*.yml` @head → ONLY `compose.smtpauth.yml`.
**Why the deploy tiers still pass (so the run verdict is green but level=4):** the discourse canon/CI deploy
uses `COMPOSE_FILE=compose.yml:compose.ccci.yml` (per recipe_meta EXTRA_ENV) — it does NOT include
compose.smtpauth.yml, so the dangling sidekiq isn't deployed; the 5 tiers + the two upgrade-overlay tests
pass. The lint rung (globs all compose files) is what surfaces it. Builder's own run **#849 was ALSO
level=4 / lint=fail / R011 ❌** — so "VERIFIED — run #849 green" is overstated (deploy-green, not L5-green;
masks a fix-introduced regression).
**Repro:**
```
cd ~/.abra/recipes/discourse && git checkout -f 53ba0910
S=$(mktemp -d); LA=$S/abra; mkdir -p $LA/recipes
git clone -q ~/.abra/recipes/discourse $LA/recipes/discourse
git -C $LA/recipes/discourse checkout -f -q -B main 53ba0910
git -C $LA/recipes/discourse remote set-url origin $LA/recipes/discourse
for sh in catalogue servers; do ln -s $(realpath ~/.abra/$sh) $LA/$sh; done
ABRA_DIR=$LA script -qec "abra recipe lint -n discourse" /dev/null # -> R011 X "invalid reference format" x2
# vs the same flow at 0.8.1+3.5.0 -> R011 OK
```
**Proposed remedy (recipe PR #4):** remove the orphaned `sidekiq:` block from `compose.smtpauth.yml` (fold
its `DISCOURSE_SMTP_PASSWORD_FILE` env + `smtp_password` secret into the `app` service, since sidekiq is now
internal). Re-run discourse cold -> EXPECT R011 OK, level=5. Only the Adversary closes this, after re-test.