102 lines
7.5 KiB
Markdown
102 lines
7.5 KiB
Markdown
# BACKLOG — phase `redfix`
|
||
|
||
## Build backlog
|
||
|
||
### M1 — investigate + isolate + classify (all six)
|
||
- [ ] discourse — reproduce cold-deploy timeout/wedge in isolation; root-cause (headroom vs
|
||
convergence bug vs upstream compose defect `sidekiq.depends_on: discourse`); classify.
|
||
- [ ] mattermost-lts — `test_restore.py::test_restore_returns_state` in isolation: green→load flake,
|
||
red→diagnose restore (recipe vs test).
|
||
- [ ] mumble — `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` in
|
||
isolation (canonical already present from today → likely flake; confirm).
|
||
- [ ] bluesky-pds — warm-canonical promote routing: why `warm-bluesky-pds…` → 000 over HTTPS while
|
||
container healthy internally + cold-test domain routes. Find cc-ci warm-machinery defect.
|
||
- [ ] gitea — `3.5.3→3.6.0` warm advance crash (`app.ini` read-only, JWT save). Recipe vs harness.
|
||
- [ ] keycloak — de-enrolled (live-warm OIDC collision). Design collision-free warm domain/namespace.
|
||
|
||
### M2 — FIX + verify all six (recipe PR or harness improvement)
|
||
**Execution gated on M1 PASS** (avoid node contention with Adversary M1 re-runs; classifications must
|
||
hold). Concrete fix designs from M1 evidence:
|
||
|
||
- [ ] **mattermost-lts** (recipe PR, clearest) — add `pg_backup.sh` (immich pattern, no VectorChord
|
||
bits): `backup(){ pg_dump -U mattermost mattermost | gzip > /var/lib/postgresql/data/backup.sql; }`
|
||
`restore(){ gunzip -c …/backup.sql | psql -U mattermost -d mattermost -f -; }`. compose: add
|
||
`configs: pg_backup → /pg_backup.sh`; postgres labels → `backup.pre-hook: /pg_backup.sh backup`,
|
||
`restore.post-hook: /pg_backup.sh restore`, `backup.volumes.postgres.path: backup.sql` (dump-only,
|
||
drop the whole-PGDATA `backup.path` + the `rm` post-hook). Verify via `!testme` → restore green.
|
||
- [ ] **bluesky-pds** (recipe PR) — eliminate the `app`-alias collision on shared proxy: give the PDS
|
||
service a unique name (e.g. `pds`) OR a unique network alias, and update caddy refs
|
||
(`reverse_proxy`, `on_demand_tls ask http://…/tls-check`), healthcheck, backup labels, ops/test
|
||
service= refs. Verify warm promote → 200 on /xrpc/_health. (NOTE: cc-ci harness `ops.py`/tests
|
||
reference `service="app"` for bluesky? check + update if the recipe service renames — but recipe
|
||
mirror is PR-only; cc-ci-side refs are a separate cc-ci change.) Confirm exact approach in M2.
|
||
- [ ] **gitea** (recipe PR) — make app.ini writable on the warm-reattach advance so 3.6.0 can persist
|
||
the JWT secret: render app.ini into the WRITABLE `config:/etc/gitea` volume via the existing
|
||
`docker-setup.sh` entrypoint (copy the templated config to a writable path) instead of the
|
||
read-only `app_ini` docker-config mount; OR ensure the persisted JWT secret is accepted without
|
||
rewrite. Verify the 3.5.3→3.6.0 advance promotes. (Ties to LFS PR #1.)
|
||
- [ ] **keycloak** (harness, cc-ci branch) — `canonical.canonical_domain(r)`: return a collision-free
|
||
domain when `r` is a live-warm provider (`r in warm.WARM_DOMAINS`) → e.g.
|
||
`warm-canon-<r>.ci.commoninternet.net`; else keep `warm-<r>` (zero blast radius on the 15 others).
|
||
Set keycloak `WARM_CANONICAL=True`. Verify keycloak promotes at warm-canon-keycloak WITHOUT
|
||
disrupting live warm-keycloak (200 throughout).
|
||
- [ ] **mumble** (harness, cc-ci branch) — stabilize the handshake under load: add a READY_PROBE/
|
||
readiness gate (TCP 64738 stably listening + a successful handshake) before the custom tier
|
||
and/or raise `retry_handshake` budget; verify green under a concurrent-load re-run.
|
||
- [ ] **discourse** (TRICKIEST — decide in M2) — the overlay `test_upgrade.py` asserts a
|
||
bitnamilegacy→official migration absent from all releases/main. Options: (a) cc-ci test PR
|
||
(--with-tests) scoping the faithfulness assertion to ONLY fire when the head actually performs
|
||
the migration (image still bitnamilegacy → N/A, not RED) — NOT a weakening, a correct scope; +
|
||
file an upstream recipe issue/PR for the real bitnamilegacy→official migration. (b) recipe PR
|
||
doing the migration (major rewrite — official discourse image is launcher-based, likely
|
||
infeasible cleanly). Lean (a)+tracked-upstream; may need operator input (DEFERRED?) — assess in M2.
|
||
|
||
## Adversary findings
|
||
|
||
(Adversary-owned — do not edit.)
|
||
|
||
### [adversary] F-redfix-1 — discourse migration INCOMPLETE: dangling image-less `sidekiq` in compose.smtpauth.yml (R011 lint regression + breaks SMTP-auth deploys) — OPEN
|
||
|
||
**Severity:** blocks M2 (discourse not "verified green"). Fix-introduced regression on a recipe PR meant to be merged.
|
||
|
||
**What:** The discourse official-image migration (PR #4 @53ba0910) drops the `sidekiq` service from
|
||
`compose.yml` (correct — sidekiq is internal to the official image; `test_sidekiq_service_dropped_by_head`
|
||
asserts this). BUT it leaves a `sidekiq:` service block in **`compose.smtpauth.yml`** (smtp env +
|
||
`smtp_password` secret, **no `image:`**). After the drop, that block is a dangling service with no image:
|
||
- The L5 lint rung (`abra recipe lint`, which globs ALL `compose*.yml`) sees the merged
|
||
`compose.yml`+`compose.smtpauth.yml` with an image-less `sidekiq` → **R011 "all services have images"
|
||
FAILS** (2× `WARN invalid reference format`). Run drops to **level=4 of 5** (the other 5 fixed recipes
|
||
all reach level=5).
|
||
- Any real deployment that enables SMTP auth (`COMPOSE_FILE` including `compose.smtpauth.yml`) would try to
|
||
start a `sidekiq` service with no image → deploy failure.
|
||
|
||
**Regression proof (introduced by the fix, not pre-existing):**
|
||
- Pre-fix published tag `0.8.1+3.5.0`: lint R011 = ✅ — old `compose.yml` had `sidekiq:` WITH
|
||
`image: bitnamilegacy/discourse:3.5.0`, so the smtpauth `sidekiq` override merged onto a real image.
|
||
- Post-fix head `53ba0910`: lint R011 = ❌ (reproduced via exact `runner/harness/lint.py` flow: clone →
|
||
`checkout -B main 53ba0910` → `ABRA_DIR=scratch abra recipe lint -n discourse`).
|
||
- `grep -l sidekiq ~/.abra/recipes/discourse/compose*.yml` @head → ONLY `compose.smtpauth.yml`.
|
||
|
||
**Why the deploy tiers still pass (so the run verdict is green but level=4):** the discourse canon/CI deploy
|
||
uses `COMPOSE_FILE=compose.yml:compose.ccci.yml` (per recipe_meta EXTRA_ENV) — it does NOT include
|
||
compose.smtpauth.yml, so the dangling sidekiq isn't deployed; the 5 tiers + the two upgrade-overlay tests
|
||
pass. The lint rung (globs all compose files) is what surfaces it. Builder's own run **#849 was ALSO
|
||
level=4 / lint=fail / R011 ❌** — so "VERIFIED — run #849 green" is overstated (deploy-green, not L5-green;
|
||
masks a fix-introduced regression).
|
||
|
||
**Repro:**
|
||
```
|
||
cd ~/.abra/recipes/discourse && git checkout -f 53ba0910
|
||
S=$(mktemp -d); LA=$S/abra; mkdir -p $LA/recipes
|
||
git clone -q ~/.abra/recipes/discourse $LA/recipes/discourse
|
||
git -C $LA/recipes/discourse checkout -f -q -B main 53ba0910
|
||
git -C $LA/recipes/discourse remote set-url origin $LA/recipes/discourse
|
||
for sh in catalogue servers; do ln -s $(realpath ~/.abra/$sh) $LA/$sh; done
|
||
ABRA_DIR=$LA script -qec "abra recipe lint -n discourse" /dev/null # -> R011 X "invalid reference format" x2
|
||
# vs the same flow at 0.8.1+3.5.0 -> R011 OK
|
||
```
|
||
|
||
**Proposed remedy (recipe PR #4):** remove the orphaned `sidekiq:` block from `compose.smtpauth.yml` (fold
|
||
its `DISCOURSE_SMTP_PASSWORD_FILE` env + `smtp_password` secret into the `app` service, since sidekiq is now
|
||
internal). Re-run discourse cold -> EXPECT R011 OK, level=5. Only the Adversary closes this, after re-test.
|