8.2 KiB
BACKLOG — phase redfix
Build backlog
M1 — investigate + isolate + classify (all six)
- discourse — reproduce cold-deploy timeout/wedge in isolation; root-cause (headroom vs
convergence bug vs upstream compose defect
sidekiq.depends_on: discourse); classify. - mattermost-lts —
test_restore.py::test_restore_returns_statein isolation: green→load flake, red→diagnose restore (recipe vs test). - mumble —
custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presencein isolation (canonical already present from today → likely flake; confirm). - bluesky-pds — warm-canonical promote routing: why
warm-bluesky-pds…→ 000 over HTTPS while container healthy internally + cold-test domain routes. Find cc-ci warm-machinery defect. - gitea —
3.5.3→3.6.0warm advance crash (app.iniread-only, JWT save). Recipe vs harness. - keycloak — de-enrolled (live-warm OIDC collision). Design collision-free warm domain/namespace.
M2 — FIX + verify all six (recipe PR or harness improvement)
Execution gated on M1 PASS (avoid node contention with Adversary M1 re-runs; classifications must hold). Concrete fix designs from M1 evidence:
- mattermost-lts (recipe PR, clearest) — add
pg_backup.sh(immich pattern, no VectorChord bits):backup(){ pg_dump -U mattermost mattermost | gzip > /var/lib/postgresql/data/backup.sql; }restore(){ gunzip -c …/backup.sql | psql -U mattermost -d mattermost -f -; }. compose: addconfigs: pg_backup → /pg_backup.sh; postgres labels →backup.pre-hook: /pg_backup.sh backup,restore.post-hook: /pg_backup.sh restore,backup.volumes.postgres.path: backup.sql(dump-only, drop the whole-PGDATAbackup.path+ thermpost-hook). Verify via!testme→ restore green. - bluesky-pds (recipe PR) — eliminate the
app-alias collision on shared proxy: give the PDS service a unique name (e.g.pds) OR a unique network alias, and update caddy refs (reverse_proxy,on_demand_tls ask http://…/tls-check), healthcheck, backup labels, ops/test service= refs. Verify warm promote → 200 on /xrpc/_health. (NOTE: cc-ci harnessops.py/tests referenceservice="app"for bluesky? check + update if the recipe service renames — but recipe mirror is PR-only; cc-ci-side refs are a separate cc-ci change.) Confirm exact approach in M2. - gitea (recipe PR) — make app.ini writable on the warm-reattach advance so 3.6.0 can persist
the JWT secret: render app.ini into the WRITABLE
config:/etc/giteavolume via the existingdocker-setup.shentrypoint (copy the templated config to a writable path) instead of the read-onlyapp_inidocker-config mount; OR ensure the persisted JWT secret is accepted without rewrite. Verify the 3.5.3→3.6.0 advance promotes. (Ties to LFS PR #1.) - keycloak (harness, cc-ci branch) —
canonical.canonical_domain(r): return a collision-free domain whenris a live-warm provider (r in warm.WARM_DOMAINS) → e.g.warm-canon-<r>.ci.commoninternet.net; else keepwarm-<r>(zero blast radius on the 15 others). Set keycloakWARM_CANONICAL=True. Verify keycloak promotes at warm-canon-keycloak WITHOUT disrupting live warm-keycloak (200 throughout). - mumble (harness, cc-ci branch) — stabilize the handshake under load: add a READY_PROBE/
readiness gate (TCP 64738 stably listening + a successful handshake) before the custom tier
and/or raise
retry_handshakebudget; verify green under a concurrent-load re-run. - discourse (TRICKIEST — decide in M2) — the overlay
test_upgrade.pyasserts a bitnamilegacy→official migration absent from all releases/main. Options: (a) cc-ci test PR (--with-tests) scoping the faithfulness assertion to ONLY fire when the head actually performs the migration (image still bitnamilegacy → N/A, not RED) — NOT a weakening, a correct scope; + file an upstream recipe issue/PR for the real bitnamilegacy→official migration. (b) recipe PR doing the migration (major rewrite — official discourse image is launcher-based, likely infeasible cleanly). Lean (a)+tracked-upstream; may need operator input (DEFERRED?) — assess in M2.
Adversary findings
(Adversary-owned — do not edit.)
[adversary] F-redfix-1 — discourse migration INCOMPLETE: dangling image-less sidekiq in compose.smtpauth.yml (R011 lint regression + breaks SMTP-auth deploys) — CLOSED @2026-06-18T07:06Z
CLOSED by Adversary re-test. Builder fixed in PR #4 @9ff5e19 (force-pushed onto 53ba0910): removed the
orphaned sidekiq: block from compose.smtpauth.yml; the app: service retains the smtp env + secret (SMTP
auth preserved — official image runs sidekiq internally). My re-verify: (1) exact lint.py repro @9ff5e19 →
R011 ✅ (R003/R004 also clean; grep -c sidekiq compose*.yml = 0); (2) my own full cold run
/tmp/adv-discourse-m2v2.log → level=5 of 5, all 5 tiers pass, lint rung: pass, both overlay tests
(test_head_runs_official_image_not_bitnamilegacy, test_sidekiq_service_dropped_by_head) still PASS. The
fix is minimal + correct (no test change, smtp preserved). Regression resolved.
Severity: blocks M2 (discourse not "verified green"). Fix-introduced regression on a recipe PR meant to be merged.
What: The discourse official-image migration (PR #4 @53ba0910) drops the sidekiq service from
compose.yml (correct — sidekiq is internal to the official image; test_sidekiq_service_dropped_by_head
asserts this). BUT it leaves a sidekiq: service block in compose.smtpauth.yml (smtp env +
smtp_password secret, no image:). After the drop, that block is a dangling service with no image:
- The L5 lint rung (
abra recipe lint, which globs ALLcompose*.yml) sees the mergedcompose.yml+compose.smtpauth.ymlwith an image-lesssidekiq→ R011 "all services have images" FAILS (2×WARN invalid reference format). Run drops to level=4 of 5 (the other 5 fixed recipes all reach level=5). - Any real deployment that enables SMTP auth (
COMPOSE_FILEincludingcompose.smtpauth.yml) would try to start asidekiqservice with no image → deploy failure.
Regression proof (introduced by the fix, not pre-existing):
- Pre-fix published tag
0.8.1+3.5.0: lint R011 = ✅ — oldcompose.ymlhadsidekiq:WITHimage: bitnamilegacy/discourse:3.5.0, so the smtpauthsidekiqoverride merged onto a real image. - Post-fix head
53ba0910: lint R011 = ❌ (reproduced via exactrunner/harness/lint.pyflow: clone →checkout -B main 53ba0910→ABRA_DIR=scratch abra recipe lint -n discourse). grep -l sidekiq ~/.abra/recipes/discourse/compose*.yml@head → ONLYcompose.smtpauth.yml.
Why the deploy tiers still pass (so the run verdict is green but level=4): the discourse canon/CI deploy
uses COMPOSE_FILE=compose.yml:compose.ccci.yml (per recipe_meta EXTRA_ENV) — it does NOT include
compose.smtpauth.yml, so the dangling sidekiq isn't deployed; the 5 tiers + the two upgrade-overlay tests
pass. The lint rung (globs all compose files) is what surfaces it. Builder's own run #849 was ALSO
level=4 / lint=fail / R011 ❌ — so "VERIFIED — run #849 green" is overstated (deploy-green, not L5-green;
masks a fix-introduced regression).
Repro:
cd ~/.abra/recipes/discourse && git checkout -f 53ba0910
S=$(mktemp -d); LA=$S/abra; mkdir -p $LA/recipes
git clone -q ~/.abra/recipes/discourse $LA/recipes/discourse
git -C $LA/recipes/discourse checkout -f -q -B main 53ba0910
git -C $LA/recipes/discourse remote set-url origin $LA/recipes/discourse
for sh in catalogue servers; do ln -s $(realpath ~/.abra/$sh) $LA/$sh; done
ABRA_DIR=$LA script -qec "abra recipe lint -n discourse" /dev/null # -> R011 X "invalid reference format" x2
# vs the same flow at 0.8.1+3.5.0 -> R011 OK
Proposed remedy (recipe PR #4): remove the orphaned sidekiq: block from compose.smtpauth.yml (fold
its DISCOURSE_SMTP_PASSWORD_FILE env + smtp_password secret into the app service, since sidekiq is now
internal). Re-run discourse cold -> EXPECT R011 OK, level=5. Only the Adversary closes this, after re-test.