Files
cc-ci/machine-docs/REVIEW-redfix.md

336 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — phase `redfix` (Adversary)
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
gitea, keycloak), isolate → root-cause → classify (flake vs genuine; recipe vs test vs
warm-machinery vs load) → FIX each via a recipe PR or harness improvement → verify green.
No standing exceptions. Nothing merged.
Gates:
- **M1** — all six investigated in isolation, classified with evidence. Adversary cold-verifies:
claimed flake = reproducibly green in isolation (and red under load); claimed recipe defect =
genuinely the recipe (not a stale test / harness artifact); claimed warm-machinery bug = in cc-ci.
- **M2** — all six FIXED + verified green (recipe PR via `!testme`; harness/cc-ci PR via the harness;
flake-stabilization green under load). All six promote/pass. No standing exception. Nothing merged.
DONE = Builder writes `## DONE` only after M1+M2 fresh Adversary PASS here.
---
## Verdicts
### M1 — investigate + isolate + classify: **PASS** @ 2026-06-18T01:18Z
Gate claim: `claim(redfix-M1)` commit `0a06c41` (@00:25Z). Verified from a COLD START on cc-ci with my
OWN isolation re-runs (one recipe at a time, no concurrent load) — NOT the Builder's logs. Isolation
discipline honored: verdict formed from the phase plan (SSOT), the recipe code / git history, the
verification info in STATUS, and my own cold acceptance runs; I did NOT read JOURNAL-redfix.md before
writing this verdict.
All six classifications are CORRECT. Evidence per recipe (full detail in the verification log below):
| Recipe | My independent reproduction | Classification — verified |
|---|---|---|
| **discourse** | my isolation run `/tmp/adv-discourse.log`: install/backup/restore/custom PASS, upgrade FAIL on the 2 PR-faithfulness overlay asserts; **converged in minutes, no FATA/rc=142/wedge** | **stale/PR-specific cc-ci OVERLAY test** (canon "timeout" root-cause was WRONG — confirmed). Recipe deploys+serves fine. ✔ |
| **mattermost-lts** | my isolation run `/tmp/adv-mattermost.log`: **restore FAIL deterministically** (`relation "ci_marker" does not exist`, 91s, isolated) | **genuine RECIPE defect** — no `backupbot.restore.post-hook`; NOT the canon "loaded-node race." ✔ |
| **mumble** | my isolation run `/tmp/adv-mumble.log`: ALL 5 tiers GREEN incl `test_handshake_completes_with_channel_presence`; promote OK | **load/timing FLAKE** — green in isolation (a recipe defect would red deterministically; it didn't). ✔ |
| **bluesky-pds** | my isolation run `/tmp/adv-bluesky.log` + live caddy diag: cold GREEN, warm promote **000 deterministic**; `getent app`→10.10.0.4 (foreign proxy), own app 10.0.5.6 never resolved; caddy log cycles `dial 10.10.0.{4..12}:3000 refused` | **genuine recipe ROUTING defect** (bare `app` + caddy on shared `proxy`), NOT cc-ci promote-machinery (it correctly refused to promote), NOT flake. (Reverses the plan's "warm-machinery" prior — confirmed against it.) ✔ |
| **gitea** | my isolation run `/tmp/adv-gitea.log` + container crash log: cold GREEN, warm advance crash-loops 0/1; `LoadCommonSettings() [F] … error saving JWT Secret … "/etc/gitea/app.ini": read-only file system`; canonical correctly stayed 3.5.3 (promote timed out, refused) | **genuine RECIPE defect** (3.6.0 JWT save vs read-only app.ini docker-config mount; `/etc/gitea` is a writable volume but the app.ini file is the RO config). ✔ |
| **keycloak** | code-verified: `canonical.canonical_domain('keycloak')``warm.stable_domain``warm-keycloak.ci.commoninternet.net` == `warm.WARM_DOMAINS['keycloak']` (warm.py:47 documents the equality); live keycloak 200 on `/realms/master` | **HARNESS defect** (data-warm canonical domain collides with the live-warm OIDC provider; no collision-free namespace). ✔ |
No defects in the classification work. No VETO. Node verified clean before AND after my runs (only infra
+ live warm-keycloak; gitea restored to undeployed idle 3.5.3, volumes retained, canonical commit
`e6a1cc79` unchanged; warm-keycloak healthy throughout). **M1 PASS — Builder cleared to proceed to M2.**
(M2 will re-verify each FIX green; this PASS is for the investigation/classification gate only.)
_(prior placeholder removed)_
## Adversary verification log
- 2026-06-17T23:18Z — Phase redfix opened. Refreshed phase plan + plan.md §6.1. Cold access to cc-ci
confirmed (`ssh cc-ci`: host `nixos`, uptime 4d, `systemctl --failed` empty, load ~0.8). No Builder
state files (`STATUS/BACKLOG/JOURNAL-redfix.md`) yet; no gate claimed. Idling for the first claim.
- 2026-06-18T00:10Z — Non-contending pre-staging (M1 NOT yet claimed; Builder mid-investigation:
gitea isolation running, keycloak pending). Stayed OFF the swarm to avoid contaminating the
Builder's isolation runs. Independently corroborated two deterministic static claims via pure
code reads on cc-ci (no deploys):
* **mattermost-lts** (recipe @ `2.1.9+10.11.15`): postgres svc has `backupbot.backup.pre-hook`
(pg_dump → /var/lib/postgresql/data/postgres-backup.sql), `backup.post-hook` (rm dump),
`backup.path=/var/lib/postgresql/data/` (hot live PGDATA) — and **NO `backupbot.restore.post-hook`**.
immich (passes) uses dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook:
/pg_backup.sh restore`. Corroborates "genuine recipe defect — no restore round-trip." ✔ pre-staged.
* **discourse** (recipe @ `0.8.1+3.5.0` = `bitnamilegacy/discourse:3.5.0` + sidekiq): overlay
`tests/discourse/test_upgrade.py` is a phase-prevb PR-faithfulness test asserting app image ==
official `discourse/discourse:3.5.3` AND sidekiq dropped — only true on an unreleased PR head, not
the latest release the canon sweep deploys. So it red-by-construction in the sweep. Corroborates
"stale/PR-specific overlay test, not flake/timeout/recipe-deploy." ✔ pre-staged.
* STILL OWED before any M1 PASS: my OWN cold isolation run of discourse to confirm the
re-classification from the original canon hypothesis ("cold-deploy timeout, ~51-min wedge") to
"deploys+serves fine, only the overlay test reds." Will run when M1 is claimed and the swarm is
free (Builder not deploying). Same for bluesky app-alias collision (needs live caddy/getent diag).
These are NOT verdicts — formal M1 PASS/FAIL awaits the Builder's gate claim.
- 2026-06-18T00:25Z — **M1 CLAIMED** (commit 0a06c41). Node verified idle/clean before any run
(only infra + live warm-keycloak; no bluesky/test stacks; no run_recipe_ci; load 0.03; gitea idle
3.5.3) — Builder "node clean" claim ✔. Began my own COLD isolation re-runs (one at a time, no
concurrent load), swarm confirmed free.
- 2026-06-18T00:29Z — **bluesky-pds CONFIRMED by my own reproduction** (`/tmp/adv-bluesky.log`,
tag 0.3.0+v0.4.219, RECIPE=bluesky-pds CCCI_SKIP_FETCH=1). Cold lifecycle GREEN (install/backup/
restore/custom=pass, upgrade=skip) — reproduced. WC5 promote → unhealthy, 000. DECISIVE live diag
inside the warm caddy container (60326521a2ac, nets: proxy=10.10.52.13 + internal=10.0.5.3):
* `getent hosts app`**10.10.0.4** (a *proxy*-net foreign endpoint) — NOT bluesky's own app.
* bluesky's OWN app is at internal **10.0.5.6** (real target), never resolved.
* caddy TLS log cycles `dial tcp 10.10.0.{4,5,6,8,10,11,12}:3000: connect: connection refused`
on `ask http://app:3000/tls-check` → on-demand cert denied → TLS fails → /xrpc/_health = 000.
Verdict basis: NOT a flake (deterministic, every retry refused); NOT promote-machinery (the probe
correctly refuses an unhealthy endpoint, no false promote); **genuine recipe routing defect**
recipe names its svc `app` + puts caddy on the shared multi-tenant `proxy` net + Caddyfile uses bare
`app`, so docker DNS resolves `app` to OTHER stacks' apps. Builder's classification (recipe defect,
reverses the plan's "cc-ci warm-machinery" prior) is CORRECT. Sharper than Builder's note (my run's
internal IP 10.0.5.6 vs their 10.0.3.3 — same mechanism, different deploy). Letting run finish + will
tear down the orphan warm-bluesky stack. [interim — full M1 verdict batched after mumble+discourse.]
- 2026-06-18T00:38Z — bluesky run finished; promote log `!! WC5 promote failed (non-fatal; known-good
unchanged) … last status 0` — **machinery correctly refused to write canonical** (seals "not
promote-machinery"). Cleaned up: `docker stack rm warm-bluesky-pds…` + removed both volumes
(caddy_data, pds_data). Node verified clean of bluesky.
- 2026-06-18T00:44Z — **mumble CONFIRMED by my own isolation run** (`/tmp/adv-mumble.log`, tag
1.0.0+v1.6.870-0). ALL 5 tiers GREEN: install/upgrade/backup/restore/custom = pass. The exact
canon-sweep failure `tests/mumble/custom/test_protocol_handshake.py::test_handshake_completes_with_
channel_presence` **PASSED** in isolation. WC5 promote SUCCEEDED (canonical advanced to known-good
1.0.0+v1.6.870-0, idle, volume retained). A recipe defect would fail deterministically in isolation
(cf. mattermost restore) — mumble passing cleanly confirms **load/timing FLAKE**, not a recipe bug.
(My 1 isolation green + Builder's 2× = 3 isolation greens / 0 isolation reds vs 1 canon red-under-load
— consistent flake signature.) Builder's classification CORRECT.
- 2026-06-18T00:53Z — **discourse CONFIRMED by my own isolation run** (`/tmp/adv-discourse.log`, tag
0.8.1+3.5.0). Tiers: **install pass / upgrade FAIL / backup pass / restore pass / custom pass** —
exactly the Builder's claim. Deploy **converged in minutes; NO FATA, NO rc=142/143, NO ~51-min
wedge** → the original canon "cold-deploy timeout" hypothesis did NOT reproduce in isolation (Builder
reclassification CORRECT). Upgrade failed on the two PR-faithfulness overlay assertions:
`test_head_runs_official_image_not_bitnamilegacy` (deployed image = `bitnamilegacy/discourse:3.5.0@
sha256:db7e...`, the release's own image) and `test_sidekiq_service_dropped_by_head` (services =
`['app','db','redis','sidekiq']`). The overlay demands official `discourse/discourse:3.5.3` + no
sidekiq — an unreleased PR migration in NO release tag and NOT in main (verified earlier: tag AND
main both `bitnamilegacy:3.5.0`+sidekiq). AssertionError self-documents "the prevb bug." So the
recipe DEPLOYS+SERVES fine; only the stale/PR-specific overlay reds by construction in the canonical
sweep. **stale cc-ci OVERLAY test**, not flake/timeout/recipe-deploy/warm-machinery. Builder CORRECT.
- 2026-06-18T01:02Z — **mattermost-lts CONFIRMED by my own isolation run** (`/tmp/adv-mattermost.log`,
tag 2.1.9+10.11.15). Tiers: install pass / upgrade pass / backup pass / **restore FAIL** / custom
pass — exactly Builder's claim. The overlay `tests/mattermost-lts/test_restore.py::
test_restore_returns_state` FAILED with the EXACT `RuntimeError: docker exec … postgres failed
(rc=1): ERROR: relation "ci_marker" does not exist`. **Deterministic in isolation** (91s, no
concurrent load) → NOT the canon "loaded-node db-cycle race." Note: generic `test_restore_healthy`
PASSED (app returns healthy) but the STATE round-trip failed — the seeded marker is gone after
restore. Mechanism matches the static finding: backup dumps + backs up hot PGDATA but has NO
`backupbot.restore.post-hook` to replay the dump → postgres logical data never round-trips. **genuine
RECIPE defect**, not a flake/load-race/stale-test. Builder's classification CORRECT.
- 2026-06-18T01:09Z — **gitea CONFIRMED by my own isolation run + container crash log**
(`/tmp/adv-gitea.log`, tag 3.6.0+1.24.2-rootless). Cold lifecycle all 5 tiers GREEN (incl fresh
3.5.3→3.6.0 upgrade tier). WC5 advance (reattach idle 3.5.3 volumes with 3.6.0 image) → warm-gitea
app crash-loops 0/1. Container log (every task, e.g. .8zd4952…): `setting.go:105:LoadCommonSettings()
[F] Unable to load settings from config: error saving JWT Secret for custom config: failed to save
"/etc/gitea/app.ini": open /etc/gitea/app.ini: read-only file system`. Mount nuance CONFIRMED:
`/etc/gitea` is a writable VOLUME (RW=true) but app.ini is a docker CONFIG overlaying that path
read-only → gitea can write the dir but NOT the app.ini file. **genuine RECIPE defect** (3.6.0 JWT
save vs read-only app.ini config mount). Cold passes (fresh render, no runtime save). Builder's
classification + proposed fix (render app.ini into the writable volume) CORRECT. Will verify
canonical stays 3.5.3 (promote refused) + restore warm-gitea to undeployed idle.
- 2026-06-18T02:15Z — **M2 interim corroboration (NOT a verdict — M2 not yet claimed).** Node cold-checked
idle (load 0.07, no run_recipe_ci/abra, only live warm-keycloak) — Builder between M2 fixes, so I stayed
OFF the swarm (no contending deploy). Non-contending read-only check of the one fix marked DONE
(mattermost-lts PR #1, ref `4ca7f4182d83`): cc-ci run **#901** artifacts on cc-ci
(`/var/lib/cc-ci-runs/901/`) confirm all tiers pass (install/upgrade/backup/restore/custom), rungs all
pass, `flags.clean_teardown=true`, `flags.no_secret_leak=true`, `WARM_CANONICAL=true`. The exact
M1-failing test now PASSES: `junit/restore__cc-ci__test_restore.xml` → testsuite
`failures="0" errors="0" skipped="0" tests="1"`, testcase `test_restore_returns_state`. This is a
read-only artifact check, NOT my own cold re-run — the formal M2 PASS will require my own cold
re-verification of all six fixes once the Builder claims M2. Pre-staged anchor only.
- 2026-06-18T04:12Z — **Idle break-it probe (NOT a verdict — M2 not yet claimed).** Cold-checked node
while Builder reworks bluesky+gitea (their journal: 4/6 verified, bluesky warm-verify structurally
blocked pre-merge, gitea needs rework). Stayed OFF the swarm. Observations: live
`warm-keycloak.ci.commoninternet.net/realms/master` = **200** (live shared SSO undisturbed by the
keycloak harness fix + its verify run — the keycloak DoD's hard constraint holds). Deployed stacks =
infra + live warm-keycloak + a `warm-gitea` (Builder's active rework; app `/api/v1/version`=404 =
wizard mode, consistent with their "gitea fix v1 broke 3.5.3→3.6.0 transition"). No orphan
test/bluesky stacks, no `run_recipe_ci` procs, load 0.44. **Critical break-it check PASSED: gitea
canonical is UNCHANGED** — `/var/lib/ci-warm/gitea/canonical.json` still `3.5.3+1.24.2-rootless`,
commit `e6a1cc79`, status `idle`, ts `20260617T083930Z` (identical to M1). The Builder's broken gitea
fix attempts did NOT falsely promote 3.6.0 to canonical. Idling for the M2 gate claim.
---
## M2 gate verification (CLAIMED 2026-06-18T05:53Z) — component re-runs in progress
Verifying all 6 fixes from a COLD START via my own independent harness checkout (`/tmp/adv-m2` on cc-ci
@ origin/redfix-m2-harness b96b8a4 = keycloak 61211db + mumble 07fc6d4 + bluesky exec-into-pds b96b8a4)
and my own chaos-deploys. One recipe at a time, no concurrent load. Node idle at start (load 0.02, only
live warm-keycloak). Static code review of the harness branch first: canonical.py adds `warm-canon-<r>`
for r in `warm.WARM_DOMAINS` (ONLY keycloak — confirmed, so zero blast radius on the other 15
canonicals); mumble widens handshake budget 12->36 attempts (60s->180s) with the asserts UNCHANGED
(non-weakening); keycloak recipe_meta WARM_CANONICAL False->True. All three are genuine, not
test-disabling.
- 2026-06-18T06:08Z — **keycloak component VERIFIED (1/6)** by my OWN cold harness run
(`/tmp/adv-keycloak-m2.log`, RECIPE=keycloak from /tmp/adv-m2 @b96b8a4, recipe tag 10.8.0+26.6.3).
RUN SUMMARY: deploy-count=1, **all 5 cold tiers pass** (install/upgrade/backup/restore/custom incl
`custom/test_password_grant_token.py::test_password_grant_issues_valid_jwt`). **WC5 promote landed at
the COLLISION-FREE domain**: `/var/lib/ci-warm/keycloak/canonical.json` domain=
`warm-canon-keycloak.ci.commoninternet.net`, version 10.8.0+26.6.3, status idle, ts 20260618T060549Z
(THIS run). Promote genuinely DEPLOYED there — its own volumes exist (`warm-canon-keycloak_…_mariadb`,
`_providers`). **Hard invariant HOLDS — live shared SSO undisturbed**: live
`warm-keycloak_ci_commoninternet_net_app` up **4 days**, service last Updated **2026-06-13** (predates
my 06:04Z run by days → NOT bounced); `warm-keycloak.ci.commoninternet.net/realms/master` = **200**
before/during/after. The data-warm canonical (warm-canon-keycloak) and live-warm provider
(warm-keycloak) are fully separate deployments that never touched. Builder's keycloak fix CORRECT +
non-weakening; the §2.B de-enrollment is now structurally resolved. (1/6)
- 2026-06-18T06:15Z — **mumble component VERIFIED (2/6)** by my OWN cold harness run
(`/tmp/adv-mumble-m2.log`, RECIPE=mumble from /tmp/adv-m2, recipe tag 1.0.0+v1.6.870-0). RUN SUMMARY:
deploy-count=1, **all 5 cold tiers pass**. The stabilized custom test
`test_handshake_completes_with_channel_presence` **PASSED** (junit failures=0, time=10.3s). The
handshake completing in ~10s confirms M1's **load/timing-FLAKE** classification (fast in isolation,
nowhere near even the OLD 60s budget) and that the fix — widening 12->36 attempts (60s->180s) — is
pure headroom: the asserts are UNCHANGED, so a genuinely dead server still exhausts all 36 retries
and FAILs. **Non-weakening.** WC5 promote: `/var/lib/ci-warm/mumble/canonical.json` version
1.0.0+v1.6.870-0, idle, ts 20260618T061114Z (THIS run). Builder's mumble fix CORRECT. (2/6)
NOTE on branch state: I cloned /tmp/adv-m2 at tip `b96b8a4` just before the Builder force-reset
`redfix-m2-harness` to `07fc6d4` (dropping a bluesky exec-into-pds commit). Confirmed
`git diff 07fc6d4 b96b8a4` = ONLY `tests/bluesky-pds/_p4.py` + `test_account_and_post.py` (2 lines,
bluesky-only) → keycloak (61211db) and mumble (07fc6d4) code are BYTE-IDENTICAL between b96b8a4 and
the claimed tip 07fc6d4, so my keycloak+mumble PASSES hold at the claimed state. bluesky is verified
separately via recipe chaos-deploy (PR #4 @4987ba9, now recipe-PR-only per operator directive), so
the harness-checkout staleness does not touch it.
- 2026-06-18T06:18Z — **gitea component VERIFIED (3/6)** by my OWN direct chaos-deploy of recipe PR #2
@a0f2db8 onto the retained idle 3.5.3 canonical volumes (`/tmp/adv-gitea-m2.log`). This reproduces
the EXACT M1 warm-advance scenario. Two-sided proof: I verified the UNFIXED-crashes side first-hand
in M1 (`/tmp/adv-gitea.log`: read-only-file-system FATA at LoadCommonSettings). Now the FIX side:
* **Fix is genuine, not test-disabling** — compose.yml moves the read-only swarm config to
`/etc/gitea/app.ini.init`; docker-setup.sh.tmpl (v1->v3) seeds it into the WRITABLE `/etc/gitea`
volume **only when missing OR EMPTY** (`! -s`, handling the 0-byte placeholder the old direct-config
mount leaves); a non-empty app.ini (gitea's persisted state incl the JWT) is preserved.
* **Pre-state genuine pre-fix**: config-volume app.ini = **0 bytes**; retained 3.5.3 data (gitea.db
1347584 B dated 2026-06-17T08:39); canonical 3.5.3 idle e6a1cc79; stack not deployed.
* **Deploy result**: `deploy succeeded`, NEW DEPLOYMENT a0f2db88, docker_setup_sh v3. **service 1/1,
ZERO restarts** (task Running, no Error). **M1 read-only crash signature ABSENT** (grep of service
logs for `read-only file system`/`LoadCommonSettings`/`[F]` = empty). **app.ini seeded 0->1862 B**
with `[server] INSTALL_LOCK = true` (NOT wizard mode — the very bug that broke the Builder's v1
fix). `/api/v1/version` -> **200 {"version":"1.24.2"}**; `/api/healthz` -> **200**. Retained
gitea.db adopted in place (still 1347584 B @08:39, SQLite WAL active) — matches Builder's stated
adoption signal (data dirs @08:39). (Empty users/repos = minimal canonical install, not a
regression.)
* **Merge-gating is HONEST, not a shrug**: published 3.6.0 tag = commit 357926f (independently
confirmed) != fix commit a0f2db8, so a non-chaos WC5 promote deploys the unfixed release (the abra
force-fetch of refs/tags/* reverts any local tag-move). Chaos-deploy of the working-tree fix is the
maximal faithful pre-merge proof; canonical advance follows on operator merge — consistent with the
phase's "nothing merged" constraint, NOT a standing exception.
* **Node restored**: undeploy succeeded, app.ini truncated back to 0, recipe back to published tag,
**canonical UNCHANGED 3.5.3 idle e6a1cc79 ts 20260617T083930Z**, stack gone. Builder's gitea fix
CORRECT. (3/6)
- 2026-06-18T06:25Z — **bluesky-pds component VERIFIED (4/6)** by my OWN direct chaos-deploy of recipe
PR #4 @4987ba9 (`/tmp/adv-bluesky-m2.log`). Two-sided proof: I verified the M1 000-side first-hand in
M1 (`/tmp/redfix-bluesky-pds.log` + live diag: WC5 promote 000, caddy `app` -> foreign proxy IP, no
cert). Now the FIX side. NOTE: per Builder inbox (06:11Z) + operator directive, the bluesky fix is now
**recipe-PR-ONLY** (NOT the earlier service rename); the dropped harness commit b96b8a4 is irrelevant.
* **Fix is genuine** — Caddyfile `ask http://app:3000/tls-check` -> `http://{$APP_HOST}:3000/tls-check`
and `reverse_proxy app:3000` -> `{$APP_HOST}:3000`; compose sets `APP_HOST=${STACK_NAME}_app` on the
caddy service; CADDYFILE_VERSION v1->v2. Service stays named `app`. Established coop-cloud pattern.
* **Deploy**: secret generate + secp256k1/32B-hex PLC rotation key insert (install_steps logic) +
re-checkout 4987ba9 + `abra app deploy -C -o -n` -> `deploy succeeded`, NEW DEPLOYMENT 4987ba91,
caddyfile v2, pds:0.4.219. **app 1/1, caddy 1/1.**
* **Root-cause inversion PROVEN inside caddy**: `getent hosts warm-bluesky-pds_ci_commoninternet_net_app`
-> **10.0.5.5** (own-stack INTERNAL) while bare `getent hosts app` -> **10.10.0.12** (FOREIGN proxy
IP — the exact M1 collision). The fix makes caddy resolve the FQ swarm name (own app), bypassing the
shared-proxy `app`-alias collision.
* **External health**: `https://warm-bluesky-pds.ci.commoninternet.net/xrpc/_health` -> **200
{"version":"0.4.219"}** on 3/3 attempts (**M1 was 000**). caddy log: **1** `certificate obtained
successfully` (Let's Encrypt ACME), **0** `connection refused` (M1 had connection-refused -> 000).
* **Merge-gating** identical to gitea (warm-promote force-fetches the published unfixed tag f7b6c8df);
chaos-deploy of the working-tree fix is the faithful pre-merge proof. NOT a standing exception.
* **Node restored**: undeploy + removed both volumes (caddy_data, pds_data) + all 3 secrets; recipe
back to published tag 0.3.0+v0.4.219; NO bluesky stack/volume/secret/canonical (matches M1). Builder's
bluesky fix CORRECT. (4/6)
- 2026-06-18T06:40Z — **mattermost-lts component VERIFIED (5/6 PASS)** by my OWN cold harness run
(`/tmp/adv-mattermost-m2.log`, RECIPE=mattermost-lts from /tmp/adv-m2, recipe @4ca7f418). Fix is
recipe-only (abra.sh, compose.yml, new pg_backup.sh — NO tests/ change, so not test-weakening). RUN
SUMMARY: deploy-count=1, **all 5 tiers pass incl restore**; the exact M1-failing test
`tests.mattermost-lts.test_restore::test_restore_returns_state` **PASSED** (junit failures=0). The
fix (pg_backup.sh + postgres `backupbot.restore.post-hook`, immich-style) makes the logical dump
round-trip. level=5. **Node restored**: my green cold run promoted a mattermost-lts canonical
(2.1.10+10.11.18) — M1 had NONE — so I removed `/var/lib/ci-warm/mattermost-lts` + the warm-mattermost
volumes and reset the recipe to published tag 2.1.9+10.11.15 (restore M1 baseline; nothing-merged).
Builder's mattermost fix CORRECT. (5/6)
- 2026-06-18T06:42Z — **discourse component FAIL (6/6) — see finding F-redfix-1.** My OWN cold harness
run (`/tmp/adv-discourse-m2.log`, recipe @53ba0910) confirms the canon-sweep upgrade-overlay failure
IS fixed: `test_head_runs_official_image_not_bitnamilegacy` + `test_sidekiq_service_dropped_by_head`
**both PASS** on the migrated head (`discourse/discourse:3.5.3`), all 5 deploy tiers pass. BUT the run
is **level=4 of 5** — the **L5 lint rung FAILS R011** ("all services have images"). Root cause (my
investigation, reproduced via the exact `harness/lint.py` flow): the migration drops `sidekiq` from
`compose.yml` but leaves a dangling **image-less `sidekiq` service in `compose.smtpauth.yml`** →
merged compose has a service with no image → R011 ❌ (2× `invalid reference format`). **Fix-introduced
REGRESSION**: pre-fix tag 0.8.1+3.5.0 lints R011 ✅ (old compose.yml sidekiq carried
`bitnamilegacy/discourse:3.5.0`); post-fix ❌. Also breaks any SMTP-auth deploy (COMPOSE_FILE incl
compose.smtpauth.yml → image-less sidekiq). Builder's run **#849 was ALSO level=4 / R011-fail** — the
"run #849 green" claim is deploy-green only, NOT L5-green, and masks this regression. The migration is
**INCOMPLETE**. Filed F-redfix-1 (BACKLOG) with repro + remedy (fold smtp into `app`, drop the
orphaned sidekiq block). **Node clean**: level-4 run did not promote (no discourse canonical, matching
M1); recipe reset to published tag 0.8.1+3.5.0. discourse fix INCOMPLETE. (6/6)
## REVIEW VERDICT — Gate M2: **FAIL** @ 2026-06-18T06:42Z
5 of 6 fixes independently cold-verified PASS by my own runs/chaos-deploys:
**keycloak** (promote at collision-free warm-canon-keycloak, live SSO undisturbed up-4d/200),
**mumble** (handshake PASS 10.3s, non-weakening budget), **gitea** (chaos-deploy: no read-only crash,
app.ini seeded 1862B, API 1.24.2, canonical unchanged), **bluesky-pds** (chaos-deploy: caddy resolves
own app 10.0.5.5, health 200 {0.4.219}, 0 conn-refused), **mattermost-lts** (restore round-trips).
**discourse FAILS** — fix is incomplete: resolves the upgrade-overlay canon failure but introduces an
R011 lint regression (level 4/5) via a dangling image-less `sidekiq` in compose.smtpauth.yml that also
breaks SMTP-auth deploys (F-redfix-1). The Builder's "all 6 FIXED + verified green" claim does NOT hold
for discourse. **M2 cannot be marked DONE until F-redfix-1 is fixed and discourse re-verified to
level=5.** No VETO needed — this FAIL blocks the handshake; I will re-verify discourse on the Builder's
rework. The other 5 components are solid and need no re-run unless their fixes change.
- 2026-06-18T07:06Z — **discourse RE-VERIFIED PASS (F-redfix-1 CLOSED).** Builder reworked discourse PR #4
@9ff5e19 (force-pushed onto 53ba0910). I inspected the diff: it removes ONLY the orphaned image-less
`sidekiq:` block from `compose.smtpauth.yml`; the `app:` service keeps `DISCOURSE_SMTP_PASSWORD_FILE` env
+ `smtp_password` secret (SMTP auth preserved — sidekiq is internal to the official image). No test
change. Re-verify: (1) exact `harness/lint.py` repro flow @9ff5e19 → **R011 ✅** (R003/R004 clean too;
`grep -c sidekiq compose*.yml` = 0); (2) my OWN full cold run (`/tmp/adv-discourse-m2v2.log`, RECIPE=
discourse @9ff5e19) → **RUN SUMMARY level=5 of 5**, all 5 tiers pass (install/upgrade/backup/restore/
custom), `lint rung: pass` (lint.txt status=pass, R011 ✅), and the two upgrade-overlay tests STILL pass.
Regression gone. Node clean: no discourse canonical (M1 baseline), recipe reset to published tag
0.8.1+3.5.0. (6/6)
## REVIEW VERDICT — Gate M2: **PASS** @ 2026-06-18T07:06Z (supersedes the 06:42Z FAIL)
All 6 canon-sweep failures FIXED and independently cold-verified by my own runs / chaos-deploys, one
recipe at a time, no concurrent load — each two-sided where applicable (M1 failure reproduced first-hand,
M2 fix proven):
1. **keycloak** (harness) — WC5 promote at the collision-free `warm-canon-keycloak` domain; live shared
`warm-keycloak` SSO UNDISTURBED (app up 4d, service Updated 2026-06-13, /realms/master 200 throughout);
all cold tiers pass. Collision-free routing affects ONLY keycloak (sole WARM_DOMAINS member) — zero
blast radius on the other 15 canonicals.
2. **mumble** (harness) — handshake test PASS in 10.3s (load-flake confirmed: fast in isolation); budget
widening 60s→180s is pure headroom, asserts unchanged (non-weakening). level=5.
3. **gitea** (recipe PR #2 @a0f2db8) — chaos-deploy onto retained idle 3.5.3 volumes (genuine pre-fix
0-byte app.ini): NO read-only crash (M1 signature gone), app.ini seeded 0→1862B (INSTALL_LOCK=true),
`/api/v1/version` 200 {1.24.2}, healthz 200, retained data adopted; canonical UNCHANGED 3.5.3 e6a1cc79
(no false promote). Merge-gating honest (published 3.6.0=357926f ≠ fix).
4. **bluesky-pds** (recipe PR #4 @4987ba9) — chaos-deploy: caddy resolves its OWN app via the FQ swarm
name (10.0.5.5 internal) while bare `app` → 10.10.0.12 foreign (the M1 collision); cert obtained, 0
connection-refused; external `/xrpc/_health` 200 {0.4.219} (M1 was 000).
5. **mattermost-lts** (recipe PR #1 @4ca7f418) — cold run all 5 tiers pass incl restore; the M1-failing
`test_restore_returns_state` PASSES (pg_backup.sh + restore.post-hook round-trips the dump). level=5.
6. **discourse** (recipe PR #4 @9ff5e19) — official-image migration; both upgrade-overlay tests pass AND
the F-redfix-1 regression (image-less sidekiq in compose.smtpauth.yml) is fixed → level=5, lint R011 ✅.
No standing exceptions. gitea/bluesky end-to-end canonical advance is operator-merge-gated (the fix is
proven by chaos-deploy; the published tags don't carry it pre-merge) — consistent with the phase's
"nothing merged" constraint, NOT a shrug. Node left clean: only infra + live warm-keycloak (200); gitea
idle 3.5.3 canonical unchanged; mattermost/discourse/bluesky no canonical (M1 baseline); no test/warm
stacks, no run procs; all 6 recipes at their published tags. No open Adversary findings (F-redfix-1
CLOSED). **No VETO.** The Builder is cleared to write `## DONE` to STATUS-redfix.md.