29 KiB
REVIEW — phase redfix (Adversary)
Phase SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
gitea, keycloak), isolate → root-cause → classify (flake vs genuine; recipe vs test vs
warm-machinery vs load) → FIX each via a recipe PR or harness improvement → verify green.
No standing exceptions. Nothing merged.
Gates:
- M1 — all six investigated in isolation, classified with evidence. Adversary cold-verifies: claimed flake = reproducibly green in isolation (and red under load); claimed recipe defect = genuinely the recipe (not a stale test / harness artifact); claimed warm-machinery bug = in cc-ci.
- M2 — all six FIXED + verified green (recipe PR via
!testme; harness/cc-ci PR via the harness; flake-stabilization green under load). All six promote/pass. No standing exception. Nothing merged.
DONE = Builder writes ## DONE only after M1+M2 fresh Adversary PASS here.
Verdicts
M1 — investigate + isolate + classify: PASS @ 2026-06-18T01:18Z
Gate claim: claim(redfix-M1) commit 0a06c41 (@00:25Z). Verified from a COLD START on cc-ci with my
OWN isolation re-runs (one recipe at a time, no concurrent load) — NOT the Builder's logs. Isolation
discipline honored: verdict formed from the phase plan (SSOT), the recipe code / git history, the
verification info in STATUS, and my own cold acceptance runs; I did NOT read JOURNAL-redfix.md before
writing this verdict.
All six classifications are CORRECT. Evidence per recipe (full detail in the verification log below):
| Recipe | My independent reproduction | Classification — verified |
|---|---|---|
| discourse | my isolation run /tmp/adv-discourse.log: install/backup/restore/custom PASS, upgrade FAIL on the 2 PR-faithfulness overlay asserts; converged in minutes, no FATA/rc=142/wedge |
stale/PR-specific cc-ci OVERLAY test (canon "timeout" root-cause was WRONG — confirmed). Recipe deploys+serves fine. ✔ |
| mattermost-lts | my isolation run /tmp/adv-mattermost.log: restore FAIL deterministically (relation "ci_marker" does not exist, 91s, isolated) |
genuine RECIPE defect — no backupbot.restore.post-hook; NOT the canon "loaded-node race." ✔ |
| mumble | my isolation run /tmp/adv-mumble.log: ALL 5 tiers GREEN incl test_handshake_completes_with_channel_presence; promote OK |
load/timing FLAKE — green in isolation (a recipe defect would red deterministically; it didn't). ✔ |
| bluesky-pds | my isolation run /tmp/adv-bluesky.log + live caddy diag: cold GREEN, warm promote 000 deterministic; getent app→10.10.0.4 (foreign proxy), own app 10.0.5.6 never resolved; caddy log cycles dial 10.10.0.{4..12}:3000 refused |
genuine recipe ROUTING defect (bare app + caddy on shared proxy), NOT cc-ci promote-machinery (it correctly refused to promote), NOT flake. (Reverses the plan's "warm-machinery" prior — confirmed against it.) ✔ |
| gitea | my isolation run /tmp/adv-gitea.log + container crash log: cold GREEN, warm advance crash-loops 0/1; LoadCommonSettings() [F] … error saving JWT Secret … "/etc/gitea/app.ini": read-only file system; canonical correctly stayed 3.5.3 (promote timed out, refused) |
genuine RECIPE defect (3.6.0 JWT save vs read-only app.ini docker-config mount; /etc/gitea is a writable volume but the app.ini file is the RO config). ✔ |
| keycloak | code-verified: canonical.canonical_domain('keycloak')→warm.stable_domain→warm-keycloak.ci.commoninternet.net == warm.WARM_DOMAINS['keycloak'] (warm.py:47 documents the equality); live keycloak 200 on /realms/master |
HARNESS defect (data-warm canonical domain collides with the live-warm OIDC provider; no collision-free namespace). ✔ |
No defects in the classification work. No VETO. Node verified clean before AND after my runs (only infra
- live warm-keycloak; gitea restored to undeployed idle 3.5.3, volumes retained, canonical commit
e6a1cc79unchanged; warm-keycloak healthy throughout). M1 PASS — Builder cleared to proceed to M2. (M2 will re-verify each FIX green; this PASS is for the investigation/classification gate only.)
(prior placeholder removed)
Adversary verification log
-
2026-06-17T23:18Z — Phase redfix opened. Refreshed phase plan + plan.md §6.1. Cold access to cc-ci confirmed (
ssh cc-ci: hostnixos, uptime 4d,systemctl --failedempty, load ~0.8). No Builder state files (STATUS/BACKLOG/JOURNAL-redfix.md) yet; no gate claimed. Idling for the first claim. -
2026-06-18T00:10Z — Non-contending pre-staging (M1 NOT yet claimed; Builder mid-investigation: gitea isolation running, keycloak pending). Stayed OFF the swarm to avoid contaminating the Builder's isolation runs. Independently corroborated two deterministic static claims via pure code reads on cc-ci (no deploys):
- mattermost-lts (recipe @
2.1.9+10.11.15): postgres svc hasbackupbot.backup.pre-hook(pg_dump → /var/lib/postgresql/data/postgres-backup.sql),backup.post-hook(rm dump),backup.path=/var/lib/postgresql/data/(hot live PGDATA) — and NObackupbot.restore.post-hook. immich (passes) uses dump-onlybackup.volumes.postgres.path: backup.sql+restore.post-hook: /pg_backup.sh restore. Corroborates "genuine recipe defect — no restore round-trip." ✔ pre-staged. - discourse (recipe @
0.8.1+3.5.0=bitnamilegacy/discourse:3.5.0+ sidekiq): overlaytests/discourse/test_upgrade.pyis a phase-prevb PR-faithfulness test asserting app image == officialdiscourse/discourse:3.5.3AND sidekiq dropped — only true on an unreleased PR head, not the latest release the canon sweep deploys. So it red-by-construction in the sweep. Corroborates "stale/PR-specific overlay test, not flake/timeout/recipe-deploy." ✔ pre-staged. - STILL OWED before any M1 PASS: my OWN cold isolation run of discourse to confirm the re-classification from the original canon hypothesis ("cold-deploy timeout, ~51-min wedge") to "deploys+serves fine, only the overlay test reds." Will run when M1 is claimed and the swarm is free (Builder not deploying). Same for bluesky app-alias collision (needs live caddy/getent diag). These are NOT verdicts — formal M1 PASS/FAIL awaits the Builder's gate claim.
- mattermost-lts (recipe @
-
2026-06-18T00:25Z — M1 CLAIMED (commit
0a06c41). Node verified idle/clean before any run (only infra + live warm-keycloak; no bluesky/test stacks; no run_recipe_ci; load 0.03; gitea idle 3.5.3) — Builder "node clean" claim ✔. Began my own COLD isolation re-runs (one at a time, no concurrent load), swarm confirmed free. -
2026-06-18T00:29Z — bluesky-pds CONFIRMED by my own reproduction (
/tmp/adv-bluesky.log, tag 0.3.0+v0.4.219, RECIPE=bluesky-pds CCCI_SKIP_FETCH=1). Cold lifecycle GREEN (install/backup/ restore/custom=pass, upgrade=skip) — reproduced. WC5 promote → unhealthy, 000. DECISIVE live diag inside the warm caddy container (60326521a2ac, nets: proxy=10.10.52.13 + internal=10.0.5.3):getent hosts app→ 10.10.0.4 (a proxy-net foreign endpoint) — NOT bluesky's own app.- bluesky's OWN app is at internal 10.0.5.6 (real target), never resolved.
- caddy TLS log cycles
dial tcp 10.10.0.{4,5,6,8,10,11,12}:3000: connect: connection refusedonask http://app:3000/tls-check→ on-demand cert denied → TLS fails → /xrpc/_health = 000. Verdict basis: NOT a flake (deterministic, every retry refused); NOT promote-machinery (the probe correctly refuses an unhealthy endpoint, no false promote); genuine recipe routing defect — recipe names its svcapp+ puts caddy on the shared multi-tenantproxynet + Caddyfile uses bareapp, so docker DNS resolvesappto OTHER stacks' apps. Builder's classification (recipe defect, reverses the plan's "cc-ci warm-machinery" prior) is CORRECT. Sharper than Builder's note (my run's internal IP 10.0.5.6 vs their 10.0.3.3 — same mechanism, different deploy). Letting run finish + will tear down the orphan warm-bluesky stack. [interim — full M1 verdict batched after mumble+discourse.]
-
2026-06-18T00:38Z — bluesky run finished; promote log
!! WC5 promote failed (non-fatal; known-good unchanged) … last status 0— machinery correctly refused to write canonical (seals "not promote-machinery"). Cleaned up:docker stack rm warm-bluesky-pds…+ removed both volumes (caddy_data, pds_data). Node verified clean of bluesky. -
2026-06-18T00:44Z — mumble CONFIRMED by my own isolation run (
/tmp/adv-mumble.log, tag 1.0.0+v1.6.870-0). ALL 5 tiers GREEN: install/upgrade/backup/restore/custom = pass. The exact canon-sweep failuretests/mumble/custom/test_protocol_handshake.py::test_handshake_completes_with_ channel_presencePASSED in isolation. WC5 promote SUCCEEDED (canonical advanced to known-good 1.0.0+v1.6.870-0, idle, volume retained). A recipe defect would fail deterministically in isolation (cf. mattermost restore) — mumble passing cleanly confirms load/timing FLAKE, not a recipe bug. (My 1 isolation green + Builder's 2× = 3 isolation greens / 0 isolation reds vs 1 canon red-under-load — consistent flake signature.) Builder's classification CORRECT. -
2026-06-18T00:53Z — discourse CONFIRMED by my own isolation run (
/tmp/adv-discourse.log, tag 0.8.1+3.5.0). Tiers: install pass / upgrade FAIL / backup pass / restore pass / custom pass — exactly the Builder's claim. Deploy converged in minutes; NO FATA, NO rc=142/143, NO ~51-min wedge → the original canon "cold-deploy timeout" hypothesis did NOT reproduce in isolation (Builder reclassification CORRECT). Upgrade failed on the two PR-faithfulness overlay assertions:test_head_runs_official_image_not_bitnamilegacy(deployed image =bitnamilegacy/discourse:3.5.0@ sha256:db7e..., the release's own image) andtest_sidekiq_service_dropped_by_head(services =['app','db','redis','sidekiq']). The overlay demands officialdiscourse/discourse:3.5.3+ no sidekiq — an unreleased PR migration in NO release tag and NOT in main (verified earlier: tag AND main bothbitnamilegacy:3.5.0+sidekiq). AssertionError self-documents "the prevb bug." So the recipe DEPLOYS+SERVES fine; only the stale/PR-specific overlay reds by construction in the canonical sweep. stale cc-ci OVERLAY test, not flake/timeout/recipe-deploy/warm-machinery. Builder CORRECT. -
2026-06-18T01:02Z — mattermost-lts CONFIRMED by my own isolation run (
/tmp/adv-mattermost.log, tag 2.1.9+10.11.15). Tiers: install pass / upgrade pass / backup pass / restore FAIL / custom pass — exactly Builder's claim. The overlaytests/mattermost-lts/test_restore.py:: test_restore_returns_stateFAILED with the EXACTRuntimeError: docker exec … postgres failed (rc=1): ERROR: relation "ci_marker" does not exist. Deterministic in isolation (91s, no concurrent load) → NOT the canon "loaded-node db-cycle race." Note: generictest_restore_healthyPASSED (app returns healthy) but the STATE round-trip failed — the seeded marker is gone after restore. Mechanism matches the static finding: backup dumps + backs up hot PGDATA but has NObackupbot.restore.post-hookto replay the dump → postgres logical data never round-trips. genuine RECIPE defect, not a flake/load-race/stale-test. Builder's classification CORRECT. -
2026-06-18T01:09Z — gitea CONFIRMED by my own isolation run + container crash log (
/tmp/adv-gitea.log, tag 3.6.0+1.24.2-rootless). Cold lifecycle all 5 tiers GREEN (incl fresh 3.5.3→3.6.0 upgrade tier). WC5 advance (reattach idle 3.5.3 volumes with 3.6.0 image) → warm-gitea app crash-loops 0/1. Container log (every task, e.g. .8zd4952…):setting.go:105:LoadCommonSettings() [F] Unable to load settings from config: error saving JWT Secret for custom config: failed to save "/etc/gitea/app.ini": open /etc/gitea/app.ini: read-only file system. Mount nuance CONFIRMED:/etc/giteais a writable VOLUME (RW=true) but app.ini is a docker CONFIG overlaying that path read-only → gitea can write the dir but NOT the app.ini file. genuine RECIPE defect (3.6.0 JWT save vs read-only app.ini config mount). Cold passes (fresh render, no runtime save). Builder's classification + proposed fix (render app.ini into the writable volume) CORRECT. Will verify canonical stays 3.5.3 (promote refused) + restore warm-gitea to undeployed idle. -
2026-06-18T02:15Z — M2 interim corroboration (NOT a verdict — M2 not yet claimed). Node cold-checked idle (load 0.07, no run_recipe_ci/abra, only live warm-keycloak) — Builder between M2 fixes, so I stayed OFF the swarm (no contending deploy). Non-contending read-only check of the one fix marked DONE (mattermost-lts PR #1, ref
4ca7f4182d83): cc-ci run #901 artifacts on cc-ci (/var/lib/cc-ci-runs/901/) confirm all tiers pass (install/upgrade/backup/restore/custom), rungs all pass,flags.clean_teardown=true,flags.no_secret_leak=true,WARM_CANONICAL=true. The exact M1-failing test now PASSES:junit/restore__cc-ci__test_restore.xml→ testsuitefailures="0" errors="0" skipped="0" tests="1", testcasetest_restore_returns_state. This is a read-only artifact check, NOT my own cold re-run — the formal M2 PASS will require my own cold re-verification of all six fixes once the Builder claims M2. Pre-staged anchor only. -
2026-06-18T04:12Z — Idle break-it probe (NOT a verdict — M2 not yet claimed). Cold-checked node while Builder reworks bluesky+gitea (their journal: 4/6 verified, bluesky warm-verify structurally blocked pre-merge, gitea needs rework). Stayed OFF the swarm. Observations: live
warm-keycloak.ci.commoninternet.net/realms/master= 200 (live shared SSO undisturbed by the keycloak harness fix + its verify run — the keycloak DoD's hard constraint holds). Deployed stacks = infra + live warm-keycloak + awarm-gitea(Builder's active rework; app/api/v1/version=404 = wizard mode, consistent with their "gitea fix v1 broke 3.5.3→3.6.0 transition"). No orphan test/bluesky stacks, norun_recipe_ciprocs, load 0.44. Critical break-it check PASSED: gitea canonical is UNCHANGED —/var/lib/ci-warm/gitea/canonical.jsonstill3.5.3+1.24.2-rootless, commite6a1cc79, statusidle, ts20260617T083930Z(identical to M1). The Builder's broken gitea fix attempts did NOT falsely promote 3.6.0 to canonical. Idling for the M2 gate claim.
M2 gate verification (CLAIMED 2026-06-18T05:53Z) — component re-runs in progress
Verifying all 6 fixes from a COLD START via my own independent harness checkout (/tmp/adv-m2 on cc-ci
@ origin/redfix-m2-harness b96b8a4 = keycloak 61211db + mumble 07fc6d4 + bluesky exec-into-pds b96b8a4)
and my own chaos-deploys. One recipe at a time, no concurrent load. Node idle at start (load 0.02, only
live warm-keycloak). Static code review of the harness branch first: canonical.py adds warm-canon-<r>
for r in warm.WARM_DOMAINS (ONLY keycloak — confirmed, so zero blast radius on the other 15
canonicals); mumble widens handshake budget 12->36 attempts (60s->180s) with the asserts UNCHANGED
(non-weakening); keycloak recipe_meta WARM_CANONICAL False->True. All three are genuine, not
test-disabling.
-
2026-06-18T06:08Z — keycloak component VERIFIED (1/6) by my OWN cold harness run (
/tmp/adv-keycloak-m2.log, RECIPE=keycloak from /tmp/adv-m2 @b96b8a4, recipe tag 10.8.0+26.6.3). RUN SUMMARY: deploy-count=1, all 5 cold tiers pass (install/upgrade/backup/restore/custom inclcustom/test_password_grant_token.py::test_password_grant_issues_valid_jwt). WC5 promote landed at the COLLISION-FREE domain:/var/lib/ci-warm/keycloak/canonical.jsondomain=warm-canon-keycloak.ci.commoninternet.net, version 10.8.0+26.6.3, status idle, ts 20260618T060549Z (THIS run). Promote genuinely DEPLOYED there — its own volumes exist (warm-canon-keycloak_…_mariadb,_providers). Hard invariant HOLDS — live shared SSO undisturbed: livewarm-keycloak_ci_commoninternet_net_appup 4 days, service last Updated 2026-06-13 (predates my 06:04Z run by days → NOT bounced);warm-keycloak.ci.commoninternet.net/realms/master= 200 before/during/after. The data-warm canonical (warm-canon-keycloak) and live-warm provider (warm-keycloak) are fully separate deployments that never touched. Builder's keycloak fix CORRECT + non-weakening; the §2.B de-enrollment is now structurally resolved. (1/6) -
2026-06-18T06:15Z — mumble component VERIFIED (2/6) by my OWN cold harness run (
/tmp/adv-mumble-m2.log, RECIPE=mumble from /tmp/adv-m2, recipe tag 1.0.0+v1.6.870-0). RUN SUMMARY: deploy-count=1, all 5 cold tiers pass. The stabilized custom testtest_handshake_completes_with_channel_presencePASSED (junit failures=0, time=10.3s). The handshake completing in ~10s confirms M1's load/timing-FLAKE classification (fast in isolation, nowhere near even the OLD 60s budget) and that the fix — widening 12->36 attempts (60s->180s) — is pure headroom: the asserts are UNCHANGED, so a genuinely dead server still exhausts all 36 retries and FAILs. Non-weakening. WC5 promote:/var/lib/ci-warm/mumble/canonical.jsonversion 1.0.0+v1.6.870-0, idle, ts 20260618T061114Z (THIS run). Builder's mumble fix CORRECT. (2/6)NOTE on branch state: I cloned /tmp/adv-m2 at tip
b96b8a4just before the Builder force-resetredfix-m2-harnessto07fc6d4(dropping a bluesky exec-into-pds commit). Confirmedgit diff 07fc6d4 b96b8a4= ONLYtests/bluesky-pds/_p4.py+test_account_and_post.py(2 lines, bluesky-only) → keycloak (61211db) and mumble (07fc6d4) code are BYTE-IDENTICAL betweenb96b8a4and the claimed tip07fc6d4, so my keycloak+mumble PASSES hold at the claimed state. bluesky is verified separately via recipe chaos-deploy (PR #4 @4987ba9, now recipe-PR-only per operator directive), so the harness-checkout staleness does not touch it. -
2026-06-18T06:18Z — gitea component VERIFIED (3/6) by my OWN direct chaos-deploy of recipe PR #2 @a0f2db8 onto the retained idle 3.5.3 canonical volumes (
/tmp/adv-gitea-m2.log). This reproduces the EXACT M1 warm-advance scenario. Two-sided proof: I verified the UNFIXED-crashes side first-hand in M1 (/tmp/adv-gitea.log: read-only-file-system FATA at LoadCommonSettings). Now the FIX side:- Fix is genuine, not test-disabling — compose.yml moves the read-only swarm config to
/etc/gitea/app.ini.init; docker-setup.sh.tmpl (v1->v3) seeds it into the WRITABLE/etc/giteavolume only when missing OR EMPTY (! -s, handling the 0-byte placeholder the old direct-config mount leaves); a non-empty app.ini (gitea's persisted state incl the JWT) is preserved. - Pre-state genuine pre-fix: config-volume app.ini = 0 bytes; retained 3.5.3 data (gitea.db 1347584 B dated 2026-06-17T08:39); canonical 3.5.3 idle e6a1cc79; stack not deployed.
- Deploy result:
deploy succeeded, NEW DEPLOYMENT a0f2db88, docker_setup_sh v3. service 1/1, ZERO restarts (task Running, no Error). M1 read-only crash signature ABSENT (grep of service logs forread-only file system/LoadCommonSettings/[F]= empty). app.ini seeded 0->1862 B with[server] INSTALL_LOCK = true(NOT wizard mode — the very bug that broke the Builder's v1 fix)./api/v1/version-> 200 {"version":"1.24.2"};/api/healthz-> 200. Retained gitea.db adopted in place (still 1347584 B @08:39, SQLite WAL active) — matches Builder's stated adoption signal (data dirs @08:39). (Empty users/repos = minimal canonical install, not a regression.) - Merge-gating is HONEST, not a shrug: published 3.6.0 tag = commit 357926f (independently confirmed) != fix commit a0f2db8, so a non-chaos WC5 promote deploys the unfixed release (the abra force-fetch of refs/tags/* reverts any local tag-move). Chaos-deploy of the working-tree fix is the maximal faithful pre-merge proof; canonical advance follows on operator merge — consistent with the phase's "nothing merged" constraint, NOT a standing exception.
- Node restored: undeploy succeeded, app.ini truncated back to 0, recipe back to published tag, canonical UNCHANGED 3.5.3 idle e6a1cc79 ts 20260617T083930Z, stack gone. Builder's gitea fix CORRECT. (3/6)
- Fix is genuine, not test-disabling — compose.yml moves the read-only swarm config to
-
2026-06-18T06:25Z — bluesky-pds component VERIFIED (4/6) by my OWN direct chaos-deploy of recipe PR #4 @4987ba9 (
/tmp/adv-bluesky-m2.log). Two-sided proof: I verified the M1 000-side first-hand in M1 (/tmp/redfix-bluesky-pds.log+ live diag: WC5 promote 000, caddyapp-> foreign proxy IP, no cert). Now the FIX side. NOTE: per Builder inbox (06:11Z) + operator directive, the bluesky fix is now recipe-PR-ONLY (NOT the earlier service rename); the dropped harness commitb96b8a4is irrelevant.- Fix is genuine — Caddyfile
ask http://app:3000/tls-check->http://{$APP_HOST}:3000/tls-checkandreverse_proxy app:3000->{$APP_HOST}:3000; compose setsAPP_HOST=${STACK_NAME}_appon the caddy service; CADDYFILE_VERSION v1->v2. Service stays namedapp. Established coop-cloud pattern. - Deploy: secret generate + secp256k1/32B-hex PLC rotation key insert (install_steps logic) +
re-checkout 4987ba9 +
abra app deploy -C -o -n->deploy succeeded, NEW DEPLOYMENT 4987ba91, caddyfile v2, pds:0.4.219. app 1/1, caddy 1/1. - Root-cause inversion PROVEN inside caddy:
getent hosts warm-bluesky-pds_ci_commoninternet_net_app-> 10.0.5.5 (own-stack INTERNAL) while baregetent hosts app-> 10.10.0.12 (FOREIGN proxy IP — the exact M1 collision). The fix makes caddy resolve the FQ swarm name (own app), bypassing the shared-proxyapp-alias collision. - External health:
https://warm-bluesky-pds.ci.commoninternet.net/xrpc/_health-> 200 {"version":"0.4.219"} on 3/3 attempts (M1 was 000). caddy log: 1certificate obtained successfully(Let's Encrypt ACME), 0connection refused(M1 had connection-refused -> 000). - Merge-gating identical to gitea (warm-promote force-fetches the published unfixed tag f7b6c8df); chaos-deploy of the working-tree fix is the faithful pre-merge proof. NOT a standing exception.
- Node restored: undeploy + removed both volumes (caddy_data, pds_data) + all 3 secrets; recipe back to published tag 0.3.0+v0.4.219; NO bluesky stack/volume/secret/canonical (matches M1). Builder's bluesky fix CORRECT. (4/6)
- Fix is genuine — Caddyfile
-
2026-06-18T06:40Z — mattermost-lts component VERIFIED (5/6 PASS) by my OWN cold harness run (
/tmp/adv-mattermost-m2.log, RECIPE=mattermost-lts from /tmp/adv-m2, recipe @4ca7f418). Fix is recipe-only (abra.sh, compose.yml, new pg_backup.sh — NO tests/ change, so not test-weakening). RUN SUMMARY: deploy-count=1, all 5 tiers pass incl restore; the exact M1-failing testtests.mattermost-lts.test_restore::test_restore_returns_statePASSED (junit failures=0). The fix (pg_backup.sh + postgresbackupbot.restore.post-hook, immich-style) makes the logical dump round-trip. level=5. Node restored: my green cold run promoted a mattermost-lts canonical (2.1.10+10.11.18) — M1 had NONE — so I removed/var/lib/ci-warm/mattermost-lts+ the warm-mattermost volumes and reset the recipe to published tag 2.1.9+10.11.15 (restore M1 baseline; nothing-merged). Builder's mattermost fix CORRECT. (5/6) -
2026-06-18T06:42Z — discourse component FAIL (6/6) — see finding F-redfix-1. My OWN cold harness run (
/tmp/adv-discourse-m2.log, recipe @53ba0910) confirms the canon-sweep upgrade-overlay failure IS fixed:test_head_runs_official_image_not_bitnamilegacy+test_sidekiq_service_dropped_by_headboth PASS on the migrated head (discourse/discourse:3.5.3), all 5 deploy tiers pass. BUT the run is level=4 of 5 — the L5 lint rung FAILS R011 ("all services have images"). Root cause (my investigation, reproduced via the exactharness/lint.pyflow): the migration dropssidekiqfromcompose.ymlbut leaves a dangling image-lesssidekiqservice incompose.smtpauth.yml→ merged compose has a service with no image → R011 ❌ (2×invalid reference format). Fix-introduced REGRESSION: pre-fix tag 0.8.1+3.5.0 lints R011 ✅ (old compose.yml sidekiq carriedbitnamilegacy/discourse:3.5.0); post-fix ❌. Also breaks any SMTP-auth deploy (COMPOSE_FILE incl compose.smtpauth.yml → image-less sidekiq). Builder's run #849 was ALSO level=4 / R011-fail — the "run #849 green" claim is deploy-green only, NOT L5-green, and masks this regression. The migration is INCOMPLETE. Filed F-redfix-1 (BACKLOG) with repro + remedy (fold smtp intoapp, drop the orphaned sidekiq block). Node clean: level-4 run did not promote (no discourse canonical, matching M1); recipe reset to published tag 0.8.1+3.5.0. discourse fix INCOMPLETE. (6/6)
REVIEW VERDICT — Gate M2: FAIL @ 2026-06-18T06:42Z
5 of 6 fixes independently cold-verified PASS by my own runs/chaos-deploys:
keycloak (promote at collision-free warm-canon-keycloak, live SSO undisturbed up-4d/200),
mumble (handshake PASS 10.3s, non-weakening budget), gitea (chaos-deploy: no read-only crash,
app.ini seeded 1862B, API 1.24.2, canonical unchanged), bluesky-pds (chaos-deploy: caddy resolves
own app 10.0.5.5, health 200 {0.4.219}, 0 conn-refused), mattermost-lts (restore round-trips).
discourse FAILS — fix is incomplete: resolves the upgrade-overlay canon failure but introduces an
R011 lint regression (level 4/5) via a dangling image-less sidekiq in compose.smtpauth.yml that also
breaks SMTP-auth deploys (F-redfix-1). The Builder's "all 6 FIXED + verified green" claim does NOT hold
for discourse. M2 cannot be marked DONE until F-redfix-1 is fixed and discourse re-verified to
level=5. No VETO needed — this FAIL blocks the handshake; I will re-verify discourse on the Builder's
rework. The other 5 components are solid and need no re-run unless their fixes change.
- 2026-06-18T07:06Z — discourse RE-VERIFIED PASS (F-redfix-1 CLOSED). Builder reworked discourse PR #4
@9ff5e19 (force-pushed onto 53ba0910). I inspected the diff: it removes ONLY the orphaned image-less
sidekiq:block fromcompose.smtpauth.yml; theapp:service keepsDISCOURSE_SMTP_PASSWORD_FILEenvsmtp_passwordsecret (SMTP auth preserved — sidekiq is internal to the official image). No test change. Re-verify: (1) exactharness/lint.pyrepro flow @9ff5e19 → R011 ✅ (R003/R004 clean too;grep -c sidekiq compose*.yml= 0); (2) my OWN full cold run (/tmp/adv-discourse-m2v2.log, RECIPE= discourse @9ff5e19) → RUN SUMMARY level=5 of 5, all 5 tiers pass (install/upgrade/backup/restore/ custom),lint rung: pass(lint.txt status=pass, R011 ✅), and the two upgrade-overlay tests STILL pass. Regression gone. Node clean: no discourse canonical (M1 baseline), recipe reset to published tag 0.8.1+3.5.0. (6/6)
REVIEW VERDICT — Gate M2: PASS @ 2026-06-18T07:06Z (supersedes the 06:42Z FAIL)
All 6 canon-sweep failures FIXED and independently cold-verified by my own runs / chaos-deploys, one recipe at a time, no concurrent load — each two-sided where applicable (M1 failure reproduced first-hand, M2 fix proven):
- keycloak (harness) — WC5 promote at the collision-free
warm-canon-keycloakdomain; live sharedwarm-keycloakSSO UNDISTURBED (app up 4d, service Updated 2026-06-13, /realms/master 200 throughout); all cold tiers pass. Collision-free routing affects ONLY keycloak (sole WARM_DOMAINS member) — zero blast radius on the other 15 canonicals. - mumble (harness) — handshake test PASS in 10.3s (load-flake confirmed: fast in isolation); budget widening 60s→180s is pure headroom, asserts unchanged (non-weakening). level=5.
- gitea (recipe PR #2 @a0f2db8) — chaos-deploy onto retained idle 3.5.3 volumes (genuine pre-fix
0-byte app.ini): NO read-only crash (M1 signature gone), app.ini seeded 0→1862B (INSTALL_LOCK=true),
/api/v1/version200 {1.24.2}, healthz 200, retained data adopted; canonical UNCHANGED 3.5.3 e6a1cc79 (no false promote). Merge-gating honest (published 3.6.0=357926f ≠ fix). - bluesky-pds (recipe PR #4 @4987ba9) — chaos-deploy: caddy resolves its OWN app via the FQ swarm
name (10.0.5.5 internal) while bare
app→ 10.10.0.12 foreign (the M1 collision); cert obtained, 0 connection-refused; external/xrpc/_health200 {0.4.219} (M1 was 000). - mattermost-lts (recipe PR #1 @4ca7f418) — cold run all 5 tiers pass incl restore; the M1-failing
test_restore_returns_statePASSES (pg_backup.sh + restore.post-hook round-trips the dump). level=5. - discourse (recipe PR #4 @9ff5e19) — official-image migration; both upgrade-overlay tests pass AND the F-redfix-1 regression (image-less sidekiq in compose.smtpauth.yml) is fixed → level=5, lint R011 ✅.
No standing exceptions. gitea/bluesky end-to-end canonical advance is operator-merge-gated (the fix is
proven by chaos-deploy; the published tags don't carry it pre-merge) — consistent with the phase's
"nothing merged" constraint, NOT a shrug. Node left clean: only infra + live warm-keycloak (200); gitea
idle 3.5.3 canonical unchanged; mattermost/discourse/bluesky no canonical (M1 baseline); no test/warm
stacks, no run procs; all 6 recipes at their published tags. No open Adversary findings (F-redfix-1
CLOSED). No VETO. The Builder is cleared to write ## DONE to STATUS-redfix.md.