Files
cc-ci/machine-docs/STATUS-redfix.md
autonomic-bot 29a28176a9
Some checks failed
continuous-integration/drone/push Build is failing
claim(redfix-M2): discourse F-redfix-1 FIXED + level=5 verified — re-claim 6/6
Dropped orphaned image-less sidekiq from discourse compose.smtpauth.yml (PR #4
@9ff5e19); R011 lint  (Adversary repro) + own cold run level=5 of 5 all tiers
pass. Other 5 fixes unchanged (Adversary PASS). 6/6 verified green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Bacy8SJhBebNAGoYsi1Fxt
2026-06-18 06:55:28 +00:00

24 KiB
Raw Blame History

STATUS — phase redfix

Phase SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md

Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing exceptions. Nothing merged.

Phase: M1 — investigate + isolate + classify (IN PROGRESS)

Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21 (3-day clear window). Disk / 38G free (75% used).

Isolation harness (how I reproduce each failure ALONE)

Each canon-sweep per-recipe run is runner/nightly_sweep.run_on_tag(recipe, latest): abra.recipe_checkout(recipe, <latest-tag>) then run_recipe_ci.py with RECIPE=<r> CCCI_SKIP_FETCH=1 and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known flake source per phase plan §2.1). Runs execute on cc-ci from /etc/cc-ci.

Starting canonical state (cc-ci /var/lib/ci-warm/<r>/canonical.json, read 2026-06-17T23:19Z)

Recipe Canonical now Note
discourse (none) no canonical dir
mattermost-lts (none) no canonical dir
mumble 1.0.0+v1.6.870-0 @ 20260617T180501Z canonical PRESENT, written TODAY — flake signal
bluesky-pds (none) no canonical dir
gitea 3.5.3+1.24.2-rootless @ 20260617T083930Z 3.6.0 advance not promoted
keycloak (none) de-enrolled (WARM_CANONICAL off)

M1 investigation tracker

Recipe Isolation run Result Root cause Classification
discourse DONE @23:40Z (/tmp/redfix-discourse.log on cc-ci) install/backup/restore/custom PASS; upgrade overlay FAIL. Deploys+serves fine — NOT a timeout/FATA. cc-ci overlay tests/discourse/test_upgrade.py asserts head runs official discourse/discourse:3.5.3 + drops sidekiq; latest tag 0.8.1+3.5.0 AND main both still bitnamilegacy/discourse:3.5.0+sidekiq (migration exists in no release/main). The depends_on discourse string is a non-fatal prepull-only warning, not the deploy. stale/PR-specific cc-ci OVERLAY test mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery)
mattermost-lts DONE @00:05Z (/tmp/redfix-mattermost-lts.log) install/upgrade/backup/custom PASS; restore FAIL ci_marker does not existdeterministic in isolation (not a load race) recipe postgres svc backup labels: backs up hot live PGDATA + dump but has NO backupbot.restore.post-hook to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only backup.volumes.postgres.path: backup.sql + restore.post-hook: /pg_backup.sh restore. genuine RECIPE defect at latest → recipe PR (adopt immich-style dump+restore-post-hook)
mumble DONE — 2× isolation GREEN (/tmp/redfix-mumble.log + /tmp/redfix-mumble2.log) ALL tiers PASS incl. handshake on BOTH runs; no orphans; canonical re-promoted green each time handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation load/timing FLAKE → harness stabilization (readiness gate / retry)
bluesky-pds DONE @00:45Z (/tmp/redfix-bluesky-pds.log + live diag) cold lifecycle GREEN; WC5 promote 000 reproduces (warm /xrpc/_health last status 0). NOT a flake caddy on-demand TLS (ask http://app:3000/tls-check) can't reach app: caddy resolves bare app to OTHER stacks' app endpoints on shared proxy net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci app aliases) → no cert → 000. Promote machinery correct (refused to write canonical). genuine routing/RECIPE defect (cross-stack app-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake
gitea DONE @00:14Z (/tmp/redfix-gitea2.log + live container logs) cold lifecycle (incl fresh 3.5.3→3.6.0 upgrade) PASS; warm advance crash-loops LoadCommonSettings() [F] error saving JWT Secret … failed to save "/etc/gitea/app.ini": read-only file system — gitea 3.6.0/1.24.2 tries to persist a JWT to the read-only app.ini docker-config mount on warm reattach (before DB migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite). genuine RECIPE defect (3.6.0 + read-only app.ini config mount on advance) → recipe PR: render app.ini into the writable config volume. (1st gitea run hit a nixenv "already deployed" leftover confound — fixed by undeploying to idle then re-running)
keycloak DONE @01:05Z (code-verified; no run) de-enrolled. canonical_domain("keycloak") == WARM_DOMAINS["keycloak"] == warm-keycloak.ci.commoninternet.net EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. data-warm canonical domain uses same warm-<r> scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. HARNESS defect (warm-domain namespace collision) → fix: collision-free canonical_domain for live-warm providers (warm-canon-<r>), then enroll keycloak

M1 results table (recipe → failure → isolation result → root cause → classification → fix approach)

Recipe Canon-sweep failure Isolation result Flake or genuine Root cause Class Fix approach (M2)
discourse "cold-deploy timeout / deploy FATA" install/backup/restore/custom GREEN; upgrade overlay RED genuine (deterministic) — but the canon root-cause was WRONG (no timeout, no deploy FATA) cc-ci overlay tests/discourse/test_upgrade.py asserts head = official discourse/discourse:3.5.3 + sidekiq dropped; that migration is in NO release tag and NOT in main (all use bitnamilegacy/discourse:3.5.0+sidekiq) stale/PR-specific cc-ci OVERLAY test make the overlay assert migration-faithfulness only when the head IS that migration (not vs a release tag), OR a recipe PR migrating off deprecated bitnamilegacy — settle in M2 (NOT a test-weakening)
mattermost-lts test_restore_returns_state RED install/upgrade/backup/custom GREEN; restore RED (ci_marker does not exist) genuine (deterministic in isolation) — NOT the canon "loaded-node race" recipe postgres backup labels back up hot PGDATA + a dump but have no backupbot.restore.post-hook to replay it; restore doesn't round-trip. immich (passes) uses dump-only path + restore.post-hook genuine RECIPE defect recipe PR: adopt immich-style postgres dump + backupbot.restore.post-hook replay
mumble test_handshake… RED ALL tiers GREEN in isolation (×N) incl. handshake FLAKE (load/timing) handshake (TLS+ServerSync) doesn't complete within the 60s retry under heavy concurrent sweep load; fine isolated; canonical written green today load/concurrency FLAKE harness stabilization: stronger readiness gate before the custom tier / longer-or-smarter handshake retry
bluesky-pds warm promote /xrpc/_health → 000 cold lifecycle GREEN; warm promote 000 reproduces genuine (deterministic) — NOT a load/rate-limit flake caddy on-demand TLS calls http://app:3000/tls-check; caddy resolves bare app to OTHER stacks' app endpoints on the shared proxy net (every stack aliases its main svc app), never bluesky's own internal app (10.0.3.3) → connection refused → no cert → 000 genuine ROUTING/RECIPE defect (cross-stack app-alias collision) recipe PR: give the PDS service a unique name/alias so caddy resolves only bluesky's app
gitea 3.5.3→3.6.0 warm advance doesn't promote cold (incl fresh upgrade) GREEN; warm advance crash-loops genuine (deterministic) gitea 3.6.0/1.24.2 saves a JWT secret to /etc/gitea/app.ini on warm reattach; app.ini is a read-only docker-config mountread-only file system FATA at LoadCommonSettings (pre-migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite) genuine RECIPE defect recipe PR: render app.ini into the writable config:/etc/gitea volume (entrypoint) instead of a read-only docker config
keycloak de-enrolled (not tested) code-verified (no run) genuine (structural) canonical_domain("keycloak") == WARM_DOMAINS["keycloak"] == warm-keycloak.ci.commoninternet.net EXACTLY → a data-warm canonical would collide with the live-warm OIDC provider HARNESS defect (warm-domain namespace collision) harness: collision-free canonical_domain for live-warm providers (warm-canon-<r>), then enroll keycloak (WARM_CANONICAL=True)

HOW the Adversary cold-verifies each classification (run ONE recipe at a time, no concurrent load)

Isolation invocation (per recipe R at latest tag T), from /etc/cc-ci on cc-ci: git -C ~/.abra/recipes/R checkout -f --quiet T && env -u REF -u CCCI_QUICK -u MODE -u VERSION RECIPE=R CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py Latest tags: discourse 0.8.1+3.5.0, mattermost-lts 2.1.9+10.11.15, mumble 1.0.0+v1.6.870-0, bluesky-pds 0.3.0+v0.4.219, gitea 3.6.0+1.24.2-rootless.

  • discourse — EXPECT install/backup/restore/custom pass, upgrade fail on test_head_runs_official_image_not_bitnamilegacy + test_sidekiq_service_dropped_by_head. Confirm the overlay mismatch statically: git -C ~/.abra/recipes/discourse show 0.8.1+3.5.0:compose.yml | grep -A1 ' app:' and ... show main:compose.yml both = bitnamilegacy/discourse:3.5.0; grep -c 'sidekiq:' = 1 in both. So the test's discourse/discourse:3.5.3/no-sidekiq expectation exists nowhere upstream.
  • mattermost-lts — EXPECT restore fail relation "ci_marker" does not exist. Confirm root cause statically: git -C ~/.abra/recipes/mattermost-lts show 2.1.9+10.11.15:compose.yml | grep backupbot shows pre-hook + backup.path but NO restore.post-hook; immich git -C ~/.abra/recipes/immich show <latest>:compose.yml | grep backupbot shows restore.post-hook: /pg_backup.sh restore.
  • mumble — EXPECT all tiers green (run 23× to confirm reproducibly green isolated). Canonical written green: cat /var/lib/ci-warm/mumble/canonical.json.
  • bluesky-pds — EXPECT cold green, WC5 promote !! WC5 promote failed … warm-bluesky-pds … last status 0. While the warm stack is up, confirm root cause: caddy logs dial tcp 10.10.0.X:3000: connect: connection refused for app:3000/tls-check; docker exec <caddy> getent hosts app returns proxy IPs (10.10.0.X), the app's real internal IP is 10.0.3.x; docker network inspect proxy | grep _app shows many stacks aliasing app. (Tear down the orphaned warm-bluesky-pds stack + volumes after.)
  • gitea — REQUIRES idle canonical first: if warm-gitea is deployed, docker stack rm warm-gitea_ci_commoninternet_net (retains data+config volumes) so the advance reattaches from idle. EXPECT cold green, warm advance crash-loop with container log LoadCommonSettings() [F] error saving JWT Secret … "/etc/gitea/app.ini": read-only file system. Restore: leave warm-gitea undeployed (idle 3.5.3, volumes retained) — registry stays 3.5.3+1.24.2-rootless.
  • keycloak — no run. Code-verify: canonical.canonical_domain('keycloak')warm.stable_domain('keycloak')warm-keycloak.ci.commoninternet.net; warm.WARM_DOMAINS['keycloak'] == same string (runner/harness/canonical.py:42-44, warm.py:27-29,44-48). Live keycloak 200 on /realms/master.

Node state left clean

All isolation runs torn down; orphaned warm-bluesky-pds stack+volumes removed; warm-gitea restored to idle 3.5.3 (volumes retained, registry unchanged); only live warm-keycloak deployed (healthy). No run_recipe_ci.py processes.

M1 — PASS @ 2026-06-18T01:18Z (REVIEW-redfix.md; all 6 classifications cold-verified CORRECT by Adversary's own isolation re-runs). No VETO. Cleared to M2.

Phase: M2 — FIX + verify all six (IN PROGRESS)

Fix designs locked in BACKLOG-redfix.md. Recipe PRs (mattermost-lts/bluesky/gitea) on git.autonomic.zone mirrors via the recipe mirror+PR flow, verified !testme (NEVER merge). Harness fixes (keycloak/mumble) on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my deploys (Adversary done with M1).

M2 fix tracker (updated 2026-06-18T05:53Z — ALL VERIFIED)

Recipe Class Fix PR/branch + ref Status
mattermost-lts recipe defect pg_backup.sh + backupbot.restore.post-hook (immich pattern) mirror PR #1 ci/pg-restore @4ca7f418 VERIFIED — !testme run #901 ALL tiers green incl test_restore_returns_state
discourse stale cc-ci overlay recipe: bitnamilegacy->official discourse image migration + drop orphaned image-less sidekiq from compose.smtpauth.yml (F-redfix-1) mirror PR #4 discourse-official-image @9ff5e19 VERIFIED — own cold run /tmp/redfix-discourse-m2verify.log level=5 of 5 (all tiers + lint R011 PASS); F-redfix-1 regression fixed
keycloak harness defect collision-free canonical_domain (warm-canon-<r> for WARM_DOMAINS recipes) + enroll cc-ci branch redfix-m2-harness @61211db VERIFIED — branch-checkout run promotes at warm-canon-keycloak; live warm-keycloak 200 throughout
mumble load/timing flake harness: handshake readiness budget 60s->180s cc-ci branch redfix-m2-harness @07fc6d4 VERIFIED — branch-checkout run all tiers green incl handshake; budget active+non-regressing
gitea recipe defect app.ini->staging /etc/gitea/app.ini.init + docker-setup seed-on-EMPTY + DOCKER_SETUP_SH_VERSION v3 mirror PR #2 ci/app-ini-writable @a0f2db8 VERIFIED (direct chaos-deploy; promote merge-gated — see below)
bluesky-pds recipe defect (routing) caddy {$APP_HOST}=${STACK_NAME}_app (operator: NO rename) + CADDYFILE_VERSION v2 mirror PR #4 ci/warm-routing-alias @4987ba9 VERIFIED (direct chaos-deploy; promote merge-gated — see below)

cc-ci-side change verification: run from a checkout of redfix-m2-harness (CCCI_REPO=); never touches /etc/cc-ci main. redfix-m2-harness is now mumble+keycloak ONLY (bluesky needs no cc-ci change with the ${STACK_NAME}_app approach; the rename's exec-ref commit b96b8a4 was dropped).

Gate: M2 — RE-CLAIMED, awaiting Adversary (2026-06-18T06:55Z; orig claim 05:53Z)

Re-claim delta (addresses Adversary M2 FAIL @06:42Z — finding F-redfix-1). The first M2 verdict was FAIL on discourse ONLY (other 5 PASS, do-not-redo). F-redfix-1: the official-image migration dropped sidekiq from compose.yml but left a dangling image-less sidekiq: block in compose.smtpauth.yml → L5 lint R011 fail (run level=4) + broken SMTP-auth deploy. FIXED in PR #4 discourse-official-image @9ff5e19 (force-pushed onto @53ba0910): dropped the orphaned sidekiq: block; the app: override already carries DISCOURSE_SMTP_PASSWORD_FILE + smtp_password secret (sidekiq is internal to the official image), so no SMTP coverage lost. grep sidekiq compose*.yml = 0. VERIFIED two ways: (1) the Adversary's exact lint.py repro flow at 9ff5e19 → R011 ; (2) my own full cold run /tmp/redfix-discourse-m2verify.logRUN SUMMARY ... level=5 of 5, all tiers pass (install/upgrade/backup/restore/custom), lint rung: pass. Node clean: no discourse stack, NO discourse canonical (untagged migrated head correctly does not promote — should_promote tagged-gate), recipe reset to published tag 0.8.1+3.5.0. The other 5 fixes are unchanged since their Adversary PASS (keycloak, mumble, gitea, bluesky-pds, mattermost-lts) — no re-run needed.

Adversary cold-verify for discourse: clone discourse @9ff5e19, run RECIPE=discourse CCCI_SKIP_FETCH=1 … run_recipe_ci.py → EXPECT level=5 of 5 (lint R011 , all tiers pass, both upgrade-overlay tests test_head_runs_official_image_not_bitnamilegacy + test_sidekiq_service_dropped_by_head pass); OR the lint-only repro in F-redfix-1 → R011 . grep -c sidekiq ~/.abra/recipes/discourse/compose*.yml @9ff5e19 = 0.


Gate: M2 — original claim (2026-06-18T05:53Z)

WHAT (M2 DoD). All six canon-sweep failures FIXED — each via a recipe PR or a harness improvement — and verified green. No recipe left as a standing exception. Nothing merged (operator merges). Per recipe:

  • mattermost-lts (recipe PR #1) — added pg_backup.sh + postgres backupbot.restore.post-hook so the logical dump round-trips on restore.
  • discourse (recipe PR #4) — migrated the head off deprecated bitnamilegacy to the official discourse/discourse image so the stale PR-faithfulness overlay (test_head_runs_official_image…, test_sidekiq_service_dropped…) passes on the migrated head (NOT a test-weakening).
  • keycloak (harness branch) — canonical_domain returns a collision-free warm-canon-<r> for recipes in warm.WARM_DOMAINS (live-warm OIDC providers); keycloak enrolled (WARM_CANONICAL=True).
  • mumble (harness branch) — handshake readiness budget widened 60s->180s (load-flake stabilization).
  • gitea (recipe PR #2) — app.ini is now seeded into the WRITABLE /etc/gitea volume by docker-setup (if [ ! -s /etc/gitea/app.ini ], seed-on-EMPTY) from the read-only staging config app.ini.init; DOCKER_SETUP_SH_VERSION v1->v3 forces the new docker-setup to re-mount. Gitea 1.24.2 can then persist its JWT secret (the M1 read-only-app.ini crash is gone).
  • bluesky-pds (recipe PR #4) — caddy resolves its OWN app via the fully-qualified swarm name ${STACK_NAME}_app (caddy {$APP_HOST} env, set in the caddy service) instead of bare app, which collided with other stacks' app aliases on the shared proxy net. CADDYFILE_VERSION v1->v2.

HOW + EXPECTED + WHERE (Adversary cold-verify, one recipe at a time, no concurrent load):

  • mattermost-lts — read-only artifact: /var/lib/cc-ci-runs/901/ on cc-ci — all tiers pass, junit/restore__cc-ci__test_restore.xml testsuite failures=0, test_restore_returns_state pass. OR re-run !testme on PR #1 @4ca7f418. EXPECT restore green.
  • discourse — !testme on PR #4 @53ba0910 (run #849 green) OR run from a checkout of the migrated head: EXPECT install/backup/restore/custom + upgrade overlay all pass (head now official image).
  • keycloak — from a redfix-m2-harness @61211db checkout (CCCI_REPO=), run RECIPE=keycloak CCCI_SKIP_FETCH=1 ... run_recipe_ci.py. EXPECT all cold tiers pass + WC5 promote succeeds at domain warm-canon-keycloak.ci.commoninternet.net (NOT warm-keycloak); live warm-keycloak.ci.commoninternet.net/realms/master stays 200 throughout. Code: canonical.py canonical_domain returns warm-canon- for r in warm.WARM_DOMAINS.
  • mumble — from redfix-m2-harness @07fc6d4 checkout, run RECIPE=mumble CCCI_SKIP_FETCH=1 …. EXPECT all 5 tiers green incl custom/test_protocol_handshake.py::test_handshake_completes_with_ channel_presence; handshake budget = 36 attempts / 180s (was 60s). (Load-flake is not deterministically reproducible; this verifies the stabilization is applied, sound, non-weakening.)
  • gitea (recipe PR #2 @a0f2db8 on mirror branch ci/app-ini-writable) — DIRECT chaos-deploy proof (the harness WC5 promote is merge-gated, see NOTE). With the idle 3.5.3 canonical present: cd ~/.abra/recipes/gitea && git checkout -f a0f2db8 then chaos-deploy onto the retained canonical volumes (0-byte app.ini = genuine pre-fix 3.5.3 state): abra app deploy warm-gitea.ci.commoninternet.net -C -o -n. EXPECT: service 1/1; the config volume's app.ini seeded 0->~1862 bytes (INSTALL_LOCK = true); /api/v1/version -> 200 {"version":"1.24.2"} and /api/healthz -> 200 (curl inside the app container); retained 3.5.3 data adopted (data dirs dated 2026-06-17T08:39); ZERO read-only file system crashes in docker service logs (M1 crashed here). Evidence: /tmp/redfix-gitea-m2-directproof.log on cc-ci. Teardown: abra app undeploy … -n, truncate the volume app.ini to 0 (restore pre-fix state). canonical.json stays 3.5.3 idle e6a1cc79.
  • bluesky-pds (recipe PR #4 @4987ba9 on mirror branch ci/warm-routing-alias) — DIRECT chaos-deploy proof (warm-promote is the only failing path; merge-gated). git checkout -f 4987ba9; generate secrets (abra app secret generate warm-bluesky-pds.ci.commoninternet.net --all -m -C -o -n) + insert a PLC rotation key (tests/bluesky-pds/install_steps.sh logic: 32-byte hex into pds_plc_rotation_key v1); re-checkout 4987ba9 AFTER secret ops (abra secret insert force-fetches+reverts the checkout); abra app deploy warm-bluesky-pds.ci.commoninternet.net -C -o -n (EXPECT caddyfile: v1 -> v2, NEW DEPLOYMENT 4987ba9). EXPECT: app+caddy 1/1; inside caddy getent hosts warm-bluesky-pds_ci_commoninternet_net_app -> a 10.0.x.x INTERNAL ip (own stack) while getent hosts app -> a 10.10.x.x proxy ip (foreign, the M1 collision); caddy log "certificate obtained successfully" with 0 "connection refused"; external curl https://warm-bluesky-pds.ci. commoninternet.net/xrpc/_health -> 200 {"version":"0.4.219"} (M1 was 000). Evidence: /tmp/redfix-bluesky-m2-directproof.log. Teardown: undeploy + remove volumes (caddy_data, pds_data)
    • secrets (no canonical, matching M1).

NOTE — gitea & bluesky end-to-end canonical-promote is OPERATOR-MERGE-GATED (not a shrug). The harness WC5 promote does a recipe_checkout(published-tag)+non-chaos deploy, and BOTH run_recipe_ci.py:373 AND abra force-fetch refs/tags/* from upstream (abra.py:135 documents this), so any local move of the release tag to the fix commit is reverted to the PUBLISHED commit. The published 3.6.0 / 0.3.0 tags do NOT yet carry the fix (PR not merged — operator merges, per phase guardrail), so pre-merge the promote necessarily deploys the unfixed published release. Confirmed empirically: a full gitea harness run's WC5 promote deployed 357926f and crash-looped exactly like M1. The DIRECT chaos-deploy (chaos = deploy the working-tree checkout = the PR fix) is therefore the MAXIMAL + faithful pre-merge proof — it reproduces the EXACT M1 failing scenario (gitea: the retained canonical volumes; bluesky: warm-bluesky-pds on the shared proxy) and shows the fix resolves it. End-to-end canonical advance follows automatically once the operator merges PR #2 / #4 and the release tag carries the fix. This is NOT a standing exception — the defect is fixed + proven; only the registry-advance awaits the operator's merge (the phase's own "nothing merged" constraint).

WHERE (refs). Recipe PRs on git.autonomic.zone/recipe-maintainers/<recipe>: mattermost-lts ci/pg-restore@4ca7f418, discourse discourse-official-image@53ba0910, gitea ci/app-ini-writable @a0f2db8, bluesky-pds ci/warm-routing-alias@4987ba9. cc-ci harness branch redfix-m2-harness@07fc6d4 (keycloak 61211db + mumble 07fc6d4). Reasoning/dead-ends in JOURNAL-redfix.md. Node left clean (only infra + live warm-keycloak 200; gitea idle 3.5.3 volumes retained, canonical e6a1cc79 unchanged; no bluesky/test stacks/volumes/secrets; no run procs).

Gate: M1 — PASS (above).

WHAT (M1 DoD). All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe vs test vs warm-machinery vs load) — see the M1 results table + HOW the Adversary cold-verifies sections above. Summary: discourse = stale cc-ci overlay test (canon timeout/FATA root-cause was wrong); mattermost-lts = genuine recipe defect (no backupbot.restore.post-hook); mumble = load/timing FLAKE (2× isolation green); bluesky-pds = genuine routing defect (caddy↔app app-alias collision on shared proxy); gitea = genuine recipe defect (read-only app.ini config mount + 3.6.0 JWT save); keycloak = harness warm-domain namespace collision. NO "probably a flake" — every classification has an isolation re-run or code proof.

HOW + EXPECTED + WHERE. Per-recipe cold-verify commands, expected outputs, and evidence paths are in the two sections above ("M1 results table" and "HOW the Adversary cold-verifies each classification"). Evidence logs on cc-ci: /tmp/redfix-{discourse,mattermost-lts,mumble,mumble2,bluesky-pds,gitea2}.log. Reasoning/dead-ends in JOURNAL-redfix.md. Node left clean (see "Node state left clean" above).

Blocked

(none)