Files
cc-ci/machine-docs/STATUS-redfix.md
autonomic-bot 29a28176a9
Some checks failed
continuous-integration/drone/push Build is failing
claim(redfix-M2): discourse F-redfix-1 FIXED + level=5 verified — re-claim 6/6
Dropped orphaned image-less sidekiq from discourse compose.smtpauth.yml (PR #4
@9ff5e19); R011 lint  (Adversary repro) + own cold run level=5 of 5 all tiers
pass. Other 5 fixes unchanged (Adversary PASS). 6/6 verified green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Bacy8SJhBebNAGoYsi1Fxt
2026-06-18 06:55:28 +00:00

220 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — phase `redfix`
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs
warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing
exceptions. Nothing merged.
## Phase: M1 — investigate + isolate + classify (IN PROGRESS)
Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21
(3-day clear window). Disk `/` 38G free (75% used).
### Isolation harness (how I reproduce each failure ALONE)
Each canon-sweep per-recipe run is `runner/nightly_sweep.run_on_tag(recipe, latest)`:
`abra.recipe_checkout(recipe, <latest-tag>)` then `run_recipe_ci.py` with `RECIPE=<r>
CCCI_SKIP_FETCH=1` and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE
recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known
flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`.
### Starting canonical state (cc-ci `/var/lib/ci-warm/<r>/canonical.json`, read 2026-06-17T23:19Z)
| Recipe | Canonical now | Note |
|---|---|---|
| discourse | (none) | no canonical dir |
| mattermost-lts | (none) | no canonical dir |
| mumble | `1.0.0+v1.6.870-0` @ 20260617T180501Z | **canonical PRESENT, written TODAY** — flake signal |
| bluesky-pds | (none) | no canonical dir |
| gitea | `3.5.3+1.24.2-rootless` @ 20260617T083930Z | 3.6.0 advance not promoted |
| keycloak | (none) | de-enrolled (WARM_CANONICAL off) |
### M1 investigation tracker
| Recipe | Isolation run | Result | Root cause | Classification |
|---|---|---|---|---|
| discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) |
| mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist`**deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) |
| mumble | DONE — **2× isolation GREEN** (`/tmp/redfix-mumble.log` + `/tmp/redfix-mumble2.log`) | **ALL tiers PASS** incl. handshake on BOTH runs; no orphans; canonical re-promoted green each time | handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation | **load/timing FLAKE** → harness stabilization (readiness gate / retry) |
| bluesky-pds | DONE @00:45Z (`/tmp/redfix-bluesky-pds.log` + live diag) | cold lifecycle GREEN; **WC5 promote 000** reproduces (warm /xrpc/_health last status 0). NOT a flake | caddy on-demand TLS (`ask http://app:3000/tls-check`) can't reach app: caddy resolves bare `app` to OTHER stacks' app endpoints on shared `proxy` net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci `app` aliases) → no cert → 000. Promote machinery correct (refused to write canonical). | **genuine routing/RECIPE defect** (cross-stack `app`-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake |
| gitea | DONE @00:14Z (`/tmp/redfix-gitea2.log` + live container logs) | cold lifecycle (incl fresh 3.5.3→3.6.0 upgrade) PASS; **warm advance crash-loops** | `LoadCommonSettings() [F] error saving JWT Secret … failed to save "/etc/gitea/app.ini": read-only file system` — gitea 3.6.0/1.24.2 tries to persist a JWT to the read-only app.ini docker-config mount on warm reattach (before DB migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite). | **genuine RECIPE defect** (3.6.0 + read-only app.ini config mount on advance) → recipe PR: render app.ini into the writable config volume. (1st gitea run hit a nixenv "already deployed" leftover confound — fixed by undeploying to idle then re-running) |
| keycloak | DONE @01:05Z (code-verified; no run) | de-enrolled. `canonical_domain("keycloak")` == `WARM_DOMAINS["keycloak"]` == `warm-keycloak.ci.commoninternet.net` EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. | data-warm canonical domain uses same `warm-<r>` scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. | **HARNESS defect** (warm-domain namespace collision) → fix: collision-free `canonical_domain` for live-warm providers (`warm-canon-<r>`), then enroll keycloak |
### M1 results table (recipe → failure → isolation result → root cause → classification → fix approach)
| Recipe | Canon-sweep failure | Isolation result | Flake or genuine | Root cause | Class | Fix approach (M2) |
|---|---|---|---|---|---|---|
| discourse | "cold-deploy timeout / deploy FATA" | install/backup/restore/custom GREEN; **upgrade overlay RED** | **genuine (deterministic)** — but the canon root-cause was WRONG (no timeout, no deploy FATA) | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head = official `discourse/discourse:3.5.3` + sidekiq dropped; that migration is in NO release tag and NOT in main (all use `bitnamilegacy/discourse:3.5.0`+sidekiq) | **stale/PR-specific cc-ci OVERLAY test** | make the overlay assert migration-faithfulness only when the head IS that migration (not vs a release tag), OR a recipe PR migrating off deprecated bitnamilegacy — settle in M2 (NOT a test-weakening) |
| mattermost-lts | `test_restore_returns_state` RED | install/upgrade/backup/custom GREEN; **restore RED** (`ci_marker does not exist`) | **genuine (deterministic in isolation)** — NOT the canon "loaded-node race" | recipe postgres backup labels back up hot PGDATA + a dump but have **no `backupbot.restore.post-hook`** to replay it; restore doesn't round-trip. immich (passes) uses dump-only path + `restore.post-hook` | **genuine RECIPE defect** | recipe PR: adopt immich-style postgres dump + `backupbot.restore.post-hook` replay |
| mumble | `test_handshake…` RED | **ALL tiers GREEN** in isolation (×N) incl. handshake | **FLAKE (load/timing)** | handshake (TLS+ServerSync) doesn't complete within the 60s retry under heavy concurrent sweep load; fine isolated; canonical written green today | **load/concurrency FLAKE** | harness stabilization: stronger readiness gate before the custom tier / longer-or-smarter handshake retry |
| bluesky-pds | warm promote `/xrpc/_health` → 000 | cold lifecycle GREEN; **warm promote 000 reproduces** | **genuine (deterministic)** — NOT a load/rate-limit flake | caddy on-demand TLS calls `http://app:3000/tls-check`; caddy resolves bare `app` to OTHER stacks' app endpoints on the shared `proxy` net (every stack aliases its main svc `app`), never bluesky's own internal app (10.0.3.3) → connection refused → no cert → 000 | **genuine ROUTING/RECIPE defect** (cross-stack `app`-alias collision) | recipe PR: give the PDS service a unique name/alias so caddy resolves only bluesky's app |
| gitea | 3.5.3→3.6.0 warm advance doesn't promote | cold (incl fresh upgrade) GREEN; **warm advance crash-loops** | **genuine (deterministic)** | gitea 3.6.0/1.24.2 saves a JWT secret to `/etc/gitea/app.ini` on warm reattach; app.ini is a **read-only docker-config mount**`read-only file system` FATA at LoadCommonSettings (pre-migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite) | **genuine RECIPE defect** | recipe PR: render app.ini into the writable `config:/etc/gitea` volume (entrypoint) instead of a read-only docker config |
| keycloak | de-enrolled (not tested) | code-verified (no run) | **genuine (structural)** | `canonical_domain("keycloak")` == `WARM_DOMAINS["keycloak"]` == `warm-keycloak.ci.commoninternet.net` EXACTLY → a data-warm canonical would collide with the live-warm OIDC provider | **HARNESS defect** (warm-domain namespace collision) | harness: collision-free `canonical_domain` for live-warm providers (`warm-canon-<r>`), then enroll keycloak (WARM_CANONICAL=True) |
### HOW the Adversary cold-verifies each classification (run ONE recipe at a time, no concurrent load)
Isolation invocation (per recipe `R` at latest tag `T`), from `/etc/cc-ci` on cc-ci:
`git -C ~/.abra/recipes/R checkout -f --quiet T && env -u REF -u CCCI_QUICK -u MODE -u VERSION RECIPE=R CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py`
Latest tags: discourse `0.8.1+3.5.0`, mattermost-lts `2.1.9+10.11.15`, mumble `1.0.0+v1.6.870-0`, bluesky-pds `0.3.0+v0.4.219`, gitea `3.6.0+1.24.2-rootless`.
- **discourse** — EXPECT install/backup/restore/custom pass, upgrade fail on `test_head_runs_official_image_not_bitnamilegacy` + `test_sidekiq_service_dropped_by_head`. Confirm the overlay mismatch statically: `git -C ~/.abra/recipes/discourse show 0.8.1+3.5.0:compose.yml | grep -A1 ' app:'` and `... show main:compose.yml` both = `bitnamilegacy/discourse:3.5.0`; `grep -c 'sidekiq:'` = 1 in both. So the test's `discourse/discourse:3.5.3`/no-sidekiq expectation exists nowhere upstream.
- **mattermost-lts** — EXPECT restore fail `relation "ci_marker" does not exist`. Confirm root cause statically: `git -C ~/.abra/recipes/mattermost-lts show 2.1.9+10.11.15:compose.yml | grep backupbot` shows pre-hook + `backup.path` but NO `restore.post-hook`; immich `git -C ~/.abra/recipes/immich show <latest>:compose.yml | grep backupbot` shows `restore.post-hook: /pg_backup.sh restore`.
- **mumble** — EXPECT all tiers green (run 23× to confirm reproducibly green isolated). Canonical written green: `cat /var/lib/ci-warm/mumble/canonical.json`.
- **bluesky-pds** — EXPECT cold green, WC5 promote `!! WC5 promote failed … warm-bluesky-pds … last status 0`. While the warm stack is up, confirm root cause: caddy logs `dial tcp 10.10.0.X:3000: connect: connection refused` for `app:3000/tls-check`; `docker exec <caddy> getent hosts app` returns proxy IPs (10.10.0.X), the app's real internal IP is 10.0.3.x; `docker network inspect proxy | grep _app` shows many stacks aliasing `app`. (Tear down the orphaned warm-bluesky-pds stack + volumes after.)
- **gitea** — REQUIRES idle canonical first: if warm-gitea is deployed, `docker stack rm warm-gitea_ci_commoninternet_net` (retains data+config volumes) so the advance reattaches from idle. EXPECT cold green, warm advance crash-loop with container log `LoadCommonSettings() [F] error saving JWT Secret … "/etc/gitea/app.ini": read-only file system`. Restore: leave warm-gitea undeployed (idle 3.5.3, volumes retained) — registry stays `3.5.3+1.24.2-rootless`.
- **keycloak** — no run. Code-verify: `canonical.canonical_domain('keycloak')``warm.stable_domain('keycloak')``warm-keycloak.ci.commoninternet.net`; `warm.WARM_DOMAINS['keycloak']` == same string (runner/harness/canonical.py:42-44, warm.py:27-29,44-48). Live keycloak 200 on `/realms/master`.
### Node state left clean
All isolation runs torn down; orphaned warm-bluesky-pds stack+volumes removed; warm-gitea restored to idle 3.5.3 (volumes retained, registry unchanged); only live warm-keycloak deployed (healthy). No `run_recipe_ci.py` processes.
## M1 — PASS @ 2026-06-18T01:18Z (REVIEW-redfix.md; all 6 classifications cold-verified CORRECT by Adversary's own isolation re-runs). No VETO. Cleared to M2.
## Phase: M2 — FIX + verify all six (IN PROGRESS)
Fix designs locked in BACKLOG-redfix.md. Recipe PRs (mattermost-lts/bluesky/gitea) on git.autonomic.zone
mirrors via the recipe mirror+PR flow, verified `!testme` (NEVER merge). Harness fixes (keycloak/mumble)
on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my
deploys (Adversary done with M1).
### M2 fix tracker (updated 2026-06-18T05:53Z — ALL VERIFIED)
| Recipe | Class | Fix | PR/branch + ref | Status |
|---|---|---|---|---|
| mattermost-lts | recipe defect | pg_backup.sh + `backupbot.restore.post-hook` (immich pattern) | mirror PR #1 `ci/pg-restore` @4ca7f418 | **VERIFIED** — !testme run #901 ALL tiers green incl `test_restore_returns_state` |
| discourse | stale cc-ci overlay | recipe: bitnamilegacy->official discourse image migration + drop orphaned image-less sidekiq from compose.smtpauth.yml (F-redfix-1) | mirror PR #4 `discourse-official-image` @9ff5e19 | **VERIFIED** — own cold run `/tmp/redfix-discourse-m2verify.log` **level=5 of 5** (all tiers + lint R011 PASS); F-redfix-1 regression fixed |
| keycloak | harness defect | collision-free `canonical_domain` (`warm-canon-<r>` for WARM_DOMAINS recipes) + enroll | cc-ci branch `redfix-m2-harness` @61211db | **VERIFIED** — branch-checkout run promotes at warm-canon-keycloak; live warm-keycloak 200 throughout |
| mumble | load/timing flake | harness: handshake readiness budget 60s->180s | cc-ci branch `redfix-m2-harness` @07fc6d4 | **VERIFIED** — branch-checkout run all tiers green incl handshake; budget active+non-regressing |
| gitea | recipe defect | app.ini->staging `/etc/gitea/app.ini.init` + docker-setup seed-on-EMPTY + DOCKER_SETUP_SH_VERSION v3 | mirror PR #2 `ci/app-ini-writable` @a0f2db8 | **VERIFIED** (direct chaos-deploy; promote merge-gated — see below) |
| bluesky-pds | recipe defect (routing) | caddy `{$APP_HOST}=${STACK_NAME}_app` (operator: NO rename) + CADDYFILE_VERSION v2 | mirror PR #4 `ci/warm-routing-alias` @4987ba9 | **VERIFIED** (direct chaos-deploy; promote merge-gated — see below) |
cc-ci-side change verification: run from a checkout of `redfix-m2-harness` (CCCI_REPO=<checkout>);
never touches /etc/cc-ci main. `redfix-m2-harness` is now mumble+keycloak ONLY (bluesky needs no
cc-ci change with the ${STACK_NAME}_app approach; the rename's exec-ref commit b96b8a4 was dropped).
## Gate: M2 — RE-CLAIMED, awaiting Adversary (2026-06-18T06:55Z; orig claim 05:53Z)
**Re-claim delta (addresses Adversary M2 FAIL @06:42Z — finding F-redfix-1).** The first M2 verdict was
FAIL on discourse ONLY (other 5 PASS, do-not-redo). F-redfix-1: the official-image migration dropped
`sidekiq` from compose.yml but left a dangling image-less `sidekiq:` block in `compose.smtpauth.yml`
L5 lint R011 fail (run level=4) + broken SMTP-auth deploy. **FIXED** in PR #4 `discourse-official-image`
@**9ff5e19** (force-pushed onto @53ba0910): dropped the orphaned `sidekiq:` block; the `app:` override
already carries `DISCOURSE_SMTP_PASSWORD_FILE` + `smtp_password` secret (sidekiq is internal to the
official image), so no SMTP coverage lost. `grep sidekiq compose*.yml` = 0.
**VERIFIED two ways:** (1) the Adversary's exact lint.py repro flow at 9ff5e19 → **R011 ✅**; (2) my own
full cold run `/tmp/redfix-discourse-m2verify.log``RUN SUMMARY ... level=5 of 5`, all tiers pass
(install/upgrade/backup/restore/custom), `lint rung: pass`. Node clean: no discourse stack, NO discourse
canonical (untagged migrated head correctly does not promote — should_promote tagged-gate), recipe reset
to published tag 0.8.1+3.5.0. The other 5 fixes are unchanged since their Adversary PASS (keycloak,
mumble, gitea, bluesky-pds, mattermost-lts) — no re-run needed.
Adversary cold-verify for discourse: clone discourse @9ff5e19, run `RECIPE=discourse CCCI_SKIP_FETCH=1
… run_recipe_ci.py` → EXPECT level=5 of 5 (lint R011 ✅, all tiers pass, both upgrade-overlay tests
`test_head_runs_official_image_not_bitnamilegacy` + `test_sidekiq_service_dropped_by_head` pass); OR the
lint-only repro in F-redfix-1 → R011 ✅. `grep -c sidekiq ~/.abra/recipes/discourse/compose*.yml` @9ff5e19 = 0.
---
## Gate: M2 — original claim (2026-06-18T05:53Z)
**WHAT (M2 DoD).** All six canon-sweep failures FIXED — each via a recipe PR or a harness improvement —
and verified green. No recipe left as a standing exception. Nothing merged (operator merges). Per recipe:
- **mattermost-lts** (recipe PR #1) — added `pg_backup.sh` + postgres `backupbot.restore.post-hook` so
the logical dump round-trips on restore.
- **discourse** (recipe PR #4) — migrated the head off deprecated `bitnamilegacy` to the official
`discourse/discourse` image so the stale PR-faithfulness overlay (`test_head_runs_official_image…`,
`test_sidekiq_service_dropped…`) passes on the migrated head (NOT a test-weakening).
- **keycloak** (harness branch) — `canonical_domain` returns a collision-free `warm-canon-<r>` for
recipes in `warm.WARM_DOMAINS` (live-warm OIDC providers); keycloak enrolled (WARM_CANONICAL=True).
- **mumble** (harness branch) — handshake readiness budget widened 60s->180s (load-flake stabilization).
- **gitea** (recipe PR #2) — app.ini is now seeded into the WRITABLE `/etc/gitea` volume by
docker-setup (`if [ ! -s /etc/gitea/app.ini ]`, seed-on-EMPTY) from the read-only staging config
`app.ini.init`; `DOCKER_SETUP_SH_VERSION` v1->v3 forces the new docker-setup to re-mount. Gitea
1.24.2 can then persist its JWT secret (the M1 read-only-app.ini crash is gone).
- **bluesky-pds** (recipe PR #4) — caddy resolves its OWN app via the fully-qualified swarm name
`${STACK_NAME}_app` (caddy `{$APP_HOST}` env, set in the caddy service) instead of bare `app`, which
collided with other stacks' `app` aliases on the shared `proxy` net. CADDYFILE_VERSION v1->v2.
**HOW + EXPECTED + WHERE (Adversary cold-verify, one recipe at a time, no concurrent load):**
- **mattermost-lts** — read-only artifact: `/var/lib/cc-ci-runs/901/` on cc-ci — all tiers pass,
`junit/restore__cc-ci__test_restore.xml` testsuite failures=0, `test_restore_returns_state` pass.
OR re-run !testme on PR #1 @4ca7f418. EXPECT restore green.
- **discourse** — !testme on PR #4 @53ba0910 (run #849 green) OR run from a checkout of the migrated
head: EXPECT install/backup/restore/custom + upgrade overlay all pass (head now official image).
- **keycloak** — from a `redfix-m2-harness` @61211db checkout (CCCI_REPO=<checkout>), run
`RECIPE=keycloak CCCI_SKIP_FETCH=1 ... run_recipe_ci.py`. EXPECT all cold tiers pass + WC5 promote
succeeds at domain `warm-canon-keycloak.ci.commoninternet.net` (NOT warm-keycloak); live
`warm-keycloak.ci.commoninternet.net/realms/master` stays 200 throughout. Code: `canonical.py`
canonical_domain returns warm-canon-<r> for r in warm.WARM_DOMAINS.
- **mumble** — from `redfix-m2-harness` @07fc6d4 checkout, run `RECIPE=mumble CCCI_SKIP_FETCH=1 …`.
EXPECT all 5 tiers green incl `custom/test_protocol_handshake.py::test_handshake_completes_with_
channel_presence`; handshake budget = 36 attempts / 180s (was 60s). (Load-flake is not
deterministically reproducible; this verifies the stabilization is applied, sound, non-weakening.)
- **gitea** (recipe PR #2 @a0f2db8 on mirror branch `ci/app-ini-writable`) — DIRECT chaos-deploy proof
(the harness WC5 promote is merge-gated, see NOTE). With the idle 3.5.3 canonical present:
`cd ~/.abra/recipes/gitea && git checkout -f a0f2db8` then chaos-deploy onto the retained canonical
volumes (0-byte app.ini = genuine pre-fix 3.5.3 state):
`abra app deploy warm-gitea.ci.commoninternet.net -C -o -n`. EXPECT: service 1/1; the config volume's
`app.ini` seeded 0->~1862 bytes (`INSTALL_LOCK = true`); `/api/v1/version` -> 200 {"version":"1.24.2"}
and `/api/healthz` -> 200 (curl inside the app container); retained 3.5.3 data adopted (data dirs
dated 2026-06-17T08:39); ZERO `read-only file system` crashes in `docker service logs` (M1 crashed
here). Evidence: `/tmp/redfix-gitea-m2-directproof.log` on cc-ci. Teardown: `abra app undeploy … -n`,
truncate the volume app.ini to 0 (restore pre-fix state). canonical.json stays 3.5.3 idle e6a1cc79.
- **bluesky-pds** (recipe PR #4 @4987ba9 on mirror branch `ci/warm-routing-alias`) — DIRECT chaos-deploy
proof (warm-promote is the only failing path; merge-gated). `git checkout -f 4987ba9`; generate
secrets (`abra app secret generate warm-bluesky-pds.ci.commoninternet.net --all -m -C -o -n`) + insert
a PLC rotation key (tests/bluesky-pds/install_steps.sh logic: 32-byte hex into pds_plc_rotation_key
v1); **re-checkout 4987ba9 AFTER secret ops** (abra secret insert force-fetches+reverts the checkout);
`abra app deploy warm-bluesky-pds.ci.commoninternet.net -C -o -n` (EXPECT `caddyfile: v1 -> v2`,
NEW DEPLOYMENT 4987ba9). EXPECT: app+caddy 1/1; inside caddy `getent hosts
warm-bluesky-pds_ci_commoninternet_net_app` -> a 10.0.x.x INTERNAL ip (own stack) while
`getent hosts app` -> a 10.10.x.x proxy ip (foreign, the M1 collision); caddy log "certificate
obtained successfully" with 0 "connection refused"; external `curl https://warm-bluesky-pds.ci.
commoninternet.net/xrpc/_health` -> **200** {"version":"0.4.219"} (M1 was 000). Evidence:
`/tmp/redfix-bluesky-m2-directproof.log`. Teardown: undeploy + remove volumes (caddy_data, pds_data)
+ secrets (no canonical, matching M1).
**NOTE — gitea & bluesky end-to-end canonical-promote is OPERATOR-MERGE-GATED (not a shrug).** The
harness WC5 promote does a recipe_checkout(published-tag)+non-chaos deploy, and BOTH run_recipe_ci.py:373
AND abra force-fetch `refs/tags/*` from upstream (abra.py:135 documents this), so any local move of the
release tag to the fix commit is reverted to the PUBLISHED commit. The published 3.6.0 / 0.3.0 tags do
NOT yet carry the fix (PR not merged — operator merges, per phase guardrail), so pre-merge the promote
necessarily deploys the unfixed published release. Confirmed empirically: a full gitea harness run's WC5
promote deployed 357926f and crash-looped exactly like M1. The DIRECT chaos-deploy (chaos = deploy the
working-tree checkout = the PR fix) is therefore the MAXIMAL + faithful pre-merge proof — it reproduces
the EXACT M1 failing scenario (gitea: the retained canonical volumes; bluesky: warm-bluesky-pds on the
shared proxy) and shows the fix resolves it. End-to-end canonical advance follows automatically once the
operator merges PR #2 / #4 and the release tag carries the fix. This is NOT a standing exception — the
defect is fixed + proven; only the registry-advance awaits the operator's merge (the phase's own
"nothing merged" constraint).
**WHERE (refs).** Recipe PRs on `git.autonomic.zone/recipe-maintainers/<recipe>`: mattermost-lts
`ci/pg-restore`@4ca7f418, discourse `discourse-official-image`@53ba0910, gitea `ci/app-ini-writable`
@a0f2db8, bluesky-pds `ci/warm-routing-alias`@4987ba9. cc-ci harness branch
`redfix-m2-harness`@07fc6d4 (keycloak 61211db + mumble 07fc6d4). Reasoning/dead-ends in
JOURNAL-redfix.md. Node left clean (only infra + live warm-keycloak 200; gitea idle 3.5.3 volumes
retained, canonical e6a1cc79 unchanged; no bluesky/test stacks/volumes/secrets; no run procs).
## Gate: M1 — PASS (above).
**WHAT (M1 DoD).** All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no
concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe
vs test vs warm-machinery vs load) — see the **M1 results table** + **HOW the Adversary cold-verifies**
sections above. Summary: discourse = stale cc-ci overlay test (canon timeout/FATA root-cause was
wrong); mattermost-lts = genuine recipe defect (no `backupbot.restore.post-hook`); mumble = load/timing
FLAKE (2× isolation green); bluesky-pds = genuine routing defect (caddy↔app `app`-alias collision on
shared proxy); gitea = genuine recipe defect (read-only app.ini config mount + 3.6.0 JWT save);
keycloak = harness warm-domain namespace collision. NO "probably a flake" — every classification has
an isolation re-run or code proof.
**HOW + EXPECTED + WHERE.** Per-recipe cold-verify commands, expected outputs, and evidence paths are
in the two sections above ("M1 results table" and "HOW the Adversary cold-verifies each classification").
Evidence logs on cc-ci: `/tmp/redfix-{discourse,mattermost-lts,mumble,mumble2,bluesky-pds,gitea2}.log`.
Reasoning/dead-ends in JOURNAL-redfix.md. Node left clean (see "Node state left clean" above).
## Blocked
(none)