# Phase `redfix` — investigate every canon-sweep failure + open fix PRs **Mission (operator-specified 2026-06-17):** the `canon` cold-full-lifecycle sweep (every recipe's latest release deployed from scratch + warm-promote) surfaced failures the **upgrade-only weekly run never tests**. Investigate **all** of them, root-cause each, and **FIX every one** — each via **a recipe PR or a harness improvement** (operator 2026-06-17). **None is left as a standing exception** (keycloak's de-enrollment and gitea's "documented limitation" are NOT acceptable end states — they must be fixed). The operator recalls **mattermost-lts and mumble passing before**, so **flake-vs-genuine classification (via isolated re-runs) is the crux** — but note a flake still gets a fix (a harness stabilization improvement), not a shrug. State files: `STATUS-redfix.md`, `BACKLOG-redfix.md`, `REVIEW-redfix.md`, `JOURNAL-redfix.md`. DECISIONS.md shared. ## 1. The failing set (from canon DECISIONS §canon exceptions) | Recipe | What happened in the sweep | First hypothesis | |---|---|---| | **discourse** | cold-deploy **timeout** (rc=142/143, ~51-min wedge) | slow Rails boot under load → timeout/headroom, maybe the start-first rollout | | **mattermost-lts** | `test_restore.py::test_restore_returns_state` FAILED at latest | restore-tier — possibly the known backup/restore db-cycle race on the loaded node | | **mumble** | `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` FAILED | readiness/timing of the protocol handshake | | **bluesky-pds** | cold test **GREEN**, warm-canonical promote FAILED (`warm-bluesky-pds…` → 000, cold domain + other 15 warm domains route fine) | warm-canonical-deploy **routing machinery** (cc-ci side) | | **gitea** | cold test **GREEN**, `3.5.3→3.6.0` warm **advance** doesn't promote (app.ini read-only) | recipe: read-only `app.ini` blocks in-place warm advance (ties to LFS PR #1) | | **keycloak** | **de-enrolled** (not tested) — live-warm OIDC dep collision on `warm-keycloak…` | **harness fix:** canonical warm deploys need a domain/namespace that can't collide with a live service, so keycloak can enroll | ## 2. Method — per recipe: isolate → root-cause → classify → fix-PR 1. **Reproduce in ISOLATION first.** Re-run each failing recipe through the harness **alone** (not under concurrent sweep load) on cc-ci. The loaded single node during the sweep is a known flake source (backup/restore db-cycle races, deploy timeouts). This directly answers the operator's "they passed before": if a recipe is **green in isolation**, it's a load/concurrency **flake**, not a recipe defect — and the fix is harness stabilization (or a documented known-flake), NOT a recipe change. 2. **Root-cause** with real evidence — the actual assertion/error, deploy logs (`docker service logs`), the wedge cause. No guessing. 3. **Classify** each precisely: - **Recipe defect at latest** → a **recipe PR** on the `git.autonomic.zone` mirror via the `recipe-upgrade`/`recipe-upstream` flow, verified with `!testme` (NEVER merge — operator). - **Stale/wrong cc-ci test** → per the standing discipline: leave a PR **comment**, or a cc-ci **test PR** only under the `--with-tests` gate. **Never weaken a test** to turn a red green. - **Warm-canonical machinery defect** (bluesky routing) → a **cc-ci branch PR** (single-writer; never push main). - **Load/concurrency flake** → a harness **stabilization** fix (timeout/serialization/readiness/retry) as a cc-ci PR, or a documented known-flake if genuinely environmental. 4. **Open the fix PR** for each and **verify it** (recipe PR green via `!testme`; cc-ci PR via the harness). One PR per fix; capture URLs. ## 3. Per-recipe specifics to chase - **discourse** — characterise the timeout/wedge: is it just headroom (raise `DEPLOY_TIMEOUT` / give it an uncontended deploy) or a real convergence bug? It cold-deploys fine in the weekly with headroom. Fix = cc-ci timeout/serialization tuning and/or a recipe convergence fix — whichever the evidence shows. - **mattermost-lts** — does `test_restore_returns_state` fail in isolation? If green → the loaded-node restore race (cf. plausible/ghost) → stabilize (e.g. a `BACKUP_VERIFY`-style settle/retry). If red → diagnose the restore (recipe vs test) and fix the right side. - **mumble** — does the handshake test fail in isolation? Likely a readiness/timing gap (handshake before the channel is up) → readiness probe / retry, or a genuine recipe/test issue. - **bluesky-pds** — why does `warm-bluesky-pds…` return 000 while the cold-test domain and the other 15 warm domains route fine? Find the warm-deploy routing defect (cc-ci warm machinery) and fix it so bluesky promotes; it should then drop from the exception list. - **gitea** — **fix it** so it promotes `3.6.0`: either a **recipe PR** making `app.ini` writable for the in-place warm advance, OR a **harness improvement** so the warm advance falls back to a clean re-deploy when an in-place config rewrite isn't possible. Pick per evidence; the bar is gitea promotes `3.6.0` (ties to the LFS `app.ini` change in PR #1). Not "documented as inherent." - **keycloak** — **fix via a harness improvement** so it can be enrolled safely: canonical warm deploys must use a domain/namespace that can **never** collide with a live shared service (e.g. a dedicated `warm-canon-` / per-canonical namespace), so keycloak's canonical deploys without touching the live OIDC service. (Likely shares a root cause with bluesky's warm-routing — a single warm-domain fix may resolve both.) Don't risk the live keycloak — the fix is isolation, then enroll. ## 4. Gates **M1 — all investigated, isolated, classified.** Every recipe in §1 re-run in isolation with evidence; a results table `recipe → failure → isolation result (flake|genuine) → root cause → classification → fix approach`. Adversary cold-verifies the classifications: a claimed **flake** is reproducibly green in isolation (and red under load); a claimed **recipe defect** is genuinely the recipe (not a stale test or a harness artifact); a claimed **warm-machinery** bug is in cc-ci, not the recipe. No "it's probably a flake" without an isolation re-run proving it. **M2 — all six FIXED + verified.** For each, the appropriate fix (a recipe PR or a harness improvement) exists and is **verified green** (recipe PR via `!testme`; harness/cc-ci PR via the harness; flake-stabilization re-run green under load). **All six promote/pass after the fix** — bluesky promotes (warm-routing fixed), gitea promotes `3.6.0`, keycloak enrolled + promotes via the collision-free warm domain, discourse converges in time, mattermost-lts + mumble green (stabilized if flake, fixed if real). **No recipe left as a standing exception.** If one genuinely cannot be fixed at the recipe level, the fix moves to the harness (or a recipe PR + a tracked upstream issue) — never a shrug. **Nothing merged** (operator merges). A short report: per recipe, flake-or-real, the fix, and the verification. Fresh Adversary PASS on both milestones → `## DONE`. ## 5. Guardrails - **Isolation before blame.** Never fix a load-flake as a recipe bug, or a recipe bug as a flake — the isolation re-run is mandatory evidence (especially for mattermost-lts/mumble, which the operator saw pass). - **Recipe mirrors are PR-only** — recipe defects get a PR, `!testme`-verified, **never merged**. **Never weaken a cc-ci test** to make a recipe green; a stale test gets a comment or a `--with-tests` test PR. - **cc-ci changes** on a dedicated branch (single-writer; never push main, never disturb other clones). - **Shared swarm:** ≤2–3 concurrent deploys; tear down every deploy on every exit path; mind the live warm services (keycloak especially). Host changes coordinated (loops may rebuild if clean + verify health). - Honest reporting — a flake is labelled a flake with proof; a recipe left unfixed is called out with the reason. Commit author `autonomic-bot `; push every commit; abra over a pseudo-TTY. ## 6. Definition of Done All six canon-sweep failures (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak) investigated in isolation, root-caused, classified (flake vs genuine; recipe vs test vs warm-machinery vs load), and **FIXED — each via a recipe PR or a harness improvement — and verified green**: bluesky promotes (warm-routing fixed), gitea promotes `3.6.0`, keycloak enrolled via a collision-free warm domain, discourse converges, mattermost-lts + mumble green (stabilized or fixed). **None left as a standing exception.** Nothing merged (operator merges); a clear flake-vs-real report with the fix + verification per recipe. M1 + M2 fresh Adversary PASSes in REVIEW-redfix.md.