Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md
autonomic-bot ff6c44a627 plan: queue redfix — investigate ALL canon-sweep failures + FIX each (recipe PR or harness improvement, opus)
Operator 2026-06-17: fix all six (discourse timeout, mattermost-lts restore,
mumble handshake, bluesky warm-routing, gitea 3.6.0 advance, keycloak
de-enroll) — none left as a standing exception. Isolation re-run first
(flake vs genuine; operator recalls mattermost-lts/mumble passing). keycloak
+ bluesky likely share a warm-domain-collision harness fix; gitea via recipe
PR or warm-advance fallback. Nothing merged.
2026-06-17 23:17:51 +00:00

109 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase `redfix` — investigate every canon-sweep failure + open fix PRs
**Mission (operator-specified 2026-06-17):** the `canon` cold-full-lifecycle sweep (every recipe's latest
release deployed from scratch + warm-promote) surfaced failures the **upgrade-only weekly run never
tests**. Investigate **all** of them, root-cause each, and **FIX every one** — each via **a recipe PR or a harness
improvement** (operator 2026-06-17). **None is left as a standing exception** (keycloak's de-enrollment and
gitea's "documented limitation" are NOT acceptable end states — they must be fixed). The operator recalls
**mattermost-lts and mumble passing before**, so **flake-vs-genuine classification (via isolated re-runs)
is the crux** — but note a flake still gets a fix (a harness stabilization improvement), not a shrug.
State files: `STATUS-redfix.md`, `BACKLOG-redfix.md`, `REVIEW-redfix.md`, `JOURNAL-redfix.md`. DECISIONS.md shared.
## 1. The failing set (from canon DECISIONS §canon exceptions)
| Recipe | What happened in the sweep | First hypothesis |
|---|---|---|
| **discourse** | cold-deploy **timeout** (rc=142/143, ~51-min wedge) | slow Rails boot under load → timeout/headroom, maybe the start-first rollout |
| **mattermost-lts** | `test_restore.py::test_restore_returns_state` FAILED at latest | restore-tier — possibly the known backup/restore db-cycle race on the loaded node |
| **mumble** | `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` FAILED | readiness/timing of the protocol handshake |
| **bluesky-pds** | cold test **GREEN**, warm-canonical promote FAILED (`warm-bluesky-pds…` → 000, cold domain + other 15 warm domains route fine) | warm-canonical-deploy **routing machinery** (cc-ci side) |
| **gitea** | cold test **GREEN**, `3.5.3→3.6.0` warm **advance** doesn't promote (app.ini read-only) | recipe: read-only `app.ini` blocks in-place warm advance (ties to LFS PR #1) |
| **keycloak** | **de-enrolled** (not tested) — live-warm OIDC dep collision on `warm-keycloak…` | **harness fix:** canonical warm deploys need a domain/namespace that can't collide with a live service, so keycloak can enroll |
## 2. Method — per recipe: isolate → root-cause → classify → fix-PR
1. **Reproduce in ISOLATION first.** Re-run each failing recipe through the harness **alone** (not under
concurrent sweep load) on cc-ci. The loaded single node during the sweep is a known flake source
(backup/restore db-cycle races, deploy timeouts). This directly answers the operator's "they passed
before": if a recipe is **green in isolation**, it's a load/concurrency **flake**, not a recipe defect
— and the fix is harness stabilization (or a documented known-flake), NOT a recipe change.
2. **Root-cause** with real evidence — the actual assertion/error, deploy logs (`docker service logs`),
the wedge cause. No guessing.
3. **Classify** each precisely:
- **Recipe defect at latest** → a **recipe PR** on the `git.autonomic.zone` mirror via the
`recipe-upgrade`/`recipe-upstream` flow, verified with `!testme` (NEVER merge — operator).
- **Stale/wrong cc-ci test** → per the standing discipline: leave a PR **comment**, or a cc-ci
**test PR** only under the `--with-tests` gate. **Never weaken a test** to turn a red green.
- **Warm-canonical machinery defect** (bluesky routing) → a **cc-ci branch PR** (single-writer; never
push main).
- **Load/concurrency flake** → a harness **stabilization** fix (timeout/serialization/readiness/retry)
as a cc-ci PR, or a documented known-flake if genuinely environmental.
4. **Open the fix PR** for each and **verify it** (recipe PR green via `!testme`; cc-ci PR via the
harness). One PR per fix; capture URLs.
## 3. Per-recipe specifics to chase
- **discourse** — characterise the timeout/wedge: is it just headroom (raise `DEPLOY_TIMEOUT` / give it an
uncontended deploy) or a real convergence bug? It cold-deploys fine in the weekly with headroom. Fix =
cc-ci timeout/serialization tuning and/or a recipe convergence fix — whichever the evidence shows.
- **mattermost-lts** — does `test_restore_returns_state` fail in isolation? If green → the loaded-node
restore race (cf. plausible/ghost) → stabilize (e.g. a `BACKUP_VERIFY`-style settle/retry). If red →
diagnose the restore (recipe vs test) and fix the right side.
- **mumble** — does the handshake test fail in isolation? Likely a readiness/timing gap (handshake before
the channel is up) → readiness probe / retry, or a genuine recipe/test issue.
- **bluesky-pds** — why does `warm-bluesky-pds…` return 000 while the cold-test domain and the other 15
warm domains route fine? Find the warm-deploy routing defect (cc-ci warm machinery) and fix it so
bluesky promotes; it should then drop from the exception list.
- **gitea** — **fix it** so it promotes `3.6.0`: either a **recipe PR** making `app.ini` writable for the
in-place warm advance, OR a **harness improvement** so the warm advance falls back to a clean re-deploy
when an in-place config rewrite isn't possible. Pick per evidence; the bar is gitea promotes `3.6.0`
(ties to the LFS `app.ini` change in PR #1). Not "documented as inherent."
- **keycloak** — **fix via a harness improvement** so it can be enrolled safely: canonical warm deploys
must use a domain/namespace that can **never** collide with a live shared service (e.g. a dedicated
`warm-canon-<recipe>` / per-canonical namespace), so keycloak's canonical deploys without touching the
live OIDC service. (Likely shares a root cause with bluesky's warm-routing — a single warm-domain fix may
resolve both.) Don't risk the live keycloak — the fix is isolation, then enroll.
## 4. Gates
**M1 — all investigated, isolated, classified.** Every recipe in §1 re-run in isolation with evidence; a
results table `recipe → failure → isolation result (flake|genuine) → root cause → classification → fix
approach`. Adversary cold-verifies the classifications: a claimed **flake** is reproducibly green in
isolation (and red under load); a claimed **recipe defect** is genuinely the recipe (not a stale test or a
harness artifact); a claimed **warm-machinery** bug is in cc-ci, not the recipe. No "it's probably a
flake" without an isolation re-run proving it.
**M2 — all six FIXED + verified.** For each, the appropriate fix (a recipe PR or a harness improvement)
exists and is **verified green** (recipe PR via `!testme`; harness/cc-ci PR via the harness;
flake-stabilization re-run green under load). **All six promote/pass after the fix** — bluesky promotes
(warm-routing fixed), gitea promotes `3.6.0`, keycloak enrolled + promotes via the collision-free warm
domain, discourse converges in time, mattermost-lts + mumble green (stabilized if flake, fixed if real).
**No recipe left as a standing exception.** If one genuinely cannot be fixed at the recipe level, the fix
moves to the harness (or a recipe PR + a tracked upstream issue) — never a shrug. **Nothing merged**
(operator merges). A short report: per recipe, flake-or-real, the fix, and the verification. Fresh
Adversary PASS on both milestones → `## DONE`.
## 5. Guardrails
- **Isolation before blame.** Never fix a load-flake as a recipe bug, or a recipe bug as a flake — the
isolation re-run is mandatory evidence (especially for mattermost-lts/mumble, which the operator saw pass).
- **Recipe mirrors are PR-only** — recipe defects get a PR, `!testme`-verified, **never merged**. **Never
weaken a cc-ci test** to make a recipe green; a stale test gets a comment or a `--with-tests` test PR.
- **cc-ci changes** on a dedicated branch (single-writer; never push main, never disturb other clones).
- **Shared swarm:** ≤23 concurrent deploys; tear down every deploy on every exit path; mind the live
warm services (keycloak especially). Host changes coordinated (loops may rebuild if clean + verify health).
- Honest reporting — a flake is labelled a flake with proof; a recipe left unfixed is called out with the
reason. Commit author `autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>`; push every commit; abra
over a pseudo-TTY.
## 6. Definition of Done
All six canon-sweep failures (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak)
investigated in isolation, root-caused, classified (flake vs genuine; recipe vs test vs warm-machinery vs
load), and **FIXED — each via a recipe PR or a harness improvement — and verified green**: bluesky promotes
(warm-routing fixed), gitea promotes `3.6.0`, keycloak enrolled via a collision-free warm domain, discourse
converges, mattermost-lts + mumble green (stabilized or fixed). **None left as a standing exception.**
Nothing merged (operator merges); a clear flake-vs-real report with the fix + verification per recipe.
M1 + M2 fresh Adversary PASSes in REVIEW-redfix.md.