plan: queue redfix — investigate ALL canon-sweep failures + FIX each (recipe PR or harness improvement, opus)
Operator 2026-06-17: fix all six (discourse timeout, mattermost-lts restore, mumble handshake, bluesky warm-routing, gitea 3.6.0 advance, keycloak de-enroll) — none left as a standing exception. Isolation re-run first (flake vs genuine; operator recalls mattermost-lts/mumble passing). keycloak + bluesky likely share a warm-domain-collision harness fix; gitea via recipe PR or warm-advance fallback. Nothing merged.
This commit is contained in:
@ -168,4 +168,6 @@ phases = [
|
||||
{ id = "settings", plan = "plan-phase-settings-ci-server-config.md", status = "STATUS-settings.md", models = { builder = "claude-opus-4-8", adversary = "claude-opus-4-8" } },
|
||||
# single-source the harness runtime env so the sweep timer + Drone runner SHARE deps (no duplication) — root-cause fix for DEFECT-3 drift (opus) — see plan-phase-nixenv-*.md (operator 2026-06-17)
|
||||
{ id = "nixenv", plan = "plan-phase-nixenv-shared-runtime-env.md", status = "STATUS-nixenv.md", models = { builder = "claude-opus-4-8", adversary = "claude-opus-4-8" } },
|
||||
# investigate ALL canon-sweep failures (discourse/mattermost-lts/mumble/bluesky/gitea/keycloak) + FIX each via recipe PR or harness improvement (opus) — see plan-phase-redfix-*.md (operator 2026-06-17)
|
||||
{ id = "redfix", plan = "plan-phase-redfix-canon-sweep-failures.md", status = "STATUS-redfix.md", models = { builder = "claude-opus-4-8", adversary = "claude-opus-4-8" } },
|
||||
]
|
||||
|
||||
108
cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md
Normal file
108
cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md
Normal file
@ -0,0 +1,108 @@
|
||||
# Phase `redfix` — investigate every canon-sweep failure + open fix PRs
|
||||
|
||||
**Mission (operator-specified 2026-06-17):** the `canon` cold-full-lifecycle sweep (every recipe's latest
|
||||
release deployed from scratch + warm-promote) surfaced failures the **upgrade-only weekly run never
|
||||
tests**. Investigate **all** of them, root-cause each, and **FIX every one** — each via **a recipe PR or a harness
|
||||
improvement** (operator 2026-06-17). **None is left as a standing exception** (keycloak's de-enrollment and
|
||||
gitea's "documented limitation" are NOT acceptable end states — they must be fixed). The operator recalls
|
||||
**mattermost-lts and mumble passing before**, so **flake-vs-genuine classification (via isolated re-runs)
|
||||
is the crux** — but note a flake still gets a fix (a harness stabilization improvement), not a shrug.
|
||||
|
||||
State files: `STATUS-redfix.md`, `BACKLOG-redfix.md`, `REVIEW-redfix.md`, `JOURNAL-redfix.md`. DECISIONS.md shared.
|
||||
|
||||
## 1. The failing set (from canon DECISIONS §canon exceptions)
|
||||
|
||||
| Recipe | What happened in the sweep | First hypothesis |
|
||||
|---|---|---|
|
||||
| **discourse** | cold-deploy **timeout** (rc=142/143, ~51-min wedge) | slow Rails boot under load → timeout/headroom, maybe the start-first rollout |
|
||||
| **mattermost-lts** | `test_restore.py::test_restore_returns_state` FAILED at latest | restore-tier — possibly the known backup/restore db-cycle race on the loaded node |
|
||||
| **mumble** | `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` FAILED | readiness/timing of the protocol handshake |
|
||||
| **bluesky-pds** | cold test **GREEN**, warm-canonical promote FAILED (`warm-bluesky-pds…` → 000, cold domain + other 15 warm domains route fine) | warm-canonical-deploy **routing machinery** (cc-ci side) |
|
||||
| **gitea** | cold test **GREEN**, `3.5.3→3.6.0` warm **advance** doesn't promote (app.ini read-only) | recipe: read-only `app.ini` blocks in-place warm advance (ties to LFS PR #1) |
|
||||
| **keycloak** | **de-enrolled** (not tested) — live-warm OIDC dep collision on `warm-keycloak…` | **harness fix:** canonical warm deploys need a domain/namespace that can't collide with a live service, so keycloak can enroll |
|
||||
|
||||
## 2. Method — per recipe: isolate → root-cause → classify → fix-PR
|
||||
|
||||
1. **Reproduce in ISOLATION first.** Re-run each failing recipe through the harness **alone** (not under
|
||||
concurrent sweep load) on cc-ci. The loaded single node during the sweep is a known flake source
|
||||
(backup/restore db-cycle races, deploy timeouts). This directly answers the operator's "they passed
|
||||
before": if a recipe is **green in isolation**, it's a load/concurrency **flake**, not a recipe defect
|
||||
— and the fix is harness stabilization (or a documented known-flake), NOT a recipe change.
|
||||
2. **Root-cause** with real evidence — the actual assertion/error, deploy logs (`docker service logs`),
|
||||
the wedge cause. No guessing.
|
||||
3. **Classify** each precisely:
|
||||
- **Recipe defect at latest** → a **recipe PR** on the `git.autonomic.zone` mirror via the
|
||||
`recipe-upgrade`/`recipe-upstream` flow, verified with `!testme` (NEVER merge — operator).
|
||||
- **Stale/wrong cc-ci test** → per the standing discipline: leave a PR **comment**, or a cc-ci
|
||||
**test PR** only under the `--with-tests` gate. **Never weaken a test** to turn a red green.
|
||||
- **Warm-canonical machinery defect** (bluesky routing) → a **cc-ci branch PR** (single-writer; never
|
||||
push main).
|
||||
- **Load/concurrency flake** → a harness **stabilization** fix (timeout/serialization/readiness/retry)
|
||||
as a cc-ci PR, or a documented known-flake if genuinely environmental.
|
||||
4. **Open the fix PR** for each and **verify it** (recipe PR green via `!testme`; cc-ci PR via the
|
||||
harness). One PR per fix; capture URLs.
|
||||
|
||||
## 3. Per-recipe specifics to chase
|
||||
|
||||
- **discourse** — characterise the timeout/wedge: is it just headroom (raise `DEPLOY_TIMEOUT` / give it an
|
||||
uncontended deploy) or a real convergence bug? It cold-deploys fine in the weekly with headroom. Fix =
|
||||
cc-ci timeout/serialization tuning and/or a recipe convergence fix — whichever the evidence shows.
|
||||
- **mattermost-lts** — does `test_restore_returns_state` fail in isolation? If green → the loaded-node
|
||||
restore race (cf. plausible/ghost) → stabilize (e.g. a `BACKUP_VERIFY`-style settle/retry). If red →
|
||||
diagnose the restore (recipe vs test) and fix the right side.
|
||||
- **mumble** — does the handshake test fail in isolation? Likely a readiness/timing gap (handshake before
|
||||
the channel is up) → readiness probe / retry, or a genuine recipe/test issue.
|
||||
- **bluesky-pds** — why does `warm-bluesky-pds…` return 000 while the cold-test domain and the other 15
|
||||
warm domains route fine? Find the warm-deploy routing defect (cc-ci warm machinery) and fix it so
|
||||
bluesky promotes; it should then drop from the exception list.
|
||||
- **gitea** — **fix it** so it promotes `3.6.0`: either a **recipe PR** making `app.ini` writable for the
|
||||
in-place warm advance, OR a **harness improvement** so the warm advance falls back to a clean re-deploy
|
||||
when an in-place config rewrite isn't possible. Pick per evidence; the bar is gitea promotes `3.6.0`
|
||||
(ties to the LFS `app.ini` change in PR #1). Not "documented as inherent."
|
||||
- **keycloak** — **fix via a harness improvement** so it can be enrolled safely: canonical warm deploys
|
||||
must use a domain/namespace that can **never** collide with a live shared service (e.g. a dedicated
|
||||
`warm-canon-<recipe>` / per-canonical namespace), so keycloak's canonical deploys without touching the
|
||||
live OIDC service. (Likely shares a root cause with bluesky's warm-routing — a single warm-domain fix may
|
||||
resolve both.) Don't risk the live keycloak — the fix is isolation, then enroll.
|
||||
|
||||
## 4. Gates
|
||||
|
||||
**M1 — all investigated, isolated, classified.** Every recipe in §1 re-run in isolation with evidence; a
|
||||
results table `recipe → failure → isolation result (flake|genuine) → root cause → classification → fix
|
||||
approach`. Adversary cold-verifies the classifications: a claimed **flake** is reproducibly green in
|
||||
isolation (and red under load); a claimed **recipe defect** is genuinely the recipe (not a stale test or a
|
||||
harness artifact); a claimed **warm-machinery** bug is in cc-ci, not the recipe. No "it's probably a
|
||||
flake" without an isolation re-run proving it.
|
||||
|
||||
**M2 — all six FIXED + verified.** For each, the appropriate fix (a recipe PR or a harness improvement)
|
||||
exists and is **verified green** (recipe PR via `!testme`; harness/cc-ci PR via the harness;
|
||||
flake-stabilization re-run green under load). **All six promote/pass after the fix** — bluesky promotes
|
||||
(warm-routing fixed), gitea promotes `3.6.0`, keycloak enrolled + promotes via the collision-free warm
|
||||
domain, discourse converges in time, mattermost-lts + mumble green (stabilized if flake, fixed if real).
|
||||
**No recipe left as a standing exception.** If one genuinely cannot be fixed at the recipe level, the fix
|
||||
moves to the harness (or a recipe PR + a tracked upstream issue) — never a shrug. **Nothing merged**
|
||||
(operator merges). A short report: per recipe, flake-or-real, the fix, and the verification. Fresh
|
||||
Adversary PASS on both milestones → `## DONE`.
|
||||
|
||||
## 5. Guardrails
|
||||
|
||||
- **Isolation before blame.** Never fix a load-flake as a recipe bug, or a recipe bug as a flake — the
|
||||
isolation re-run is mandatory evidence (especially for mattermost-lts/mumble, which the operator saw pass).
|
||||
- **Recipe mirrors are PR-only** — recipe defects get a PR, `!testme`-verified, **never merged**. **Never
|
||||
weaken a cc-ci test** to make a recipe green; a stale test gets a comment or a `--with-tests` test PR.
|
||||
- **cc-ci changes** on a dedicated branch (single-writer; never push main, never disturb other clones).
|
||||
- **Shared swarm:** ≤2–3 concurrent deploys; tear down every deploy on every exit path; mind the live
|
||||
warm services (keycloak especially). Host changes coordinated (loops may rebuild if clean + verify health).
|
||||
- Honest reporting — a flake is labelled a flake with proof; a recipe left unfixed is called out with the
|
||||
reason. Commit author `autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>`; push every commit; abra
|
||||
over a pseudo-TTY.
|
||||
|
||||
## 6. Definition of Done
|
||||
|
||||
All six canon-sweep failures (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak)
|
||||
investigated in isolation, root-caused, classified (flake vs genuine; recipe vs test vs warm-machinery vs
|
||||
load), and **FIXED — each via a recipe PR or a harness improvement — and verified green**: bluesky promotes
|
||||
(warm-routing fixed), gitea promotes `3.6.0`, keycloak enrolled via a collision-free warm domain, discourse
|
||||
converges, mattermost-lts + mumble green (stabilized or fixed). **None left as a standing exception.**
|
||||
Nothing merged (operator merges); a clear flake-vs-real report with the fix + verification per recipe.
|
||||
M1 + M2 fresh Adversary PASSes in REVIEW-redfix.md.
|
||||
Reference in New Issue
Block a user