From ff6c44a627a1c3cffa77a681e84b3c781233d6bc Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Wed, 17 Jun 2026 23:17:51 +0000 Subject: [PATCH] =?UTF-8?q?plan:=20queue=20redfix=20=E2=80=94=20investigat?= =?UTF-8?q?e=20ALL=20canon-sweep=20failures=20+=20FIX=20each=20(recipe=20P?= =?UTF-8?q?R=20or=20harness=20improvement,=20opus)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Operator 2026-06-17: fix all six (discourse timeout, mattermost-lts restore, mumble handshake, bluesky warm-routing, gitea 3.6.0 advance, keycloak de-enroll) — none left as a standing exception. Isolation re-run first (flake vs genuine; operator recalls mattermost-lts/mumble passing). keycloak + bluesky likely share a warm-domain-collision harness fix; gitea via recipe PR or warm-advance fallback. Nothing merged. --- cc-ci-plan/agents.toml | 2 + .../plan-phase-redfix-canon-sweep-failures.md | 108 ++++++++++++++++++ 2 files changed, 110 insertions(+) create mode 100644 cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md diff --git a/cc-ci-plan/agents.toml b/cc-ci-plan/agents.toml index 531bd0d..1d861d7 100644 --- a/cc-ci-plan/agents.toml +++ b/cc-ci-plan/agents.toml @@ -168,4 +168,6 @@ phases = [ { id = "settings", plan = "plan-phase-settings-ci-server-config.md", status = "STATUS-settings.md", models = { builder = "claude-opus-4-8", adversary = "claude-opus-4-8" } }, # single-source the harness runtime env so the sweep timer + Drone runner SHARE deps (no duplication) — root-cause fix for DEFECT-3 drift (opus) — see plan-phase-nixenv-*.md (operator 2026-06-17) { id = "nixenv", plan = "plan-phase-nixenv-shared-runtime-env.md", status = "STATUS-nixenv.md", models = { builder = "claude-opus-4-8", adversary = "claude-opus-4-8" } }, + # investigate ALL canon-sweep failures (discourse/mattermost-lts/mumble/bluesky/gitea/keycloak) + FIX each via recipe PR or harness improvement (opus) — see plan-phase-redfix-*.md (operator 2026-06-17) + { id = "redfix", plan = "plan-phase-redfix-canon-sweep-failures.md", status = "STATUS-redfix.md", models = { builder = "claude-opus-4-8", adversary = "claude-opus-4-8" } }, ] diff --git a/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md b/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md new file mode 100644 index 0000000..977c795 --- /dev/null +++ b/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md @@ -0,0 +1,108 @@ +# Phase `redfix` — investigate every canon-sweep failure + open fix PRs + +**Mission (operator-specified 2026-06-17):** the `canon` cold-full-lifecycle sweep (every recipe's latest +release deployed from scratch + warm-promote) surfaced failures the **upgrade-only weekly run never +tests**. Investigate **all** of them, root-cause each, and **FIX every one** — each via **a recipe PR or a harness +improvement** (operator 2026-06-17). **None is left as a standing exception** (keycloak's de-enrollment and +gitea's "documented limitation" are NOT acceptable end states — they must be fixed). The operator recalls +**mattermost-lts and mumble passing before**, so **flake-vs-genuine classification (via isolated re-runs) +is the crux** — but note a flake still gets a fix (a harness stabilization improvement), not a shrug. + +State files: `STATUS-redfix.md`, `BACKLOG-redfix.md`, `REVIEW-redfix.md`, `JOURNAL-redfix.md`. DECISIONS.md shared. + +## 1. The failing set (from canon DECISIONS §canon exceptions) + +| Recipe | What happened in the sweep | First hypothesis | +|---|---|---| +| **discourse** | cold-deploy **timeout** (rc=142/143, ~51-min wedge) | slow Rails boot under load → timeout/headroom, maybe the start-first rollout | +| **mattermost-lts** | `test_restore.py::test_restore_returns_state` FAILED at latest | restore-tier — possibly the known backup/restore db-cycle race on the loaded node | +| **mumble** | `custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence` FAILED | readiness/timing of the protocol handshake | +| **bluesky-pds** | cold test **GREEN**, warm-canonical promote FAILED (`warm-bluesky-pds…` → 000, cold domain + other 15 warm domains route fine) | warm-canonical-deploy **routing machinery** (cc-ci side) | +| **gitea** | cold test **GREEN**, `3.5.3→3.6.0` warm **advance** doesn't promote (app.ini read-only) | recipe: read-only `app.ini` blocks in-place warm advance (ties to LFS PR #1) | +| **keycloak** | **de-enrolled** (not tested) — live-warm OIDC dep collision on `warm-keycloak…` | **harness fix:** canonical warm deploys need a domain/namespace that can't collide with a live service, so keycloak can enroll | + +## 2. Method — per recipe: isolate → root-cause → classify → fix-PR + +1. **Reproduce in ISOLATION first.** Re-run each failing recipe through the harness **alone** (not under + concurrent sweep load) on cc-ci. The loaded single node during the sweep is a known flake source + (backup/restore db-cycle races, deploy timeouts). This directly answers the operator's "they passed + before": if a recipe is **green in isolation**, it's a load/concurrency **flake**, not a recipe defect + — and the fix is harness stabilization (or a documented known-flake), NOT a recipe change. +2. **Root-cause** with real evidence — the actual assertion/error, deploy logs (`docker service logs`), + the wedge cause. No guessing. +3. **Classify** each precisely: + - **Recipe defect at latest** → a **recipe PR** on the `git.autonomic.zone` mirror via the + `recipe-upgrade`/`recipe-upstream` flow, verified with `!testme` (NEVER merge — operator). + - **Stale/wrong cc-ci test** → per the standing discipline: leave a PR **comment**, or a cc-ci + **test PR** only under the `--with-tests` gate. **Never weaken a test** to turn a red green. + - **Warm-canonical machinery defect** (bluesky routing) → a **cc-ci branch PR** (single-writer; never + push main). + - **Load/concurrency flake** → a harness **stabilization** fix (timeout/serialization/readiness/retry) + as a cc-ci PR, or a documented known-flake if genuinely environmental. +4. **Open the fix PR** for each and **verify it** (recipe PR green via `!testme`; cc-ci PR via the + harness). One PR per fix; capture URLs. + +## 3. Per-recipe specifics to chase + +- **discourse** — characterise the timeout/wedge: is it just headroom (raise `DEPLOY_TIMEOUT` / give it an + uncontended deploy) or a real convergence bug? It cold-deploys fine in the weekly with headroom. Fix = + cc-ci timeout/serialization tuning and/or a recipe convergence fix — whichever the evidence shows. +- **mattermost-lts** — does `test_restore_returns_state` fail in isolation? If green → the loaded-node + restore race (cf. plausible/ghost) → stabilize (e.g. a `BACKUP_VERIFY`-style settle/retry). If red → + diagnose the restore (recipe vs test) and fix the right side. +- **mumble** — does the handshake test fail in isolation? Likely a readiness/timing gap (handshake before + the channel is up) → readiness probe / retry, or a genuine recipe/test issue. +- **bluesky-pds** — why does `warm-bluesky-pds…` return 000 while the cold-test domain and the other 15 + warm domains route fine? Find the warm-deploy routing defect (cc-ci warm machinery) and fix it so + bluesky promotes; it should then drop from the exception list. +- **gitea** — **fix it** so it promotes `3.6.0`: either a **recipe PR** making `app.ini` writable for the + in-place warm advance, OR a **harness improvement** so the warm advance falls back to a clean re-deploy + when an in-place config rewrite isn't possible. Pick per evidence; the bar is gitea promotes `3.6.0` + (ties to the LFS `app.ini` change in PR #1). Not "documented as inherent." +- **keycloak** — **fix via a harness improvement** so it can be enrolled safely: canonical warm deploys + must use a domain/namespace that can **never** collide with a live shared service (e.g. a dedicated + `warm-canon-` / per-canonical namespace), so keycloak's canonical deploys without touching the + live OIDC service. (Likely shares a root cause with bluesky's warm-routing — a single warm-domain fix may + resolve both.) Don't risk the live keycloak — the fix is isolation, then enroll. + +## 4. Gates + +**M1 — all investigated, isolated, classified.** Every recipe in §1 re-run in isolation with evidence; a +results table `recipe → failure → isolation result (flake|genuine) → root cause → classification → fix +approach`. Adversary cold-verifies the classifications: a claimed **flake** is reproducibly green in +isolation (and red under load); a claimed **recipe defect** is genuinely the recipe (not a stale test or a +harness artifact); a claimed **warm-machinery** bug is in cc-ci, not the recipe. No "it's probably a +flake" without an isolation re-run proving it. + +**M2 — all six FIXED + verified.** For each, the appropriate fix (a recipe PR or a harness improvement) +exists and is **verified green** (recipe PR via `!testme`; harness/cc-ci PR via the harness; +flake-stabilization re-run green under load). **All six promote/pass after the fix** — bluesky promotes +(warm-routing fixed), gitea promotes `3.6.0`, keycloak enrolled + promotes via the collision-free warm +domain, discourse converges in time, mattermost-lts + mumble green (stabilized if flake, fixed if real). +**No recipe left as a standing exception.** If one genuinely cannot be fixed at the recipe level, the fix +moves to the harness (or a recipe PR + a tracked upstream issue) — never a shrug. **Nothing merged** +(operator merges). A short report: per recipe, flake-or-real, the fix, and the verification. Fresh +Adversary PASS on both milestones → `## DONE`. + +## 5. Guardrails + +- **Isolation before blame.** Never fix a load-flake as a recipe bug, or a recipe bug as a flake — the + isolation re-run is mandatory evidence (especially for mattermost-lts/mumble, which the operator saw pass). +- **Recipe mirrors are PR-only** — recipe defects get a PR, `!testme`-verified, **never merged**. **Never + weaken a cc-ci test** to make a recipe green; a stale test gets a comment or a `--with-tests` test PR. +- **cc-ci changes** on a dedicated branch (single-writer; never push main, never disturb other clones). +- **Shared swarm:** ≤2–3 concurrent deploys; tear down every deploy on every exit path; mind the live + warm services (keycloak especially). Host changes coordinated (loops may rebuild if clean + verify health). +- Honest reporting — a flake is labelled a flake with proof; a recipe left unfixed is called out with the + reason. Commit author `autonomic-bot `; push every commit; abra + over a pseudo-TTY. + +## 6. Definition of Done + +All six canon-sweep failures (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak) +investigated in isolation, root-caused, classified (flake vs genuine; recipe vs test vs warm-machinery vs +load), and **FIXED — each via a recipe PR or a harness improvement — and verified green**: bluesky promotes +(warm-routing fixed), gitea promotes `3.6.0`, keycloak enrolled via a collision-free warm domain, discourse +converges, mattermost-lts + mumble green (stabilized or fixed). **None left as a standing exception.** +Nothing merged (operator merges); a clear flake-vs-real report with the fix + verification per recipe. +M1 + M2 fresh Adversary PASSes in REVIEW-redfix.md.