Operator 2026-06-17: fix all six (discourse timeout, mattermost-lts restore, mumble handshake, bluesky warm-routing, gitea 3.6.0 advance, keycloak de-enroll) — none left as a standing exception. Isolation re-run first (flake vs genuine; operator recalls mattermost-lts/mumble passing). keycloak + bluesky likely share a warm-domain-collision harness fix; gitea via recipe PR or warm-advance fallback. Nothing merged.
8.8 KiB
Phase redfix — investigate every canon-sweep failure + open fix PRs
Mission (operator-specified 2026-06-17): the canon cold-full-lifecycle sweep (every recipe's latest
release deployed from scratch + warm-promote) surfaced failures the upgrade-only weekly run never
tests. Investigate all of them, root-cause each, and FIX every one — each via a recipe PR or a harness
improvement (operator 2026-06-17). None is left as a standing exception (keycloak's de-enrollment and
gitea's "documented limitation" are NOT acceptable end states — they must be fixed). The operator recalls
mattermost-lts and mumble passing before, so flake-vs-genuine classification (via isolated re-runs)
is the crux — but note a flake still gets a fix (a harness stabilization improvement), not a shrug.
State files: STATUS-redfix.md, BACKLOG-redfix.md, REVIEW-redfix.md, JOURNAL-redfix.md. DECISIONS.md shared.
1. The failing set (from canon DECISIONS §canon exceptions)
| Recipe | What happened in the sweep | First hypothesis |
|---|---|---|
| discourse | cold-deploy timeout (rc=142/143, ~51-min wedge) | slow Rails boot under load → timeout/headroom, maybe the start-first rollout |
| mattermost-lts | test_restore.py::test_restore_returns_state FAILED at latest |
restore-tier — possibly the known backup/restore db-cycle race on the loaded node |
| mumble | custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence FAILED |
readiness/timing of the protocol handshake |
| bluesky-pds | cold test GREEN, warm-canonical promote FAILED (warm-bluesky-pds… → 000, cold domain + other 15 warm domains route fine) |
warm-canonical-deploy routing machinery (cc-ci side) |
| gitea | cold test GREEN, 3.5.3→3.6.0 warm advance doesn't promote (app.ini read-only) |
recipe: read-only app.ini blocks in-place warm advance (ties to LFS PR #1) |
| keycloak | de-enrolled (not tested) — live-warm OIDC dep collision on warm-keycloak… |
harness fix: canonical warm deploys need a domain/namespace that can't collide with a live service, so keycloak can enroll |
2. Method — per recipe: isolate → root-cause → classify → fix-PR
- Reproduce in ISOLATION first. Re-run each failing recipe through the harness alone (not under concurrent sweep load) on cc-ci. The loaded single node during the sweep is a known flake source (backup/restore db-cycle races, deploy timeouts). This directly answers the operator's "they passed before": if a recipe is green in isolation, it's a load/concurrency flake, not a recipe defect — and the fix is harness stabilization (or a documented known-flake), NOT a recipe change.
- Root-cause with real evidence — the actual assertion/error, deploy logs (
docker service logs), the wedge cause. No guessing. - Classify each precisely:
- Recipe defect at latest → a recipe PR on the
git.autonomic.zonemirror via therecipe-upgrade/recipe-upstreamflow, verified with!testme(NEVER merge — operator). - Stale/wrong cc-ci test → per the standing discipline: leave a PR comment, or a cc-ci
test PR only under the
--with-testsgate. Never weaken a test to turn a red green. - Warm-canonical machinery defect (bluesky routing) → a cc-ci branch PR (single-writer; never push main).
- Load/concurrency flake → a harness stabilization fix (timeout/serialization/readiness/retry) as a cc-ci PR, or a documented known-flake if genuinely environmental.
- Recipe defect at latest → a recipe PR on the
- Open the fix PR for each and verify it (recipe PR green via
!testme; cc-ci PR via the harness). One PR per fix; capture URLs.
3. Per-recipe specifics to chase
- discourse — characterise the timeout/wedge: is it just headroom (raise
DEPLOY_TIMEOUT/ give it an uncontended deploy) or a real convergence bug? It cold-deploys fine in the weekly with headroom. Fix = cc-ci timeout/serialization tuning and/or a recipe convergence fix — whichever the evidence shows. - mattermost-lts — does
test_restore_returns_statefail in isolation? If green → the loaded-node restore race (cf. plausible/ghost) → stabilize (e.g. aBACKUP_VERIFY-style settle/retry). If red → diagnose the restore (recipe vs test) and fix the right side. - mumble — does the handshake test fail in isolation? Likely a readiness/timing gap (handshake before the channel is up) → readiness probe / retry, or a genuine recipe/test issue.
- bluesky-pds — why does
warm-bluesky-pds…return 000 while the cold-test domain and the other 15 warm domains route fine? Find the warm-deploy routing defect (cc-ci warm machinery) and fix it so bluesky promotes; it should then drop from the exception list. - gitea — fix it so it promotes
3.6.0: either a recipe PR makingapp.iniwritable for the in-place warm advance, OR a harness improvement so the warm advance falls back to a clean re-deploy when an in-place config rewrite isn't possible. Pick per evidence; the bar is gitea promotes3.6.0(ties to the LFSapp.inichange in PR #1). Not "documented as inherent." - keycloak — fix via a harness improvement so it can be enrolled safely: canonical warm deploys
must use a domain/namespace that can never collide with a live shared service (e.g. a dedicated
warm-canon-<recipe>/ per-canonical namespace), so keycloak's canonical deploys without touching the live OIDC service. (Likely shares a root cause with bluesky's warm-routing — a single warm-domain fix may resolve both.) Don't risk the live keycloak — the fix is isolation, then enroll.
4. Gates
M1 — all investigated, isolated, classified. Every recipe in §1 re-run in isolation with evidence; a
results table recipe → failure → isolation result (flake|genuine) → root cause → classification → fix approach. Adversary cold-verifies the classifications: a claimed flake is reproducibly green in
isolation (and red under load); a claimed recipe defect is genuinely the recipe (not a stale test or a
harness artifact); a claimed warm-machinery bug is in cc-ci, not the recipe. No "it's probably a
flake" without an isolation re-run proving it.
M2 — all six FIXED + verified. For each, the appropriate fix (a recipe PR or a harness improvement)
exists and is verified green (recipe PR via !testme; harness/cc-ci PR via the harness;
flake-stabilization re-run green under load). All six promote/pass after the fix — bluesky promotes
(warm-routing fixed), gitea promotes 3.6.0, keycloak enrolled + promotes via the collision-free warm
domain, discourse converges in time, mattermost-lts + mumble green (stabilized if flake, fixed if real).
No recipe left as a standing exception. If one genuinely cannot be fixed at the recipe level, the fix
moves to the harness (or a recipe PR + a tracked upstream issue) — never a shrug. Nothing merged
(operator merges). A short report: per recipe, flake-or-real, the fix, and the verification. Fresh
Adversary PASS on both milestones → ## DONE.
5. Guardrails
- Isolation before blame. Never fix a load-flake as a recipe bug, or a recipe bug as a flake — the isolation re-run is mandatory evidence (especially for mattermost-lts/mumble, which the operator saw pass).
- Recipe mirrors are PR-only — recipe defects get a PR,
!testme-verified, never merged. Never weaken a cc-ci test to make a recipe green; a stale test gets a comment or a--with-teststest PR. - cc-ci changes on a dedicated branch (single-writer; never push main, never disturb other clones).
- Shared swarm: ≤2–3 concurrent deploys; tear down every deploy on every exit path; mind the live warm services (keycloak especially). Host changes coordinated (loops may rebuild if clean + verify health).
- Honest reporting — a flake is labelled a flake with proof; a recipe left unfixed is called out with the
reason. Commit author
autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>; push every commit; abra over a pseudo-TTY.
6. Definition of Done
All six canon-sweep failures (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak)
investigated in isolation, root-caused, classified (flake vs genuine; recipe vs test vs warm-machinery vs
load), and FIXED — each via a recipe PR or a harness improvement — and verified green: bluesky promotes
(warm-routing fixed), gitea promotes 3.6.0, keycloak enrolled via a collision-free warm domain, discourse
converges, mattermost-lts + mumble green (stabilized or fixed). None left as a standing exception.
Nothing merged (operator merges); a clear flake-vs-real report with the fix + verification per recipe.
M1 + M2 fresh Adversary PASSes in REVIEW-redfix.md.