Files
cc-ci-orchestrator/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md
autonomic-bot ff6c44a627 plan: queue redfix — investigate ALL canon-sweep failures + FIX each (recipe PR or harness improvement, opus)
Operator 2026-06-17: fix all six (discourse timeout, mattermost-lts restore,
mumble handshake, bluesky warm-routing, gitea 3.6.0 advance, keycloak
de-enroll) — none left as a standing exception. Isolation re-run first
(flake vs genuine; operator recalls mattermost-lts/mumble passing). keycloak
+ bluesky likely share a warm-domain-collision harness fix; gitea via recipe
PR or warm-advance fallback. Nothing merged.
2026-06-17 23:17:51 +00:00

8.8 KiB
Raw Blame History

Phase redfix — investigate every canon-sweep failure + open fix PRs

Mission (operator-specified 2026-06-17): the canon cold-full-lifecycle sweep (every recipe's latest release deployed from scratch + warm-promote) surfaced failures the upgrade-only weekly run never tests. Investigate all of them, root-cause each, and FIX every one — each via a recipe PR or a harness improvement (operator 2026-06-17). None is left as a standing exception (keycloak's de-enrollment and gitea's "documented limitation" are NOT acceptable end states — they must be fixed). The operator recalls mattermost-lts and mumble passing before, so flake-vs-genuine classification (via isolated re-runs) is the crux — but note a flake still gets a fix (a harness stabilization improvement), not a shrug.

State files: STATUS-redfix.md, BACKLOG-redfix.md, REVIEW-redfix.md, JOURNAL-redfix.md. DECISIONS.md shared.

1. The failing set (from canon DECISIONS §canon exceptions)

Recipe What happened in the sweep First hypothesis
discourse cold-deploy timeout (rc=142/143, ~51-min wedge) slow Rails boot under load → timeout/headroom, maybe the start-first rollout
mattermost-lts test_restore.py::test_restore_returns_state FAILED at latest restore-tier — possibly the known backup/restore db-cycle race on the loaded node
mumble custom/test_protocol_handshake.py::test_handshake_completes_with_channel_presence FAILED readiness/timing of the protocol handshake
bluesky-pds cold test GREEN, warm-canonical promote FAILED (warm-bluesky-pds… → 000, cold domain + other 15 warm domains route fine) warm-canonical-deploy routing machinery (cc-ci side)
gitea cold test GREEN, 3.5.3→3.6.0 warm advance doesn't promote (app.ini read-only) recipe: read-only app.ini blocks in-place warm advance (ties to LFS PR #1)
keycloak de-enrolled (not tested) — live-warm OIDC dep collision on warm-keycloak… harness fix: canonical warm deploys need a domain/namespace that can't collide with a live service, so keycloak can enroll

2. Method — per recipe: isolate → root-cause → classify → fix-PR

  1. Reproduce in ISOLATION first. Re-run each failing recipe through the harness alone (not under concurrent sweep load) on cc-ci. The loaded single node during the sweep is a known flake source (backup/restore db-cycle races, deploy timeouts). This directly answers the operator's "they passed before": if a recipe is green in isolation, it's a load/concurrency flake, not a recipe defect — and the fix is harness stabilization (or a documented known-flake), NOT a recipe change.
  2. Root-cause with real evidence — the actual assertion/error, deploy logs (docker service logs), the wedge cause. No guessing.
  3. Classify each precisely:
    • Recipe defect at latest → a recipe PR on the git.autonomic.zone mirror via the recipe-upgrade/recipe-upstream flow, verified with !testme (NEVER merge — operator).
    • Stale/wrong cc-ci test → per the standing discipline: leave a PR comment, or a cc-ci test PR only under the --with-tests gate. Never weaken a test to turn a red green.
    • Warm-canonical machinery defect (bluesky routing) → a cc-ci branch PR (single-writer; never push main).
    • Load/concurrency flake → a harness stabilization fix (timeout/serialization/readiness/retry) as a cc-ci PR, or a documented known-flake if genuinely environmental.
  4. Open the fix PR for each and verify it (recipe PR green via !testme; cc-ci PR via the harness). One PR per fix; capture URLs.

3. Per-recipe specifics to chase

  • discourse — characterise the timeout/wedge: is it just headroom (raise DEPLOY_TIMEOUT / give it an uncontended deploy) or a real convergence bug? It cold-deploys fine in the weekly with headroom. Fix = cc-ci timeout/serialization tuning and/or a recipe convergence fix — whichever the evidence shows.
  • mattermost-lts — does test_restore_returns_state fail in isolation? If green → the loaded-node restore race (cf. plausible/ghost) → stabilize (e.g. a BACKUP_VERIFY-style settle/retry). If red → diagnose the restore (recipe vs test) and fix the right side.
  • mumble — does the handshake test fail in isolation? Likely a readiness/timing gap (handshake before the channel is up) → readiness probe / retry, or a genuine recipe/test issue.
  • bluesky-pds — why does warm-bluesky-pds… return 000 while the cold-test domain and the other 15 warm domains route fine? Find the warm-deploy routing defect (cc-ci warm machinery) and fix it so bluesky promotes; it should then drop from the exception list.
  • giteafix it so it promotes 3.6.0: either a recipe PR making app.ini writable for the in-place warm advance, OR a harness improvement so the warm advance falls back to a clean re-deploy when an in-place config rewrite isn't possible. Pick per evidence; the bar is gitea promotes 3.6.0 (ties to the LFS app.ini change in PR #1). Not "documented as inherent."
  • keycloakfix via a harness improvement so it can be enrolled safely: canonical warm deploys must use a domain/namespace that can never collide with a live shared service (e.g. a dedicated warm-canon-<recipe> / per-canonical namespace), so keycloak's canonical deploys without touching the live OIDC service. (Likely shares a root cause with bluesky's warm-routing — a single warm-domain fix may resolve both.) Don't risk the live keycloak — the fix is isolation, then enroll.

4. Gates

M1 — all investigated, isolated, classified. Every recipe in §1 re-run in isolation with evidence; a results table recipe → failure → isolation result (flake|genuine) → root cause → classification → fix approach. Adversary cold-verifies the classifications: a claimed flake is reproducibly green in isolation (and red under load); a claimed recipe defect is genuinely the recipe (not a stale test or a harness artifact); a claimed warm-machinery bug is in cc-ci, not the recipe. No "it's probably a flake" without an isolation re-run proving it.

M2 — all six FIXED + verified. For each, the appropriate fix (a recipe PR or a harness improvement) exists and is verified green (recipe PR via !testme; harness/cc-ci PR via the harness; flake-stabilization re-run green under load). All six promote/pass after the fix — bluesky promotes (warm-routing fixed), gitea promotes 3.6.0, keycloak enrolled + promotes via the collision-free warm domain, discourse converges in time, mattermost-lts + mumble green (stabilized if flake, fixed if real). No recipe left as a standing exception. If one genuinely cannot be fixed at the recipe level, the fix moves to the harness (or a recipe PR + a tracked upstream issue) — never a shrug. Nothing merged (operator merges). A short report: per recipe, flake-or-real, the fix, and the verification. Fresh Adversary PASS on both milestones → ## DONE.

5. Guardrails

  • Isolation before blame. Never fix a load-flake as a recipe bug, or a recipe bug as a flake — the isolation re-run is mandatory evidence (especially for mattermost-lts/mumble, which the operator saw pass).
  • Recipe mirrors are PR-only — recipe defects get a PR, !testme-verified, never merged. Never weaken a cc-ci test to make a recipe green; a stale test gets a comment or a --with-tests test PR.
  • cc-ci changes on a dedicated branch (single-writer; never push main, never disturb other clones).
  • Shared swarm: ≤23 concurrent deploys; tear down every deploy on every exit path; mind the live warm services (keycloak especially). Host changes coordinated (loops may rebuild if clean + verify health).
  • Honest reporting — a flake is labelled a flake with proof; a recipe left unfixed is called out with the reason. Commit author autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>; push every commit; abra over a pseudo-TTY.

6. Definition of Done

All six canon-sweep failures (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak) investigated in isolation, root-caused, classified (flake vs genuine; recipe vs test vs warm-machinery vs load), and FIXED — each via a recipe PR or a harness improvement — and verified green: bluesky promotes (warm-routing fixed), gitea promotes 3.6.0, keycloak enrolled via a collision-free warm domain, discourse converges, mattermost-lts + mumble green (stabilized or fixed). None left as a standing exception. Nothing merged (operator merges); a clear flake-vs-real report with the fix + verification per recipe. M1 + M2 fresh Adversary PASSes in REVIEW-redfix.md.