# STATUS — phase `redfix` Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md` Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing exceptions. Nothing merged. ## Phase: M1 — investigate + isolate + classify (IN PROGRESS) Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21 (3-day clear window). Disk `/` 38G free (75% used). ### Isolation harness (how I reproduce each failure ALONE) Each canon-sweep per-recipe run is `runner/nightly_sweep.run_on_tag(recipe, latest)`: `abra.recipe_checkout(recipe, )` then `run_recipe_ci.py` with `RECIPE= CCCI_SKIP_FETCH=1` and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`. ### Starting canonical state (cc-ci `/var/lib/ci-warm//canonical.json`, read 2026-06-17T23:19Z) | Recipe | Canonical now | Note | |---|---|---| | discourse | (none) | no canonical dir | | mattermost-lts | (none) | no canonical dir | | mumble | `1.0.0+v1.6.870-0` @ 20260617T180501Z | **canonical PRESENT, written TODAY** — flake signal | | bluesky-pds | (none) | no canonical dir | | gitea | `3.5.3+1.24.2-rootless` @ 20260617T083930Z | 3.6.0 advance not promoted | | keycloak | (none) | de-enrolled (WARM_CANONICAL off) | ### M1 investigation tracker | Recipe | Isolation run | Result | Root cause | Classification | |---|---|---|---|---| | discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) | | mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist` — **deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) | | mumble | DONE @00:18Z (`/tmp/redfix-mumble.log`) | **ALL tiers PASS** incl. handshake; no orphans. Canon red under load; canonical written green TODAY | handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation | **load/timing FLAKE** → harness stabilization (readiness gate / retry). (1 isolation green; will repeat 1-2× before M1 claim) | | bluesky-pds | DONE @00:45Z (`/tmp/redfix-bluesky-pds.log` + live diag) | cold lifecycle GREEN; **WC5 promote 000** reproduces (warm /xrpc/_health last status 0). NOT a flake | caddy on-demand TLS (`ask http://app:3000/tls-check`) can't reach app: caddy resolves bare `app` to OTHER stacks' app endpoints on shared `proxy` net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci `app` aliases) → no cert → 000. Promote machinery correct (refused to write canonical). | **genuine routing/RECIPE defect** (cross-stack `app`-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake | | gitea | DONE @00:14Z (`/tmp/redfix-gitea2.log` + live container logs) | cold lifecycle (incl fresh 3.5.3→3.6.0 upgrade) PASS; **warm advance crash-loops** | `LoadCommonSettings() [F] error saving JWT Secret … failed to save "/etc/gitea/app.ini": read-only file system` — gitea 3.6.0/1.24.2 tries to persist a JWT to the read-only app.ini docker-config mount on warm reattach (before DB migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite). | **genuine RECIPE defect** (3.6.0 + read-only app.ini config mount on advance) → recipe PR: render app.ini into the writable config volume. (1st gitea run hit a nixenv "already deployed" leftover confound — fixed by undeploying to idle then re-running) | | keycloak | DONE @01:05Z (code-verified; no run) | de-enrolled. `canonical_domain("keycloak")` == `WARM_DOMAINS["keycloak"]` == `warm-keycloak.ci.commoninternet.net` EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. | data-warm canonical domain uses same `warm-` scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. | **HARNESS defect** (warm-domain namespace collision) → fix: collision-free `canonical_domain` for live-warm providers (`warm-canon-`), then enroll keycloak | Gate: M1 not yet claimed. ## Blocked (none)