Files
cc-ci/machine-docs/STATUS-redfix.md
2026-06-18 00:14:32 +00:00

50 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — phase `redfix`
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs
warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing
exceptions. Nothing merged.
## Phase: M1 — investigate + isolate + classify (IN PROGRESS)
Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21
(3-day clear window). Disk `/` 38G free (75% used).
### Isolation harness (how I reproduce each failure ALONE)
Each canon-sweep per-recipe run is `runner/nightly_sweep.run_on_tag(recipe, latest)`:
`abra.recipe_checkout(recipe, <latest-tag>)` then `run_recipe_ci.py` with `RECIPE=<r>
CCCI_SKIP_FETCH=1` and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE
recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known
flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`.
### Starting canonical state (cc-ci `/var/lib/ci-warm/<r>/canonical.json`, read 2026-06-17T23:19Z)
| Recipe | Canonical now | Note |
|---|---|---|
| discourse | (none) | no canonical dir |
| mattermost-lts | (none) | no canonical dir |
| mumble | `1.0.0+v1.6.870-0` @ 20260617T180501Z | **canonical PRESENT, written TODAY** — flake signal |
| bluesky-pds | (none) | no canonical dir |
| gitea | `3.5.3+1.24.2-rootless` @ 20260617T083930Z | 3.6.0 advance not promoted |
| keycloak | (none) | de-enrolled (WARM_CANONICAL off) |
### M1 investigation tracker
| Recipe | Isolation run | Result | Root cause | Classification |
|---|---|---|---|---|
| discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) |
| mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist`**deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) |
| mumble | DONE @00:18Z (`/tmp/redfix-mumble.log`) | **ALL tiers PASS** incl. handshake; no orphans. Canon red under load; canonical written green TODAY | handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation | **load/timing FLAKE** → harness stabilization (readiness gate / retry). (1 isolation green; will repeat 1-2× before M1 claim) |
| bluesky-pds | DONE @00:45Z (`/tmp/redfix-bluesky-pds.log` + live diag) | cold lifecycle GREEN; **WC5 promote 000** reproduces (warm /xrpc/_health last status 0). NOT a flake | caddy on-demand TLS (`ask http://app:3000/tls-check`) can't reach app: caddy resolves bare `app` to OTHER stacks' app endpoints on shared `proxy` net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci `app` aliases) → no cert → 000. Promote machinery correct (refused to write canonical). | **genuine routing/RECIPE defect** (cross-stack `app`-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake |
| gitea | DONE @00:14Z (`/tmp/redfix-gitea2.log` + live container logs) | cold lifecycle (incl fresh 3.5.3→3.6.0 upgrade) PASS; **warm advance crash-loops** | `LoadCommonSettings() [F] error saving JWT Secret … failed to save "/etc/gitea/app.ini": read-only file system` — gitea 3.6.0/1.24.2 tries to persist a JWT to the read-only app.ini docker-config mount on warm reattach (before DB migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite). | **genuine RECIPE defect** (3.6.0 + read-only app.ini config mount on advance) → recipe PR: render app.ini into the writable config volume. (1st gitea run hit a nixenv "already deployed" leftover confound — fixed by undeploying to idle then re-running) |
| keycloak | DONE @01:05Z (code-verified; no run) | de-enrolled. `canonical_domain("keycloak")` == `WARM_DOMAINS["keycloak"]` == `warm-keycloak.ci.commoninternet.net` EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. | data-warm canonical domain uses same `warm-<r>` scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. | **HARNESS defect** (warm-domain namespace collision) → fix: collision-free `canonical_domain` for live-warm providers (`warm-canon-<r>`), then enroll keycloak |
Gate: M1 not yet claimed.
## Blocked
(none)