Files
cc-ci/machine-docs/STATUS-redfix.md

112 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — phase `redfix`
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs
warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing
exceptions. Nothing merged.
## Phase: M1 — investigate + isolate + classify (IN PROGRESS)
Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21
(3-day clear window). Disk `/` 38G free (75% used).
### Isolation harness (how I reproduce each failure ALONE)
Each canon-sweep per-recipe run is `runner/nightly_sweep.run_on_tag(recipe, latest)`:
`abra.recipe_checkout(recipe, <latest-tag>)` then `run_recipe_ci.py` with `RECIPE=<r>
CCCI_SKIP_FETCH=1` and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE
recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known
flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`.
### Starting canonical state (cc-ci `/var/lib/ci-warm/<r>/canonical.json`, read 2026-06-17T23:19Z)
| Recipe | Canonical now | Note |
|---|---|---|
| discourse | (none) | no canonical dir |
| mattermost-lts | (none) | no canonical dir |
| mumble | `1.0.0+v1.6.870-0` @ 20260617T180501Z | **canonical PRESENT, written TODAY** — flake signal |
| bluesky-pds | (none) | no canonical dir |
| gitea | `3.5.3+1.24.2-rootless` @ 20260617T083930Z | 3.6.0 advance not promoted |
| keycloak | (none) | de-enrolled (WARM_CANONICAL off) |
### M1 investigation tracker
| Recipe | Isolation run | Result | Root cause | Classification |
|---|---|---|---|---|
| discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) |
| mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist`**deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) |
| mumble | DONE — **2× isolation GREEN** (`/tmp/redfix-mumble.log` + `/tmp/redfix-mumble2.log`) | **ALL tiers PASS** incl. handshake on BOTH runs; no orphans; canonical re-promoted green each time | handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation | **load/timing FLAKE** → harness stabilization (readiness gate / retry) |
| bluesky-pds | DONE @00:45Z (`/tmp/redfix-bluesky-pds.log` + live diag) | cold lifecycle GREEN; **WC5 promote 000** reproduces (warm /xrpc/_health last status 0). NOT a flake | caddy on-demand TLS (`ask http://app:3000/tls-check`) can't reach app: caddy resolves bare `app` to OTHER stacks' app endpoints on shared `proxy` net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci `app` aliases) → no cert → 000. Promote machinery correct (refused to write canonical). | **genuine routing/RECIPE defect** (cross-stack `app`-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake |
| gitea | DONE @00:14Z (`/tmp/redfix-gitea2.log` + live container logs) | cold lifecycle (incl fresh 3.5.3→3.6.0 upgrade) PASS; **warm advance crash-loops** | `LoadCommonSettings() [F] error saving JWT Secret … failed to save "/etc/gitea/app.ini": read-only file system` — gitea 3.6.0/1.24.2 tries to persist a JWT to the read-only app.ini docker-config mount on warm reattach (before DB migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite). | **genuine RECIPE defect** (3.6.0 + read-only app.ini config mount on advance) → recipe PR: render app.ini into the writable config volume. (1st gitea run hit a nixenv "already deployed" leftover confound — fixed by undeploying to idle then re-running) |
| keycloak | DONE @01:05Z (code-verified; no run) | de-enrolled. `canonical_domain("keycloak")` == `WARM_DOMAINS["keycloak"]` == `warm-keycloak.ci.commoninternet.net` EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. | data-warm canonical domain uses same `warm-<r>` scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. | **HARNESS defect** (warm-domain namespace collision) → fix: collision-free `canonical_domain` for live-warm providers (`warm-canon-<r>`), then enroll keycloak |
### M1 results table (recipe → failure → isolation result → root cause → classification → fix approach)
| Recipe | Canon-sweep failure | Isolation result | Flake or genuine | Root cause | Class | Fix approach (M2) |
|---|---|---|---|---|---|---|
| discourse | "cold-deploy timeout / deploy FATA" | install/backup/restore/custom GREEN; **upgrade overlay RED** | **genuine (deterministic)** — but the canon root-cause was WRONG (no timeout, no deploy FATA) | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head = official `discourse/discourse:3.5.3` + sidekiq dropped; that migration is in NO release tag and NOT in main (all use `bitnamilegacy/discourse:3.5.0`+sidekiq) | **stale/PR-specific cc-ci OVERLAY test** | make the overlay assert migration-faithfulness only when the head IS that migration (not vs a release tag), OR a recipe PR migrating off deprecated bitnamilegacy — settle in M2 (NOT a test-weakening) |
| mattermost-lts | `test_restore_returns_state` RED | install/upgrade/backup/custom GREEN; **restore RED** (`ci_marker does not exist`) | **genuine (deterministic in isolation)** — NOT the canon "loaded-node race" | recipe postgres backup labels back up hot PGDATA + a dump but have **no `backupbot.restore.post-hook`** to replay it; restore doesn't round-trip. immich (passes) uses dump-only path + `restore.post-hook` | **genuine RECIPE defect** | recipe PR: adopt immich-style postgres dump + `backupbot.restore.post-hook` replay |
| mumble | `test_handshake…` RED | **ALL tiers GREEN** in isolation (×N) incl. handshake | **FLAKE (load/timing)** | handshake (TLS+ServerSync) doesn't complete within the 60s retry under heavy concurrent sweep load; fine isolated; canonical written green today | **load/concurrency FLAKE** | harness stabilization: stronger readiness gate before the custom tier / longer-or-smarter handshake retry |
| bluesky-pds | warm promote `/xrpc/_health` → 000 | cold lifecycle GREEN; **warm promote 000 reproduces** | **genuine (deterministic)** — NOT a load/rate-limit flake | caddy on-demand TLS calls `http://app:3000/tls-check`; caddy resolves bare `app` to OTHER stacks' app endpoints on the shared `proxy` net (every stack aliases its main svc `app`), never bluesky's own internal app (10.0.3.3) → connection refused → no cert → 000 | **genuine ROUTING/RECIPE defect** (cross-stack `app`-alias collision) | recipe PR: give the PDS service a unique name/alias so caddy resolves only bluesky's app |
| gitea | 3.5.3→3.6.0 warm advance doesn't promote | cold (incl fresh upgrade) GREEN; **warm advance crash-loops** | **genuine (deterministic)** | gitea 3.6.0/1.24.2 saves a JWT secret to `/etc/gitea/app.ini` on warm reattach; app.ini is a **read-only docker-config mount**`read-only file system` FATA at LoadCommonSettings (pre-migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite) | **genuine RECIPE defect** | recipe PR: render app.ini into the writable `config:/etc/gitea` volume (entrypoint) instead of a read-only docker config |
| keycloak | de-enrolled (not tested) | code-verified (no run) | **genuine (structural)** | `canonical_domain("keycloak")` == `WARM_DOMAINS["keycloak"]` == `warm-keycloak.ci.commoninternet.net` EXACTLY → a data-warm canonical would collide with the live-warm OIDC provider | **HARNESS defect** (warm-domain namespace collision) | harness: collision-free `canonical_domain` for live-warm providers (`warm-canon-<r>`), then enroll keycloak (WARM_CANONICAL=True) |
### HOW the Adversary cold-verifies each classification (run ONE recipe at a time, no concurrent load)
Isolation invocation (per recipe `R` at latest tag `T`), from `/etc/cc-ci` on cc-ci:
`git -C ~/.abra/recipes/R checkout -f --quiet T && env -u REF -u CCCI_QUICK -u MODE -u VERSION RECIPE=R CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py`
Latest tags: discourse `0.8.1+3.5.0`, mattermost-lts `2.1.9+10.11.15`, mumble `1.0.0+v1.6.870-0`, bluesky-pds `0.3.0+v0.4.219`, gitea `3.6.0+1.24.2-rootless`.
- **discourse** — EXPECT install/backup/restore/custom pass, upgrade fail on `test_head_runs_official_image_not_bitnamilegacy` + `test_sidekiq_service_dropped_by_head`. Confirm the overlay mismatch statically: `git -C ~/.abra/recipes/discourse show 0.8.1+3.5.0:compose.yml | grep -A1 ' app:'` and `... show main:compose.yml` both = `bitnamilegacy/discourse:3.5.0`; `grep -c 'sidekiq:'` = 1 in both. So the test's `discourse/discourse:3.5.3`/no-sidekiq expectation exists nowhere upstream.
- **mattermost-lts** — EXPECT restore fail `relation "ci_marker" does not exist`. Confirm root cause statically: `git -C ~/.abra/recipes/mattermost-lts show 2.1.9+10.11.15:compose.yml | grep backupbot` shows pre-hook + `backup.path` but NO `restore.post-hook`; immich `git -C ~/.abra/recipes/immich show <latest>:compose.yml | grep backupbot` shows `restore.post-hook: /pg_backup.sh restore`.
- **mumble** — EXPECT all tiers green (run 23× to confirm reproducibly green isolated). Canonical written green: `cat /var/lib/ci-warm/mumble/canonical.json`.
- **bluesky-pds** — EXPECT cold green, WC5 promote `!! WC5 promote failed … warm-bluesky-pds … last status 0`. While the warm stack is up, confirm root cause: caddy logs `dial tcp 10.10.0.X:3000: connect: connection refused` for `app:3000/tls-check`; `docker exec <caddy> getent hosts app` returns proxy IPs (10.10.0.X), the app's real internal IP is 10.0.3.x; `docker network inspect proxy | grep _app` shows many stacks aliasing `app`. (Tear down the orphaned warm-bluesky-pds stack + volumes after.)
- **gitea** — REQUIRES idle canonical first: if warm-gitea is deployed, `docker stack rm warm-gitea_ci_commoninternet_net` (retains data+config volumes) so the advance reattaches from idle. EXPECT cold green, warm advance crash-loop with container log `LoadCommonSettings() [F] error saving JWT Secret … "/etc/gitea/app.ini": read-only file system`. Restore: leave warm-gitea undeployed (idle 3.5.3, volumes retained) — registry stays `3.5.3+1.24.2-rootless`.
- **keycloak** — no run. Code-verify: `canonical.canonical_domain('keycloak')``warm.stable_domain('keycloak')``warm-keycloak.ci.commoninternet.net`; `warm.WARM_DOMAINS['keycloak']` == same string (runner/harness/canonical.py:42-44, warm.py:27-29,44-48). Live keycloak 200 on `/realms/master`.
### Node state left clean
All isolation runs torn down; orphaned warm-bluesky-pds stack+volumes removed; warm-gitea restored to idle 3.5.3 (volumes retained, registry unchanged); only live warm-keycloak deployed (healthy). No `run_recipe_ci.py` processes.
## M1 — PASS @ 2026-06-18T01:18Z (REVIEW-redfix.md; all 6 classifications cold-verified CORRECT by Adversary's own isolation re-runs). No VETO. Cleared to M2.
## Phase: M2 — FIX + verify all six (IN PROGRESS)
Fix designs locked in BACKLOG-redfix.md. Recipe PRs (mattermost-lts/bluesky/gitea) on git.autonomic.zone
mirrors via the recipe mirror+PR flow, verified `!testme` (NEVER merge). Harness fixes (keycloak/mumble)
on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my
deploys (Adversary done with M1).
### M2 fix tracker
| Recipe | Fix type | PR/branch | Status |
|---|---|---|---|
| mattermost-lts | recipe PR (pg_backup.sh + restore.post-hook) | mirror PR #1 `ci/pg-restore` @4ca7f418 | **DONE — !testme run #901 ALL tiers green** (restore__cc-ci failures=0 skipped=0; the M1-failing test_restore_returns_state now PASSES) |
| bluesky-pds | recipe PR (unique `pds` internal alias for caddy) | mirror PR #4 `ci/warm-routing-alias` | PR created; verifying on PROMOTE path (warm-bluesky-pds → expect 200 vs M1 000; !testme cold-only won't reproduce) |
| gitea | recipe PR (app.ini → writable volume) | — | pending |
| keycloak | harness (collision-free canonical_domain) + enroll | — | pending |
| mumble | harness (handshake readiness/retry stabilization) | — | pending |
| discourse | recipe PR (official-image migration) | mirror PR #4 `discourse-official-image` | already !testme-GREEN @53ba0910 (run #849, 16:36Z); re-verify fresh |
## Gate: M1 — PASS (above). M2 not yet claimed.
**WHAT (M1 DoD).** All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no
concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe
vs test vs warm-machinery vs load) — see the **M1 results table** + **HOW the Adversary cold-verifies**
sections above. Summary: discourse = stale cc-ci overlay test (canon timeout/FATA root-cause was
wrong); mattermost-lts = genuine recipe defect (no `backupbot.restore.post-hook`); mumble = load/timing
FLAKE (2× isolation green); bluesky-pds = genuine routing defect (caddy↔app `app`-alias collision on
shared proxy); gitea = genuine recipe defect (read-only app.ini config mount + 3.6.0 JWT save);
keycloak = harness warm-domain namespace collision. NO "probably a flake" — every classification has
an isolation re-run or code proof.
**HOW + EXPECTED + WHERE.** Per-recipe cold-verify commands, expected outputs, and evidence paths are
in the two sections above ("M1 results table" and "HOW the Adversary cold-verifies each classification").
Evidence logs on cc-ci: `/tmp/redfix-{discourse,mattermost-lts,mumble,mumble2,bluesky-pds,gitea2}.log`.
Reasoning/dead-ends in JOURNAL-redfix.md. Node left clean (see "Node state left clean" above).
## Blocked
(none)