112 lines
14 KiB
Markdown
112 lines
14 KiB
Markdown
# STATUS — phase `redfix`
|
||
|
||
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
|
||
|
||
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
|
||
gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs
|
||
warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing
|
||
exceptions. Nothing merged.
|
||
|
||
## Phase: M1 — investigate + isolate + classify (IN PROGRESS)
|
||
|
||
Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21
|
||
(3-day clear window). Disk `/` 38G free (75% used).
|
||
|
||
### Isolation harness (how I reproduce each failure ALONE)
|
||
|
||
Each canon-sweep per-recipe run is `runner/nightly_sweep.run_on_tag(recipe, latest)`:
|
||
`abra.recipe_checkout(recipe, <latest-tag>)` then `run_recipe_ci.py` with `RECIPE=<r>
|
||
CCCI_SKIP_FETCH=1` and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE
|
||
recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known
|
||
flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`.
|
||
|
||
### Starting canonical state (cc-ci `/var/lib/ci-warm/<r>/canonical.json`, read 2026-06-17T23:19Z)
|
||
|
||
| Recipe | Canonical now | Note |
|
||
|---|---|---|
|
||
| discourse | (none) | no canonical dir |
|
||
| mattermost-lts | (none) | no canonical dir |
|
||
| mumble | `1.0.0+v1.6.870-0` @ 20260617T180501Z | **canonical PRESENT, written TODAY** — flake signal |
|
||
| bluesky-pds | (none) | no canonical dir |
|
||
| gitea | `3.5.3+1.24.2-rootless` @ 20260617T083930Z | 3.6.0 advance not promoted |
|
||
| keycloak | (none) | de-enrolled (WARM_CANONICAL off) |
|
||
|
||
### M1 investigation tracker
|
||
|
||
| Recipe | Isolation run | Result | Root cause | Classification |
|
||
|---|---|---|---|---|
|
||
| discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) |
|
||
| mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist` — **deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) |
|
||
| mumble | DONE — **2× isolation GREEN** (`/tmp/redfix-mumble.log` + `/tmp/redfix-mumble2.log`) | **ALL tiers PASS** incl. handshake on BOTH runs; no orphans; canonical re-promoted green each time | handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation | **load/timing FLAKE** → harness stabilization (readiness gate / retry) |
|
||
| bluesky-pds | DONE @00:45Z (`/tmp/redfix-bluesky-pds.log` + live diag) | cold lifecycle GREEN; **WC5 promote 000** reproduces (warm /xrpc/_health last status 0). NOT a flake | caddy on-demand TLS (`ask http://app:3000/tls-check`) can't reach app: caddy resolves bare `app` to OTHER stacks' app endpoints on shared `proxy` net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci `app` aliases) → no cert → 000. Promote machinery correct (refused to write canonical). | **genuine routing/RECIPE defect** (cross-stack `app`-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake |
|
||
| gitea | DONE @00:14Z (`/tmp/redfix-gitea2.log` + live container logs) | cold lifecycle (incl fresh 3.5.3→3.6.0 upgrade) PASS; **warm advance crash-loops** | `LoadCommonSettings() [F] error saving JWT Secret … failed to save "/etc/gitea/app.ini": read-only file system` — gitea 3.6.0/1.24.2 tries to persist a JWT to the read-only app.ini docker-config mount on warm reattach (before DB migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite). | **genuine RECIPE defect** (3.6.0 + read-only app.ini config mount on advance) → recipe PR: render app.ini into the writable config volume. (1st gitea run hit a nixenv "already deployed" leftover confound — fixed by undeploying to idle then re-running) |
|
||
| keycloak | DONE @01:05Z (code-verified; no run) | de-enrolled. `canonical_domain("keycloak")` == `WARM_DOMAINS["keycloak"]` == `warm-keycloak.ci.commoninternet.net` EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. | data-warm canonical domain uses same `warm-<r>` scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. | **HARNESS defect** (warm-domain namespace collision) → fix: collision-free `canonical_domain` for live-warm providers (`warm-canon-<r>`), then enroll keycloak |
|
||
|
||
### M1 results table (recipe → failure → isolation result → root cause → classification → fix approach)
|
||
|
||
| Recipe | Canon-sweep failure | Isolation result | Flake or genuine | Root cause | Class | Fix approach (M2) |
|
||
|---|---|---|---|---|---|---|
|
||
| discourse | "cold-deploy timeout / deploy FATA" | install/backup/restore/custom GREEN; **upgrade overlay RED** | **genuine (deterministic)** — but the canon root-cause was WRONG (no timeout, no deploy FATA) | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head = official `discourse/discourse:3.5.3` + sidekiq dropped; that migration is in NO release tag and NOT in main (all use `bitnamilegacy/discourse:3.5.0`+sidekiq) | **stale/PR-specific cc-ci OVERLAY test** | make the overlay assert migration-faithfulness only when the head IS that migration (not vs a release tag), OR a recipe PR migrating off deprecated bitnamilegacy — settle in M2 (NOT a test-weakening) |
|
||
| mattermost-lts | `test_restore_returns_state` RED | install/upgrade/backup/custom GREEN; **restore RED** (`ci_marker does not exist`) | **genuine (deterministic in isolation)** — NOT the canon "loaded-node race" | recipe postgres backup labels back up hot PGDATA + a dump but have **no `backupbot.restore.post-hook`** to replay it; restore doesn't round-trip. immich (passes) uses dump-only path + `restore.post-hook` | **genuine RECIPE defect** | recipe PR: adopt immich-style postgres dump + `backupbot.restore.post-hook` replay |
|
||
| mumble | `test_handshake…` RED | **ALL tiers GREEN** in isolation (×N) incl. handshake | **FLAKE (load/timing)** | handshake (TLS+ServerSync) doesn't complete within the 60s retry under heavy concurrent sweep load; fine isolated; canonical written green today | **load/concurrency FLAKE** | harness stabilization: stronger readiness gate before the custom tier / longer-or-smarter handshake retry |
|
||
| bluesky-pds | warm promote `/xrpc/_health` → 000 | cold lifecycle GREEN; **warm promote 000 reproduces** | **genuine (deterministic)** — NOT a load/rate-limit flake | caddy on-demand TLS calls `http://app:3000/tls-check`; caddy resolves bare `app` to OTHER stacks' app endpoints on the shared `proxy` net (every stack aliases its main svc `app`), never bluesky's own internal app (10.0.3.3) → connection refused → no cert → 000 | **genuine ROUTING/RECIPE defect** (cross-stack `app`-alias collision) | recipe PR: give the PDS service a unique name/alias so caddy resolves only bluesky's app |
|
||
| gitea | 3.5.3→3.6.0 warm advance doesn't promote | cold (incl fresh upgrade) GREEN; **warm advance crash-loops** | **genuine (deterministic)** | gitea 3.6.0/1.24.2 saves a JWT secret to `/etc/gitea/app.ini` on warm reattach; app.ini is a **read-only docker-config mount** → `read-only file system` FATA at LoadCommonSettings (pre-migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite) | **genuine RECIPE defect** | recipe PR: render app.ini into the writable `config:/etc/gitea` volume (entrypoint) instead of a read-only docker config |
|
||
| keycloak | de-enrolled (not tested) | code-verified (no run) | **genuine (structural)** | `canonical_domain("keycloak")` == `WARM_DOMAINS["keycloak"]` == `warm-keycloak.ci.commoninternet.net` EXACTLY → a data-warm canonical would collide with the live-warm OIDC provider | **HARNESS defect** (warm-domain namespace collision) | harness: collision-free `canonical_domain` for live-warm providers (`warm-canon-<r>`), then enroll keycloak (WARM_CANONICAL=True) |
|
||
|
||
### HOW the Adversary cold-verifies each classification (run ONE recipe at a time, no concurrent load)
|
||
|
||
Isolation invocation (per recipe `R` at latest tag `T`), from `/etc/cc-ci` on cc-ci:
|
||
`git -C ~/.abra/recipes/R checkout -f --quiet T && env -u REF -u CCCI_QUICK -u MODE -u VERSION RECIPE=R CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py`
|
||
Latest tags: discourse `0.8.1+3.5.0`, mattermost-lts `2.1.9+10.11.15`, mumble `1.0.0+v1.6.870-0`, bluesky-pds `0.3.0+v0.4.219`, gitea `3.6.0+1.24.2-rootless`.
|
||
|
||
- **discourse** — EXPECT install/backup/restore/custom pass, upgrade fail on `test_head_runs_official_image_not_bitnamilegacy` + `test_sidekiq_service_dropped_by_head`. Confirm the overlay mismatch statically: `git -C ~/.abra/recipes/discourse show 0.8.1+3.5.0:compose.yml | grep -A1 ' app:'` and `... show main:compose.yml` both = `bitnamilegacy/discourse:3.5.0`; `grep -c 'sidekiq:'` = 1 in both. So the test's `discourse/discourse:3.5.3`/no-sidekiq expectation exists nowhere upstream.
|
||
- **mattermost-lts** — EXPECT restore fail `relation "ci_marker" does not exist`. Confirm root cause statically: `git -C ~/.abra/recipes/mattermost-lts show 2.1.9+10.11.15:compose.yml | grep backupbot` shows pre-hook + `backup.path` but NO `restore.post-hook`; immich `git -C ~/.abra/recipes/immich show <latest>:compose.yml | grep backupbot` shows `restore.post-hook: /pg_backup.sh restore`.
|
||
- **mumble** — EXPECT all tiers green (run 2–3× to confirm reproducibly green isolated). Canonical written green: `cat /var/lib/ci-warm/mumble/canonical.json`.
|
||
- **bluesky-pds** — EXPECT cold green, WC5 promote `!! WC5 promote failed … warm-bluesky-pds … last status 0`. While the warm stack is up, confirm root cause: caddy logs `dial tcp 10.10.0.X:3000: connect: connection refused` for `app:3000/tls-check`; `docker exec <caddy> getent hosts app` returns proxy IPs (10.10.0.X), the app's real internal IP is 10.0.3.x; `docker network inspect proxy | grep _app` shows many stacks aliasing `app`. (Tear down the orphaned warm-bluesky-pds stack + volumes after.)
|
||
- **gitea** — REQUIRES idle canonical first: if warm-gitea is deployed, `docker stack rm warm-gitea_ci_commoninternet_net` (retains data+config volumes) so the advance reattaches from idle. EXPECT cold green, warm advance crash-loop with container log `LoadCommonSettings() [F] error saving JWT Secret … "/etc/gitea/app.ini": read-only file system`. Restore: leave warm-gitea undeployed (idle 3.5.3, volumes retained) — registry stays `3.5.3+1.24.2-rootless`.
|
||
- **keycloak** — no run. Code-verify: `canonical.canonical_domain('keycloak')` → `warm.stable_domain('keycloak')` → `warm-keycloak.ci.commoninternet.net`; `warm.WARM_DOMAINS['keycloak']` == same string (runner/harness/canonical.py:42-44, warm.py:27-29,44-48). Live keycloak 200 on `/realms/master`.
|
||
|
||
### Node state left clean
|
||
All isolation runs torn down; orphaned warm-bluesky-pds stack+volumes removed; warm-gitea restored to idle 3.5.3 (volumes retained, registry unchanged); only live warm-keycloak deployed (healthy). No `run_recipe_ci.py` processes.
|
||
|
||
## M1 — PASS @ 2026-06-18T01:18Z (REVIEW-redfix.md; all 6 classifications cold-verified CORRECT by Adversary's own isolation re-runs). No VETO. Cleared to M2.
|
||
|
||
## Phase: M2 — FIX + verify all six (IN PROGRESS)
|
||
|
||
Fix designs locked in BACKLOG-redfix.md. Recipe PRs (mattermost-lts/bluesky/gitea) on git.autonomic.zone
|
||
mirrors via the recipe mirror+PR flow, verified `!testme` (NEVER merge). Harness fixes (keycloak/mumble)
|
||
on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my
|
||
deploys (Adversary done with M1).
|
||
|
||
### M2 fix tracker
|
||
|
||
| Recipe | Fix type | PR/branch | Status |
|
||
|---|---|---|---|
|
||
| mattermost-lts | recipe PR (pg_backup.sh + restore.post-hook) | mirror PR #1 `ci/pg-restore` @4ca7f418 | **DONE — !testme run #901 ALL tiers green** (restore__cc-ci failures=0 skipped=0; the M1-failing test_restore_returns_state now PASSES) |
|
||
| bluesky-pds | recipe PR (unique `pds` internal alias for caddy) | mirror PR #4 `ci/warm-routing-alias` | PR created; verifying on PROMOTE path (warm-bluesky-pds → expect 200 vs M1 000; !testme cold-only won't reproduce) |
|
||
| gitea | recipe PR (app.ini → writable volume) | — | pending |
|
||
| keycloak | harness (collision-free canonical_domain) + enroll | — | pending |
|
||
| mumble | harness (handshake readiness/retry stabilization) | — | pending |
|
||
| discourse | recipe PR (official-image migration) | mirror PR #4 `discourse-official-image` | already !testme-GREEN @53ba0910 (run #849, 16:36Z); re-verify fresh |
|
||
|
||
## Gate: M1 — PASS (above). M2 not yet claimed.
|
||
|
||
**WHAT (M1 DoD).** All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no
|
||
concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe
|
||
vs test vs warm-machinery vs load) — see the **M1 results table** + **HOW the Adversary cold-verifies**
|
||
sections above. Summary: discourse = stale cc-ci overlay test (canon timeout/FATA root-cause was
|
||
wrong); mattermost-lts = genuine recipe defect (no `backupbot.restore.post-hook`); mumble = load/timing
|
||
FLAKE (2× isolation green); bluesky-pds = genuine routing defect (caddy↔app `app`-alias collision on
|
||
shared proxy); gitea = genuine recipe defect (read-only app.ini config mount + 3.6.0 JWT save);
|
||
keycloak = harness warm-domain namespace collision. NO "probably a flake" — every classification has
|
||
an isolation re-run or code proof.
|
||
|
||
**HOW + EXPECTED + WHERE.** Per-recipe cold-verify commands, expected outputs, and evidence paths are
|
||
in the two sections above ("M1 results table" and "HOW the Adversary cold-verifies each classification").
|
||
Evidence logs on cc-ci: `/tmp/redfix-{discourse,mattermost-lts,mumble,mumble2,bluesky-pds,gitea2}.log`.
|
||
Reasoning/dead-ends in JOURNAL-redfix.md. Node left clean (see "Node state left clean" above).
|
||
|
||
## Blocked
|
||
|
||
(none)
|