50 lines
3.2 KiB
Markdown
50 lines
3.2 KiB
Markdown
# STATUS — phase `redfix`
|
|
|
|
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
|
|
|
|
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
|
|
gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs
|
|
warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing
|
|
exceptions. Nothing merged.
|
|
|
|
## Phase: M1 — investigate + isolate + classify (IN PROGRESS)
|
|
|
|
Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21
|
|
(3-day clear window). Disk `/` 38G free (75% used).
|
|
|
|
### Isolation harness (how I reproduce each failure ALONE)
|
|
|
|
Each canon-sweep per-recipe run is `runner/nightly_sweep.run_on_tag(recipe, latest)`:
|
|
`abra.recipe_checkout(recipe, <latest-tag>)` then `run_recipe_ci.py` with `RECIPE=<r>
|
|
CCCI_SKIP_FETCH=1` and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE
|
|
recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known
|
|
flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`.
|
|
|
|
### Starting canonical state (cc-ci `/var/lib/ci-warm/<r>/canonical.json`, read 2026-06-17T23:19Z)
|
|
|
|
| Recipe | Canonical now | Note |
|
|
|---|---|---|
|
|
| discourse | (none) | no canonical dir |
|
|
| mattermost-lts | (none) | no canonical dir |
|
|
| mumble | `1.0.0+v1.6.870-0` @ 20260617T180501Z | **canonical PRESENT, written TODAY** — flake signal |
|
|
| bluesky-pds | (none) | no canonical dir |
|
|
| gitea | `3.5.3+1.24.2-rootless` @ 20260617T083930Z | 3.6.0 advance not promoted |
|
|
| keycloak | (none) | de-enrolled (WARM_CANONICAL off) |
|
|
|
|
### M1 investigation tracker
|
|
|
|
| Recipe | Isolation run | Result | Root cause | Classification |
|
|
|---|---|---|---|---|
|
|
| discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) |
|
|
| mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist` — **deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) |
|
|
| mumble | running (isolation) | — | — | — |
|
|
| bluesky-pds | pending | — | — | — |
|
|
| gitea | pending | — | — | — |
|
|
| keycloak | pending | — | — | — |
|
|
|
|
Gate: M1 not yet claimed.
|
|
|
|
## Blocked
|
|
|
|
(none)
|