journal(redfix): M1 mattermost-lts isolation — DETERMINISTIC restore fail; genuine recipe defect (no restore.post-hook vs immich)
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
This commit is contained in:
@ -63,3 +63,35 @@ Classification: **stale/PR-specific cc-ci OVERLAY test mismatched to the canonic
|
||||
(NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no
|
||||
discourse stack left). Evidence: `/tmp/redfix-discourse.log` on cc-ci; junit under
|
||||
`/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml`.
|
||||
|
||||
## 2026-06-18T00:05Z — M1: mattermost-lts isolation run — DETERMINISTIC restore failure (recipe defect)
|
||||
|
||||
Ran mattermost-lts ALONE (tag 2.1.9+10.11.15, log /tmp/redfix-mattermost-lts.log).
|
||||
RESULT: **install/upgrade/backup/custom PASS, restore FAIL** — identical to the canon failure:
|
||||
`tests/mattermost-lts/test_restore.py::test_restore_returns_state` → `relation "ci_marker" does not
|
||||
exist` after restore. So it is **deterministic in isolation, NOT a loaded-node race** (canon framing
|
||||
was wrong). The marker logic is sound (postgres table seeded pre-backup, dropped pre-restore, asserted
|
||||
post-restore — same pattern immich uses and PASSES).
|
||||
|
||||
ROOT CAUSE (recipe backup/restore labels). Compared mattermost-lts vs immich (immich passes the
|
||||
IDENTICAL test):
|
||||
- immich `database` svc: `backupbot.backup.pre-hook: /pg_backup.sh backup`,
|
||||
`backupbot.backup.volumes.postgres.path: backup.sql` (backs up ONLY the dump file), and
|
||||
**`backupbot.restore.post-hook: /pg_backup.sh restore`** (replays the dump on restore). → round-trips.
|
||||
- mattermost-lts `postgres` svc: `pre-hook: pg_dump > /var/lib/postgresql/data/postgres-backup.sql`,
|
||||
`backup.path: /var/lib/postgresql/data/` (backs up the WHOLE live/hot PGDATA dir + the dump),
|
||||
`post-hook: rm .../postgres-backup.sql`, and **NO `backupbot.restore.post-hook`**. So on restore,
|
||||
abra restores the files but NOTHING replays the dump, and a hot-copied live PGDATA over a running
|
||||
postgres does not reload → `ci_marker` lost. Restore log confirms `Restoring Snapshot b0495d36 at /`
|
||||
with no post-hook reimport.
|
||||
|
||||
Classification: **GENUINE RECIPE DEFECT at latest** (postgres backup/restore does not round-trip —
|
||||
missing restore post-hook + backs up hot PGDATA instead of dump-only). NOT a flake, NOT cc-ci test
|
||||
weakening (test is correct & unmodified; immich proves the pattern works). Fix (M2) = recipe PR
|
||||
adopting the immich-style postgres backup/restore (a `/pg_backup.sh`-style dump + restore post-hook).
|
||||
Teardown clean (no matt stack). Evidence: /tmp/redfix-mattermost-lts.log; junit
|
||||
restore__cc-ci__test_restore.xml.
|
||||
|
||||
Tooling note: my background "waiter" loop `while pgrep -f run_recipe_ci.py` self-matched (its own
|
||||
cmdline contains the string) → never exited, falsely showed a run active. Use `pgrep -f
|
||||
"[r]un_recipe_ci.py"` or match the python invocation. Killed the stuck waiters; node confirmed free.
|
||||
|
||||
@ -36,8 +36,8 @@ flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`.
|
||||
| Recipe | Isolation run | Result | Root cause | Classification |
|
||||
|---|---|---|---|---|
|
||||
| discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) |
|
||||
| mattermost-lts | running (isolation) | — | — | — |
|
||||
| mumble | pending | — | — | — |
|
||||
| mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist` — **deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) |
|
||||
| mumble | running (isolation) | — | — | — |
|
||||
| bluesky-pds | pending | — | — | — |
|
||||
| gitea | pending | — | — | — |
|
||||
| keycloak | pending | — | — | — |
|
||||
|
||||
Reference in New Issue
Block a user