journal(redfix): M1 mattermost-lts isolation — DETERMINISTIC restore fail; genuine recipe defect (no restore.post-hook vs immich)
Some checks failed
continuous-integration/drone/push Build is failing

This commit is contained in:
autonomic-bot
2026-06-17 23:41:29 +00:00
parent 23b439db83
commit 8df74d7bc0
2 changed files with 34 additions and 2 deletions

View File

@ -63,3 +63,35 @@ Classification: **stale/PR-specific cc-ci OVERLAY test mismatched to the canonic
(NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no
discourse stack left). Evidence: `/tmp/redfix-discourse.log` on cc-ci; junit under
`/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml`.
## 2026-06-18T00:05Z — M1: mattermost-lts isolation run — DETERMINISTIC restore failure (recipe defect)
Ran mattermost-lts ALONE (tag 2.1.9+10.11.15, log /tmp/redfix-mattermost-lts.log).
RESULT: **install/upgrade/backup/custom PASS, restore FAIL** — identical to the canon failure:
`tests/mattermost-lts/test_restore.py::test_restore_returns_state``relation "ci_marker" does not
exist` after restore. So it is **deterministic in isolation, NOT a loaded-node race** (canon framing
was wrong). The marker logic is sound (postgres table seeded pre-backup, dropped pre-restore, asserted
post-restore — same pattern immich uses and PASSES).
ROOT CAUSE (recipe backup/restore labels). Compared mattermost-lts vs immich (immich passes the
IDENTICAL test):
- immich `database` svc: `backupbot.backup.pre-hook: /pg_backup.sh backup`,
`backupbot.backup.volumes.postgres.path: backup.sql` (backs up ONLY the dump file), and
**`backupbot.restore.post-hook: /pg_backup.sh restore`** (replays the dump on restore). → round-trips.
- mattermost-lts `postgres` svc: `pre-hook: pg_dump > /var/lib/postgresql/data/postgres-backup.sql`,
`backup.path: /var/lib/postgresql/data/` (backs up the WHOLE live/hot PGDATA dir + the dump),
`post-hook: rm .../postgres-backup.sql`, and **NO `backupbot.restore.post-hook`**. So on restore,
abra restores the files but NOTHING replays the dump, and a hot-copied live PGDATA over a running
postgres does not reload → `ci_marker` lost. Restore log confirms `Restoring Snapshot b0495d36 at /`
with no post-hook reimport.
Classification: **GENUINE RECIPE DEFECT at latest** (postgres backup/restore does not round-trip —
missing restore post-hook + backs up hot PGDATA instead of dump-only). NOT a flake, NOT cc-ci test
weakening (test is correct & unmodified; immich proves the pattern works). Fix (M2) = recipe PR
adopting the immich-style postgres backup/restore (a `/pg_backup.sh`-style dump + restore post-hook).
Teardown clean (no matt stack). Evidence: /tmp/redfix-mattermost-lts.log; junit
restore__cc-ci__test_restore.xml.
Tooling note: my background "waiter" loop `while pgrep -f run_recipe_ci.py` self-matched (its own
cmdline contains the string) → never exited, falsely showed a run active. Use `pgrep -f
"[r]un_recipe_ci.py"` or match the python invocation. Killed the stuck waiters; node confirmed free.

View File

@ -36,8 +36,8 @@ flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`.
| Recipe | Isolation run | Result | Root cause | Classification |
|---|---|---|---|---|
| discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) |
| mattermost-lts | running (isolation) | — | — | — |
| mumble | pending | — | — | — |
| mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist`**deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) |
| mumble | running (isolation) | — | — | — |
| bluesky-pds | pending | — | — | — |
| gitea | pending | — | — | — |
| keycloak | pending | — | — | — |