From 8df74d7bc01b5b73e87056ff2001bd2019872979 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Wed, 17 Jun 2026 23:41:29 +0000 Subject: [PATCH] =?UTF-8?q?journal(redfix):=20M1=20mattermost-lts=20isolat?= =?UTF-8?q?ion=20=E2=80=94=20DETERMINISTIC=20restore=20fail;=20genuine=20r?= =?UTF-8?q?ecipe=20defect=20(no=20restore.post-hook=20vs=20immich)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- machine-docs/JOURNAL-redfix.md | 32 ++++++++++++++++++++++++++++++++ machine-docs/STATUS-redfix.md | 4 ++-- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/machine-docs/JOURNAL-redfix.md b/machine-docs/JOURNAL-redfix.md index 62f655f..ed72eb9 100644 --- a/machine-docs/JOURNAL-redfix.md +++ b/machine-docs/JOURNAL-redfix.md @@ -63,3 +63,35 @@ Classification: **stale/PR-specific cc-ci OVERLAY test mismatched to the canonic (NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no discourse stack left). Evidence: `/tmp/redfix-discourse.log` on cc-ci; junit under `/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml`. + +## 2026-06-18T00:05Z — M1: mattermost-lts isolation run — DETERMINISTIC restore failure (recipe defect) + +Ran mattermost-lts ALONE (tag 2.1.9+10.11.15, log /tmp/redfix-mattermost-lts.log). +RESULT: **install/upgrade/backup/custom PASS, restore FAIL** — identical to the canon failure: +`tests/mattermost-lts/test_restore.py::test_restore_returns_state` → `relation "ci_marker" does not +exist` after restore. So it is **deterministic in isolation, NOT a loaded-node race** (canon framing +was wrong). The marker logic is sound (postgres table seeded pre-backup, dropped pre-restore, asserted +post-restore — same pattern immich uses and PASSES). + +ROOT CAUSE (recipe backup/restore labels). Compared mattermost-lts vs immich (immich passes the +IDENTICAL test): +- immich `database` svc: `backupbot.backup.pre-hook: /pg_backup.sh backup`, + `backupbot.backup.volumes.postgres.path: backup.sql` (backs up ONLY the dump file), and + **`backupbot.restore.post-hook: /pg_backup.sh restore`** (replays the dump on restore). → round-trips. +- mattermost-lts `postgres` svc: `pre-hook: pg_dump > /var/lib/postgresql/data/postgres-backup.sql`, + `backup.path: /var/lib/postgresql/data/` (backs up the WHOLE live/hot PGDATA dir + the dump), + `post-hook: rm .../postgres-backup.sql`, and **NO `backupbot.restore.post-hook`**. So on restore, + abra restores the files but NOTHING replays the dump, and a hot-copied live PGDATA over a running + postgres does not reload → `ci_marker` lost. Restore log confirms `Restoring Snapshot b0495d36 at /` + with no post-hook reimport. + +Classification: **GENUINE RECIPE DEFECT at latest** (postgres backup/restore does not round-trip — +missing restore post-hook + backs up hot PGDATA instead of dump-only). NOT a flake, NOT cc-ci test +weakening (test is correct & unmodified; immich proves the pattern works). Fix (M2) = recipe PR +adopting the immich-style postgres backup/restore (a `/pg_backup.sh`-style dump + restore post-hook). +Teardown clean (no matt stack). Evidence: /tmp/redfix-mattermost-lts.log; junit +restore__cc-ci__test_restore.xml. + +Tooling note: my background "waiter" loop `while pgrep -f run_recipe_ci.py` self-matched (its own +cmdline contains the string) → never exited, falsely showed a run active. Use `pgrep -f +"[r]un_recipe_ci.py"` or match the python invocation. Killed the stuck waiters; node confirmed free. diff --git a/machine-docs/STATUS-redfix.md b/machine-docs/STATUS-redfix.md index 5c4a0e0..520d93c 100644 --- a/machine-docs/STATUS-redfix.md +++ b/machine-docs/STATUS-redfix.md @@ -36,8 +36,8 @@ flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`. | Recipe | Isolation run | Result | Root cause | Classification | |---|---|---|---|---| | discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) | -| mattermost-lts | running (isolation) | — | — | — | -| mumble | pending | — | — | — | +| mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist` — **deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) | +| mumble | running (isolation) | — | — | — | | bluesky-pds | pending | — | — | — | | gitea | pending | — | — | — | | keycloak | pending | — | — | — |