98 lines
6.8 KiB
Markdown
98 lines
6.8 KiB
Markdown
# JOURNAL — phase `redfix`
|
||
|
||
## 2026-06-17T23:20Z — Bootstrap
|
||
|
||
Read phase plan + plan.md §6.1/§7/§9 + canon DECISIONS exceptions (lines ~1494–1552). Six
|
||
canon-sweep failures to investigate. Confirmed cc-ci access, no run in flight, sweep timer next
|
||
fires 2026-06-21 (3-day window), disk 38G free.
|
||
|
||
Isolation mechanism understood: `runner/nightly_sweep.run_on_tag` = `abra.recipe_checkout(r, tag)` +
|
||
`run_recipe_ci.py RECIPE=<r> CCCI_SKIP_FETCH=1` cold/full. I reproduce each failure by running ONE
|
||
recipe at a time with no concurrent load.
|
||
|
||
Starting canonical state notable: **mumble canonical IS present** (`1.0.0+v1.6.870-0`, written
|
||
20260617T180501Z — during today's nixenv sweep). The canon DECISIONS recorded mumble RED
|
||
(`test_handshake_completes_with_channel_presence`). A canonical only gets written on a GREEN cold run
|
||
on latest → mumble flipped green in a recent run. Strong early evidence for the operator's "mumble
|
||
passed before" → load flake hypothesis. Must confirm with a clean isolation re-run + check whether the
|
||
canon-sweep red was under concurrent load.
|
||
|
||
Next: start M1 investigation. Plan order (cheap/informative first): triage the existing sweep logs on
|
||
cc-ci to pin the EXACT assertion/error for each (mumble, mattermost-lts restore, gitea app.ini,
|
||
bluesky routing, discourse compose), then run isolation re-runs. discourse's recorded cause is an
|
||
UPSTREAM compose defect (`sidekiq.depends_on: discourse` while service is `app`) that FATAs before any
|
||
deploy — that's deterministic, not a load timeout, so it may not even need a long isolation run to
|
||
confirm; verify the compose at the latest tag directly first.
|
||
|
||
## 2026-06-17T23:40Z — M1: discourse isolation run — CANON ROOT-CAUSE WAS WRONG
|
||
|
||
Ran discourse ALONE on cc-ci (`recipe_checkout discourse 0.8.1+3.5.0` + `RECIPE=discourse
|
||
CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py`, log `/tmp/redfix-discourse.log`).
|
||
|
||
RESULT: **install PASS, upgrade FAIL, backup PASS, restore PASS, custom PASS** — the recipe deploys,
|
||
serves (200 /srv/status), backs up and restores cleanly. NOT a deploy timeout, NOT a 51-min wedge,
|
||
NOT a deploy FATA. The canon DECISIONS root-cause ("`abra app deploy` FATAs: service sidekiq depends
|
||
on undefined service discourse → invalid compose project") is **misattributed**: that string appears
|
||
ONLY from the non-fatal prepull `docker compose config --images` (rc=15, harness logs "skipping
|
||
(deploy will pull as usual)"). The real `abra app deploy` is a swarm `docker stack deploy`, which
|
||
ignores `depends_on` entirely → the stack converges (`UpdateStatus=completed`).
|
||
|
||
The ONLY failure is the cc-ci upgrade OVERLAY `tests/discourse/test_upgrade.py`:
|
||
- `test_head_runs_official_image_not_bitnamilegacy` — app image is `bitnamilegacy/discourse:3.5.0`;
|
||
test demands `discourse/discourse:3.5.3` (official).
|
||
- `test_sidekiq_service_dropped_by_head` — services `['app','db','redis','sidekiq']`; test demands
|
||
sidekiq dropped.
|
||
|
||
These `prevb`-phase overlay tests are PR-FAITHFULNESS assertions for a specific migration PR
|
||
(bitnamilegacy → official `discourse/discourse:3.5.3`, drop sidekiq). Verified that migration exists
|
||
in **NO upstream release tag and NOT in main** — `git show main:compose.yml` and every tag
|
||
(`0.1.0…0.8.1+3.5.0`) all use `bitnamilegacy/discourse:3.5.0` + sidekiq. So the overlay asserts a
|
||
state that doesn't exist anywhere upstream → deterministic RED whenever the sweep tests the latest
|
||
release tag. The head DID deploy (chaos-version label = head f87c612d+U, converged) — the test
|
||
expectation is simply wrong for the released recipe.
|
||
|
||
Note (M2 design): migrating discourse from the deprecated `bitnamilegacy` image to official
|
||
`discourse/discourse` is a MAJOR recipe rewrite (different fs layout, entrypoint, no `/opt/bitnami`
|
||
sidekiq run.sh) — not a 1-line image swap. So the overlay test's `discourse/discourse:3.5.3`
|
||
expectation may not be a realistic near-term recipe change. The bitnamilegacy deprecation is real
|
||
(bitnami sunset legacy images), so a migration is the right long-term direction, but the test as
|
||
written hard-codes a migration target absent upstream. Classification + fix approach to settle in M1
|
||
table / M2.
|
||
|
||
Classification: **stale/PR-specific cc-ci OVERLAY test mismatched to the canonical-sweep context**
|
||
(NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no
|
||
discourse stack left). Evidence: `/tmp/redfix-discourse.log` on cc-ci; junit under
|
||
`/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml`.
|
||
|
||
## 2026-06-18T00:05Z — M1: mattermost-lts isolation run — DETERMINISTIC restore failure (recipe defect)
|
||
|
||
Ran mattermost-lts ALONE (tag 2.1.9+10.11.15, log /tmp/redfix-mattermost-lts.log).
|
||
RESULT: **install/upgrade/backup/custom PASS, restore FAIL** — identical to the canon failure:
|
||
`tests/mattermost-lts/test_restore.py::test_restore_returns_state` → `relation "ci_marker" does not
|
||
exist` after restore. So it is **deterministic in isolation, NOT a loaded-node race** (canon framing
|
||
was wrong). The marker logic is sound (postgres table seeded pre-backup, dropped pre-restore, asserted
|
||
post-restore — same pattern immich uses and PASSES).
|
||
|
||
ROOT CAUSE (recipe backup/restore labels). Compared mattermost-lts vs immich (immich passes the
|
||
IDENTICAL test):
|
||
- immich `database` svc: `backupbot.backup.pre-hook: /pg_backup.sh backup`,
|
||
`backupbot.backup.volumes.postgres.path: backup.sql` (backs up ONLY the dump file), and
|
||
**`backupbot.restore.post-hook: /pg_backup.sh restore`** (replays the dump on restore). → round-trips.
|
||
- mattermost-lts `postgres` svc: `pre-hook: pg_dump > /var/lib/postgresql/data/postgres-backup.sql`,
|
||
`backup.path: /var/lib/postgresql/data/` (backs up the WHOLE live/hot PGDATA dir + the dump),
|
||
`post-hook: rm .../postgres-backup.sql`, and **NO `backupbot.restore.post-hook`**. So on restore,
|
||
abra restores the files but NOTHING replays the dump, and a hot-copied live PGDATA over a running
|
||
postgres does not reload → `ci_marker` lost. Restore log confirms `Restoring Snapshot b0495d36 at /`
|
||
with no post-hook reimport.
|
||
|
||
Classification: **GENUINE RECIPE DEFECT at latest** (postgres backup/restore does not round-trip —
|
||
missing restore post-hook + backs up hot PGDATA instead of dump-only). NOT a flake, NOT cc-ci test
|
||
weakening (test is correct & unmodified; immich proves the pattern works). Fix (M2) = recipe PR
|
||
adopting the immich-style postgres backup/restore (a `/pg_backup.sh`-style dump + restore post-hook).
|
||
Teardown clean (no matt stack). Evidence: /tmp/redfix-mattermost-lts.log; junit
|
||
restore__cc-ci__test_restore.xml.
|
||
|
||
Tooling note: my background "waiter" loop `while pgrep -f run_recipe_ci.py` self-matched (its own
|
||
cmdline contains the string) → never exited, falsely showed a run active. Use `pgrep -f
|
||
"[r]un_recipe_ci.py"` or match the python invocation. Killed the stuck waiters; node confirmed free.
|