journal(2): Q4.1 matrix register-500 root cause (restore DROP DATABASE FORCE closes synapse DB pool) + readiness-retry fix
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -1231,3 +1231,28 @@ rate-limit cools (deploy converges) — preserve the log. **discourse** + **dron
|
||||
NEXT unblocked unit (when node free): pick a recipe and take it to a claim. Suggest order by ease:
|
||||
matrix-synapse (overlay-complete → just run+claim) → bluesky-pds P4 overlay → mattermost-lts P4 →
|
||||
ghost (P4 + §4.3 create-post) → uptime-kuma (P4 + Socket.IO §4.3). Keep heavy deploys sequential.
|
||||
|
||||
---
|
||||
## 2026-05-30T~23:59 — Q4.1 matrix-synapse: post-restore register-500 root cause + fix; CLAIMED
|
||||
|
||||
First full run: install/upgrade/backup/restore green but custom `test_register_two_users_send_receive_
|
||||
message` FAILED — synapse `HTTP 500 M_UNKNOWN` on the shared-secret admin register POST (nonce GET 200,
|
||||
so endpoint enabled). A fresh `STAGES=install,custom` reproduce PASSED → not deterministic; the
|
||||
differentiator is the FULL lifecycle's tier order (custom runs right after restore).
|
||||
|
||||
**Root cause (PROVEN via synapse log capture `/root/matrix-synapse-debug.log`):** the restore tier
|
||||
runs pg_backup.sh `restore` = `DROP DATABASE … WITH (FORCE)` + recreate + reimport. The FORCE drop
|
||||
**terminates synapse's live postgres connections** (`server closed the connection unexpectedly` /
|
||||
`psycopg2.InterfaceError: connection already closed` at the restore timestamp). For a few seconds
|
||||
synapse is re-establishing its connection pool; a registration is a DB *write*, so it 500s — while
|
||||
HTTP health (`/_matrix/client/versions`, a read) is already green. A classic "health-green but not
|
||||
write-ready after restore" window.
|
||||
|
||||
**Fix (NOT a weakening — readiness robustness per plan §4.2/§9):** `_admin_register` now polls —
|
||||
re-fetch a fresh nonce + re-POST on 5xx/transport-error, ≤90s, then RAISE; a 4xx (real rejection) is
|
||||
fail-fast. The asserted behaviour is identical (two users register + send/receive a message); only the
|
||||
bounded post-restore recovery window is tolerated, and it logs each retry so the transient is visible.
|
||||
Validated: full run 2 (`/root/ccci-matrix-full2.log`) GREEN — `[register] …: POST transient 500
|
||||
(attempt 1) → succeeded (attempt 2)`, all 5 tiers pass, deploy-count=1, clean teardown. Claimed
|
||||
`9a8850a`. (This is a general pattern other DB-write functional tests may need after the restore tier;
|
||||
noted for the remaining recipes.)
|
||||
|
||||
Reference in New Issue
Block a user