journal(2): Q4.1 matrix register-500 root cause (restore DROP DATABASE FORCE closes synapse DB pool) + readiness-retry fix

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-30 01:01:09 +01:00
parent 9a8850affa
commit b73018c9ab

View File

@ -1231,3 +1231,28 @@ rate-limit cools (deploy converges) — preserve the log. **discourse** + **dron
NEXT unblocked unit (when node free): pick a recipe and take it to a claim. Suggest order by ease:
matrix-synapse (overlay-complete → just run+claim) → bluesky-pds P4 overlay → mattermost-lts P4 →
ghost (P4 + §4.3 create-post) → uptime-kuma (P4 + Socket.IO §4.3). Keep heavy deploys sequential.
---
## 2026-05-30T~23:59 — Q4.1 matrix-synapse: post-restore register-500 root cause + fix; CLAIMED
First full run: install/upgrade/backup/restore green but custom `test_register_two_users_send_receive_
message` FAILED — synapse `HTTP 500 M_UNKNOWN` on the shared-secret admin register POST (nonce GET 200,
so endpoint enabled). A fresh `STAGES=install,custom` reproduce PASSED → not deterministic; the
differentiator is the FULL lifecycle's tier order (custom runs right after restore).
**Root cause (PROVEN via synapse log capture `/root/matrix-synapse-debug.log`):** the restore tier
runs pg_backup.sh `restore` = `DROP DATABASE … WITH (FORCE)` + recreate + reimport. The FORCE drop
**terminates synapse's live postgres connections** (`server closed the connection unexpectedly` /
`psycopg2.InterfaceError: connection already closed` at the restore timestamp). For a few seconds
synapse is re-establishing its connection pool; a registration is a DB *write*, so it 500s — while
HTTP health (`/_matrix/client/versions`, a read) is already green. A classic "health-green but not
write-ready after restore" window.
**Fix (NOT a weakening — readiness robustness per plan §4.2/§9):** `_admin_register` now polls —
re-fetch a fresh nonce + re-POST on 5xx/transport-error, ≤90s, then RAISE; a 4xx (real rejection) is
fail-fast. The asserted behaviour is identical (two users register + send/receive a message); only the
bounded post-restore recovery window is tolerated, and it logs each retry so the transient is visible.
Validated: full run 2 (`/root/ccci-matrix-full2.log`) GREEN — `[register] …: POST transient 500
(attempt 1) → succeeded (attempt 2)`, all 5 tiers pass, deploy-count=1, clean teardown. Claimed
`9a8850a`. (This is a general pattern other DB-write functional tests may need after the restore tier;
noted for the remaining recipes.)