fix(2): Q4.2 mumble — TCP voice-server READY_PROBE gates backup past upgrade host-port churn

Diagnostic (RECIPE=mumble STAGES=install,backup,restore,custom, no upgrade) PROVED backup+restore green
on a stable 1.0.0 deploy incl. ci_marker survival (P4). The full-run backup 409 ('container not
running') was the chaos UPGRADE redeploy: host-mode 64738 must be released by the old task + rebound by
the new, and HEALTH_PATH '/' only proves the mumble-web sidecar (not the voice server), so wait_healthy
passed while the app churned → backup-bot execed a not-running container. Fix: extend
lifecycle.wait_ready_probes to support a TCP probe ({tcp_host,tcp_port,stable=N consecutive connects});
mumble recipe_meta READY_PROBE returns 64738 (stable=3) so the harness waits for the voice server up
after install AND upgrade before backup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 20:19:07 +01:00
parent 1890cb58f3
commit ec76072489
2 changed files with 45 additions and 1 deletions

View File

@ -36,3 +36,14 @@ EXTRA_ENV = {
"WELCOME_TEXT": WELCOME_TEXT_MARKER,
"USERS": str(MAX_USERS),
}
def READY_PROBE(domain):
# HEALTH_PATH "/" only proves the mumble-web HTTP sidecar; it does NOT reflect the voice server.
# After a chaos upgrade redeploy the host-mode 64738 port must be released by the old task and
# rebound by the new one — a window where the app (voice) container isn't yet serving while
# mumble-web still returns 200. backup-bot then execs its sqlite pre-hook into a not-running app
# container → 409. Gate readiness on the voice port being STABLY listening (3 consecutive
# connects) before the harness proceeds to the backup tier. The port is host-published
# (compose.host-ports.yml), so we probe it on the cc-ci host where the run executes.
return [{"tcp_host": "127.0.0.1", "tcp_port": 64738, "stable": 3}]