fix(clickhouse): make clickhouse-backup fetch resilient (cache on persistent volume, retry+backoff, never block server start)
Some checks failed
cc-ci/testme cc-ci: failure

The published entrypoint downloads the 22MB clickhouse-backup binary from GitHub at container boot
with 'set -ex' + a single silenced no-retry wget to ephemeral /tmp. Any transient failure of that
download (rate-limit / network) exits the container BEFORE clickhouse-server starts, so swarm restarts
it, it re-downloads, and the throttle is amplified into a crash-loop (deploy timeout).

clickhouse-backup is the BACKUP tool (backupbot pre/post hooks), not required for clickhouse-server to
run. This hardening caches the binary on the persistent /var/lib/clickhouse volume (fetched at most
once, reused on restart), retries with backoff, never blocks the server start on a fetch failure, and
un-silences the wget for diagnosability. No behaviour change when the first download succeeds.
This commit is contained in:
2026-05-31 05:28:16 +00:00
parent da159375d8
commit bd8bd93d2e

53
entrypoint.clickhouse.sh Normal file → Executable file
View File

@ -1,6 +1,21 @@
#!/bin/bash
# clickhouse entrypoint (cc-ci Q4.7b hardening — recipe-PR for recipe-maintainers/plausible).
#
# clickhouse-backup is the BACKUP tool (backupbot pre/post-hooks: `clickhouse-backup create/restore`).
# It is NOT required for clickhouse-SERVER (`/entrypoint.sh`) to run. The published recipe fetched it
# with `set -ex` + a single silenced no-retry wget to ephemeral /tmp, so ANY transient failure of the
# 22 MB GitHub download (rate-limit / network) exited the container BEFORE the server started → swarm
# restarted it → re-downloaded → amplified the throttle → crash-loop → deploy timeout (cc-ci Q4.7).
#
# Hardening (no behaviour change when the download succeeds first try):
# - cache the binary on the PERSISTENT clickhouse data volume (/var/lib/clickhouse) so it is fetched
# at most once and reused on every container restart (no re-download amplification);
# - retry with backoff;
# - NEVER let a download failure block the server start (best-effort: the server comes up, backup/
# restore degrade until the next successful fetch);
# - un-silenced so a failure is diagnosable in `docker service logs`.
set -ex
set -e
CLICKHOUSE_BACKUP_VERSION=2.4.2
@ -17,13 +32,33 @@ elif [[ $ARCH =~ "x86_64" ]]; then
ARCH="amd64"
fi
wget \
--quiet \
--continue \
--no-clobber \
--output-document=/tmp/clickhouse-backup.tar.gz \
"https://github.com/AlexAkulov/clickhouse-backup/releases/download/v${CLICKHOUSE_BACKUP_VERSION}/clickhouse-backup-linux-${ARCH}.tar.gz" 2>/dev/null
CACHE_DIR=/var/lib/clickhouse/.ccci-bin
CACHED="${CACHE_DIR}/clickhouse-backup"
BIN=/usr/local/bin/clickhouse-backup
URL="https://github.com/AlexAkulov/clickhouse-backup/releases/download/v${CLICKHOUSE_BACKUP_VERSION}/clickhouse-backup-linux-${ARCH}.tar.gz"
tar -xf /tmp/clickhouse-backup.tar.gz --directory=/usr/local/bin --strip-components=3
install_clickhouse_backup() {
mkdir -p "$CACHE_DIR"
if [ -x "$CACHED" ]; then
cp -f "$CACHED" "$BIN"
echo "clickhouse-backup: restored from persistent cache ($CACHED)"
return 0
fi
for attempt in 1 2 3 4 5; do
if wget --continue --output-document=/tmp/clickhouse-backup.tar.gz "$URL" \
&& tar -xf /tmp/clickhouse-backup.tar.gz --directory=/usr/local/bin --strip-components=3; then
cp -f "$BIN" "$CACHED" 2>/dev/null || true
echo "clickhouse-backup: downloaded + cached (attempt ${attempt})"
return 0
fi
echo "clickhouse-backup: fetch attempt ${attempt} failed; backing off $((attempt * 10))s" >&2
sleep $((attempt * 10))
done
echo "clickhouse-backup: fetch FAILED after retries — starting clickhouse-server WITHOUT the backup tool (backup/restore unavailable until a later restart fetches it)" >&2
return 1
}
/entrypoint.sh
# Best-effort: the server MUST start even if the backup-tool fetch fails (it is not a server dependency).
install_clickhouse_backup || true
exec /entrypoint.sh