journal(2): Q4.7 plausible — root cause of clickhouse-backup boot-download crash-loop + decision
This commit is contained in:
@ -985,3 +985,53 @@ clear "clear pull error BEFORE deploy: manifest unknown" pre-deploy. abra deploy
|
||||
update/scale). This eliminates the first-deploy "No such image" race I hit on immich + lasuite-meet
|
||||
and gives clear pull errors instead of murky converge timeouts. Honest scope: removes pull-time not
|
||||
app-init-time.
|
||||
|
||||
## 2026-05-29 — Q4.7 plausible: test content green; deploy blocked by upstream clickhouse-boot-download flakiness
|
||||
|
||||
**Test content authored + partially proven.** Wrote the §4.3 functional tests
|
||||
(`tests/plausible/functional/test_event_tracking.py`: `test_pageview_event_roundtrip` +
|
||||
`test_custom_event_roundtrip`) and fixed the health probe. Empirically validated the full event
|
||||
round-trip against a live probe BEFORE writing: register a site row in the metadata postgres
|
||||
(plausible's `sites_cache` GATES ingestion — events for unregistered domains are silently dropped,
|
||||
confirmed count=0), POST to `/api/event` with a **browser User-Agent** (plausible drops bot/library
|
||||
UAs), poll ClickHouse `events_v2` for the row (sites_cache refresh + write-buffer flush → first landing
|
||||
~35-50s). A first `STAGES=install,custom` run **PASSED both event tests** (`2 passed in 73.58s`) and the
|
||||
custom tier — so the §4.3 content is GREEN. Health probe switched `/` → `/api/health` (returns 200 with
|
||||
`{"clickhouse":"ok","postgres":"ok","sites_cache":"ok"}` only when both stores ready; `/` 500s under
|
||||
headless DISABLE_AUTH then 302s once ready, so `/` can't distinguish not-ready from ready). The prior WIP
|
||||
edit had left an UNTERMINATED docstring in test_health_check.py (syntax error) — fixed. Install overlay
|
||||
re-checked `/` (→500) and FAILED; replaced with a stronger assertion on the /api/health JSON subsystems.
|
||||
|
||||
**Blocker (upstream recipe defect): clickhouse-backup boot-download crash-loop.** The full lifecycle run
|
||||
**timed out at DEPLOY_TIMEOUT=1200s** — `abra app deploy ... timed out after 1200 seconds`. Root cause:
|
||||
the recipe's `entrypoint.clickhouse.sh` (swarm config `clickhouse_entrypoint`, mapped to
|
||||
`/custom-entrypoint.sh`) runs, with `set -e` and NO retry, a `wget` of a 22MB `clickhouse-backup` tarball
|
||||
from `github.com/AlexAkulov/clickhouse-backup` (renamed → 301 to `Altinity/...`) BEFORE exec'ing
|
||||
clickhouse-server. If that wget (or the subsequent `tar -xf`) fails, the entrypoint exits 1 with EMPTY
|
||||
logs (clickhouse-server never starts) and swarm crash-loops the task. Each restart re-downloads 22MB →
|
||||
~120 attempts/20min ≈ 2.6GB hammered at GitHub → **GitHub secondary rate-limiting** → all subsequent
|
||||
downloads fail → sustained crash-loop → deploy timeout.
|
||||
|
||||
Evidence: exited containers = `exit=1`, zero logs (fails before clickhouse). The download URL is fine —
|
||||
a bridge-network `docker run` with the EXACT entrypoint command (busybox wget; image's `wget` is
|
||||
`/bin/busybox`) succeeds 3/3 (22222742 bytes) when NOT hammered. The first `install,custom` run and a
|
||||
manual probe BOTH converged (clickhouse up, events ingested) — i.e. the deploy works when GitHub answers
|
||||
the first wget. The failure is induced by my back-to-back heavy testing churn today exhausting the IP's
|
||||
GitHub budget; swarm task containers egress via the same host IP so they share the throttle.
|
||||
|
||||
**Why it matters for the gate:** normal CI (one PR → one deploy, MAX_TESTS=1) does ONE wget — usually
|
||||
succeeds, converges (as proven). The catastrophic 20-min spiral needs SUSTAINED GitHub throttling, which
|
||||
only my repeated-deploy testing produces. So plausible is reasonably reliable in normal operation but is
|
||||
NOT robust to a transient first-wget failure (any single failure spirals), and the Adversary cold-verify
|
||||
shares the risk.
|
||||
|
||||
**Decision (see DECISIONS.md):** durable fix = recipe PR hardening `entrypoint.clickhouse.sh` —
|
||||
download the binary to the PERSISTENT `/var/lib/clickhouse` volume with skip-if-present (restarts don't
|
||||
re-download → no amplification), retry-with-backoff, and `set +e` so a download failure does NOT block
|
||||
clickhouse-server start (the DB must come up regardless; backup capability degrades gracefully). This
|
||||
ALSO makes the deploy converge even under an active GitHub throttle (the DB no longer waits on the
|
||||
download), so it is testable now. Same upstream-robustness pattern as Q3.2b (lasuite-drive) and immich's
|
||||
pg_dump. cc-ci test content is correct and unchanged by this.
|
||||
|
||||
Killed the crash-looping runs + removed all plausible stacks/configs/networks/volumes (clean). NOT
|
||||
claiming Q4.7 until the full lifecycle is green.
|
||||
|
||||
Reference in New Issue
Block a user