journal(2): Q4.7 plausible — root cause of clickhouse-backup boot-download crash-loop + decision

This commit is contained in:
2026-05-29 18:48:56 +01:00
parent b4f39cb51a
commit f9ebb3f610

View File

@ -985,3 +985,53 @@ clear "clear pull error BEFORE deploy: manifest unknown" pre-deploy. abra deploy
update/scale). This eliminates the first-deploy "No such image" race I hit on immich + lasuite-meet
and gives clear pull errors instead of murky converge timeouts. Honest scope: removes pull-time not
app-init-time.
## 2026-05-29 — Q4.7 plausible: test content green; deploy blocked by upstream clickhouse-boot-download flakiness
**Test content authored + partially proven.** Wrote the §4.3 functional tests
(`tests/plausible/functional/test_event_tracking.py`: `test_pageview_event_roundtrip` +
`test_custom_event_roundtrip`) and fixed the health probe. Empirically validated the full event
round-trip against a live probe BEFORE writing: register a site row in the metadata postgres
(plausible's `sites_cache` GATES ingestion — events for unregistered domains are silently dropped,
confirmed count=0), POST to `/api/event` with a **browser User-Agent** (plausible drops bot/library
UAs), poll ClickHouse `events_v2` for the row (sites_cache refresh + write-buffer flush → first landing
~35-50s). A first `STAGES=install,custom` run **PASSED both event tests** (`2 passed in 73.58s`) and the
custom tier — so the §4.3 content is GREEN. Health probe switched `/` → `/api/health` (returns 200 with
`{"clickhouse":"ok","postgres":"ok","sites_cache":"ok"}` only when both stores ready; `/` 500s under
headless DISABLE_AUTH then 302s once ready, so `/` can't distinguish not-ready from ready). The prior WIP
edit had left an UNTERMINATED docstring in test_health_check.py (syntax error) — fixed. Install overlay
re-checked `/` (→500) and FAILED; replaced with a stronger assertion on the /api/health JSON subsystems.
**Blocker (upstream recipe defect): clickhouse-backup boot-download crash-loop.** The full lifecycle run
**timed out at DEPLOY_TIMEOUT=1200s** — `abra app deploy ... timed out after 1200 seconds`. Root cause:
the recipe's `entrypoint.clickhouse.sh` (swarm config `clickhouse_entrypoint`, mapped to
`/custom-entrypoint.sh`) runs, with `set -e` and NO retry, a `wget` of a 22MB `clickhouse-backup` tarball
from `github.com/AlexAkulov/clickhouse-backup` (renamed → 301 to `Altinity/...`) BEFORE exec'ing
clickhouse-server. If that wget (or the subsequent `tar -xf`) fails, the entrypoint exits 1 with EMPTY
logs (clickhouse-server never starts) and swarm crash-loops the task. Each restart re-downloads 22MB →
~120 attempts/20min ≈ 2.6GB hammered at GitHub → **GitHub secondary rate-limiting** → all subsequent
downloads fail → sustained crash-loop → deploy timeout.
Evidence: exited containers = `exit=1`, zero logs (fails before clickhouse). The download URL is fine —
a bridge-network `docker run` with the EXACT entrypoint command (busybox wget; image's `wget` is
`/bin/busybox`) succeeds 3/3 (22222742 bytes) when NOT hammered. The first `install,custom` run and a
manual probe BOTH converged (clickhouse up, events ingested) — i.e. the deploy works when GitHub answers
the first wget. The failure is induced by my back-to-back heavy testing churn today exhausting the IP's
GitHub budget; swarm task containers egress via the same host IP so they share the throttle.
**Why it matters for the gate:** normal CI (one PR → one deploy, MAX_TESTS=1) does ONE wget — usually
succeeds, converges (as proven). The catastrophic 20-min spiral needs SUSTAINED GitHub throttling, which
only my repeated-deploy testing produces. So plausible is reasonably reliable in normal operation but is
NOT robust to a transient first-wget failure (any single failure spirals), and the Adversary cold-verify
shares the risk.
**Decision (see DECISIONS.md):** durable fix = recipe PR hardening `entrypoint.clickhouse.sh` —
download the binary to the PERSISTENT `/var/lib/clickhouse` volume with skip-if-present (restarts don't
re-download → no amplification), retry-with-backoff, and `set +e` so a download failure does NOT block
clickhouse-server start (the DB must come up regardless; backup capability degrades gracefully). This
ALSO makes the deploy converge even under an active GitHub throttle (the DB no longer waits on the
download), so it is testable now. Same upstream-robustness pattern as Q3.2b (lasuite-drive) and immich's
pg_dump. cc-ci test content is correct and unchanged by this.
Killed the crash-looping runs + removed all plausible stacks/configs/networks/volumes (clean). NOT
claiming Q4.7 until the full lifecycle is green.