journal(2): Q4.7 plausible — root cause of clickhouse-backup boot-download crash-loop + decision

2026-05-29 18:48:56 +01:00
parent b4f39cb51a
commit f9ebb3f610
1 changed files with 50 additions and 0 deletions
--- a/machine-docs/JOURNAL-2.md
+++ b/machine-docs/JOURNAL-2.md
@ -985,3 +985,53 @@ clear "clear pull error BEFORE deploy: manifest unknown" pre-deploy. abra deploy
 update/scale). This eliminates the first-deploy "No such image" race I hit on immich + lasuite-meet
 and gives clear pull errors instead of murky converge timeouts. Honest scope: removes pull-time not
 app-init-time.
+
+## 2026-05-29 — Q4.7 plausible: test content green; deploy blocked by upstream clickhouse-boot-download flakiness
+
+**Test content authored + partially proven.** Wrote the §4.3 functional tests
+(`tests/plausible/functional/test_event_tracking.py`: `test_pageview_event_roundtrip` +
+`test_custom_event_roundtrip`) and fixed the health probe. Empirically validated the full event
+round-trip against a live probe BEFORE writing: register a site row in the metadata postgres
+(plausible's `sites_cache` GATES ingestion — events for unregistered domains are silently dropped,
+confirmed count=0), POST to `/api/event` with a **browser User-Agent** (plausible drops bot/library
+UAs), poll ClickHouse `events_v2` for the row (sites_cache refresh + write-buffer flush → first landing
+~35-50s). A first `STAGES=install,custom` run **PASSED both event tests** (`2 passed in 73.58s`) and the
+custom tier — so the §4.3 content is GREEN. Health probe switched `/` → `/api/health` (returns 200 with
+`{"clickhouse":"ok","postgres":"ok","sites_cache":"ok"}` only when both stores ready; `/` 500s under
+headless DISABLE_AUTH then 302s once ready, so `/` can't distinguish not-ready from ready). The prior WIP
+edit had left an UNTERMINATED docstring in test_health_check.py (syntax error) — fixed. Install overlay
+re-checked `/` (→500) and FAILED; replaced with a stronger assertion on the /api/health JSON subsystems.
+
+**Blocker (upstream recipe defect): clickhouse-backup boot-download crash-loop.** The full lifecycle run
+**timed out at DEPLOY_TIMEOUT=1200s** — `abra app deploy ... timed out after 1200 seconds`. Root cause:
+the recipe's `entrypoint.clickhouse.sh` (swarm config `clickhouse_entrypoint`, mapped to
+`/custom-entrypoint.sh`) runs, with `set -e` and NO retry, a `wget` of a 22MB `clickhouse-backup` tarball
+from `github.com/AlexAkulov/clickhouse-backup` (renamed → 301 to `Altinity/...`) BEFORE exec'ing
+clickhouse-server. If that wget (or the subsequent `tar -xf`) fails, the entrypoint exits 1 with EMPTY
+logs (clickhouse-server never starts) and swarm crash-loops the task. Each restart re-downloads 22MB →
+~120 attempts/20min ≈ 2.6GB hammered at GitHub → **GitHub secondary rate-limiting** → all subsequent
+downloads fail → sustained crash-loop → deploy timeout.
+
+Evidence: exited containers = `exit=1`, zero logs (fails before clickhouse). The download URL is fine —
+a bridge-network `docker run` with the EXACT entrypoint command (busybox wget; image's `wget` is
+`/bin/busybox`) succeeds 3/3 (22222742 bytes) when NOT hammered. The first `install,custom` run and a
+manual probe BOTH converged (clickhouse up, events ingested) — i.e. the deploy works when GitHub answers
+the first wget. The failure is induced by my back-to-back heavy testing churn today exhausting the IP's
+GitHub budget; swarm task containers egress via the same host IP so they share the throttle.
+
+**Why it matters for the gate:** normal CI (one PR → one deploy, MAX_TESTS=1) does ONE wget — usually
+succeeds, converges (as proven). The catastrophic 20-min spiral needs SUSTAINED GitHub throttling, which
+only my repeated-deploy testing produces. So plausible is reasonably reliable in normal operation but is
+NOT robust to a transient first-wget failure (any single failure spirals), and the Adversary cold-verify
+shares the risk.
+
+**Decision (see DECISIONS.md):** durable fix = recipe PR hardening `entrypoint.clickhouse.sh` —
+download the binary to the PERSISTENT `/var/lib/clickhouse` volume with skip-if-present (restarts don't
+re-download → no amplification), retry-with-backoff, and `set +e` so a download failure does NOT block
+clickhouse-server start (the DB must come up regardless; backup capability degrades gracefully). This
+ALSO makes the deploy converge even under an active GitHub throttle (the DB no longer waits on the
+download), so it is testable now. Same upstream-robustness pattern as Q3.2b (lasuite-drive) and immich's
+pg_dump. cc-ci test content is correct and unchanged by this.
+
+Killed the crash-looping runs + removed all plausible stacks/configs/networks/volumes (clean). NOT
+claiming Q4.7 until the full lifecycle is green.