diff --git a/machine-docs/JOURNAL-2.md b/machine-docs/JOURNAL-2.md index ed16bf4..228b700 100644 --- a/machine-docs/JOURNAL-2.md +++ b/machine-docs/JOURNAL-2.md @@ -985,3 +985,53 @@ clear "clear pull error BEFORE deploy: manifest unknown" pre-deploy. abra deploy update/scale). This eliminates the first-deploy "No such image" race I hit on immich + lasuite-meet and gives clear pull errors instead of murky converge timeouts. Honest scope: removes pull-time not app-init-time. + +## 2026-05-29 — Q4.7 plausible: test content green; deploy blocked by upstream clickhouse-boot-download flakiness + +**Test content authored + partially proven.** Wrote the §4.3 functional tests +(`tests/plausible/functional/test_event_tracking.py`: `test_pageview_event_roundtrip` + +`test_custom_event_roundtrip`) and fixed the health probe. Empirically validated the full event +round-trip against a live probe BEFORE writing: register a site row in the metadata postgres +(plausible's `sites_cache` GATES ingestion — events for unregistered domains are silently dropped, +confirmed count=0), POST to `/api/event` with a **browser User-Agent** (plausible drops bot/library +UAs), poll ClickHouse `events_v2` for the row (sites_cache refresh + write-buffer flush → first landing +~35-50s). A first `STAGES=install,custom` run **PASSED both event tests** (`2 passed in 73.58s`) and the +custom tier — so the §4.3 content is GREEN. Health probe switched `/` → `/api/health` (returns 200 with +`{"clickhouse":"ok","postgres":"ok","sites_cache":"ok"}` only when both stores ready; `/` 500s under +headless DISABLE_AUTH then 302s once ready, so `/` can't distinguish not-ready from ready). The prior WIP +edit had left an UNTERMINATED docstring in test_health_check.py (syntax error) — fixed. Install overlay +re-checked `/` (→500) and FAILED; replaced with a stronger assertion on the /api/health JSON subsystems. + +**Blocker (upstream recipe defect): clickhouse-backup boot-download crash-loop.** The full lifecycle run +**timed out at DEPLOY_TIMEOUT=1200s** — `abra app deploy ... timed out after 1200 seconds`. Root cause: +the recipe's `entrypoint.clickhouse.sh` (swarm config `clickhouse_entrypoint`, mapped to +`/custom-entrypoint.sh`) runs, with `set -e` and NO retry, a `wget` of a 22MB `clickhouse-backup` tarball +from `github.com/AlexAkulov/clickhouse-backup` (renamed → 301 to `Altinity/...`) BEFORE exec'ing +clickhouse-server. If that wget (or the subsequent `tar -xf`) fails, the entrypoint exits 1 with EMPTY +logs (clickhouse-server never starts) and swarm crash-loops the task. Each restart re-downloads 22MB → +~120 attempts/20min ≈ 2.6GB hammered at GitHub → **GitHub secondary rate-limiting** → all subsequent +downloads fail → sustained crash-loop → deploy timeout. + +Evidence: exited containers = `exit=1`, zero logs (fails before clickhouse). The download URL is fine — +a bridge-network `docker run` with the EXACT entrypoint command (busybox wget; image's `wget` is +`/bin/busybox`) succeeds 3/3 (22222742 bytes) when NOT hammered. The first `install,custom` run and a +manual probe BOTH converged (clickhouse up, events ingested) — i.e. the deploy works when GitHub answers +the first wget. The failure is induced by my back-to-back heavy testing churn today exhausting the IP's +GitHub budget; swarm task containers egress via the same host IP so they share the throttle. + +**Why it matters for the gate:** normal CI (one PR → one deploy, MAX_TESTS=1) does ONE wget — usually +succeeds, converges (as proven). The catastrophic 20-min spiral needs SUSTAINED GitHub throttling, which +only my repeated-deploy testing produces. So plausible is reasonably reliable in normal operation but is +NOT robust to a transient first-wget failure (any single failure spirals), and the Adversary cold-verify +shares the risk. + +**Decision (see DECISIONS.md):** durable fix = recipe PR hardening `entrypoint.clickhouse.sh` — +download the binary to the PERSISTENT `/var/lib/clickhouse` volume with skip-if-present (restarts don't +re-download → no amplification), retry-with-backoff, and `set +e` so a download failure does NOT block +clickhouse-server start (the DB must come up regardless; backup capability degrades gracefully). This +ALSO makes the deploy converge even under an active GitHub throttle (the DB no longer waits on the +download), so it is testable now. Same upstream-robustness pattern as Q3.2b (lasuite-drive) and immich's +pg_dump. cc-ci test content is correct and unchanged by this. + +Killed the crash-looping runs + removed all plausible stacks/configs/networks/volumes (clean). NOT +claiming Q4.7 until the full lifecycle is green.