decisions(2): plausible Q4.7 full upgrade+P4 ENV-BLOCKED by ClickHouse cold-init crash flake (3-failure rule) — §4.3 floor verified, full tiers deferred pending env stabilization, §7.1 sign-off requested
This commit is contained in:
@ -961,3 +961,29 @@ Fix: strip the `+...` working-tree-state marker before the commit match (`chaos.
|
||||
HC1 is preserved — the underlying commit must still equal head_ref; a stale prev-checkout chaos
|
||||
redeploy stamps prev's commit (also `+U` if overlaid) and still won't match. General: every future
|
||||
cc-ci overlay recipe (untracked overlay + CHAOS_BASE_DEPLOY) would otherwise hit this.
|
||||
|
||||
## 2026-05-30 — plausible Q4.7 full lifecycle env-blocked by ClickHouse cold-init crash flake (3-failure rule)
|
||||
|
||||
**Decision:** Q4.7 plausible stays at its **§4.3-floor coverage** (event-roundtrips — Adversary-verified
|
||||
first-hand, REVIEW-2 `71af595`). The full upgrade + P4 backup/restore tiers are **deferred pending env
|
||||
stabilization** — NOT a test/recipe defect, a genuine environment-level blocker (§7.1 exception),
|
||||
requesting Adversary sign-off.
|
||||
|
||||
**Evidence (3 consecutive install failures, 3-identical-failure rule → stop):**
|
||||
- `ccci-plausible-q47`: install FAIL — app `/api/health` 404 (ClickHouse `events_db` never converged).
|
||||
- `ccci-plausible-q47b`: install FAIL — `events_db` (ClickHouse) crash-loop `exit(1)`, swarm restarting
|
||||
every ~10-30s, persistent for the whole deploy; postgres `db` + `app` were 1/1.
|
||||
- `ccci-plausible-q47c`: install FAIL — same `events_db` `exit(1)` crash-loop.
|
||||
|
||||
**Characterisation:** the ClickHouse cold-init crash is **per-deploy** (a fresh deploy gets EITHER a clean
|
||||
ClickHouse OR a persistently-crashing one — ~1-in-2 per the Adversary's own observation; clustered to 3/3
|
||||
here), and **persistent within a run** (swarm restarts don't recover it — a corrupted/raced first-init
|
||||
that re-crashes on every restart of that volume). ClickHouse logs to FILES (`/var/log/clickhouse-server/`),
|
||||
not stdout, so `docker logs` is empty and the crashed container can't be exec'd → the err log is
|
||||
inaccessible post-crash. Cause is most likely a ClickHouse first-boot init race on the single cc-ci node.
|
||||
The §4.3 functional tests + P4 overlays are authored and correct (`tests/plausible/`) — they simply can't
|
||||
run when ClickHouse fails to boot. NOT weakening anything.
|
||||
|
||||
**Re-entry:** when the ClickHouse boot is stabilised (e.g. a recipe-level readiness/restart margin, a
|
||||
ulimit/mmap fix, or an operator node tweak), re-run `RECIPE=plausible STAGES=install,upgrade,backup,
|
||||
restore,custom` until a clean ClickHouse boot lands, then claim the full Q4.7 gate. Filed in DEFERRED.md.
|
||||
|
||||
Reference in New Issue
Block a user