decisions(2): plausible Q4.7 full upgrade+P4 ENV-BLOCKED by ClickHouse cold-init crash flake (3-failure rule) — §4.3 floor verified, full tiers deferred pending env stabilization, §7.1 sign-off requested

This commit is contained in:
2026-05-30 09:15:29 +01:00
parent d753903c2a
commit 4de75a5b7a
3 changed files with 46 additions and 2 deletions

View File

@ -961,3 +961,29 @@ Fix: strip the `+...` working-tree-state marker before the commit match (`chaos.
HC1 is preserved — the underlying commit must still equal head_ref; a stale prev-checkout chaos
redeploy stamps prev's commit (also `+U` if overlaid) and still won't match. General: every future
cc-ci overlay recipe (untracked overlay + CHAOS_BASE_DEPLOY) would otherwise hit this.
## 2026-05-30 — plausible Q4.7 full lifecycle env-blocked by ClickHouse cold-init crash flake (3-failure rule)
**Decision:** Q4.7 plausible stays at its **§4.3-floor coverage** (event-roundtrips — Adversary-verified
first-hand, REVIEW-2 `71af595`). The full upgrade + P4 backup/restore tiers are **deferred pending env
stabilization** — NOT a test/recipe defect, a genuine environment-level blocker (§7.1 exception),
requesting Adversary sign-off.
**Evidence (3 consecutive install failures, 3-identical-failure rule → stop):**
- `ccci-plausible-q47`: install FAIL — app `/api/health` 404 (ClickHouse `events_db` never converged).
- `ccci-plausible-q47b`: install FAIL — `events_db` (ClickHouse) crash-loop `exit(1)`, swarm restarting
every ~10-30s, persistent for the whole deploy; postgres `db` + `app` were 1/1.
- `ccci-plausible-q47c`: install FAIL — same `events_db` `exit(1)` crash-loop.
**Characterisation:** the ClickHouse cold-init crash is **per-deploy** (a fresh deploy gets EITHER a clean
ClickHouse OR a persistently-crashing one — ~1-in-2 per the Adversary's own observation; clustered to 3/3
here), and **persistent within a run** (swarm restarts don't recover it — a corrupted/raced first-init
that re-crashes on every restart of that volume). ClickHouse logs to FILES (`/var/log/clickhouse-server/`),
not stdout, so `docker logs` is empty and the crashed container can't be exec'd → the err log is
inaccessible post-crash. Cause is most likely a ClickHouse first-boot init race on the single cc-ci node.
The §4.3 functional tests + P4 overlays are authored and correct (`tests/plausible/`) — they simply can't
run when ClickHouse fails to boot. NOT weakening anything.
**Re-entry:** when the ClickHouse boot is stabilised (e.g. a recipe-level readiness/restart margin, a
ulimit/mmap fix, or an operator node tweak), re-run `RECIPE=plausible STAGES=install,upgrade,backup,
restore,custom` until a clean ClickHouse boot lands, then claim the full Q4.7 gate. Filed in DEFERRED.md.