decisions(2): plausible Q4.7 full upgrade+P4 ENV-BLOCKED by ClickHouse cold-init crash flake (3-failure rule) — §4.3 floor verified, full tiers deferred pending env stabilization, §7.1 sign-off requested

This commit is contained in:
2026-05-30 09:15:29 +01:00
parent d753903c2a
commit 4de75a5b7a
3 changed files with 46 additions and 2 deletions

View File

@ -961,3 +961,29 @@ Fix: strip the `+...` working-tree-state marker before the commit match (`chaos.
HC1 is preserved — the underlying commit must still equal head_ref; a stale prev-checkout chaos
redeploy stamps prev's commit (also `+U` if overlaid) and still won't match. General: every future
cc-ci overlay recipe (untracked overlay + CHAOS_BASE_DEPLOY) would otherwise hit this.
## 2026-05-30 — plausible Q4.7 full lifecycle env-blocked by ClickHouse cold-init crash flake (3-failure rule)
**Decision:** Q4.7 plausible stays at its **§4.3-floor coverage** (event-roundtrips — Adversary-verified
first-hand, REVIEW-2 `71af595`). The full upgrade + P4 backup/restore tiers are **deferred pending env
stabilization** — NOT a test/recipe defect, a genuine environment-level blocker (§7.1 exception),
requesting Adversary sign-off.
**Evidence (3 consecutive install failures, 3-identical-failure rule → stop):**
- `ccci-plausible-q47`: install FAIL — app `/api/health` 404 (ClickHouse `events_db` never converged).
- `ccci-plausible-q47b`: install FAIL — `events_db` (ClickHouse) crash-loop `exit(1)`, swarm restarting
every ~10-30s, persistent for the whole deploy; postgres `db` + `app` were 1/1.
- `ccci-plausible-q47c`: install FAIL — same `events_db` `exit(1)` crash-loop.
**Characterisation:** the ClickHouse cold-init crash is **per-deploy** (a fresh deploy gets EITHER a clean
ClickHouse OR a persistently-crashing one — ~1-in-2 per the Adversary's own observation; clustered to 3/3
here), and **persistent within a run** (swarm restarts don't recover it — a corrupted/raced first-init
that re-crashes on every restart of that volume). ClickHouse logs to FILES (`/var/log/clickhouse-server/`),
not stdout, so `docker logs` is empty and the crashed container can't be exec'd → the err log is
inaccessible post-crash. Cause is most likely a ClickHouse first-boot init race on the single cc-ci node.
The §4.3 functional tests + P4 overlays are authored and correct (`tests/plausible/`) — they simply can't
run when ClickHouse fails to boot. NOT weakening anything.
**Re-entry:** when the ClickHouse boot is stabilised (e.g. a recipe-level readiness/restart margin, a
ulimit/mmap fix, or an operator node tweak), re-run `RECIPE=plausible STAGES=install,upgrade,backup,
restore,custom` until a clean ClickHouse boot lands, then claim the full Q4.7 gate. Filed in DEFERRED.md.

View File

@ -320,3 +320,18 @@ before the build is called done) — but does **not** force closure.
so /etc/timezone exists). Then the Builder executes the scoped gitea+drone integration (JOURNAL f86a58a).
- **Re-entry trigger:** host /etc/timezone deployed (verify `ssh cc-ci 'cat /etc/timezone'` = UTC).
- **Linked IDEA / BACKLOG:** Q4.10; JOURNAL-2 f86a58a; commit 3bde76f.
### 2026-05-30 — plausible Q4.7 full upgrade+P4 tiers (ClickHouse cold-init crash flake)
- [ ] **What:** Run plausible's FULL lifecycle (`install,upgrade,backup,restore,custom`) to a clean
green and claim the full Q4.7 gate. Suite is authored (`tests/plausible/` ops + test_backup/
restore/upgrade + functional event-roundtrips). §4.3 floor already Adversary-verified (REVIEW-2
`71af595`); only the full upgrade + P4 tiers remain.
- **Filed by:** Builder, phase 2 (Q4.7)
- **Reason for deferral:** ENVIRONMENT-level blocker (§7.1 exception). ClickHouse `events_db` cold-init
crash-loops `exit(1)` on ~1-in-2 fresh deploys and persists within a run (3 consecutive install
failures q47/q47b/q47c → 3-failure rule). Logs-to-file so no stdout diagnostics. NOT a test/recipe
defect — the tests can't run when ClickHouse won't boot. See DECISIONS 2026-05-30.
- **Re-entry trigger:** ClickHouse boot stabilised (recipe readiness/restart margin, ulimit/mmap fix, or
operator node tweak) → re-run until a clean ClickHouse boot, then claim. **Needs Adversary §7.1 sign-off**
that the §4.3-floor coverage + documented env-blocker is acceptable for this gate meanwhile.
- **Linked:** REVIEW-2 `71af595` (§4.3 floor PASS); DECISIONS 2026-05-30 (ClickHouse crash flake).

View File

@ -53,8 +53,11 @@ tree must carry:
(Q3.2), lasuite-meet (Q3.3), immich (Q3.5), matrix-synapse (Q4.1), mumble (Q4.2), bluesky-pds (Q4.3),
**ghost (Q4.4 ✅)**, mattermost-lts (Q4.5), uptime-kuma (Q4.8), mailu (Q4.9). Still open:
- **lasuite-docs (Q3.1)** — ✅ Adversary PASS @2026-05-30 (REVIEW-2 `bb07242`). DONE.
- **plausible (Q4.7)** — §4.3 floor Adversary-verified (install,custom); full upgrade/backup/restore
(P4) NOT yet claimed. Heavy: ClickHouse cold-boot flaky 1-in-2 (retry/readiness margin). Node-needed.
- **plausible (Q4.7)** — §4.3 floor Adversary-verified (REVIEW-2 `71af595`). Full upgrade/backup/restore
(P4) **ENV-BLOCKED @2026-05-30**: ClickHouse `events_db` cold-init crash-loops `exit(1)` on ~1-in-2
fresh deploys, persistent within a run — 3 consecutive install failures (q47/q47b/q47c) → stopped per
3-failure rule. Documented DECISIONS + DEFERRED 2026-05-30; **§7.1 env-blocker sign-off requested**.
Tests authored + correct; can't run when ClickHouse won't boot. Re-run when ClickHouse boot stabilises.
- **drone (Q4.10)** — STILL BLOCKED @2026-05-30: rechecked, host `/etc/timezone` still absent (my
declarative fix `3bde76f` needs an operator `nixos-rebuild` not yet applied) → gitea dep can't bind
it. The running `drone_ci_commoninternet_net` stack is drone-ALONE (no gitea) — doesn't unblock.