decisions(2): plausible Q4.7 full upgrade+P4 ENV-BLOCKED by ClickHouse cold-init crash flake (3-failure rule) — §4.3 floor verified, full tiers deferred pending env stabilization, §7.1 sign-off requested
This commit is contained in:
@ -961,3 +961,29 @@ Fix: strip the `+...` working-tree-state marker before the commit match (`chaos.
|
||||
HC1 is preserved — the underlying commit must still equal head_ref; a stale prev-checkout chaos
|
||||
redeploy stamps prev's commit (also `+U` if overlaid) and still won't match. General: every future
|
||||
cc-ci overlay recipe (untracked overlay + CHAOS_BASE_DEPLOY) would otherwise hit this.
|
||||
|
||||
## 2026-05-30 — plausible Q4.7 full lifecycle env-blocked by ClickHouse cold-init crash flake (3-failure rule)
|
||||
|
||||
**Decision:** Q4.7 plausible stays at its **§4.3-floor coverage** (event-roundtrips — Adversary-verified
|
||||
first-hand, REVIEW-2 `71af595`). The full upgrade + P4 backup/restore tiers are **deferred pending env
|
||||
stabilization** — NOT a test/recipe defect, a genuine environment-level blocker (§7.1 exception),
|
||||
requesting Adversary sign-off.
|
||||
|
||||
**Evidence (3 consecutive install failures, 3-identical-failure rule → stop):**
|
||||
- `ccci-plausible-q47`: install FAIL — app `/api/health` 404 (ClickHouse `events_db` never converged).
|
||||
- `ccci-plausible-q47b`: install FAIL — `events_db` (ClickHouse) crash-loop `exit(1)`, swarm restarting
|
||||
every ~10-30s, persistent for the whole deploy; postgres `db` + `app` were 1/1.
|
||||
- `ccci-plausible-q47c`: install FAIL — same `events_db` `exit(1)` crash-loop.
|
||||
|
||||
**Characterisation:** the ClickHouse cold-init crash is **per-deploy** (a fresh deploy gets EITHER a clean
|
||||
ClickHouse OR a persistently-crashing one — ~1-in-2 per the Adversary's own observation; clustered to 3/3
|
||||
here), and **persistent within a run** (swarm restarts don't recover it — a corrupted/raced first-init
|
||||
that re-crashes on every restart of that volume). ClickHouse logs to FILES (`/var/log/clickhouse-server/`),
|
||||
not stdout, so `docker logs` is empty and the crashed container can't be exec'd → the err log is
|
||||
inaccessible post-crash. Cause is most likely a ClickHouse first-boot init race on the single cc-ci node.
|
||||
The §4.3 functional tests + P4 overlays are authored and correct (`tests/plausible/`) — they simply can't
|
||||
run when ClickHouse fails to boot. NOT weakening anything.
|
||||
|
||||
**Re-entry:** when the ClickHouse boot is stabilised (e.g. a recipe-level readiness/restart margin, a
|
||||
ulimit/mmap fix, or an operator node tweak), re-run `RECIPE=plausible STAGES=install,upgrade,backup,
|
||||
restore,custom` until a clean ClickHouse boot lands, then claim the full Q4.7 gate. Filed in DEFERRED.md.
|
||||
|
||||
@ -320,3 +320,18 @@ before the build is called done) — but does **not** force closure.
|
||||
so /etc/timezone exists). Then the Builder executes the scoped gitea+drone integration (JOURNAL f86a58a).
|
||||
- **Re-entry trigger:** host /etc/timezone deployed (verify `ssh cc-ci 'cat /etc/timezone'` = UTC).
|
||||
- **Linked IDEA / BACKLOG:** Q4.10; JOURNAL-2 f86a58a; commit 3bde76f.
|
||||
|
||||
### 2026-05-30 — plausible Q4.7 full upgrade+P4 tiers (ClickHouse cold-init crash flake)
|
||||
- [ ] **What:** Run plausible's FULL lifecycle (`install,upgrade,backup,restore,custom`) to a clean
|
||||
green and claim the full Q4.7 gate. Suite is authored (`tests/plausible/` ops + test_backup/
|
||||
restore/upgrade + functional event-roundtrips). §4.3 floor already Adversary-verified (REVIEW-2
|
||||
`71af595`); only the full upgrade + P4 tiers remain.
|
||||
- **Filed by:** Builder, phase 2 (Q4.7)
|
||||
- **Reason for deferral:** ENVIRONMENT-level blocker (§7.1 exception). ClickHouse `events_db` cold-init
|
||||
crash-loops `exit(1)` on ~1-in-2 fresh deploys and persists within a run (3 consecutive install
|
||||
failures q47/q47b/q47c → 3-failure rule). Logs-to-file so no stdout diagnostics. NOT a test/recipe
|
||||
defect — the tests can't run when ClickHouse won't boot. See DECISIONS 2026-05-30.
|
||||
- **Re-entry trigger:** ClickHouse boot stabilised (recipe readiness/restart margin, ulimit/mmap fix, or
|
||||
operator node tweak) → re-run until a clean ClickHouse boot, then claim. **Needs Adversary §7.1 sign-off**
|
||||
that the §4.3-floor coverage + documented env-blocker is acceptable for this gate meanwhile.
|
||||
- **Linked:** REVIEW-2 `71af595` (§4.3 floor PASS); DECISIONS 2026-05-30 (ClickHouse crash flake).
|
||||
|
||||
@ -53,8 +53,11 @@ tree must carry:
|
||||
(Q3.2), lasuite-meet (Q3.3), immich (Q3.5), matrix-synapse (Q4.1), mumble (Q4.2), bluesky-pds (Q4.3),
|
||||
**ghost (Q4.4 ✅)**, mattermost-lts (Q4.5), uptime-kuma (Q4.8), mailu (Q4.9). Still open:
|
||||
- **lasuite-docs (Q3.1)** — ✅ Adversary PASS @2026-05-30 (REVIEW-2 `bb07242`). DONE.
|
||||
- **plausible (Q4.7)** — §4.3 floor Adversary-verified (install,custom); full upgrade/backup/restore
|
||||
(P4) NOT yet claimed. Heavy: ClickHouse cold-boot flaky 1-in-2 (retry/readiness margin). Node-needed.
|
||||
- **plausible (Q4.7)** — §4.3 floor Adversary-verified (REVIEW-2 `71af595`). Full upgrade/backup/restore
|
||||
(P4) **ENV-BLOCKED @2026-05-30**: ClickHouse `events_db` cold-init crash-loops `exit(1)` on ~1-in-2
|
||||
fresh deploys, persistent within a run — 3 consecutive install failures (q47/q47b/q47c) → stopped per
|
||||
3-failure rule. Documented DECISIONS + DEFERRED 2026-05-30; **§7.1 env-blocker sign-off requested**.
|
||||
Tests authored + correct; can't run when ClickHouse won't boot. Re-run when ClickHouse boot stabilises.
|
||||
- **drone (Q4.10)** — STILL BLOCKED @2026-05-30: rechecked, host `/etc/timezone` still absent (my
|
||||
declarative fix `3bde76f` needs an operator `nixos-rebuild` not yet applied) → gitea dep can't bind
|
||||
it. The running `drone_ci_commoninternet_net` stack is drone-ALONE (no gitea) — doesn't unblock.
|
||||
|
||||
Reference in New Issue
Block a user