1649 lines
118 KiB
Markdown
1649 lines
118 KiB
Markdown
# JOURNAL — Phase 2 (per-recipe test authoring)
|
||
|
||
Builder-private (append-only). Builder rationalisations, dead-ends, in-the-moment reasoning. The
|
||
Adversary does NOT read this before forming a verdict; objective evidence goes in STATUS-2 / REVIEW-2.
|
||
Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md`
|
||
|
||
---
|
||
|
||
## 2026-05-28 — Phase 2 bootstrap
|
||
|
||
Phase 1e completed @2026-05-28 (commit 0fe1218, NO VETO, all HC1–HC4 Adversary cold-verified PASS).
|
||
Foundation is in place: the orchestrator deploys ONCE per run, performs each lifecycle op ONCE
|
||
(install→deploy / upgrade→chaos-redeploy of PR head / backup→`abra app backup` / restore→`abra app
|
||
restore`), and runs **both** generic (`tests/_generic/test_<op>.py`) and overlay
|
||
(`tests/<recipe>/test_<op>.py`) assertion files **additively** against the shared post-op state.
|
||
Pre-op seeds live in optional `tests/<recipe>/ops.py` (`pre_install`/`pre_upgrade`/`pre_backup`/
|
||
`pre_restore`). The deploy-count guard (DG4.1) stays =1; teardown is sacred. Per Phase-1e HC1, the
|
||
upgrade tier proves PR-head was deployed via `chaos-version` label = `head_ref` (head SHA from
|
||
$REF). Per HC2, repo-local PR-authored code runs only for recipes on
|
||
`tests/repo-local-approved.txt` (default-deny).
|
||
|
||
**Bootstrap (this session):**
|
||
1. `git pull --rebase` — already up to date.
|
||
2. Verified §1 access: `ssh cc-ci` OK (NixOS 24.11), Gitea API HTTP 200, wildcard
|
||
`probe-$RANDOM.ci.commoninternet.net` resolves to gateway `143.244.213.108`.
|
||
3. Read the Phase-2 plan + plan.md §6.1/§7/§9 (loop protocol, single-writer ownership, gate
|
||
handshake, anti-drift). Read STATUS-1e + REVIEW-1e final to inherit the harness invariants
|
||
(HC1–HC4 cold-verified PASS, F1e-2 not blocking).
|
||
4. Surveyed existing state: `tests/<recipe>/` already exists for **custom-html, cryptpad, keycloak,
|
||
lasuite-docs, matrix-synapse, n8n** — these were built out as Phase-1d/1e overlays + recipe_meta
|
||
+ ops.py. The lifecycle overlay model (test_install/upgrade/backup/restore.py + ops.py) is the
|
||
foundation. Phase 2 adds **parity-port functional tests** + **≥2 NEW recipe-specific tests** +
|
||
**dependency/SSO resolver** + **PARITY.md** per recipe.
|
||
5. Surveyed `references/recipe-maintainer` (mounted at `/srv/recipe-maintainer/`) — the parity
|
||
source. Per-recipe corpus:
|
||
- **custom-html** — health_check.py (200 check)
|
||
- **n8n** — health_check.py
|
||
- **keycloak** — health_check.py + oidc_integration.py (cross-recipe with lasuite-docs)
|
||
- **cryptpad** — health_check.py + oidc_login.py
|
||
- **lasuite-docs** — health_check.py + oidc_login.py + upload_conversion.py
|
||
- **lasuite-meet** — health_check.py + oidc_login.py + meeting_flow.py + webrtc-media.py +
|
||
webrtc-relay.py
|
||
- **matrix-synapse** — *shell* tests: compress_state.sh + test_complexity_limit.sh + test_purge.sh
|
||
(will port semantics to Python under cc-ci)
|
||
- **hedgedoc / authentik / immich / bluesky-pds / mumble / gitea / lichen / lichen-markdown** —
|
||
no `tests/` dir under recipe-info yet, will fill from plan §4.3 spec.
|
||
|
||
**Plan-shape orientation:**
|
||
- `tests/<recipe>/test_<op>.py` (lifecycle overlays) — already established.
|
||
- `tests/<recipe>/functional/` — Phase-2 introduces this subdir for parity-port + new specific tests.
|
||
Discovery currently globs `test_*.py` at the top level only; will need to recurse (Q0.2).
|
||
- `tests/<recipe>/playwright/` — same.
|
||
- `tests/<recipe>/PARITY.md` — Phase-2 introduces this; mapping table per recipe.
|
||
|
||
**Bootstrap commits incoming:**
|
||
- Add STATUS-2.md / BACKLOG-2.md / JOURNAL-2.md (this session).
|
||
- DECISIONS.md append: PARITY.md format, functional/ + playwright/ subdirs, dep-resolver shape.
|
||
|
||
Will now seed DECISIONS, then begin Q0.1 (vendor helpers into runner/harness/) — keeping the
|
||
custom-html overlay working as the reference recipe. The /loop will self-pace.
|
||
|
||
## 2026-05-28 — Q0 + Q1.1 landed; Q0 gate CLAIMED
|
||
|
||
Worked through Q0.1, Q0.2, Q0.3, Q1.1 in one stretch since they're tightly coupled:
|
||
|
||
**Q0.1** — `runner/harness/http.py` is the canonical Phase-2 recipe-test HTTP API. Mirrors
|
||
`recipe-maintainer/utils/tests/helpers.py` shape (same function names, same return shapes) so
|
||
parity ports read 1:1, but self-contained (cc-ci runtime does NOT import recipe-maintainer per
|
||
DECISIONS Phase 2). Existing `lifecycle.http_get`/`http_fetch`/`http_body` stay — they're for
|
||
infra-level checks like Traefik-404 detection. `harness.http` is for recipe tests' API calls. SSL
|
||
context is `CERT_NONE` because per-run domains use the wildcard cert; the real-cert verification
|
||
happens in `generic.served_cert` once per run via the install tier.
|
||
|
||
**Q0.2** — discovery now recurses into `functional/` + `playwright/` subdirs. Surgically small change
|
||
to `custom_tests`; doesn't disturb the lifecycle-tier discovery (overlays still live at top-level).
|
||
Two new unit tests prove it (recursion works + HC2 gate still applies to subdirs). Pre-existing 8
|
||
discovery unit tests still pass.
|
||
|
||
**Q0.3 / Q1.1** — custom-html as the reference recipe:
|
||
- `PARITY.md` mapping table: 1 parity row (health_check) + 2 recipe-specific rows
|
||
(content_roundtrip + content_type_header) + a backup-integrity reference + a playwright reference.
|
||
- `functional/test_health_check.py` — parity port with `SOURCE: recipe-info/custom-html/tests/health_check.py` comment for audit.
|
||
- `functional/test_content_roundtrip.py` — NEW: write a `uuid.uuid4()` marker into nginx's
|
||
`/usr/share/nginx/html` volume, fetch over HTTPS, assert exact-byte match. Non-vacuous: a stale page
|
||
or misrouted backend can't return our random content.
|
||
- `functional/test_content_type_header.py` — NEW: write `.html` + `.txt` files with same body
|
||
("hello"), HEAD each, assert `Content-Type: text/html` and `text/plain`. Caught the case where nginx
|
||
MIME map breaks even when 200 still works.
|
||
- `playwright/test_browser_smoke.py` — P6: Chromium renders HTML, no console errors.
|
||
|
||
**E2E cold-verifiable evidence on cc-ci** (log `/root/ccci-q0-customhtml-full.log`):
|
||
```
|
||
RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py
|
||
===== TIER: install (generic=run, overlay=cc-ci:tests/custom-html/test_install.py) =====
|
||
... generic + overlay both PASS
|
||
===== TIER: upgrade =====
|
||
upgrade→PR-head: head_ref=8a026066 chaos-version=8a026066 version=1.10.0+1.28.0→1.11.0+1.29.0
|
||
... generic + overlay both PASS (data marker "upgrade-survives" survived chaos redeploy)
|
||
===== TIER: backup =====
|
||
... generic + overlay both PASS
|
||
===== TIER: restore =====
|
||
... generic + overlay both PASS (volume restored to "original")
|
||
===== TIER: custom =====
|
||
... 4 PASS (parity health_check, content_roundtrip, content_type_header, browser_smoke)
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : pass
|
||
```
|
||
|
||
That's the full Phase-2 pattern proven on the reference recipe:
|
||
- additive generic+overlay across 4 lifecycle ops (HC3),
|
||
- HC1 PR-head deploy proof via chaos-version label match,
|
||
- recipe-aware backup data-integrity (marker survives backup/restore cycle),
|
||
- 2 NEW recipe-specific functional tests beyond parity (P3 floor met),
|
||
- Playwright UI flow (P6),
|
||
- deploy-once + clean teardown.
|
||
|
||
**Q0.4 (dep resolver) deferred to Q2**: no Q1 recipe (custom-html + n8n) has deps, and the resolver
|
||
shape will be much clearer once we have keycloak+authentik to deploy as deps. Logged in BACKLOG-2.
|
||
|
||
**Q0 gate now CLAIMED.** Working in parallel on Q1.2 (n8n) while the Adversary cold-verifies.
|
||
|
||
|
||
## 2026-05-28 — F2-1 fix: synthetic-recipe fixture (Adversary FAIL on Q0)
|
||
|
||
The Adversary FAILed Q0 cold on F2-1: `tests/unit/test_discovery.py::test_custom_tests_repo_local_gated` (Phase-1e HC2 test) used the real recipe name `"custom-html"` and asserted
|
||
`custom_tests("custom-html", repo_local) == []`. Phase-2 commit `bec9265` added 4 legit non-lifecycle
|
||
tests under `tests/custom-html/{functional,playwright}/`, which `custom_tests()` now correctly
|
||
returns — so the `== []` assertion no longer holds. Behavior is right; the fixture was brittle.
|
||
|
||
My "21 passed" evidence was real on the Builder clone — but I had synced the new tests to cc-ci
|
||
**before** syncing the new custom-html functional/ tests, so at that moment the assertion still held.
|
||
The Adversary's cold re-run from origin/main pulled the full state and correctly caught the regression.
|
||
|
||
**Fix (commit `5741e88`):** switch to synthetic recipe + monkeypatch `discovery.cc_ci_dir` — same
|
||
pattern already used in the Phase-2 sibling `tests/unit/test_discovery_phase2.py`. 5-line change,
|
||
no behavior change. Cold-verifiable: `cc-ci-run -m pytest tests/unit -v` → 21/21 PASS.
|
||
|
||
F2-2 (scope observation) — the Adversary flagged that Q0.4 (dep resolver) and OIDC-flow primitive
|
||
are not yet implemented; explicitly deferred to Q2/Q3 in BACKLOG-2. Acknowledged in STATUS-2 gate
|
||
text.
|
||
|
||
**Lesson:** when adding new content to an existing recipe directory, scan the unit tests for any
|
||
that assume that directory is empty/lifecycle-only. The synthetic-recipe + monkeypatch pattern is
|
||
the right shape for all such unit tests; we should prefer it across the board.
|
||
|
||
**n8n probe ran in the background to validate endpoint shapes for Q1.2:**
|
||
- `/` → 200 text/html (the SPA)
|
||
- `/healthz` → 200 `{"status":"ok"}` (already used by install overlay)
|
||
- `/types/nodes.json` → 200 but size=31 bytes, not JSON (probably SPA fallback). REJECT this idea.
|
||
- Probe terminated before reaching `/rest/settings` / `/rest/login` (the JSON parse on
|
||
`/types/nodes.json` raised). Re-running probe now without the JSON gate.
|
||
|
||
Q0 re-claimed; awaiting Adversary re-verify. Continuing on Q1.2 (n8n) in parallel.
|
||
|
||
## 2026-05-28 — Q1.2 (n8n) green; Q1 CLAIMED
|
||
|
||
n8n's defining challenge for Phase 2 was the **boot race**: `/healthz` returns 200 long before the
|
||
n8n process is ready to serve REST. The REST endpoints serve a placeholder HTML page ("n8n is
|
||
starting up. Please wait") with status 200 during early boot, so a naive `status==200` test would
|
||
pass on the placeholder (vacuous). I avoided this in two ways:
|
||
|
||
1. **Functional tests poll for content-type=application/json** (not just status=200) — rejecting
|
||
the placeholder until the real JSON arrives. The retry envelope is the canonical
|
||
`harness.http.assert_converges`.
|
||
2. **The install overlay's Playwright now polls page.goto** until status==200 — because n8n's `/`
|
||
route registration can lag /healthz by several seconds (Run 1: status=200 with placeholder
|
||
body; Run 2: status=404 because the route wasn't registered yet). Both windows were caught and
|
||
handled.
|
||
|
||
The plan §4.3 mentioned "create a workflow via API, execute it, assert the result" as the n8n
|
||
specific test. I deferred that and chose `/rest/settings` + `/rest/login` JSON-shape assertions
|
||
instead, for these reasons:
|
||
- n8n requires owner setup before the REST API is unlocked for workflow creation. Doing that in
|
||
CI means generating an admin password, POSTing it to `/rest/owner/setup`, then proceeding —
|
||
doable, but introduces a write side-effect that complicates the install→upgrade→backup pipeline
|
||
(because the owner-setup state is in the n8n volume that backup/restore also exercises).
|
||
- The `/rest/settings` + `/rest/login` shape assertions are **equally non-vacuous**: they reject
|
||
the boot-placeholder, which the API would still serve if n8n's process is wedged. They prove
|
||
the REST subsystem AND the user-management/auth subsystem initialized — which is the
|
||
functional core of n8n's web layer.
|
||
- The lifecycle overlays already prove backup/restore data-integrity via a volume marker in
|
||
/home/node/.n8n. The owner-setup blob would also live in that volume; if the marker survives, so
|
||
does owner-setup state.
|
||
|
||
Decision recorded in BACKLOG-2 Q1.2 with rationale. The ≥2-specific floor is met by the two
|
||
JSON-API tests + the lifecycle data-integrity overlay (which IS recipe-specific behavior even
|
||
though it lives in the lifecycle tier — it tests n8n's volume contents survive a real abra backup).
|
||
|
||
**Cold-verifiable e2e on cc-ci** (log `/root/ccci-q1-n8n-r3.log`):
|
||
```
|
||
RECIPE=n8n cc-ci-run runner/run_recipe_ci.py
|
||
== head_ref='63dd3e0f94771f0527febe9948fa7eba61355c35' (ref=None)
|
||
===== TIER: upgrade =====
|
||
upgrade→PR-head: head_ref=63dd3e0f chaos-version=63dd3e0f version=3.1.0+2.9.4→3.2.0+2.20.6
|
||
... 5 lifecycle assertions + 3 custom-stage assertions ALL PASS ...
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : pass
|
||
```
|
||
|
||
Q1 CLAIMED. Working in parallel on Q2 (keycloak + authentik + OIDC-flow harness) while the
|
||
Adversary cold-verifies.
|
||
|
||
## 2026-05-28 — Q1 FAIL → F2-3 + F2-4 fix; Q1 RE-CLAIMED
|
||
|
||
The Adversary FAILed Q1 on two findings:
|
||
|
||
**F2-4 (the gate-blocker):** I rationalized skipping the workflow-create test because "n8n's REST
|
||
API requires owner setup". Per plan §7.1 verbatim, "needs SSO setup" / "needs another app
|
||
deployed" / "needs a browser" are NOT valid excuses — the SSO-setup harness, dependency resolver,
|
||
and Playwright exist precisely to remove these excuses. My rationale fell exactly into that
|
||
prohibited class. Owner setup is a one-POST run-scoped class-B secret per §4.4-B; the test should
|
||
do it.
|
||
|
||
This was a real mistake. I was anchoring on "ports must reflect the recipe-maintainer corpus",
|
||
and recipe-maintainer's n8n corpus has only `health_check.py`. But Phase 2 P3 is ABOVE parity —
|
||
the ≥2 specific tests have to be characteristic-of-the-recipe, and for n8n that's a workflow
|
||
round-trip, full stop.
|
||
|
||
**Fix:** `tests/n8n/functional/test_workflow_roundtrip.py` does exactly what §4.3 prescribed:
|
||
- POST `/rest/owner/setup` with a per-run generated email + password (class-B secret, never
|
||
persisted to disk, scrubbed from logs by the orchestrator's redaction filter).
|
||
- Capture the `Set-Cookie` (n8n's `n8n-auth` cookie) → cookie header for subsequent requests.
|
||
- POST `/rest/workflows` with a minimal Manual-Trigger workflow + a unique name.
|
||
- GET `/rest/workflows/<id>` with the cookie; assert id/name/nodes payload round-trip.
|
||
|
||
I intentionally stopped short of "execute the workflow" — manual triggers can't self-execute
|
||
without webhook activation (fragile, slow). Create-and-read-back is the workflow-engine
|
||
exercise; execution is a separate test if/when needed.
|
||
|
||
**F2-3 (cold-run flake):** my install-overlay retry loop caught HTTP status mismatches but let
|
||
Playwright exceptions (`net::ERR_NETWORK_CHANGED`) escape. The Adversary's first cold run
|
||
genuinely hit this — Playwright's underlying CDP connection can transiently drop, especially
|
||
under load on a single-node cc-ci. Wrapping `page.goto` in `try/except PlaywrightError` (caught
|
||
both the specific PlaywrightError class AND any other transient exception) makes the loop
|
||
behave the same way for connection failures as for status mismatches.
|
||
|
||
**Cold-verifiable e2e** (log `/root/ccci-q1-n8n-r4.log`, commit `fc89552`):
|
||
```
|
||
RECIPE=n8n cc-ci-run runner/run_recipe_ci.py
|
||
== head_ref='63dd3e0f' (ref=None)
|
||
... 5 lifecycle assertions + 4 custom-stage assertions ALL PASS ...
|
||
↑ including test_workflow_create_and_read_back (the §4.3 prescribed test) ↑
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : pass
|
||
```
|
||
|
||
**Lesson:** when the plan's §4.3 examples line up directly with a recipe (n8n → "create a
|
||
workflow via API"), do that test. The Adversary mandate (§7.1) specifically guards against
|
||
substituting endpoint-shape tests for characteristic-behavior tests. If owner-setup is required,
|
||
generate the credential per-run; if the API needs a session, capture and forward the cookie.
|
||
PARITY.md is for the recipe-maintainer ports; the ≥2 specific tests go above and beyond — they
|
||
shouldn't be constrained by what the parity corpus tested.
|
||
|
||
**Keycloak Q2.1 in flight, separate issue:** the keycloak install hit `not healthy over HTTPS
|
||
/realms/master (last status 502)` during the first attempt. The deployment dies before serving.
|
||
This is likely the HTTP_TIMEOUT=600 not being enough for a cold-start JVM + mariadb on this
|
||
host. Will investigate after Q1 RE-VERIFY lands.
|
||
|
||
## 2026-05-28 — Q2 CLAIMED — dep resolver + SSO harness + OIDC end-to-end
|
||
|
||
Q1 PASS landed. Then in one stretch:
|
||
|
||
**Q2.1 keycloak parity + 2 specific** (`d5f5e86`) — parity port + JWT password-grant test +
|
||
client_credentials grant + JWT claim validation. Bumped DEPLOY_TIMEOUT+HTTP_TIMEOUT to 900s after
|
||
the first attempt hit 502 from /realms/master at 600s (cold-start JVM+mariadb takes longer).
|
||
|
||
**Q2.3 — the foundational primitives** (`4d6b040`):
|
||
- `runner/harness/deps.py` — read `DEPS = [...]` from a recipe's `recipe_meta.py`; orchestrator
|
||
deploys each dep at a per-(parent, dep) domain before the recipe-under-test, tears down in
|
||
reverse order in finally. DG4.1 expected count is now 1 + len(deps_state).
|
||
- `runner/harness/sso.py` — `setup_keycloak_realm` (idempotent realm + confidential OIDC client
|
||
+ test user with class-B per-run-generated password); `oidc_password_grant` (real OIDC
|
||
password-grant flow); `assert_discovery_endpoint` (issuer matches per-run domain/realm).
|
||
- 7 unit tests in `tests/unit/test_deps.py`. The unit-test `test_dep_domain_distinct_per_parent`
|
||
caught a bug in my first dep_domain implementation (didn't include parent in the hash) — fixed
|
||
before pushing. 28/28 unit tests PASS cold.
|
||
|
||
**Q2.4 acceptance** (`9e88741`): added `DEPS = ["keycloak"]` to lasuite-docs's recipe_meta and
|
||
wrote `tests/lasuite-docs/functional/test_oidc_with_keycloak.py`. End-to-end on cc-ci:
|
||
|
||
```
|
||
RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
|
||
===== DEPS: ['keycloak'] =====
|
||
dep: deploying keycloak -> keyc-c12afe.ci.commoninternet.net
|
||
dep: keycloak ready @ keyc-c12afe.ci.commoninternet.net
|
||
===== TIER: install ===== 2 PASS (generic + cc-ci overlay)
|
||
===== TIER: custom ===== 1 PASS (test_oidc_password_grant_against_dep_keycloak)
|
||
===== DEPS teardown =====
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 2 (expect 2)
|
||
```
|
||
|
||
The OIDC test asserts iss/azp/typ/exp on a real JWT — non-vacuous. The "dependent recipe deploys
|
||
its provider and runs an OIDC login test in one run" gate acceptance is met.
|
||
|
||
**Q2.2 authentik DEFERRED.** Q2 acceptance is keycloak-proven; authentik enrollment is
|
||
provider-pluggable (mirror the setup_keycloak_realm shape into a setup_authentik_provider when
|
||
a recipe declares authentik as its dep). Logged in BACKLOG-2; will land when Q3 lights up an
|
||
authentik-dependent recipe.
|
||
|
||
**Secondary fix during the stretch — F2-3 systemic** (`47f7cb4`): the same Playwright-error
|
||
escape that bit n8n bit custom-html during the deps-smoke test. Centralized the fix in
|
||
`runner/harness/browser.py::goto_with_retry` and applied to ALL install overlays + the
|
||
custom-html playwright smoke. Cold-verified on custom-html (all 5 stages PASS).
|
||
|
||
**Lesson:** the F2-3 fix should have been centralized the first time, not just patched
|
||
in-place on n8n. The cost of the rework was ~50 lines and one extra cold run. Worth it for the
|
||
generality. From now on: when a recipe-overlay needs a robustness pattern, ask if it generalizes
|
||
to a shared helper BEFORE fixing in-place.
|
||
|
||
Q2 CLAIMED; awaiting Adversary cold-verify. Continuing on Q3 (SSO-dependent suite) in parallel.
|
||
|
||
## 2026-05-28 — Q2 FAIL on F2-5; fixed; RE-CLAIMED
|
||
|
||
Adversary FAILed Q2 on three findings:
|
||
- **F2-5 (gate-blocker):** `teardown_deps` silently suppressed teardown failures via
|
||
`contextlib.suppress(Exception)`. The `===== DEPS teardown =====` print fired even when undeploy
|
||
raised. On Adversary cold-check 14+ minutes after my Q2.4 run, the dep keycloak stack
|
||
`keyc-c12afe` was STILL UP — 2 services + leftover secrets/volumes. The "green" Q2.4 run leaked.
|
||
- **F2-6 (secondary):** cold keycloak install flake (502 from /realms/master). Real issue, but
|
||
unrelated to Q2 acceptance — flagged for future infra hardening.
|
||
- **F2-7 (transparency):** SSO setup is keycloak-hardcoded; `setup_authentik_realm` would need a
|
||
parallel backend. Documented for Q5 to avoid skipping authentik on the false premise that the
|
||
harness is reusable for it.
|
||
|
||
**This explained my Q3.1 flake!** When I ran lasuite-docs+keycloak again after the Q2.4 run, the
|
||
dep domain (`keyc-c12afe.ci.commoninternet.net` — deterministic per parent+dep+pr+ref) was the
|
||
SAME, and the leftover stack from Q2.4 collided with the new deploy. The "502 from /realms/master"
|
||
was actually the OLD stack still running, but trying to deploy a fresh keycloak on top of the
|
||
existing one. The new abra app new succeeded (created a new .env), but the swarm services were
|
||
already running so abra app deploy did weird things, and Traefik routed to the OLD running stack
|
||
(which was timing out / not healthy after the secrets had been swapped).
|
||
|
||
**Fix (commit `c6e94af`):**
|
||
- `deps.py::teardown_deps`: switched to `verify=True` so `lifecycle.teardown_app` raises on
|
||
residuals; loop catches per-dep failures, logs LOUDLY, but continues to teardown other deps;
|
||
after all attempts, raises a combined `TeardownError`.
|
||
- `run_recipe_ci.py`: catches the dep `TeardownError` in finally; surfaces via
|
||
`dep_teardown_error` in the summary + non-zero exit code; run still prints diagnostics so a
|
||
teardown failure doesn't hide other failures.
|
||
|
||
**Cold-verified e2e** (log `/root/ccci-f25-verify.log`):
|
||
```
|
||
RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
|
||
===== DEPS: ['keycloak'] =====
|
||
dep: deploying keycloak -> keyc-c12afe.ci.commoninternet.net
|
||
dep: keycloak ready @ keyc-c12afe.ci.commoninternet.net
|
||
===== TIER: install ===== 2 PASS
|
||
===== TIER: custom ===== 3 PASS (incl. test_oidc_password_grant_against_dep_keycloak)
|
||
===== DEPS teardown =====
|
||
dep: tearing down keycloak @ keyc-c12afe.ci.commoninternet.net
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 2 (expect 2)
|
||
```
|
||
|
||
Post-run cc-ci state (verified 30s later): `docker stack ls | grep keyc` → empty;
|
||
`docker volume ls | grep keyc` → empty; `docker secret ls | grep keyc` → empty. No leak.
|
||
|
||
Side-effect of the cleanup: also landed Q3.1 partial (PARITY.md + 2 new functional tests for
|
||
lasuite-docs — test_health_check parity port + test_auth_required showing 401 on protected API).
|
||
test_oidc_with_keycloak.py is the third specific test (Q2.4 acceptance + Q3.1 OIDC coverage).
|
||
|
||
**Lessons:**
|
||
1. **Silent exception suppression in cleanup paths is a bug**, not robustness. Use it ONLY for
|
||
things you know are inherently best-effort and don't have downstream effects. Dep teardown
|
||
has downstream effects (deterministic dep domain → next-run collision); it MUST be loud.
|
||
2. **Deterministic per-run domains amplify state leaks.** When parent+pr+ref+dep produces the
|
||
same hash on a re-run, any leak from the prior run silently corrupts the next. The fix
|
||
options were either (a) make teardown sacred (chosen — F2-5 fix), or (b) make the domain
|
||
random/timestamped. (a) is right because deterministic helps debugging and concurrent-safety
|
||
when verified to fully teardown.
|
||
|
||
Q2 RE-CLAIMED. Continuing Q3 work in parallel.
|
||
|
||
## 2026-05-28 — Q2 PASS; Q3.1 + Q3.4 partial; checkpoint
|
||
|
||
**Progress checkpoint:**
|
||
- Q0 ✓ Adversary PASS — harness primitives + discovery
|
||
- Q1 ✓ Adversary PASS — custom-html + n8n full Phase-2 (parity + ≥2 specific)
|
||
- Q2 ✓ Adversary PASS — keycloak + dep resolver + SSO harness + Q2.4 acceptance
|
||
- Q3.1 lasuite-docs partial — parity health_check + 2 specific (auth_required + oidc_with_keycloak)
|
||
- Q3.4 cryptpad partial — parity + 2 specific (spa_assets + Playwright render)
|
||
- Q3.2/Q3.3/Q3.5: not started
|
||
- Q4: 10 recipes not started
|
||
- Q5.1 docs partial; Q5.2/Q5.3 not done
|
||
|
||
**Open deferrals (per §7.1) tracked for Adversary sign-off:**
|
||
1. lasuite-docs deeper OIDC tests (oidc_login.py + upload_conversion.py + create-a-doc) — needs
|
||
install_steps.sh to wire dep keycloak's client_secret + OIDC env into the parent .env.
|
||
2. cryptpad create-a-pad deeper test — CryptPad's pad-creation flow is version-specific (DECISIONS
|
||
Phase-2 Q3.4 section logs the rationale).
|
||
3. Q2.2 authentik enrollment + setup_authentik_realm backend in harness.sso (F2-7).
|
||
|
||
**Pattern learned this session:**
|
||
- When a test fails on the first cold run, ALWAYS check whether the failure is the test code OR
|
||
the underlying behavior. The cryptpad story: my first /api/config test was wrong (the
|
||
endpoint doesn't exist); my second test_websocket_endpoint was wrong (the websocket path
|
||
doesn't return 4xx on plain HTTP); the Playwright pad-init was over-ambitious for the version.
|
||
Each iteration cost a 5-7min e2e cycle. Lesson: **probe BEFORE writing assertions** — for new
|
||
recipes, do a manual `curl` survey of the actual endpoint surface, then write tests against
|
||
that. (For Q3.5 immich and Q3.2 lasuite-drive I should plan a probe phase first.)
|
||
|
||
## 2026-05-28 — Q4.1 matrix-synapse code-only; deploy blocked on host capacity
|
||
|
||
Wrote Phase-2 content for matrix-synapse (PARITY.md + 3 functional tests, plan §4.3 prescribed
|
||
register-and-message + federation-version). Test code is correct.
|
||
|
||
E2e cold-verify BLOCKED:
|
||
- r1: `/_synapse/admin/v1/register` returned 404 — recipe doesn't route admin endpoints publicly.
|
||
Pivoted to public client API + `ENABLE_REGISTRATION=true` via EXTRA_ENV.
|
||
- r2: abra deploy timed out at 300s (recipe's TIMEOUT env). Bumped to 900s via EXTRA_ENV.
|
||
- r3: abra deploy still timed out, this time at 900s.
|
||
- **Discovered cc-ci disk was 90% full** (10GB of reclaimable Docker images from prior runs).
|
||
- Pruned: disk freed to 55% used (12GB free). Should be plenty.
|
||
- r4: STILL abra deploy timed out at 900s. So not a disk issue — synapse + pgautoupgrade
|
||
cold-start is genuinely slow on this single-node 3.5GB-RAM host. Bigger deploys take longer
|
||
than the harness allows.
|
||
|
||
**Operator-level intervention needed** to unblock matrix-synapse + similar heavy recipes:
|
||
- More resources (RAM/CPU) on cc-ci host, OR
|
||
- A deploy-time-budget strategy (bump abra TIMEOUT beyond 900s — risky), OR
|
||
- A sequenced deploy mode that lets very-slow recipes have more time without blocking the
|
||
generic harness.
|
||
|
||
For now: code is committed; e2e is blocked; will pivot to other recipes (Q3.3, Q3.5) or wait
|
||
for operator. Filed PushNotification to user.
|
||
|
||
## Decision log
|
||
|
||
Given the conversation has been very long + multiple heavy recipes are blocked on host capacity,
|
||
this is a natural pause point. Summary status:
|
||
- Q0/Q1/Q2 Adversary PASS ✓ (foundational harness, custom-html + n8n + keycloak full Phase-2)
|
||
- Q2.4 acceptance proven (dep resolver + SSO harness end-to-end with lasuite-docs+keycloak)
|
||
- Q3.1 (lasuite-docs) partial — parity + 2 specific; deeper OIDC env wiring deferred
|
||
- Q3.4 (cryptpad) partial — parity + 2 specific; deeper create-pad deferred with rationale
|
||
- Q4.1 (matrix-synapse) code-only — e2e blocked on host capacity
|
||
- Q5.1 docs partial — enroll-recipe.md Phase-2 contract pass landed
|
||
- Q3.2/Q3.3/Q3.5 + remaining Q4 + Q5.2/Q5.3 not started
|
||
|
||
The remaining work is substantial AND much of it touches the same host-capacity ceiling we hit
|
||
on matrix-synapse. The right next step is operator review of cc-ci's resource budget, not more
|
||
autonomous churn. Sending PushNotification.
|
||
|
||
## 2026-05-28 — Post-capacity-unblock sprint: matrix-synapse + bluesky-pds GREEN
|
||
|
||
Operator capacity-unblocked cc-ci (RAM 4→8GB, other VMs stopped). Resumed Phase 2.
|
||
|
||
**matrix-synapse (Q4.1) — cold green:**
|
||
- r5: still timed out (turns out not just capacity)
|
||
- Discovered the actual issue: synapse REFUSES to start with `ENABLE_REGISTRATION=true` UNLESS
|
||
`enable_registration_without_verification=true` ALSO set (anti-spam guard). The recipe doesn't
|
||
expose the second env. Looped log lines: `Error in configuration: You have enabled open
|
||
registration without any verification.`
|
||
- Pivoted: dropped ENABLE_REGISTRATION; use the shared-secret admin register endpoint via
|
||
`exec_in_app curl http://localhost:8008/_synapse/admin/v1/register` — bypasses public router
|
||
(where /_synapse/admin/* returns 404), uses the abra-generated registration_shared_secret
|
||
with HMAC-SHA1 per Synapse spec.
|
||
- r6: full register-2-users + send/receive message GREEN (sees a misplaced root-level copy of
|
||
the test ran TWICE — once at root, once at functional/ — the functional/ one passed; root
|
||
copy was sync residue).
|
||
- r7 (post-cleanup): clean GREEN. 5 assertions PASS (parity health + federation version + the
|
||
§4.3 prescribed register-and-message + 2 install).
|
||
|
||
**bluesky-pds (Q4.3) — new enrollment + cold green:**
|
||
- Probed: `/xrpc/_health` available; recipe needs `pds_plc_rotation_key` secret (marked
|
||
`generate=false` in recipe; secp256k1 32-byte hex).
|
||
- Wrote `install_steps.sh` that generates the key with cc-ci-run python's `secrets.token_bytes(32)
|
||
.hex()` (random 32 bytes are almost-always valid secp256k1; P(invalid) ~= 2^-128 — equivalent
|
||
to the openssl path the recipe README uses). Inserted via `abra app secret insert` under
|
||
TTY-wrap.
|
||
- r1: `/.well-known/atproto-did` test failed (PDS doesn't auto-publish a server-DID at the bare
|
||
domain). Replaced with `test_session_auth.py` — GET `/xrpc/com.atproto.server.getSession`
|
||
expecting 401 + XRPC error envelope. This is the recipe-defining auth contract.
|
||
- r4 (final): install + 3 functional tests all PASS, deploy-count=1.
|
||
|
||
**Pattern reinforcement (from cryptpad lesson + n8n lesson):**
|
||
- "probe before assert" applied successfully here. The 4 e2e iterations on bluesky-pds were each
|
||
for a real failure mode I learned from. Each iteration tightened the test design.
|
||
- Capacity unblock fixed the matrix-synapse timeout BUT the synapse open-registration check
|
||
was independent. Capacity + recipe-specific config both matter.
|
||
|
||
**Phase 2 status (current):**
|
||
- Q0/Q1/Q2 Adversary PASS ✓
|
||
- Q3.1 partial (lasuite-docs), Q3.4 partial (cryptpad), Q4.1 done (matrix-synapse), Q4.3 done (bluesky-pds)
|
||
- Q5.1 docs partial
|
||
- Remaining: Q3.2/3.3/3.5 + Q4.2/4-10 + the deferred follow-ups (lasuite-docs OIDC wiring,
|
||
cryptpad create-pad, matrix-synapse shell-script ports)
|
||
|
||
Pausing for Adversary cold-verify of Q4.1+Q4.3 (and re-verify of Q3.1+Q3.4 if updated). Will
|
||
resume on watchdog ping.
|
||
|
||
## 2026-05-28 (later) — Q3.2 lasuite-drive base-deploy verify: disk → prune → Docker Hub rate limit; + Gitea outage
|
||
|
||
Resumed loop to cold-verify the lasuite-drive base deploy (the f59d8e6 commit deferred OIDC/specific
|
||
tests until the ~10-service base converges). Chain of events:
|
||
|
||
1. **First install run timed out at abra TIMEOUT=900.** abra log root cause was NOT slowness but
|
||
`FATAL: could not write init file: No space left on device` in postgres init — cc-ci `/` was at
|
||
**89% (2.9 GB free)**. The ~2GB onlyoffice + ~1GB collabora pulls filled the disk; postgres
|
||
couldn't initialise. Stack is actually **12 services** (app, backend, celery, celery-beat, db,
|
||
redis, minio, minio-createbuckets[0/0 one-shot], mailcatcher, web/nginx, collabora, **onlyoffice**)
|
||
— bigger than the recipe_meta header noted; it ships BOTH office backends by default.
|
||
|
||
2. **Freed disk via `docker image prune -af`** → reclaimed 10.1 GB (30 dangling images from prior
|
||
recipe runs); host went 2.9 GB → 14 GB free. Bumped abra TIMEOUT 900→1500, DEPLOY_TIMEOUT
|
||
1200→1800 (recipe_meta.py edit; not yet committed — Gitea down, see below).
|
||
|
||
3. **Second run progressed far** — db, collabora, onlyoffice, backend, celery, app all reached 1/1.
|
||
But minio/redis/web/mailcatcher stuck at 0/1 in an instant Assigned→Rejected loop ("No such
|
||
image"). Manual `docker pull minio/minio:...` returned **`toomanyrequests: You have reached your
|
||
unauthenticated pull rate limit`**. The prune wiped these (previously-cached) small images, and
|
||
the full cold re-pull of 12 images — on top of today's many recipe deploys (matrix-synapse,
|
||
bluesky, ghost, uptime-kuma, keycloak, lasuite-docs, cryptpad retries) — exhausted Docker Hub's
|
||
per-IP anonymous quota. Big images pulled first; the 4 small ones got starved.
|
||
|
||
**Lesson:** pruning is double-edged on this host — it frees disk but forces re-pulls that burn the
|
||
anonymous rate limit. The real fix is authenticated registry pulls (plan §1.5 "registry pull
|
||
credentials") + trimming heavy stacks (lasuite-drive does not need BOTH collabora and onlyoffice
|
||
for WOPI parity — one office backend suffices; disabling onlyoffice cuts the biggest image + RAM).
|
||
|
||
4. **Gitea (git.autonomic.zone) is down** — bare host `/`, unauth `/api/v1/version`, and authed repo
|
||
API all return plain-text `404 page not found` (Go default ServeMux 404 = backend down, proxy has
|
||
no upstream). Same from both my sandbox and cc-ci (same IP 116.203.211.204), so it's a real
|
||
instance outage, not my creds/path. Adversary's `/root/adv-verify` clone is stale at 1aaf3bd
|
||
(clean, no inbox) → Adversary runs in its own sandbox; the only shared channel (Gitea) is dead.
|
||
**Two watchdog pings arrived (REVIEW-2 update + BUILDER-INBOX.md) that I CANNOT consume** until
|
||
Gitea recovers — will pull + act the instant it's back.
|
||
|
||
Action: interrupted the stuck deploy (let abra TIMEOUT fire for clean teardown). Recording finding;
|
||
notifying operator (registry creds per §1.5 + Gitea outage). Idle-retry both until recovery.
|
||
|
||
### Correction (same session): cannot trim onlyoffice — recipe-as-is rule
|
||
Investigated the "disable onlyoffice to shrink the stack" idea from the entry above. The lasuite-drive
|
||
recipe ships a **single `compose.yml`** with collabora AND onlyoffice as unconditional services — no
|
||
`COMPOSE_FILE`/compose-profile toggle in `.env.sample`. Disabling onlyoffice would require editing the
|
||
recipe's `compose.yml`, which violates "test the recipe as-is / never modify the recipe under test"
|
||
(§7-equivalent corner-cut). So **the trim avenue is closed** — I test all 12 services. The only
|
||
legitimate levers for the rate-limit problem are: (1) **registry pull credentials** (the §1.5 operator
|
||
finding — requested), and (2) **don't `docker image prune` aggressively** between runs (it forces cold
|
||
re-pulls that burn the anonymous quota; let the cache persist). Disk pressure must instead be managed
|
||
by pruning ONLY truly-dangling images, or by the operator growing the cc-ci disk.
|
||
(Also noted: recipe env is `ONLY_OFFICE_DOMAIN`, underscore — my EXTRA_ENV flattened COLLABORA/MINIO
|
||
domains but not onlyoffice's; only matters for the WOPI/TLS path, to revisit when base converges.)
|
||
|
||
## 2026-05-28 (later) — Gitea restored; consumed Adversary inbox; fixed F2-11 (SSO-skip-goes-green)
|
||
|
||
Gitea (git.autonomic.zone) recovered ~21:08Z (orchestrator confirmed). Reconciled: `git pull --rebase`
|
||
(up to date), pushed my 2 queued local commits (1138d77 + 4a118ea → origin), then a 3rd pull picked up
|
||
the Adversary's `b941f55` (its outage-queued writes: F2-11 + REVIEW-2 idle checkpoint + BUILDER-INBOX).
|
||
Consumed + deleted BUILDER-INBOX. The 3 watchdog pings during the outage were phantoms (Adversary's
|
||
failed push retries) — nothing was lost.
|
||
|
||
**Adversary's BUILDER-INBOX (digested):** DONE-gate warnings (F2-7 authentik, F2-9 cryptpad create-pad,
|
||
ghost §4.3 create-post floor, Q3.2 drive specifics, full P1–P8 Q5 re-verify) — all need deploys, so
|
||
gated on the Docker Hub rate limit. Plus **F2-11** (medium, not a VETO), which is pure code → fixed it
|
||
now (rate-limit-independent).
|
||
|
||
**F2-11 — SSO-dep "deps-not-ready" SKIP must not yield a GREEN run.** Adversary cold-proved: when
|
||
`setup_custom_tests` fails for a DEPS-declaring recipe, `CCCI_DEPS_READY=0` → conftest skips every
|
||
`@requires_deps` test → a skip-only pytest file exits 0 → `run_custom` returns "pass" → `overall=0` →
|
||
`!testme` GREEN while the only SSO/OIDC test never ran. Violates P7.
|
||
|
||
Why my fix is shaped this way: the failure-isolation design (a transient SSO-setup failure must not
|
||
break the *generic* tier signal) is correct and I kept it — generic tier results stand untouched. The
|
||
defect was only that the green SIGNAL was indistinguishable from "SSO verified." So I correct the
|
||
signal, not the isolation:
|
||
- `conftest.pytest_collection_modifyitems` now COUNTS the requires_deps tests it skips and appends the
|
||
count to `$CCCI_DEPS_SKIP_REPORT` (one line per pytest invocation; orchestrator sums across the
|
||
per-custom-file loop). Chose a filesystem report (not exit code) because pytest has no "fail on
|
||
skip" and a skip-only file legitimately exits 0 — the orchestrator already shares run-scoped temp
|
||
files with the pytest subprocess (depsfile/statefile/countfile), so this matches the pattern.
|
||
- `run_recipe_ci`: reads + sums the count, surfaces it in RUN SUMMARY (`custom: pass (N requires_deps
|
||
SKIPPED ... SSO UNVERIFIED)`), and a new pure predicate `sso_dep_unverified(declared, deps_ready,
|
||
skipped)` flips `overall=1` when a recipe declares DEPS + deps not ready + ≥1 requires_deps skipped.
|
||
Gated on skip>0 so a deps-declaring recipe with no requires_deps tests isn't false-failed.
|
||
|
||
Verified (both deploy-free — rate-limit-independent):
|
||
1. `cc-ci-run -m pytest tests/unit -q` → **35 passed** (28 prior + 7 new in test_f211_sso_skip.py:
|
||
predicate truth table + conftest skip/record/append/noop-when-ready).
|
||
2. Cold real-test proof on cc-ci: `CCCI_DEPS_READY=0 CCCI_DEPS_SKIP_REPORT=/tmp/f211-skip.txt
|
||
cc-ci-run -m pytest tests/lasuite-docs/functional/test_oidc_with_keycloak.py -rs` → `1 skipped`,
|
||
`PYTEST_EXIT=0` (the hazard), but `/tmp/f211-skip.txt` now contains `1` → orchestrator would compute
|
||
`sso_dep_unverified(["keycloak"], False, 1)=True` → `overall=1`. Hazard closed.
|
||
|
||
Full e2e (real deploy with a forced setup_custom_tests failure → observe overall=1) deferred to when
|
||
the Docker Hub rate limit lifts; the unit + cold-real-test proofs cover the predicate, the conftest
|
||
signal on real files, and the count flow — only the sequential read→sum→predicate→overall wiring is
|
||
unexercised by a live run, and it's straight-line code.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Phase 2 RESUMED after the 2w (warm-canonical) detour
|
||
|
||
Builder loop resumed on Phase 2 (per-recipe test authoring). Phase 2w ran to DONE in the interim
|
||
(warm-canonical/quick); the 2w changes (`runner/warm*.py`, `canonical.py`, `nightly_sweep.py`, WC5
|
||
promote-on-green-cold wired into `run_recipe_ci.main()`) are merged on origin/main HEAD `7b5ed9c`.
|
||
|
||
**Re-orientation done this tick:**
|
||
- Adversary's last Phase-2 commit `7b5ed9c review(2)` is a cross-phase break-it probe (2w WC5
|
||
promotion × F2-11 SSO-skip): NO regression, no finding, NO VETO — F2-11 protection holds under
|
||
WC5 (promotion strictly gated on the fully-computed `overall`, which the F2-11 predicate flips to
|
||
1 before the promote check). So no gate of mine to advance, nothing to fix.
|
||
- All Adversary findings closed (F2-10, F2-11). Gates Q0/Q1/Q2 PASS. Q3/Q4 partial.
|
||
|
||
**Server build clone established:** `/root/builder-clone` (origin/main, secrets submodule skipped —
|
||
not needed for recipe tests; Gitea token comes from `/run/secrets/bridge_gitea_token`, dockerhub
|
||
auth from sops-rendered `/root/.docker/config.json`). `/root/cc-ci` is the nix-deploy materialised
|
||
copy (no `.git`), `/root/adv-verify` is the Adversary's. I run e2e from `/root/builder-clone`.
|
||
|
||
**Foundation re-confirmed post-2w (this tick):**
|
||
- `cc-ci-run -m pytest tests/unit -q` → **72 passed** (Phase-2 harness survived the 2w merge).
|
||
- `RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py` → all 5 tiers PASS, deploy-count=1, WC5
|
||
promoted canonical custom-html → 1.11.0+1.29.0. Full install→upgrade→backup→restore→custom
|
||
pipeline healthy on the current harness.
|
||
|
||
**Reference-corpus mapping (key planning fact).** Corpus at `/srv/recipe-maintainer/recipe-info/`
|
||
(NOT `references/` — that path in the plan is stale). Present: authentik, bluesky-pds, cryptpad,
|
||
custom-html, gitea, hedgedoc, immich, keycloak, lasuite-docs, lasuite-drive, lasuite-meet, lichen,
|
||
lichen-markdown, matrix-synapse, mumble, n8n. Implication for P2 (parity):
|
||
- §5 recipes WITH reference parity still to port: **lasuite-meet, immich, mumble** (+ already done:
|
||
bluesky-pds, cryptpad, custom-html, keycloak, lasuite-docs, lasuite-drive, matrix-synapse, n8n).
|
||
- §5 recipes with NO reference → P2 vacuous, need only ≥2 specifics + lifecycle: **plausible, ghost,
|
||
uptime-kuma (done), mattermost-lts, discourse, mailu, drone**.
|
||
- authentik: SSO provider, Q2.2 deferred (lands only if a dependent needs it).
|
||
- gitea/hedgedoc/lichen* are in the corpus but NOT in §5 → out of scope.
|
||
|
||
**Remaining §5 work:** Q3.3 lasuite-meet, Q3.5 immich, Q4.2 mumble (parity+specifics, need
|
||
mirror/enroll), Q4.5 mattermost-lts, Q4.6 discourse, Q4.7 plausible (finish specifics), Q4.9 mailu,
|
||
Q4.10 drone (specifics only), + deferral lift cryptpad create-pad (F2-9, must lift before DONE).
|
||
|
||
**In flight this tick:** full `RECIPE=lasuite-drive` e2e on `/root/builder-clone`
|
||
(log `/root/ccci-resume-lasuite-drive.log`) — lasuite-drive suite (health parity + real MinIO S3
|
||
upload/list/download round-trip + OIDC password-grant JWT-claims against dep keycloak) is fully
|
||
authored; driving it to its first verified-green full run (the Q3.2 acceptance evidence).
|
||
|
||
---
|
||
|
||
## 2026-05-29 — lasuite-drive full e2e: upgrade tier hits a DISK-SIZE env blocker (host health emergency handled)
|
||
|
||
Drove lasuite-drive (heaviest §5 recipe — BOTH office backends) toward its first verified-green full
|
||
run. install tier PASSED (generic test_serving + cc-ci test_serving_and_frontend; all 12 services
|
||
converged after collabora won its startup race — see below). backup tier PASSED. Then the **upgrade
|
||
tier FAILED** and disk hit **99% (522M free)**, risking a host wedge.
|
||
|
||
**Root cause (definitive, from the abra DEPLOY OVERVIEW in the log):** the prev→PR-head upgrade
|
||
crosses *two different multi-GB office image versions simultaneously*:
|
||
- onlyoffice/documentserver-de: 9.2 → **9.3.1.2** (3.94GB image)
|
||
- collabora/code: 25.04.9.1.1 → 25.04.9.4.1 (~1GB)
|
||
- (+ small drive-backend/frontend v0.12.0→v0.18.0, redis, nginx)
|
||
abra's in-place chaos rolling update must hold BOTH the running prev office images AND pull the new
|
||
ones before swapping — ~10GB of office images transiently. The 28GB host has only ~14GB docker
|
||
headroom over the ~13GB baseline (nix store ~9.6GB + infra images ~1.75GB), so the PR-head pull
|
||
overflowed. **No harness mitigation exists:** the prev images are *running* (not dangling) when the
|
||
new must be pulled, and you cannot `docker rmi` a running image; a pre-upgrade prune finds nothing
|
||
dangling. It is fundamentally a disk-SIZE constraint, driven by the recipe legitimately bumping office
|
||
image tags across releases. Not a test-quality issue and not weakenable.
|
||
|
||
**collabora startup race (separate, self-resolving):** collabora/code logs
|
||
`/usr/bin/coolmount: Operation not permitted` (CapAdd=[] + default seccomp blocks mount()), falls back
|
||
to slow file-COPYING into its jail; the healthcheck killed an early task (exit 137) but a later task
|
||
finished the copy and reached 1/1. So collabora converges, just flaps once or twice first. Not the
|
||
blocker; noting in case it recurs on slower disk.
|
||
|
||
**Emergency handled — host fully restored:** killed the run (`pkill -f run_recipe_ci.py`), removed the
|
||
orphaned `lasu-7ea5e3` stack + its volumes (minio, postgres) + 8 leftover secrets (the killed run's
|
||
teardown never ran), pruned dangling images. Disk recovered 99% → 37% (17GB free). Infra stacks
|
||
(traefik/drone/dashboard/bridge/backups/warm-keycloak) untouched and healthy throughout.
|
||
|
||
**Decision:** the upgrade tier for lasuite-drive (and very likely other heavy recipes: lasuite-docs
|
||
also ships collabora; immich ships multi-GB ML images; lasuite-meet) is a genuine **Class A1 env-level
|
||
disk blocker** — the clean fix is a larger host disk (operator). Filed in DEFERRED.md + DECISIONS.md +
|
||
BACKLOG-2; flagged to operator (PushNotification) and Adversary (inbox). Meanwhile banking the
|
||
**maximal testable subset** (install+backup+restore+custom — single version, fits disk) to prove
|
||
lasuite-drive's actual Q3.2 CONTENT works: parity health, the real MinIO S3 upload→list→download
|
||
round-trip, and the OIDC password-grant + JWT-claims flow against the dep keycloak. Per §7.1 the
|
||
maximal subset is implemented and only the genuinely-disk-blocked upgrade tier is outstanding —
|
||
pending Adversary sign-off on the env-blocker.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — lasuite-drive: --detach fix validated, but OIDC setup redeploy is FLAKY (NOT claiming Q3.2 yet)
|
||
|
||
Ran lasuite-drive maximal subset (install,backup,restore,custom) four times today:
|
||
- **Run 1** (`ccci-drive-subset.log`): all tiers + all 3 functional GREEN (health, MinIO round-trip,
|
||
OIDC JWT) — but required a manual kill of the hung `docker service scale` (the bug I then fixed with
|
||
`--detach`, commit `f1c626c`). So the test ASSERTIONS are all correct and CAN pass.
|
||
- **Runs 2 & 3** (`-clean`, `-clean2`): corrupted by MY OWN over-eager `docker image prune -f` mid-deploy
|
||
— it removed the just-pulled, not-yet-attached digest-pinned images (drive-frontend, onlyoffice),
|
||
so swarm rejected with "No such image" and install failed/timed out. **LESSON: never
|
||
`docker image prune` during an active deploy — mid-pull images look dangling and get removed.**
|
||
Confirmed self-inflicted: `docker pull lasuite/drive-frontend@sha256:eeef…` succeeded (image is on
|
||
hub), and after seeding it the stack converged. Not a recipe/test issue.
|
||
- **Run 4** (`-clean3`, warm images, hands-off, fixed `--detach`): install/backup/restore all PASS,
|
||
health + MinIO PASS, **but the OIDC test SKIPPED because `setup_custom_tests.sh` exited 1** — its
|
||
step-3 in-place `abra app deploy --force --chaos` (applies the OIDC env) FAILED to converge
|
||
("FATA deploy failed"; abra log shows backend `Permission denied: /.gunicorn` + celery
|
||
`configure_wopi: 404 from collabora discovery url`). Per F2-11 the run correctly went RED (no false
|
||
green) — `custom: pass (1 requires_deps SKIPPED — SSO UNVERIFIED)`, overall=1. The `--detach` fix
|
||
itself works (bucket scale returned, secret inserted v2); the failure is the full-stack redeploy.
|
||
|
||
**Root finding: the OIDC-wiring step (a full 12-service in-place `--chaos` redeploy) is FLAKY on this
|
||
heaviest stack** — collabora's reconverge race + a transient backend gunicorn-perms/WOPI-404 window
|
||
mean the redeploy succeeds only sometimes (run 1 yes, run 4 no). The OIDC env change only affects
|
||
backend/app, so re-converging collabora/onlyoffice is unnecessary exposure. Fix direction (BACKLOG):
|
||
wire OIDC at INSTALL time (no post-deploy redeploy — like lasuite-docs install_steps), or make the
|
||
setup redeploy resilient (retry / wait for collabora WOPI discovery 200 before declaring ready).
|
||
|
||
**Decision:** NOT claiming Q3.2 — a flaky OIDC setup is not a reliable green, and claiming would risk
|
||
an Adversary cold-verify FAIL. lasuite-drive stays [~]: test content proven correct (run 1), `--detach`
|
||
bug fixed, two open issues (disk-blocker on upgrade tier [DEFERRED/operator]; flaky OIDC redeploy
|
||
[BACKLOG, needs robustness work]). **Pivoting to lighter recipes for broad Phase-2 progress**;
|
||
lasuite-drive's OIDC robustness + upgrade-disk return later. Host left clean (all stacks torn down,
|
||
disk 65%, infra healthy).
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Next unit scouted: mumble (Q4.2) — design for the first NON-HTTP recipe
|
||
|
||
Pivoted off heavy lasuite-drive to a lighter recipe. mumble: recipe.toml has NO deps, single light
|
||
service (mumblevoip/mumble-server:v1.6.870-0) → fast deploys, low disk (avoids the lasuite-drive
|
||
heaviness/flakiness). BUT it's the first non-HTTP recipe: raw Mumble protocol over TLS on TCP 64738
|
||
(+ UDP). Reference corpus `/srv/recipe-maintainer/recipe-info/mumble/tests/`: health_check.py (TCP
|
||
connect to 64738), mumble_connect.py (pure-stdlib TLS handshake: Version + auth-accepted +
|
||
ChannelState + ServerSync + welcome text — portable as-is), web_client.py (HTTPS web UI, needs
|
||
`compose.mumbleweb.yml` overlay).
|
||
|
||
**Reachability decision (the crux):** cc-ci's traefik is HTTP(S)-only; the recipe declares traefik
|
||
TCP/UDP router labels but cc-ci has no :64738 TCP entrypoint, and host→overlay-container-IP isn't
|
||
reliably routable. **Chosen approach: run the protocol probe from a throwaway `python:3-slim`
|
||
sidecar container attached to the app's overlay network**, connecting to the murmur service by its
|
||
swarm DNS name (`app`) on 64738. No traefik change, no host-port publish, no compose-overlay
|
||
selection needed — the harness already knows the stack/network name. This becomes a small reusable
|
||
harness primitive (`run probe container on app network`) for any future non-HTTP recipe. Record in
|
||
DECISIONS.md when implemented.
|
||
|
||
**Enrollment plan (next tick):** mirror-check mumble on recipe-maintainers (auto-mirror if absent per
|
||
plan §0b); `tests/mumble/recipe_meta.py` (no DEPS; HEALTH via the sidecar TCP probe, not HTTP —
|
||
needs a recipe_meta hook or a custom install overlay since the generic HTTP health check won't apply;
|
||
likely set CCCI_SKIP_GENERIC or provide a TCP-aware install overlay); port health_check +
|
||
mumble_connect as functional tests using the sidecar primitive; ≥2 specifics (protocol handshake +
|
||
channel-list presence beyond TCP health); PARITY.md; e2e (light/fast). web_client.py deferred unless
|
||
the mumbleweb overlay is enabled. Open question to resolve in code: how the generic install tier
|
||
(HTTP health) behaves for a non-HTTP recipe — may need a per-recipe "health kind = tcp" in
|
||
recipe_meta consumed by the generic harness.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — mumble scope CORRECTION: non-HTTP health is a high-blast-radius core-harness feature, not a light add
|
||
|
||
On deeper inspection, mumble's non-HTTP nature is NOT a small adaptation. The HTTP health assumption
|
||
is baked into the CORE health path used by EVERY recipe + the 2w warm system:
|
||
- `run_recipe_ci._load_meta` defaults (HEALTH_PATH/HEALTH_OK) + the mirrored `conftest._recipe_meta`.
|
||
- `lifecycle.wait_healthy(domain, ok_codes, path, ...)` — the orchestrator's post-deploy HTTP poll at
|
||
THREE call sites (run_recipe_ci.py:467 warm/canonical, :633, :737).
|
||
- `canonical.deploy_canonical` health gate (warm-cache, 2w).
|
||
- `generic.assert_serving` (HTTP fetch + served_cert) and restore-health.
|
||
Supporting a TCP/protocol recipe means threading a `HEALTH_KIND` (http|tcp) through ALL of these with
|
||
default="http" preserving current behavior. That's a legitimate harness feature but HIGH BLAST RADIUS
|
||
(a regression breaks every recipe and the warm sweep), so it warrants a dedicated, careful effort with
|
||
unit tests + a no-regression re-run of an HTTP recipe + Adversary scrutiny of the core change — NOT a
|
||
tail-of-session cram. **Filed as its own unit (Q4.2 stays open; needs the non-HTTP-health harness
|
||
feature first).** Also: mumble's app is only on the `proxy` net and routes via a traefik `mumble` TCP
|
||
entrypoint cc-ci lacks (HostSNI + TLS passthrough) — the custom protocol test still needs the
|
||
python-sidecar-on-proxy-net probe.
|
||
|
||
**Next-unit re-pick:** prefer an HTTP-NATIVE recipe that uses the proven harness with zero core
|
||
changes — **mattermost-lts (Q4.5)** is the candidate (HTTP UI+API via traefik; §4.3 = create-a-message
|
||
round-trip is pure test-authoring, not harness surgery). Scout it next: confirm it's HTTP-native +
|
||
self-contained DB (vs needing a dep), mirror-check, then enroll (recipe_meta + lifecycle overlays +
|
||
≥2 specifics + PARITY note [no reference corpus → P2 vacuous]). Keeps blast radius low and adds real
|
||
coverage. mumble/mailu (non-HTTP) batch behind the HEALTH_KIND harness feature.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — DISK RESIZE 30→70GB in progress (orchestrator) — disk-blocker LIFTING; deploys paused
|
||
|
||
Orchestrator is resizing the cc-ci VM disk 30→70GB; VM RESTARTS (few-min outage + live-warm keycloak
|
||
re-warms on boot, up to ~10min). Actions: PAUSED new deploys; the in-flight mattermost-lts
|
||
install+custom e2e (`ccci-mattermost2.log`) will die transiently with the restart — that is the
|
||
restart, NOT a bug; re-run after. Waiting for the orchestrator's "back + healthy" signal (fallback
|
||
self-poll meanwhile).
|
||
|
||
**Impact (big):** this lifts the heavy-recipe upgrade-tier disk blocker (DEFERRED 2026-05-29 →
|
||
LIFTING). After cc-ci is healthy I can:
|
||
1. Re-run **lasuite-drive FULL lifecycle** (install+upgrade+backup+restore+custom) — the upgrade tier's
|
||
dual multi-GB office-image crossover (~10GB transient) now fits in 70GB. This is the path to the
|
||
real Q3.2 green (modulo the separate Q3.2a OIDC-redeploy flakiness — watch whether the bigger disk
|
||
also eases the redeploy convergence, though the flakiness root was collabora reconverge timing, not
|
||
disk). With more headroom the collabora re-pull churn from my earlier prune mistakes also stops
|
||
biting.
|
||
2. Re-run **mattermost-lts** install+custom (validate the create-message §4.3 round-trip) — it had
|
||
just launched when the resize started.
|
||
3. Resume broad heavy-recipe coverage (immich, lasuite-meet) with real disk headroom.
|
||
|
||
Note: with 70GB, I can also be less aggressive about teardown/prune churn between heavy runs.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — lasuite-drive Q3.2a Step 0: root-cause failure logs captured (BEFORE any fix)
|
||
|
||
Resuming Q3.2a (plan-lasuite-drive-oidc-robustness.md) after Phase 2pc DONE. The Adversary's
|
||
cold-verify criterion #1 requires real captured failure logs before any fix. Captured from the
|
||
flaky run-4 deploy (`/root/.abra/logs/default/lasu-288dfd...2026-05-29T062401Z`, the
|
||
`abra app deploy --force --chaos` OIDC-setup redeploy that exited 1 / "FATA deploy failed"):
|
||
|
||
1. **gunicorn perms race** — `backend [1] [ERROR] Control server error: [Errno 13] Permission
|
||
denied: '/.gunicorn'`. gunicorn tries to create its control-server temp dir under HOME=`/`
|
||
(not writable). (Part B fix: set perms / writable HOME in entrypoint before exec gunicorn.)
|
||
2. **WOPI-discovery race** — `celery RuntimeError: status code 404 return by discovery url for
|
||
wopi client collabora is invalid` at `/app/wopi/tasks/configure_wopi.py:53`. The celery
|
||
`configure_wopi_clients` task hits collabora's discovery URL at boot (06:21:54) while collabora
|
||
is still caching its 132+ l10n files (finishes ~06:24) → 404 → task raises. (Part B fix:
|
||
collabora WOPI healthcheck gating + backend retry/backoff on discovery.)
|
||
3. **transient db-not-ready** — `db FATAL: database "drive" does not exist` + celery
|
||
`Could not connect to database: failed to resolve host 'db'` — early-boot DNS/init races that
|
||
self-heal; harmless on a fresh deploy with the full TIMEOUT window.
|
||
|
||
**Key observation that shapes the fix:** the FIRST install deploy converges reliably **every** run
|
||
(install: pass in runs 1–4, incl. run 4). Only the post-install in-place `--force --chaos` redeploy
|
||
(applied to push the OIDC env) is flaky. The OIDC env touches ONLY backend/app — re-converging
|
||
collabora/onlyoffice/minio is unnecessary exposure. → **Part A: wire OIDC into the .env at INSTALL
|
||
time (between `abra app new` and the single `abra app deploy`) so the recipe deploys ONCE with OIDC
|
||
already set; no post-deploy reconverge.** keycloak is live-warm (always up), so the per-run realm is
|
||
a lightweight API call provisioned before the single deploy. Part B (recipe robustness PR) remains
|
||
the deeper fix so ANY reconverge (incl. the upgrade-tier prev→PR-head crossover) is race-free.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — lasuite-drive Q3.2a: Part A + upgrade-gate fix → FULL SUITE GREEN (run 1 of 3)
|
||
|
||
Two iterations landed:
|
||
- **Part A** (commit `a151489`): wire OIDC at INSTALL (provision warm-keycloak realm before the
|
||
single deploy; `install_steps.sh` writes OIDC env into it). Run 1 (`ccci-drive-q32a-r1.log`):
|
||
deploy-count=1, install/backup/restore/custom + OIDC test all GREEN — but **upgrade tier FAILED**:
|
||
the chaos redeploy SIGTERMed a still-booting collabora (coolwsd ~2min boot) → "Shutdown requested
|
||
while starting up", forced exit 70 → abra aborted ("FATA deploy failed"). install wait_healthy
|
||
returns on collabora container 1/1 while coolwsd is still loading.
|
||
- **Upgrade-gate fix** (commit `4b38b66`): `ops.py::pre_upgrade` now waits for collabora WOPI
|
||
discovery (`/hosting/discovery` on `collabora-<domain>`) → 200 before the chaos redeploy; +
|
||
DEPLOY_TIMEOUT plumbed through `chaos_redeploy`/`perform_upgrade`/`_perform_op` (was abra.deploy's
|
||
900s default vs the .env internal TIMEOUT 1500s).
|
||
|
||
**Fixed-code run 1 (`ccci-drive-q32a-r2.log`) — FULL SUITE GREEN:**
|
||
```
|
||
pre_upgrade: collabora WOPI discovery ready (200) on collabora-lasu-d12d03.ci.commoninternet.net
|
||
RUN SUMMARY: deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : pass
|
||
```
|
||
- upgrade: `test_upgrade_preserves_data` PASSED (ci_marker survived prev→PR-head chaos crossover).
|
||
- custom: health + minio round-trip + OIDC password-grant JWT all PASSED (OIDC PASS, NOT skip).
|
||
- Clean teardown: no lasu stacks/volumes after; disk 38%.
|
||
|
||
The collabora-ready gate is the decisive fix — the upgrade chaos redeploy now replaces a fully-ready
|
||
collabora cleanly instead of killing it mid-boot. Launching runs 2 + 3 for the Adversary-required 3×
|
||
repeat-green before claiming Q3.2. (Part B — recipe-level WOPI healthcheck/gunicorn-perms PR — is no
|
||
longer required for CI green; will reassess whether to still file it as upstream robustness once 3×
|
||
green holds.)
|
||
|
||
---
|
||
|
||
## 2026-05-29 — cryptpad F2-9 RESOLVED: create-pad content roundtrip green in full harness custom tier
|
||
|
||
The §4.3 create-an-object+read-it-back test three prior drafts couldn't land (cited CryptPad
|
||
version-fragility) is now working. Empirically mapped CryptPad 2026.2.0 against a live probe instance:
|
||
the pad editor is the deeply-nested frame `…/pad/ckeditor-inner.html` (top → `#sbox-iframe` on the
|
||
sandbox domain → CKEditor frame); visiting `/pad/` auto-creates a fragment-keyed pad
|
||
(`#/2/pad/edit/<key>/`) after ~15s cold init (LESS compile). `tests/cryptpad/playwright/
|
||
test_pad_content_roundtrip.py`: create pad → type unique marker into the CKEditor body → wait for
|
||
encrypted sync → open a FRESH browser context (no shared localStorage) → navigate to the captured pad
|
||
URL → assert the marker survives in the re-decrypted body. Proves genuine E2E-encrypted server-side
|
||
persistence (the fresh session carries only the URL+fragment key).
|
||
|
||
Validation path:
|
||
- 3/3 green standalone against a warm probe instance (commit 05d0dc1).
|
||
- First full-suite run did NOT exercise it (I'd `rm`'d the file from builder-clone to unblock a pull;
|
||
the ff left it deleted → discovery skipped it — LESSON: `git checkout -- <file>` after pull, never
|
||
leave a tracked test locally-deleted).
|
||
- Second full-suite run RAN it but it FAILED on the fresh COLD deploy: the pad `#/2/pad/edit` fragment
|
||
didn't appear within `_open_pad`'s 80s wait (cold server datastore + first-ever websocket slower
|
||
than the warm probe). Fix `656b68b`: bump `_open_pad` hash-wait to ~240s + a mid-way reload.
|
||
- Third full-suite run (`/root/ccci-cryptpad-full3.log`) GREEN: install/upgrade/backup/restore/custom
|
||
all pass; **test_cryptpad_pad_content_survives_fresh_session PASSED in the custom tier**; deploy-count=1;
|
||
clean teardown.
|
||
|
||
F2-9 (Adversary-owned conditional sign-off) is satisfied — left for the Adversary to close on
|
||
cold-verify. DEFERRED.md cryptpad create-pad entry marked resolved.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Both Phase-2-DONE blockers cleared; next unit scouted: Q3.3 lasuite-meet
|
||
|
||
**Milestone:** Q3.2 lasuite-drive = Adversary PASS (F2-12 CLOSED). cryptpad F2-9 = RESOLVED (roundtrip
|
||
green in full custom tier; awaiting Adversary close). The two veto-eligible / DONE-gating items are done.
|
||
|
||
**Next unit — Q3.3 lasuite-meet (SSO-dependent, La Suite sibling).** Scouted: mirrored on
|
||
recipe-maintainers (200), reference corpus rich (health_check, oidc_login, meeting_flow, webrtc-media,
|
||
webrtc-relay), `recipe.toml` requires=["keycloak"], [sso] provider=keycloak. **Reuses the exact
|
||
machinery I just built for lasuite-drive** — so low-friction:
|
||
- `recipe_meta.py`: DEPS=["keycloak"] + OIDC_AT_INSTALL=True (+ READY_PROBE if a heavy sub-service
|
||
like livekit needs an extra readiness signal — TBD at deploy).
|
||
- `install_steps.sh`: wire OIDC env at install (mirror lasuite-drive's; impress/La Suite OIDC contract
|
||
— adapt env var names to meet's .env.sample).
|
||
- lifecycle overlays test_install/upgrade/backup/restore + ops.py (DB marker like drive's, if meet has
|
||
a backable DB).
|
||
- Parity ports: health_check (HTTP 200), oidc_login (→ test_oidc_with_keycloak via
|
||
harness.sso.oidc_password_grant). PARITY.md mapping.
|
||
- §4.3 specifics: **meeting_flow** (password-grant token → create a room via meet API → assert room +
|
||
obtain LiveKit join token for 2 users; corpus meeting_flow.py shows the shape) + **webrtc** probe
|
||
(ICE/connectivity or LiveKit token issuance — full UDP media relay may be an env-blocker per plan
|
||
§7.1; implement the maximal testable subset = signaling/token issuance + document any true blocker).
|
||
- e2e: RECIPE=lasuite-meet PR=0 cc-ci-run runner/run_recipe_ci.py → full suite green, OIDC PASS.
|
||
|
||
(Also noted: tests/plausible/ has a stub (recipe_meta + functional/) from an earlier partial; plausible
|
||
not mirrored. Lower priority than lasuite-meet which completes Q3.)
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Testing-standard clarification (operator): 3× repeat-green is flakiness-specific, not general
|
||
|
||
The 3× repeat-green bar I applied to lasuite-drive (F2-12 fix) was correct THERE because that recipe
|
||
was demonstrably flaky — it was a flakiness proof (show the fix made it reliably green, not lucky-once).
|
||
**It is NOT the general standard.** Normal recipe gates = **ONE Adversary cold-verified green** per
|
||
plan.md §6.1. Do NOT require 3× for other recipes (lasuite-meet Q3.3, future Q4 recipes) — a single
|
||
full-suite green + Adversary cold-verify is the bar. (Recorded by orchestrator in
|
||
plan-lasuite-drive-recipe-pr.md §2; the 3× re-applies only if a recipe shows flakiness again.)
|
||
|
||
---
|
||
|
||
## 2026-05-29 — F2-13 fixed: cryptpad roundtrip read-back made robust (poll all frames)
|
||
|
||
Adversary cold-verify of F2-9 FAILED (F2-13): the roundtrip's read-back leg timed out waiting for the
|
||
CKEditor `ckeditor-inner` frame to ATTACH on a fresh cold context (flaky). Fix (commit `b44d75b`): the
|
||
read-back no longer requires that specific frame to attach — it polls EVERY frame's body text for the
|
||
marker (generous ~240s deadline + periodic reloads). The marker appearing in a fresh context still
|
||
proves server-side E2E-encrypted persistence (only URL+fragment key carried over). Bumped session-1
|
||
post-type sync wait 9s→12s.
|
||
|
||
Validated **3× green** against a cold cryptpad probe (`cryptpad-probe`), ~33s each, no flakiness (the
|
||
poll-all-frames finds the marker fast once the pad renders — robust AND faster than the old
|
||
frame-attach wait). F2-13 is Adversary-owned — left for the Adversary to re-verify + close F2-9.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Q3.5 immich: 4/5 tiers green + §4.3; restore data-integrity blocked by UPSTREAM recipe (no pg_dump hook)
|
||
|
||
Full suite (`/root/ccci-immich-full.log`): install PASS, upgrade PASS (real crossover
|
||
1.5.1+v2.6.3→1.6.0+v2.7.5, ci_marker survived), backup PASS (artifact created), custom PASS
|
||
(test_immich_upload_asset_readback_and_thumbnail = §4.3 upload→read-back→thumbnail-derivative;
|
||
health), deploy-count=1, clean teardown. **ONLY `test_restore_returns_state` FAILED** — postgres
|
||
`ci_marker` does not survive `abra app restore` (relation does not exist; app itself healthy).
|
||
|
||
**Diagnosed (harness path, immich probe):** seed ci_marker='original' → `abra app backup create`
|
||
(restic snapshot, 1729 files / 190MB) → drop ci_marker → `abra app restore` → ci_marker STILL absent.
|
||
**Root cause:** immich's UPSTREAM recipe backs up the **live postgres data VOLUME** via restic
|
||
(`backupbot.backup=true` on `database`, NO pg_dump hook) — a hot pgdata snapshot that cannot reliably
|
||
restore a DB row into a running postgres. Contrast lasuite-drive/meet, which ship a `pg_backup.sh` +
|
||
labels (`backup.pre-hook: /pg_backup.sh backup` → `backup.volumes.postgres.path: backup.sql` →
|
||
`restore.post-hook: /pg_backup.sh restore`) producing a CONSISTENT SQL dump that restores cleanly
|
||
(their restore tiers pass). This is an upstream immich-recipe defect (same class as the parked Q3.2b
|
||
lasuite-drive recipe-robustness PR), not a cc-ci/test bug — the ci_marker pattern is correct (works on
|
||
drive/meet).
|
||
|
||
**Decision:** Q3.5 immich = PARTIAL. The maximal subset is proven (install/upgrade/backup-artifact/
|
||
restore-healthy/custom incl. §4.3 + health). Real DB-restore data-integrity (P4) needs the immich
|
||
recipe to gain a `pg_dump` backup hook — a recipe-create-pr unit (mirror immich → add pg_backup.sh +
|
||
the 4 backupbot labels [adapt POSTGRES_USER=postgres, DB=immich] → cc-ci full-suite green on the PR →
|
||
operator merge), exactly like Q3.2b for drive. Filed DEFERRED + BACKLOG. NOT claiming Q3.5 full (restore
|
||
RED); Adversary to weigh whether the recipe PR is required before Phase-2 DONE or §7.1 sign-off applies.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — HQ1 image pre-pull DONE (commit 2bf40d6), claimed
|
||
|
||
Implemented per plan-prepull-images.md: lifecycle.prepull_images resolves a recipe's images via
|
||
`docker compose config --images` (COMPOSE_FILE from the app .env — handles $VERSION interpolation +
|
||
multi-compose; verified the invocation on custom-html-tiny [1 img] + lasuite-meet [compose.yml:
|
||
compose.turn.yml]) and docker-pulls them skip-if-present. Wired into deploy_app (before the unchanged
|
||
abra.deploy) + perform_upgrade (before the chaos redeploy). Validation: 4 unit tests (mocked docker)
|
||
prove present→skip / missing→pull / pull-fail→RAISE / no-images→skip; n8n run #1 prepulled a cold
|
||
image + green; n8n run #2 (warm) showed `prepull: present` (no re-download); a bogus tag raised a
|
||
clear "clear pull error BEFORE deploy: manifest unknown" pre-deploy. abra deploy unchanged (no service
|
||
update/scale). This eliminates the first-deploy "No such image" race I hit on immich + lasuite-meet
|
||
and gives clear pull errors instead of murky converge timeouts. Honest scope: removes pull-time not
|
||
app-init-time.
|
||
|
||
## 2026-05-29 — Q4.7 plausible: test content green; deploy blocked by upstream clickhouse-boot-download flakiness
|
||
|
||
**Test content authored + partially proven.** Wrote the §4.3 functional tests
|
||
(`tests/plausible/functional/test_event_tracking.py`: `test_pageview_event_roundtrip` +
|
||
`test_custom_event_roundtrip`) and fixed the health probe. Empirically validated the full event
|
||
round-trip against a live probe BEFORE writing: register a site row in the metadata postgres
|
||
(plausible's `sites_cache` GATES ingestion — events for unregistered domains are silently dropped,
|
||
confirmed count=0), POST to `/api/event` with a **browser User-Agent** (plausible drops bot/library
|
||
UAs), poll ClickHouse `events_v2` for the row (sites_cache refresh + write-buffer flush → first landing
|
||
~35-50s). A first `STAGES=install,custom` run **PASSED both event tests** (`2 passed in 73.58s`) and the
|
||
custom tier — so the §4.3 content is GREEN. Health probe switched `/` → `/api/health` (returns 200 with
|
||
`{"clickhouse":"ok","postgres":"ok","sites_cache":"ok"}` only when both stores ready; `/` 500s under
|
||
headless DISABLE_AUTH then 302s once ready, so `/` can't distinguish not-ready from ready). The prior WIP
|
||
edit had left an UNTERMINATED docstring in test_health_check.py (syntax error) — fixed. Install overlay
|
||
re-checked `/` (→500) and FAILED; replaced with a stronger assertion on the /api/health JSON subsystems.
|
||
|
||
**Blocker (upstream recipe defect): clickhouse-backup boot-download crash-loop.** The full lifecycle run
|
||
**timed out at DEPLOY_TIMEOUT=1200s** — `abra app deploy ... timed out after 1200 seconds`. Root cause:
|
||
the recipe's `entrypoint.clickhouse.sh` (swarm config `clickhouse_entrypoint`, mapped to
|
||
`/custom-entrypoint.sh`) runs, with `set -e` and NO retry, a `wget` of a 22MB `clickhouse-backup` tarball
|
||
from `github.com/AlexAkulov/clickhouse-backup` (renamed → 301 to `Altinity/...`) BEFORE exec'ing
|
||
clickhouse-server. If that wget (or the subsequent `tar -xf`) fails, the entrypoint exits 1 with EMPTY
|
||
logs (clickhouse-server never starts) and swarm crash-loops the task. Each restart re-downloads 22MB →
|
||
~120 attempts/20min ≈ 2.6GB hammered at GitHub → **GitHub secondary rate-limiting** → all subsequent
|
||
downloads fail → sustained crash-loop → deploy timeout.
|
||
|
||
Evidence: exited containers = `exit=1`, zero logs (fails before clickhouse). The download URL is fine —
|
||
a bridge-network `docker run` with the EXACT entrypoint command (busybox wget; image's `wget` is
|
||
`/bin/busybox`) succeeds 3/3 (22222742 bytes) when NOT hammered. The first `install,custom` run and a
|
||
manual probe BOTH converged (clickhouse up, events ingested) — i.e. the deploy works when GitHub answers
|
||
the first wget. The failure is induced by my back-to-back heavy testing churn today exhausting the IP's
|
||
GitHub budget; swarm task containers egress via the same host IP so they share the throttle.
|
||
|
||
**Why it matters for the gate:** normal CI (one PR → one deploy, MAX_TESTS=1) does ONE wget — usually
|
||
succeeds, converges (as proven). The catastrophic 20-min spiral needs SUSTAINED GitHub throttling, which
|
||
only my repeated-deploy testing produces. So plausible is reasonably reliable in normal operation but is
|
||
NOT robust to a transient first-wget failure (any single failure spirals), and the Adversary cold-verify
|
||
shares the risk.
|
||
|
||
**Decision (see DECISIONS.md):** durable fix = recipe PR hardening `entrypoint.clickhouse.sh` —
|
||
download the binary to the PERSISTENT `/var/lib/clickhouse` volume with skip-if-present (restarts don't
|
||
re-download → no amplification), retry-with-backoff, and `set +e` so a download failure does NOT block
|
||
clickhouse-server start (the DB must come up regardless; backup capability degrades gracefully). This
|
||
ALSO makes the deploy converge even under an active GitHub throttle (the DB no longer waits on the
|
||
download), so it is testable now. Same upstream-robustness pattern as Q3.2b (lasuite-drive) and immich's
|
||
pg_dump. cc-ci test content is correct and unchanged by this.
|
||
|
||
Killed the crash-looping runs + removed all plausible stacks/configs/networks/volumes (clean). NOT
|
||
claiming Q4.7 until the full lifecycle is green.
|
||
|
||
## 2026-05-29 — next-recipe recon (drone/discourse/mailu) after Q4.2 mumble claim
|
||
Recon (abra recipe fetch + compose inspect; non-deploy) of the 3 remaining unenrolled §5 recipes:
|
||
- **discourse**: services app+db(postgres)+redis+sidekiq; **HAS backupbot.backup label (compose.yml)
|
||
→ real P4 achievable**; 13 version tags (real upgrade); compose.smtpauth.yml overlay; functional =
|
||
create-a-topic via admin API (needs an admin API key — discourse first-boot/admin bootstrap). Heaviest
|
||
deploy (slow cold start, big image) — main risk is run time/flakiness, not coverage.
|
||
- **mailu**: 11 services (app/db/admin/imap/smtp/antispam/webmail/rspamd/dkim...); **NO backupbot label
|
||
→ P4 gap** (would need a recipe-PR to add backup, like immich Q3.5 — a deferral); 11 tags; functional =
|
||
admin API create domain+mailbox + SMTP/IMAP send/receive. Multi-service, moderate-heavy.
|
||
- **drone**: single app service + data volume; **NO backupbot → P4 gap**; 11 tags; compose.gitea.yml /
|
||
compose.github.yml overlays — functional depth (create/list builds) needs a wired git provider (gitea
|
||
OAuth dep). It is cc-ci's own CI engine. Shallow without a dep; P4 gap.
|
||
**Choice for the cleanest COMPLETE enrollment (P1 install+upgrade+backup-restore + real P4): discourse**
|
||
(only one of the three with a recipe backup mechanism). mailu/drone would each carry a P4-N/A deferral
|
||
(no upstream backup config) needing Adversary §7.1 sign-off or a recipe-PR. Plan discourse next: HTTP
|
||
health, admin-API create-a-topic (+ read-back) for §4.3, postgres ci_marker for P4 (backupbot present).
|
||
Hold the deploy until the Adversary's mumble cold-verify frees the single node.
|
||
|
||
## 2026-05-29 — mailu (Q4.9) investigation; discourse (Q4.6) blocked
|
||
- **discourse Q4.6 BLOCKED**: `bitnami/discourse:*` images removed from Docker Hub (manifest unknown;
|
||
swarm "No such image" rejection). bitnamilegacy/discourse exists but install tier uses the gone
|
||
prev-published version → recipe-PR can't unblock until upstream re-releases. DEFERRED.md entry filed.
|
||
Scaffolding (recipe_meta+postgres-P4 ops/overlays+health) staged at ca7acf3 for when fixed.
|
||
- **mailu Q4.9 plan** (images all pullable — ghcr.io/mailu/* OK; NOT bitnami):
|
||
- Services: front(nginx)/admin/imap(dovecot)/smtp(postfix)/antispam(rspamd)/webmail(snappymail)/
|
||
resolver/oletools/dkim... (~11). NO backupbot label → P4 N/A (recipe-PR-deferrable like immich) —
|
||
document in PARITY.md + DEFERRED, seek Adversary §7.1 sign-off OR file a backup recipe-PR.
|
||
- EXTRA_ENV needed: DOMAIN (harness sets), MAIL_DOMAIN, HOSTNAMES, TRAEFIK_STACK_NAME (cc-ci's
|
||
traefik stack name = traefik_ci_commoninternet_net), SITENAME, POSTMASTER, TLS_FLAVOR. Set
|
||
API=true + a MAILU API token if using the REST API; else use the admin-container CLI.
|
||
- Health: front serves; WEBROOT_REDIRECT=/webmail. HEALTH_PATH candidate `/admin` (login 200) or
|
||
`/` (302→/webmail). admin healthcheck is DISABLED in compose → rely on front + HTTP probe.
|
||
- §4.3 functional: create-an-object+read-back via the admin container CLI (headless, reliable):
|
||
exec_in_app(service="admin") `flask mailu domain <MAIL_DOMAIN>` + `flask mailu user <u> <domain>
|
||
<pw>` → read back via `flask mailu user` list / admin API → assert mailbox exists. Distinctive #2:
|
||
real mail flow — SMTP send (smtp service) → IMAP retrieve (imap service) of a unique-marker mail;
|
||
reachability likely needs host-published mail ports (like mumble host-ports) OR exec inside the
|
||
container using swaks/openssl. Simpler distinctive #2 if SMTP/IMAP host-reach is hard: create a
|
||
2nd domain/alias via CLI + verify, or assert the admin API lists the created user.
|
||
- recipe_meta: DEPLOY_TIMEOUT generous (multi-service); confirm version tags for the upgrade tier.
|
||
- Build next iteration (fresh context): scaffold tests/mailu/, smoke deploy install,custom to find
|
||
the exact `flask mailu` invocation + health path + mail-port reachability, then add §4.3 tests.
|
||
|
||
## 2026-05-29 — mailu (Q4.9) deeper recon: TLS/certdumper friction noted
|
||
- Services: `app`=ghcr.io/mailu/nginx (the front/web+mail proxy), `db`=redis:8.0.3-alpine (redis, NOT
|
||
a SQL DB — mailu admin uses sqlite at /data inside the admin container), `admin`=ghcr.io/mailu/admin
|
||
(mgmt API + `flask mailu` CLI), imap(dovecot), smtp(postfix), antispam(rspamd), webmail, **certdumper**
|
||
(ldez/traefik-certs-dumper). All images PULLABLE (ghcr.io/mailu/* + redis + ldez). NO backupbot → P4 N/A.
|
||
- **FRICTION (cc-ci-specific): certdumper expects traefik's ACME acme.json** (it dumps certs from
|
||
traefik_letsencrypt volume for the mail ports' TLS). cc-ci uses a FILE-PROVIDER wildcard cert, NOT
|
||
ACME (Class-A1, ACME forbidden) → no acme.json → certdumper likely never converges → services_converged
|
||
False → install "fails". MITIGATION to try: set TLS_FLAVOR (mailu env) to `notls` (disables mail TLS,
|
||
no cert needed) or `mail-letsencrypt`→ avoid; OR drop certdumper from COMPOSE_FILE if the recipe allows;
|
||
OR provide the cc-ci wildcard cert files to mailu's expected path. Smoke deploy will reveal whether
|
||
certdumper blocks convergence; START with TLS_FLAVOR=notls in EXTRA_ENV. The web/admin HTTP path
|
||
(traefik file-provider wildcard) works regardless; functional create-mailbox is via the admin CLI
|
||
(no mail-TLS needed). SMTP/IMAP send-receive distinctive test may need TLS_FLAVOR handled.
|
||
- Versions: 1.1.0/1.1.1/2.0.0/3.0.0/3.0.1; prev=3.0.0+2024.06.27 → head 3.0.1+2024.06.37 (real upgrade).
|
||
- Build approach: EXTRA_ENV callable(domain)→{MAIL_DOMAIN:domain, HOSTNAMES:domain, TRAEFIK_STACK_NAME:
|
||
"traefik_ci_commoninternet_net", SITENAME:"ccci", POSTMASTER:"admin", TLS_FLAVOR:"notls"}. Smoke
|
||
install,custom first to confirm convergence (esp. certdumper) + find `flask mailu` syntax + health path.
|
||
|
||
## 2026-05-29 — drone (Q4.10) investigation: needs a gitea SCM dep + OAuth + build-trigger pipeline
|
||
drone = single `app` (drone/drone:2.26.0), HEALTH=/healthz, NO backupbot (P4 N/A), real upgrade tags
|
||
(1.8.0+2.25.0→1.9.0+2.26.0). KEY: drone is a CI server that REQUIRES exactly one SCM provider — the
|
||
base compose's drone.env.tmpl only sets DRONE_RPC_SECRET; the SCM (DRONE_GITEA_CLIENT_ID/SERVER +
|
||
client_secret) is supplied by compose.gitea.yml. drone's server FATALs without an SCM provider
|
||
configured, so it cannot even BOOT standalone. gitea recipe IS fetchable (dep-deployable).
|
||
**Full §4.3 enrollment cost (the heaviest of any §5 recipe):**
|
||
1. Deploy gitea as a DEP (deps.py — but gitea is a full git service, heavier than keycloak).
|
||
2. Create a gitea OAuth2 application via the gitea admin API → client_id + client_secret.
|
||
3. Wire DRONE_GITEA_SERVER/CLIENT_ID + client_secret secret into drone (compose.gitea.yml +
|
||
install_steps), then drone boots.
|
||
4. §4.3 "create/list builds" needs a drone USER API TOKEN — which drone only issues AFTER an OAuth
|
||
login flow against gitea (headless OAuth consent is itself complex), PLUS a synced repo with a
|
||
.drone.yml PLUS a push/webhook to trigger a build. That is a full CI-trigger pipeline, multi-system.
|
||
**Assessment:** deploying drone+gitea (boot+/healthz) is achievable; the §4.3 create-an-object (a
|
||
build) requires OAuth-token + repo-sync + webhook-trigger infra that is disproportionate. §7.1 says
|
||
"needs another app"/"needs SSO" are NOT valid excuses (dep resolver exists) — but drone's blocker is
|
||
the OAuth-token + build-trigger PIPELINE, beyond a simple dep. **Proposed: build the gitea-dep +
|
||
OAuth-at-install wiring so drone BOOTS (install+upgrade green + a health/version/SCM-config functional
|
||
= maximal subset), and DEFER the build-creation §4.3 with a DEFERRED.md entry + Adversary §7.1
|
||
sign-off** (the create-build pipeline is a dedicated unit). Decide next iteration; gitea-dep wiring is
|
||
the main effort. Do NOT deploy concurrently with the Adversary's mailu cold-verify.
|
||
|
||
## 2026-05-29 — drone+gitea integration FULLY SCOPED (execute next iteration)
|
||
Confirmed mechanics:
|
||
- `deps.py::deploy_deps` is GENERIC (deploys any dep recipe by name + waits health; reads
|
||
tests/<dep>/recipe_meta.py EXTRA_ENV/HEALTH via meta_for). So DEPS=["gitea"] works, BUT gitea needs
|
||
config: gitea ships `COMPOSE_FILE=compose.yml:compose.mariadb.yml` (app + mariadb, 2 services) and
|
||
uses GITEA_DOMAIN for ROOT_URL/OAuth redirects — defaults to gitea.example.com, so a dep deploy
|
||
needs GITEA_DOMAIN pinned to the per-run dep domain.
|
||
- gitea: `INSTALL_LOCK=true` (no web installer), NO auto-admin user via env. Admin must be created via
|
||
the gitea CLI in the app container: `gitea admin user create --admin --username ccci --password <pw>
|
||
--email ccci@ci.local --must-change-password=false`, then a token: `gitea admin user
|
||
generate-access-token -u ccci --scopes 'write:application,write:user' --raw` (gitea ≥1.19 syntax).
|
||
- drone OAuth: drone needs DRONE_GITEA_SERVER=https://<gitea-dep-domain> + DRONE_GITEA_CLIENT_ID + a
|
||
`client_secret` swarm secret (compose.gitea.yml). Create the gitea OAuth2 app via API:
|
||
`POST https://<gitea>/api/v1/user/applications/oauth2` (header Authorization: token <admintoken>)
|
||
body {name, redirect_uris:["https://<drone-domain>/login"], confidential_client:true} → returns
|
||
{client_id, client_secret}.
|
||
INTEGRATION PLAN (execute fresh):
|
||
1. tests/gitea/recipe_meta.py: EXTRA_ENV(domain)→{GITEA_DOMAIN:domain, GITEA_DISABLE_REGISTRATION:"true"}
|
||
(+ any required), HEALTH_PATH="/" HEALTH_OK=(200,302), DEPLOY_TIMEOUT~900. (gitea as a dep app.)
|
||
2. tests/drone/recipe_meta.py: DEPS=["gitea"]; EXTRA_ENV(domain)→ COMPOSE_FILE="compose.yml:compose.gitea.yml",
|
||
DRONE_USER_CREATE="username:ccci,admin:true" (match the gitea admin username so drone admin maps),
|
||
GITEA_DOMAIN=<dep domain> (from deps file at install_steps time — so EXTRA_ENV may need the dep
|
||
domain, which isn't known until deps deploy → use install_steps for the dep-dependent env, like the
|
||
keycloak OIDC-at-install pattern). HEALTH_PATH="/healthz" HEALTH_OK=(200,). Likely OIDC_AT_INSTALL-style.
|
||
3. tests/drone/install_steps.sh: read $CCCI_DEPS_FILE for gitea dep domain; exec into the gitea dep
|
||
container to create admin+token (or via API); create the OAuth2 app → client_id/secret; `abra app
|
||
secret insert drone client_secret v1 <secret>`; env_set DRONE_GITEA_CLIENT_ID + GITEA_DOMAIN into
|
||
drone .env; then the single drone deploy boots with gitea SCM. (Mirror lasuite OIDC-at-install: the
|
||
orchestrator deploys the dep BEFORE drone when OIDC_AT_INSTALL+DEPS; install_steps wires it.)
|
||
NOTE: install_steps runs in the drone deploy_app; the gitea dep must be deployed FIRST — verify the
|
||
orchestrator's OIDC_AT_INSTALL path deploys deps before the parent (it does: _provision_deps before
|
||
deploy when oidc_at_install). May need to generalize that flag (e.g. DEPS_AT_INSTALL) for non-OIDC.
|
||
4. §4.3 build-creation (create/list builds): DEFER — needs drone user OAuth token (drone issues tokens
|
||
only post-OAuth-login against gitea; headless OAuth consent is complex) + a synced repo + .drone.yml
|
||
+ a push/webhook trigger. DISPROPORTIONATE pipeline. Ship MAXIMAL SUBSET: drone boots with gitea SCM
|
||
(install+upgrade+health/healthz + a functional test asserting drone serves /healthz 200 and the
|
||
login page advertises gitea SSO, proving SCM configured). DEFERRED.md entry + Adversary §7.1 sign-off
|
||
for the build-trigger pipeline. SMOKE-FIRST: manually deploy gitea→create OAuth app→deploy drone wired
|
||
→confirm /healthz, before writing test code (nail the gitea CLI/API calls).
|
||
This is the heaviest Phase-2 integration; budget multiple iterations. Hold deploys if Adversary active.
|
||
|
||
---
|
||
## 2026-05-29T22:4x — immich Q3.5 P4 decision: recipe-PR (add postgres backup), not N/A
|
||
|
||
Resumed loop. Adversary checkpoint (REVIEW-2 `af94708`) confirms my own finding: immich's P4 restore
|
||
is RED and unsigned. Root-caused it directly on cc-ci:
|
||
- immich's `backupbot.backup` label sits ONLY on the `app` service, whose sole data volume `uploads`
|
||
is `backupbot.volumes.uploads=false` (excluded), and the two other excluded names (model-cache,
|
||
external_storage) aren't even on `app`. → app backs up nothing.
|
||
- the `database` (postgres, DB_USERNAME=postgres/DB_DATABASE_NAME=immich) service has NO backupbot
|
||
label and NO pg_dump hook. → the postgres DB is NOT backed up at all.
|
||
- No `abra.sh`, no top-level `configs:` section. So immich-as-published produces a backup containing
|
||
no restorable application data. My P4 ci_marker (postgres row) therefore cannot survive restore —
|
||
the test correctly detected a genuine, serious upstream deficiency (immich users get NO DB backup).
|
||
|
||
**WHY recipe-PR over §7.1 N/A sign-off:** immich is THE object-storage/large-volume D10 category
|
||
recipe — its entire purpose is storing user data. A P4-N/A here (unlike mailu's mail-relay N/A) would
|
||
be hollow: the data path is exactly what must be proven to survive. cc-ci exists to catch precisely
|
||
this class of bug; the recipe mirror+PR flow (§0b/§4.1) is the sanctioned mechanism, and the durable
|
||
fix was already filed as the immich Q3.5 deferral. So: author a recipe-PR adding a `database`-service
|
||
postgres backup (mirroring matrix-synapse's `/pg_backup.sh` config-mount + backupbot pre/restore
|
||
hooks), then `!testme`/`RECIPE=immich PR=<n>` proves P4 green on the fixed recipe.
|
||
|
||
**Reference pattern (matrix-synapse compose.yml):** db service `deploy.labels`:
|
||
`backupbot.backup.pre-hook="/pg_backup.sh backup"`, `backupbot.backup.volumes.postgres.path="backup.sql"`,
|
||
`backupbot.restore.post-hook="/pg_backup.sh restore"`; `configs: [{source: pg_backup, target:
|
||
/pg_backup.sh, mode: 0555}]`; top-level `configs.pg_backup.file=pg_backup.sh`. The script: backup =
|
||
`pg_dump -U $USER $DB | gzip > /var/lib/postgresql/data/backup.sql`; restore = drop+recreate+reimport.
|
||
|
||
**immich-specific risk to validate empirically BEFORE the PR:** the postgres image is VectorChord/
|
||
pgvecto.rs (custom extensions). A naive single-DB pg_dump|psql restore may choke on the vector
|
||
extension and on the live immich-server's held connections. So I'm deploying immich (install) now and
|
||
will test seed→dump→drop→restore→verify directly in the `database` container to pin down the exact
|
||
dump/restore commands (likely `pg_dumpall --clean --if-exists` and connection-termination on restore)
|
||
that round-trip the ci_marker, then bake the proven commands into pg_backup.sh. No "should work".
|
||
|
||
---
|
||
## 2026-05-30T~23:22 — Q3.5 immich CLAIMED; remaining-recipe scope (backup-capability survey)
|
||
|
||
immich P4 done the right way: recipe-PR `recipe-maintainers/immich#1` (mechanism validated live, then
|
||
full lifecycle green `/root/ccci-immich-prfull.log` — 5 tiers + 3 custom, deploy-count=1, clean
|
||
teardown). Added a genuine 2nd P3 functional test (asset-processing: exifInfo metadata + library
|
||
statistics) so the §4.3 ≥2-tests floor is met by separate test functions, not one test doing double
|
||
duty (avoids the bluesky F2-8 "floor BYPASSED" failure mode). Claimed `0487631`.
|
||
|
||
**Remaining Phase-2 work (post-immich), by node-contention class.** The Adversary will cold-verify
|
||
immich next (full ~30min run; MAX_TESTS=1) so I should NOT start a heavy deploy until it frees.
|
||
|
||
Backup-capability survey (just done) of the 4 overlay-less recipes — ALL backup-capable, so P4
|
||
data-integrity overlays are REQUIRED (not N/A like mailu):
|
||
- **ghost** — app vol `/var/lib/ghost/content` (path) + mysql `mysqldump --tab` pre-hook. P4: seed a
|
||
ghost post (mysql) OR content marker. Also owes §4.3 create-post (named Adversary standing
|
||
condition) — needs Ghost owner-setup + admin token. Heavy (~15-20min cold start).
|
||
- **bluesky-pds** — `backupbot.backup=true` on pds svc (data volume: sqlite account repos + blobs).
|
||
P4: create account+post (goat), backup, wipe, restore, assert the post/account survive. (F2-8 was
|
||
about the §4.3 floor; bluesky already has 4 functional tests incl. account+post round-trip.)
|
||
- **uptime-kuma** — default sqlite data-vol backup (mariadb variant has dump hooks). P4: create a
|
||
monitor, backup, restore, assert. Also owes §4.3 create-monitor (deferred — needs a Socket.IO
|
||
client primitive in harness; uptime-kuma's setup wizard + monitor CRUD are Socket.IO, not REST).
|
||
- **mattermost-lts** — app `/mattermost` + postgres `pg_dump` pre-hook. P4: create team/channel/
|
||
message, backup, restore, assert. Also owes §4.3 create-message read-back depth.
|
||
|
||
Overlay-complete, need only a formal green-run + gate claim: **matrix-synapse**, **lasuite-docs**
|
||
(dep: keycloak). **plausible** needs a cold green run when the upstream clickhouse-backup GitHub
|
||
rate-limit cools (deploy converges) — preserve the log. **discourse** + **drone** remain BLOCKED
|
||
(upstream bitnami images gone / operator /etc/timezone host-deploy).
|
||
|
||
NEXT unblocked unit (when node free): pick a recipe and take it to a claim. Suggest order by ease:
|
||
matrix-synapse (overlay-complete → just run+claim) → bluesky-pds P4 overlay → mattermost-lts P4 →
|
||
ghost (P4 + §4.3 create-post) → uptime-kuma (P4 + Socket.IO §4.3). Keep heavy deploys sequential.
|
||
|
||
---
|
||
## 2026-05-30T~23:59 — Q4.1 matrix-synapse: post-restore register-500 root cause + fix; CLAIMED
|
||
|
||
First full run: install/upgrade/backup/restore green but custom `test_register_two_users_send_receive_
|
||
message` FAILED — synapse `HTTP 500 M_UNKNOWN` on the shared-secret admin register POST (nonce GET 200,
|
||
so endpoint enabled). A fresh `STAGES=install,custom` reproduce PASSED → not deterministic; the
|
||
differentiator is the FULL lifecycle's tier order (custom runs right after restore).
|
||
|
||
**Root cause (PROVEN via synapse log capture `/root/matrix-synapse-debug.log`):** the restore tier
|
||
runs pg_backup.sh `restore` = `DROP DATABASE … WITH (FORCE)` + recreate + reimport. The FORCE drop
|
||
**terminates synapse's live postgres connections** (`server closed the connection unexpectedly` /
|
||
`psycopg2.InterfaceError: connection already closed` at the restore timestamp). For a few seconds
|
||
synapse is re-establishing its connection pool; a registration is a DB *write*, so it 500s — while
|
||
HTTP health (`/_matrix/client/versions`, a read) is already green. A classic "health-green but not
|
||
write-ready after restore" window.
|
||
|
||
**Fix (NOT a weakening — readiness robustness per plan §4.2/§9):** `_admin_register` now polls —
|
||
re-fetch a fresh nonce + re-POST on 5xx/transport-error, ≤90s, then RAISE; a 4xx (real rejection) is
|
||
fail-fast. The asserted behaviour is identical (two users register + send/receive a message); only the
|
||
bounded post-restore recovery window is tolerated, and it logs each retry so the transient is visible.
|
||
Validated: full run 2 (`/root/ccci-matrix-full2.log`) GREEN — `[register] …: POST transient 500
|
||
(attempt 1) → succeeded (attempt 2)`, all 5 tiers pass, deploy-count=1, clean teardown. Claimed
|
||
`9a8850a`. (This is a general pattern other DB-write functional tests may need after the restore tier;
|
||
noted for the remaining recipes.)
|
||
|
||
---
|
||
## 2026-05-30T~00:30 — Q4.5 mattermost-lts: P4 overlay caught a real recipe restore defect
|
||
|
||
Authored the mattermost P4 overlay (ops.py postgres ci_marker + test_install/upgrade/backup/restore).
|
||
First run failed on a self-inflicted bug: the postgres service is named `postgres`, not `db` (I misread
|
||
compose; `exec_in_app(service="db")` → "no running container"). Fixed (commit 012a477), re-ran.
|
||
|
||
Re-run: install+upgrade+backup+custom GREEN (ci_marker survives the upgrade chaos crossover
|
||
2.1.9+10.11.15→2.1.10+10.11.18, captured "original" at backup; all 3 functional tests pass incl.
|
||
create_message_roundtrip §4.3). **restore FAILED**: after `abra app restore`, `relation "ci_marker"
|
||
does not exist`.
|
||
|
||
**Root cause = recipe defect (same class as immich, different shape).** mattermost's `postgres`
|
||
service backs up via a pg_dump pre-hook (→ /var/lib/postgresql/data/postgres-backup.sql) + archives the
|
||
whole PGDATA dir (`backup.path=/var/lib/postgresql/data/`), but ships **NO `backupbot.restore.post-hook`**.
|
||
backupbot's restore extracts the archived files into the volume, but the RUNNING postgres doesn't
|
||
reload PGDATA without a restart, so the live DB keeps the post-drop (pre_restore) state → the seeded
|
||
marker is gone. The logical dump is in the archive but never reimported.
|
||
|
||
**Fix = recipe-PR (immich pattern):** add a `backupbot.restore.post-hook` that reimports the dump into
|
||
the live DB deterministically — terminate connections → DROP DATABASE … WITH (FORCE) → createdb →
|
||
`psql -f postgres-backup.sql`. (Validate the mechanism live first, like immich, since the dump is a
|
||
plain pg_dump reimported into a fresh DB.) Mirror+PR `recipe-maintainers/mattermost-lts`, then
|
||
`RECIPE=mattermost-lts PR=<n>` proves restore green. QUEUED as the next mattermost unit.
|
||
|
||
This is the 2nd recipe (after immich) where the P4 data-integrity overlay caught a genuine
|
||
backup/restore defect — strong evidence the phase's P4 requirement is doing real work. The remaining
|
||
backup-capable recipes (bluesky-pds, uptime-kuma, ghost) should be assumed similarly suspect until their
|
||
restore is proven to round-trip seeded data.
|
||
|
||
---
|
||
## 2026-05-30T~01:40 — Q4.5 mattermost PASS (3rd this session); next: bluesky-pds P4 (scoped)
|
||
|
||
Session tally: immich Q3.5 PASS (recipe-PR adds DB backup), matrix-synapse Q4.1 PASS (post-restore
|
||
DB-pool race fix), mattermost-lts Q4.5 PASS (recipe-PR fixes no-op restore; negative control proved
|
||
teeth). Two recipe-PRs fixing real coop-cloud data-loss bugs (immich + mattermost), both Adversary-
|
||
verified non-vacuous via PR=0 negative controls.
|
||
|
||
**NEXT: bluesky-pds P4 (Q4.3 already has strong functional; only the P4 data-integrity overlay is
|
||
missing).** Recipe shape: service `app` (pds 0.4) mounts `pds_data:/pds` (PDS_DATA_DIRECTORY=/pds;
|
||
atproto account/repo sqlite + blobs under /pds/blocks). `backupbot.backup=true` on `app`, NO
|
||
backup.path / pre-hook / restore post-hook → whole-volume file-level backup (same shape as mattermost's
|
||
broken PGDATA backup). **Design decision for the P4 marker — DON'T use a bare /pds/ci_marker FILE:**
|
||
the PDS doesn't hold a loose file open, so a file marker would survive restore even if the running PDS
|
||
fails to reload its restored sqlite — i.e. it would NOT catch the "running app holds the data files"
|
||
class of bug (which IS what bit mattermost/immich). To have teeth, seed RECIPE-AWARE data: create an
|
||
atproto account (unique handle, via the PDS API like the §4.3 test / `com.atproto.server.createAccount`
|
||
with an admin-minted invite code), `test_backup` asserts it resolves (`com.atproto.repo.describeRepo`),
|
||
`pre_restore` deletes it (`com.atproto.admin.deleteAccount`, admin auth via pds_admin_password) so a
|
||
successful restore is OBSERVABLE, `test_restore` asserts the account resolves again. Expect this MAY
|
||
reveal the same running-app-holds-sqlite restore gap → if so, recipe-PR (restart the pds on restore,
|
||
or a sqlite-aware restore hook). Deploy-test first to find out (don't assume).
|
||
- After bluesky: uptime-kuma (sqlite data-vol + Socket.IO §4.3 create-monitor) and ghost (mysql
|
||
backup + §4.3 create-post) remain; then plausible (clickhouse rate-limit) cold green; discourse/drone
|
||
stay BLOCKED. Then Q5 (docs + DONE).
|
||
|
||
Checkpointing here (node clean, no gate pending — all 3 claims this session PASSed) to take bluesky
|
||
fresh next cycle; the analysis above lets it start at the overlay, not the investigation.
|
||
|
||
## 2026-05-30 — Q4.4 ghost: P3 create-post GREEN + P4 non-vacuous; migration-lock deadlock + +U fixes
|
||
|
||
Authored ghost P4 overlays (MySQL `ci_marker` in the `ghost` DB — recipe is MySQL not sqlite; stale
|
||
comment) + §4.3 create-post round-trip (cookie-aware Admin API client `_ghost.py`). Run-4 results
|
||
(`/root/ccci-ghost-4.log`): deploy-count=1; install/backup/custom PASS; `test_create_post_roundtrip
|
||
PASSED (22s)`; P4 upgrade+backup markers PASS; restore RED (real recipe gap — no reimport-on-restore).
|
||
|
||
**Why two deploys failed first (NOT test issues):**
|
||
1. **migrations_lock deadlock.** Ghost's fresh-DB first boot runs a ~6-9min schema migration (dozens
|
||
of CREATE TABLEs, each a separate MySQL round-trip — round-trip-bound, NOT CPU-bound: hit on BOTH
|
||
2- and 4-vCPU). The recipe healthcheck `start_period:1m` (+10×30s ≈ 6min grace) marks the still-
|
||
migrating task unhealthy → swarm kills it mid-migration → leaves `migrations_lock.locked=1,
|
||
released_at=NULL` → every later task boots, sees the held lock, refuses (`MigrationsAreLockedError`)
|
||
→ permanent deadlock. Bumping the abra TIMEOUT does NOT help (the lock never clears). FIX: a cc-ci
|
||
DEPLOY overlay `compose.ccci-health.yml` raising the app healthcheck start_period to 900s (failures
|
||
ignored during it; a PASS still marks healthy at once) so the fresh migration finishes + releases
|
||
the lock. Wired via recipe_meta COMPOSE_FILE + install_steps.sh + CHAOS_BASE_DEPLOY. NOT a test
|
||
change — the real healthcheck still gates readiness. Validated: migration ran past the old kill
|
||
point, install converged 1/1. (Operator bumped the VM 2→4 vCPU mid-session; didn't fix this — the
|
||
migration is round-trip-bound — but made everything else snappier.)
|
||
2. **`+U` chaos-version marker.** The untracked overlay makes abra stamp `chaos-version='<commit>+U'`
|
||
(U=untracked). The commit equals head_ref (HC1 satisfied) but `+U` broke assert_upgraded's exact-
|
||
prefix match → spurious upgrade FAIL. FIX: strip the working-tree-state marker before the commit
|
||
match (commit identity still enforced; HC1 preserved). mumble dodged this only because its overlay
|
||
is tracked natively in newer versions; cc-ci overlays generally aren't → general harness fix.
|
||
|
||
**P4 restore gap (real recipe defect → recipe-PR):** ghost db service has `mysqldump --tab` backup
|
||
pre-hook but NO `backupbot.restore.*` hook, and the mysql data volume isn't backupbot-labelled → the
|
||
dump is restored to disk but never reimported → dropped `ci_marker` doesn't return. Non-vacuous
|
||
(backup PASS with marker, restore RED). Same class as immich#1 / mattermost-lts#1. FIX = recipe-PR
|
||
adding a mysql dump+reimport hook (mirror mattermost `pg_backup.sh` → `mysql_backup.sh`). Ghost not
|
||
yet mirrored on gitea (404) → mirror first (plan §0b), then PR, then final green run, then claim.
|
||
|
||
## 2026-05-30T19:53Z — ghost F2-14b full4 timeout → DEPLOY_TIMEOUT bump (full5)
|
||
|
||
full4 (`/root/ccci-ghost-full4.log`, committed db-grace overlay 3ca45c7) FAILED at the base deploy:
|
||
`abra app deploy ghos-9431a1... -o -n -C` timed out after 1200s; RUN SUMMARY install:fail, rest skip.
|
||
|
||
Root cause (inspected live swarm, not guessed): db (mysql:8.0) converged 1/1 healthy — the db-grace
|
||
overlay (15m start_period) successfully prevented the prior mysql-redo-corruption deadlock. But the
|
||
app crash-looped 4-5× with exit(2) = `connect ECONNREFUSED 10.0.5.5:3306` (knex-migrator can't reach
|
||
mysql) during mysql's ~6min fresh-dir init; once mysql was ready (~19:36) the app task `hwfixm5`
|
||
started a clean migration (`Creating table: email_recipients` @19:46:45, `email_recipient_failures`
|
||
@19:47:38 — late-stage tables). abra's deploy subprocess (DEPLOY_TIMEOUT=1200, started ~19:31) was
|
||
killed at ~19:51 while migration was still finishing (app 0/1). So wall-time = mysql init (~6min) +
|
||
schema migration (~9-15min under load) exceeded the 20min window. full3 (17:23) squeaked under it;
|
||
full4 was slower (host load variance). The crash-loops lose NO migration progress (they precede any
|
||
migration — pure can't-connect), so the only cost is the mysql-init head start.
|
||
|
||
Fix (4a160f6): bump ghost DEPLOY_TIMEOUT + EXTRA_ENV TIMEOUT 1200→2400s (matches discourse). Not a
|
||
test weakening — the wait is bounded; a genuine hang still fails at 40min. Teardown after full4 was
|
||
clean (no leftover stack/volume/secret). Re-running as full5.
|
||
|
||
## 2026-05-30T20:10Z — ghost full5: P4 restore RED (ci_marker table absent post-restore) — investigating
|
||
|
||
full5 (`/root/ccci-ghost-full5.log`, 2400s timeout): deploy-count=1, install/upgrade/backup/custom
|
||
PASS, **restore FAIL**. `test_restore_returns_state`: `Table 'ghost.ci_marker' doesn't exist` after
|
||
restore (generic test_restore_healthy PASSED → app up; my P4 overlay caught a data-integrity gap).
|
||
|
||
Recipe-PR head ae43ffe DOES ship the reimport hook (verified ~/.abra/recipes/ghost on cc-ci):
|
||
compose db service has `backupbot.backup.pre-hook=/mysql_backup.sh backup`,
|
||
`backupbot.backup.volumes.mysql.path=backup.sql.gz`, `backupbot.restore.post-hook=/mysql_backup.sh
|
||
restore`; mysql_backup.sh restore = `gunzip -c /var/lib/mysql/backup.sql.gz | mysql -u root`.
|
||
|
||
Puzzle: full3 (17:23, app-ONLY overlay, db@native 1m) was FULLY GREEN incl restore; full5 (after
|
||
3ca45c7 added db@15m grace) regressed on restore. db-grace was observed-necessary (run#2 mysql-init
|
||
exit-137 redo-corruption deadlock under load), so I can't just drop it. But db start_period only
|
||
changes WHEN swarm marks unhealthy — it shouldn't mechanically break the reimport. So leading
|
||
hypotheses: (a) load-dependent flake in backupbot restore / the reimport; (b) recipe-hook robustness
|
||
gap — `gunzip -c | mysql` has `set -e` but NOT `set -o pipefail`, so a failed/empty gunzip silently
|
||
reimports nothing yet returns 0. Action: full6 re-run + instrument the restore tier live (capture
|
||
backupbot restore output, backup.sql.gz presence, whether reimport populated ci_marker). NOT claiming
|
||
ghost until restore is reliably green. Stack/vol teardown after full5 was clean.
|
||
|
||
## 2026-05-30T20:30Z — ghost full6 restore RED again → SYSTEMATIC (db-grace correlated)
|
||
|
||
full6 (`/root/ccci-ghost-full6.log`): identical result to full5 — install/upgrade/backup/custom PASS,
|
||
restore FAIL (`ci_marker` absent post-restore). 2 fails WITH db@15m grace; full3 PASSED WITHOUT it
|
||
(db@native 1m). So systematic, correlated with the db-grace overlay block — NOT a flake.
|
||
|
||
Ruled out by direct check:
|
||
- Harness restore op = `abra app restore -n -C -o` → triggers backupbot restore + `restore.post-hook`.
|
||
- Compose merge (compose.yml + compose.ccci.yml) on cc-ci: merged db service RETAINS all backupbot
|
||
labels incl `backupbot.restore.post-hook=/mysql_backup.sh restore`; only start_period changes
|
||
(1m→15m). So the db overlay block does NOT drop the reimport hook.
|
||
- mysqldump backup.sql.gz (backup tier, contains ci_marker='original') is intact (backup test PASS).
|
||
|
||
So the reimport post-hook is configured + present yet ci_marker doesn't return ONLY when db
|
||
start_period=15m. Mechanism unclear by reasoning (start_period shouldn't keep a ready mysql
|
||
"starting"). Next: full7 with the restore tier WATCHED LIVE — db health state, `abra app restore`
|
||
output, backup.sql.gz presence, ci_marker immediately post-restore — to get the actual mechanism.
|
||
|
||
## 2026-05-30T21:15Z — ghost full8 INSTRUMENTED: DEFINITIVE root cause = db container cycles DURING backup op
|
||
|
||
Fixed the diag watcher (NixOS has no /bin/bash → must `bash gdw3.sh`) and captured db state every 4s
|
||
through full8's backup+restore tiers (`/root/ghost-diag8.log`). Decisive timeline (backup op):
|
||
21:08:43–51 db cid=93865743 repl=1/1 healthy
|
||
21:08:58 db cid=93865743 repl=1/1 **unhealthy**
|
||
21:09:03 repl=0/1 cid= (container GONE)
|
||
21:09:07–19 repl=0/1 **starting**, NEW cid=784ec680
|
||
21:09:24→ repl=1/1 healthy (new cid), stays healthy through the whole restore tier
|
||
So the db container is REPLACED during the BACKUP op (abra app backup create), well before the restore
|
||
tier. This races backupbot's volume enumeration (which resolves each volume path from running service
|
||
specs at backup time) → the mysql volume is intermittently omitted from the snapshot (proven earlier:
|
||
full5 snapshot had …_mysql/backup.sql.gz; full6/7 had only ghost_content). Restore then restores a
|
||
dump-less snapshot → reimport reads nothing → silent no-op (hook lacks `set -o pipefail`) → ci_marker lost.
|
||
|
||
Ruled OUT: not OOM (db has NO mem_limit; host 5.8G free), not healthcheck-timing (base hc retries=10 ×
|
||
interval=30s = 5min to unhealthy — impossible in the observed ~16s window; merged hc keeps test+retries).
|
||
So the cycle is driven BY the backup op, not a crash/healthcheck. The db-grace start_period overlay was
|
||
a RED HERRING for restore (the cycle is past start_period). Likely abra/backup-bot-two stops/restarts
|
||
the db to take a consistent volume snapshot; the omission is a timing race in that flow.
|
||
|
||
NEXT (precise): read backup-bot-two `/usr/bin/backup` backup flow (does it stop/cycle containers? how
|
||
does it enumerate volumes relative to that?) to confirm the cycle+capture interaction, THEN fix:
|
||
candidate = harness verifies the backup snapshot contains the db volume and retries if not, AND/OR the
|
||
recipe-PR backup is made resilient (+ `set -o pipefail` + fail-loud on missing dump so it can never be
|
||
silent again). 5 ghost runs done (full4 timeout-fixed; full5/6/7/8 restore race) — stop blind re-runs.
|
||
|
||
## 2026-05-30T21:18Z — backupbot backup flow read: enumerate-once → no retry recovers a dropped volume
|
||
|
||
Read backup-bot-two `/usr/bin/backup` `create`: it computes (pre_cmds, post_cmds, backup_paths) ONCE
|
||
via get_backup_details (which resolves each labelled volume's host path from the RUNNING service spec),
|
||
then runs pre_cmds (mysqldump via docker exec), then `backup_volumes(backup_paths, retries)` (restic),
|
||
then post_cmds. It does NOT stop/cycle the db. So the db cycle I observed during backup is swarm/mysqld,
|
||
NOT backupbot. Critically: backup_paths are enumerated ONCE up-front; if the db service is mid-cycle at
|
||
enumeration, its mysql path is omitted from backup_paths and abra's `--retries` (which only retries the
|
||
restic step) can NEVER recover it. So a per-restic retry is useless here.
|
||
|
||
FIX (chosen, harness-side, general for all DB recipes): after perform_backup, VERIFY the resulting
|
||
snapshot includes the db service's backupbot-labelled volume path; if missing, RE-INVOKE the whole
|
||
`abra app backup create` (fresh enumeration) up to N times. This closes the enumerate-during-cycle race
|
||
generally. Pair with recipe-PR mysql_backup.sh `set -o pipefail` + fail-loud-on-missing-dump so a
|
||
dump-less restore can never silently no-op again. (Still-open minor: the db cycle's own trigger during
|
||
backup — not OOM/not-healthcheck — left as a separate observation; the harness verify+retry makes the
|
||
backup correct regardless.) Implement next tick, then ghost full run to verify green incl upgrade.
|
||
|
||
## 2026-05-30T21:20Z — full8 = FLAKY-green (restore PASSED this time) → confirms intermittent race; NOT claiming
|
||
|
||
full8 final RUN SUMMARY: deploy-count=1, install/upgrade/backup/restore/custom ALL PASS — restore
|
||
PASSED despite the db cycling during backup (watcher saw the cycle, but the mysql volume made it into
|
||
the snapshot this time). So restore is a PURE intermittent race: full5/6/7 lost it (mysql volume
|
||
omitted → data loss), full8 won it. Merged db healthcheck confirmed retries=10/interval=30s intact
|
||
(not the cycle cause). A flaky-green is NOT a reliable PASS — the Adversary's cold re-run can hit the
|
||
failure, and an intermittently-broken P4 data-integrity test is a real defect (P7). NOT claiming ghost
|
||
on luck. Decision stands: implement the harness backup-integrity verify+re-invoke fix (next), then a
|
||
ghost run must pass restore RELIABLY (ideally confirm with 2 consecutive green incl upgrade) before claim.
|
||
|
||
---
|
||
## 2026-05-31T01:2x — discourse full4 timeout root-cause + full5 fixes (Builder)
|
||
Woke into the loop with discourse full4 in flight (PR head 3758522, STAGES=install,upgrade,backup,
|
||
restore,custom — the VETO-clearing run incl upgrade-to-latest). full4 FAILED at the BASE deploy:
|
||
`install: fail`, rest skipped; `abra app deploy disc-ce6450 ... timed out after 2400 seconds`.
|
||
|
||
Investigation:
|
||
- full2 (same REF, same overlay) base deploy SUCCEEDED (install+upgrade tiers passed) → the overlay
|
||
approach works; full4's timeout is flakiness at the convergence edge, not a config break.
|
||
- The recurring log line `service "sidekiq" depends on undefined service "discourse": invalid compose
|
||
project` comes from `abra app config --images` (the prepull step): the published recipe (base 0.7.0
|
||
AND PR head) has `sidekiq.depends_on: [discourse]`, but the main service is `app` — `discourse` is
|
||
undefined → config rc=15 → prepull SKIPPED → the 2.4GB image is pulled INLINE during deploy.
|
||
- On cc-ci the image was cached as `bitnamilegacy/discourse:<none>` (tag dangling) → the deploy
|
||
re-pulled 2.4GB, eating the convergence budget. Combined with the node being only **7 GiB RAM**
|
||
(not the 28 GiB the plan assumed) + load 6-7 on 4 vCPU during Rails asset-precompile, 40min was too
|
||
tight. (swarm IGNORES depends_on, so the dangling ref has zero runtime effect — full2 proves deploy
|
||
works despite it; it only breaks the prepull lint.)
|
||
|
||
Tried to fix prepull by overriding `sidekiq.depends_on:[app]` in the overlay (04cc44c). It does NOT
|
||
work: docker normalizes short-form depends_on to a map and map-merge is ADDITIVE → {discourse}+{app}
|
||
={discourse,app}, the bad key survives, config --images still rc=15. (My initial "rc=0" test was
|
||
bogus — `$?` after `| head` is head's exit code.) Reverted (8dfd8ed); overlay stays minimal.
|
||
|
||
full5 fixes (the ones that actually address the timeout):
|
||
1. Pre-cached `bitnamilegacy/discourse:3.3.1` by TAG on cc-ci (`docker pull`) — was dangling <none>;
|
||
now the inline pull during deploy is a no-op (layers present) → convergence not pull-bound.
|
||
2. DEPLOY_TIMEOUT/TIMEOUT 2400→3600 (recipe_meta) — headroom for the RAM/CPU-constrained Rails boot.
|
||
Cleaned full4's stray state (2 app.1 containers stuck "Removal In Progress" held the discourse_data
|
||
volume; cleared after the daemon finished removal; volume rm'd). Node verified clean before launch.
|
||
full5: `/root/ccci-discourse-full5.log`, PID 848184, REF 3758522, builder-clone @8dfd8ed.
|
||
|
||
---
|
||
## 2026-05-31T01:38Z — cc-ci VM went OFFLINE mid discourse full5 (likely OOM on 7-GiB node) (Builder)
|
||
At the 01:38 poll, `ssh cc-ci` timed out; `ping 100.90.116.4` 100% loss; `tailscale status` shows
|
||
`cc-nix-test 100.90.116.4 ... active; relay "nyc"; offline`. My orchestrator host + b1 (hypervisor)
|
||
are online — only the cc-ci VM dropped off. Last good state (01:33): discourse app attempt-2 in
|
||
"Populating database" (Rails migration), health=starting. Strong hypothesis: the 7-GiB node OOM'd /
|
||
thrashed under discourse's migration+asset-precompile (Rails/ember, memory-hungry) co-resident with
|
||
the CI infra (traefik/drone/dashboard/bridge/backups) AND a running warm-keycloak+db → tailscaled
|
||
starved → VM unresponsive. Tailnet membership intact (node exists, just offline) → recoverable, not a
|
||
class-A1 blocker yet. Polling for recovery; if it doesn't come back in ~15-20min it's an operator
|
||
reboot (b1 VM) → STATUS Blocked. Root-cause implication regardless: discourse is too heavy for this
|
||
node co-resident with warm-keycloak — need to shed memory (stop warm-keycloak before discourse, and/or
|
||
mem-limit the discourse build) before re-running, else this recurs.
|
||
|
||
---
|
||
## 2026-05-31T04:2xZ — RESUMED (spend limit lifted): cc-ci now = Hetzner node; discourse full6 setup (Builder)
|
||
Woke into the loop after the spend pause. Re-oriented from STATUS-2/REVIEW-2/JOURNAL-2.
|
||
|
||
**Node migration (prior session, undocumented until now):** `ssh cc-ci` no longer targets the b1-hosted
|
||
`cc-nix-test` VM (100.90.116.4 — now tailnet-OFFLINE, the 7-GiB node that OOM'd mid discourse full5).
|
||
It now targets the new **Hetzner cloud node** `cc-ci` = 100.95.31.88 (public 91.98.47.73), the
|
||
`cc-ci-hetzner` host added in commits 4237cc0/a216395 (nixos-infect). Confirmed: hostname `nixos`,
|
||
swarm node `cc-ci` Ready/Active/Leader, abra server `default` registered, CI infra stacks
|
||
(traefik/drone/dashboard/bridge/backups + warm-keycloak) all redeployed and running. `HCLOUD_TOKEN`
|
||
is in `.testenv` (Hetzner access available). **Caveat: the new node is STILL 4 vCPU / ~7.7 GiB RAM**
|
||
(MemTotal 7937188 kB, nproc 4) — same class as the old node, NOT bigger. So the discourse memory
|
||
constraint persists; the migration bought a reachable/declarative node, not more RAM.
|
||
|
||
**Fresh-node state:** root is persistent ext4 (150G, 7% used) but `/root/builder-clone`, the cached
|
||
discourse image, and recipe residue were all absent (fresh infect). Re-established builder-clone at
|
||
`origin/main` (a216395) via `git clone` (no submodules). abra + cc-ci-run are Nix-provided
|
||
(`/run/current-system/sw/bin`). No discourse/ghost stacks/volumes/secrets present → clean slate.
|
||
|
||
**discourse full6 setup (re-run of the OOM-lost full5, same committed shape):** recipe_meta at main
|
||
already carries the full upgrade-to-latest shape — UPGRADE_BASE_VERSION=0.7.0+3.3.1,
|
||
COMPOSE_FILE=compose.yml:compose.ccci.yml, CHAOS_BASE_DEPLOY=True, TIMEOUT/DEPLOY_TIMEOUT=3600,
|
||
BACKUP_VERIFY probe. compose.ccci.yml (bitnamilegacy re-pin + literal 20m start_period grace on the
|
||
0.7.0 base) + install_steps.sh both present and consistent. REF = discourse PR#1 head
|
||
3758522cf8702e97e88cd38d47165cf14defe74e (confirmed current via gitea API; branch ci/bitnamilegacy-repin).
|
||
**Memory-shed (the full5 root-cause fix):** stopped warm-keycloak (`docker stack rm`) — discourse needs
|
||
no SSO for STAGES=install,upgrade,backup,restore,custom. Result: available RAM 6.4→**7.0 GiB**, platform
|
||
stacks total ~70 MiB (traefik 33 / drone 7 / dashboard 13 / bridge 14 / backups 2). discourse now gets
|
||
nearly the whole node vs competing with keycloak's ~700MB java during asset-precompile. Pre-pulling
|
||
`bitnamilegacy/discourse:3.3.1` by TAG (full5 fix #1: inline deploy pull → no-op). Launch on image-ready.
|
||
|
||
---
|
||
## 2026-05-31T04:3xZ — RESUMED loop; consumed orchestrator inbox; launched discourse full6 (Builder)
|
||
Re-oriented from STATUS-2/REVIEW-2/JOURNAL-2. Consumed `machine-docs/BUILDER-INBOX.md` (orchestrator
|
||
heads-up, commit `c01225b`). **Re-baseline per the heads-up — my prior OOM/disk-starved/rate-limit notes
|
||
were about the OLD Incus box and are STALE:** the live `ssh cc-ci` is the new Hetzner box `cc-ci-hetzner`
|
||
(tailnet 100.95.31.88, public 91.98.47.73), NVMe, **~8 GB RAM**, **150 GB disk / ~135 GB free**,
|
||
**authenticated Docker Hub pulls** (no anon rate-limit). `df`/`free` re-checked: load ~0.08, 6 GiB avail,
|
||
6% disk. DNS for `*.ci.commoninternet.net` is mid-cutover to 91.98.47.73 (TTL ≤3h) — treat public-URL
|
||
flakes during the window as DNS, not a defect.
|
||
Node verified clean (no discourse/ghost stacks/volumes/secrets); warm-keycloak already shed; image
|
||
`bitnamilegacy/discourse:3.3.1` pre-cached by TAG. builder-clone fast-forwarded to origin/main.
|
||
**Launched discourse full6** (re-run of the OOM-lost full5, identical committed shape): `RECIPE=discourse
|
||
PR=1 REF=3758522cf8702e97e88cd38d47165cf14defe74e SRC=recipe-maintainers/discourse cc-ci-run
|
||
runner/run_recipe_ci.py` → `/root/ccci-discourse-full6.log`, PID 50718. Stages: install,upgrade,backup,
|
||
restore,custom (full upgrade-to-latest, required by the DONE VETO). prepull rc=15 (dangling
|
||
`sidekiq.depends_on:[discourse]`) is the known-harmless lint failure — image pre-cached, inline pull a
|
||
no-op. Polling ~5min per §7 case 1.
|
||
|
||
---
|
||
## 2026-05-31T04:5xZ — discourse full6 DONE (1 test bug) → fixed → full7 launched (Builder)
|
||
**full6 result** (`/root/ccci-discourse-full6.log`, deploy-count=1, REF 3758522):
|
||
- install: PASS · **upgrade: PASS** (upgrade-to-latest, the DONE-VETO requirement) · backup: PASS ·
|
||
restore: PASS (P4 ci_marker survived) · **custom: FAIL — only `test_create_topic_roundtrip`**
|
||
(health_check + site_basic PASS). Clean teardown (0 stacks/volumes).
|
||
- backup tier: `backup-verify FAILED (attempt 1/3) → re-ran → PASS` — the chaos-upgrade db-cycle race
|
||
(same class ghost hit); BACKUP_VERIFY retry converged, non-vacuous. `/pg_backup.sh No such file` on
|
||
attempt 1 was the racing db restart (pre-hook script present at PR head, exec hit a cycling container).
|
||
- create_topic failure was a **TEST BUG not an app defect**: Discourse 3.x disables uncategorized
|
||
topics by default → `POST /posts.json` w/o category 422s `"Category can't be blank"`. mint_admin
|
||
worked (ruby-PATH fix `8d689d6` confirmed good).
|
||
**Fix** (`1f92776`): enable `SiteSetting.allow_uncategorized_topics = true` in the existing Rails admin
|
||
bootstrap (`_discourse.py _BOOTSTRAP_RB`). Standard Discourse feature toggle, config-parity with a real
|
||
forum — NOT a weakening: the round-trip still posts a real topic + asserts a unique body marker survives
|
||
read-back. **full7** relaunched full lifecycle (`/root/ccci-discourse-full7.log`, PID 57983, builder-clone
|
||
@1f92776). On all-green → CLAIM Q4.6 (closes the discourse portion of the DONE VETO). Polling ~5min.
|
||
|
||
---
|
||
## 2026-05-31T05:0xZ — discourse full7: category fix worked, hit title_prettify; fixed → full8 (Builder)
|
||
**full7** (`/root/ccci-discourse-full7.log`, deploy-count=1): install/upgrade/backup/restore all PASS
|
||
again; custom still FAIL but **different + further** — the `allow_uncategorized_topics` fix WORKED (topic
|
||
created, topic_id returned, read back); new failure was Discourse's `title_prettify` capitalising the
|
||
title first letter (`'ccci topic …'` → `'Ccci topic …'`) tripping the exact-equality round-trip.
|
||
**Fix `588a087`:** send an already-capitalised title (`CCCI topic <uniq>`) so prettify is a no-op and
|
||
the exact round-trip stays faithful (unique hex token mid-string, untouched). NOT a weakening — still a
|
||
real create→read-back of a uniquely-marked topic. **full8** relaunched full lifecycle
|
||
(`/root/ccci-discourse-full8.log`, PID 65368, builder-clone @588a087). Node clean before launch
|
||
(disc-ce6450 fresh secrets, no collision). On all-green → CLAIM Q4.6. Polling ~5min.
|
||
|
||
---
|
||
## 2026-05-31T05:2xZ — mumble F2-14c implemented + run launched (Builder)
|
||
Discourse Q4.6 claimed (`dabcceb`); picked up the LAST DONE-VETO item, mumble F2-14c. Investigated the
|
||
mumble recipe tags (corrected an earlier tag-name slip): `0.1.0/0.2.0/1.0.0+v1.6.870-0`; `compose.mumbleweb.yml`
|
||
is on the 0.2.0 base, `compose.host-ports.yml` ONLY on 1.0.0. So the only cc-ci fork was the host-ports copy.
|
||
Implemented per the Adversary's disposition (see DECISIONS 2026-05-31): removed the fork +install_steps;
|
||
base 0.2.0 deploys minimally; new `UPGRADE_EXTRA_ENV` harness hook adds native host-ports on the
|
||
upgrade-to-latest; `READY_PROBE`/install-overlay self-gate the voice-port check to the host-ports phase via
|
||
`abra.env_get(COMPOSE_FILE)`; dropped CHAOS_BASE_DEPLOY. py_compile clean. Commit `4bf9e1d`. **Run launched:**
|
||
`RECIPE=mumble PR=0` → `/root/ccci-mumble-f214c.log`, PID 75792 (node clean). Expect: install pass (voice
|
||
overlay SKIPS on 0.2.0, generic HTTP serving passes), upgrade pass (COMPOSE_FILE switched, host-ports added,
|
||
ready-probe tcp 3x on latest), backup/restore pass (sqlite ci_marker), custom pass (handshake/web/config on
|
||
latest). Polling ~5min (exercises new harness code — watch base deploy + the upgrade env switch).
|
||
|
||
---
|
||
## 2026-05-31T05:2xZ — mumble F2-14c GREEN + CLAIMED (1461e44); DONE-VETO checklist complete (Builder)
|
||
mumble F2-14c run (`/root/ccci-mumble-f214c.log`) FULLY GREEN exactly as designed: deploy-count=1;
|
||
install pass (generic HTTP serving on 0.2.0 mumble-web; voice overlay SKIPPED on base w/ recorded
|
||
reason); upgrade pass (`upgrade-env: COMPOSE_FILE=...:compose.host-ports.yml` fired → `ready-probe OK
|
||
(tcp 3x): 127.0.0.1:64738` → crossover 0.2.0→1.0.0, chaos-version==head_ref 9fa5e949); backup/restore
|
||
pass (sqlite ci_marker); custom pass (all 5 voice/web/config tests on latest). PID gone, node fully
|
||
clean (0 stacks/vols/secrets/nets). Claimed F2-14c (`claim(` → watchdog pings Adversary).
|
||
**DONE-VETO checklist (REVIEW-2 @16:22:07Z) now fully addressed:** ghost F2-14b ✅PASS, discourse Q4.6
|
||
✅CLAIMED, mumble F2-14c ✅CLAIMED. Awaiting Adversary cold-verify of Q4.6 + F2-14c to clear the VETO.
|
||
**Remaining for Phase-2 DONE (P1 coverage):** plausible Q4.7b (recipe-PR: clickhouse-backup tarball
|
||
silent-wget defect → cache/retry/un-silence; full upgrade/backup/restore green) + drone Q4.10 (§7.1
|
||
sign-off granted; maximal gitea+drone subset run post host-rebuild). Both need the cc-ci node; HOLDING
|
||
deploys while the Adversary cold-verifies (single node, MAX_TESTS=1). Next: author plausible recipe-PR
|
||
offline, queue its validation run for when the node frees.
|
||
|
||
---
|
||
## 2026-05-31T05:3xZ — discourse Q4.6 PASS; fixed F2-15 (PARITY.md); mumble F2-14c verdict pending (Builder)
|
||
**Adversary cold-verified discourse Q4.6 = PASS** (REVIEW-2 `7525478` @05:34Z) — closes the discourse
|
||
portion of the DONE VETO. One finding **F2-15 [adversary]**: `tests/discourse/PARITY.md` missing (P2 §4.1
|
||
required file even though parity is genuinely N/A — no upstream discourse corpus). NOT a VETO item, does
|
||
not reopen Q4.6. **Fixed:** added `tests/discourse/PARITY.md` (N/A parity note + the 3 functional tests
|
||
[create-topic round-trip §4.3, site.json config, health] + P4 postgres ci_marker integrity + BACKUP_VERIFY
|
||
note + P6 advisory), modeled on ghost/mattermost-lts N/A PARITY.md; claims verified against the live test
|
||
bodies (site_basic asserts `categories` is a list; health GETs /srv/status). Left the F2-15 box for the
|
||
Adversary to close after re-check (only the Adversary closes [adversary] items). mumble F2-14c verdict
|
||
still pending; plausible Q4.7b + drone Q4.10 queued behind the node. Still parked on the F2-14c gate.
|
||
|
||
---
|
||
## 2026-05-31T05:4xZ — DONE-VETO checklist COMPLETE; executing plausible Q4.7b (Builder)
|
||
mumble F2-14c ✅PASS (`0d5d516` @05:26Z) + discourse Q4.6 ✅PASS (`7525478` @05:34Z) + ghost F2-14b done →
|
||
all 3 DONE-VETO upgrade-to-latest items Adversary-PASSED; F2-15 CLOSED. Adversary holds the VETO pending
|
||
remaining P1/Q5 (plausible Q4.7b, drone Q4.10, Q5 docs/sample). Node free post-verifies.
|
||
**plausible Q4.7b executed:** (1) mirrored `coop-cloud/plausible` → `recipe-maintainers/plausible`
|
||
(private; main + 4 tags; --mirror choked on upstream refs/pull/* → pushed heads+tags explicitly).
|
||
(2) recipe-PR `recipe-maintainers/plausible#1` (branch `ci/clickhouse-backup-resilient`, head
|
||
`bd8bd93d`): hardens `entrypoint.clickhouse.sh` — caches clickhouse-backup on the persistent
|
||
event-data:/var/lib/clickhouse volume, retry×5+backoff, best-effort `|| true` so a download failure never
|
||
blocks `exec /entrypoint.sh`, un-silenced. (3) **Full run launched** `RECIPE=plausible PR=1
|
||
REF=bd8bd93d SRC=recipe-maintainers/plausible` → `/root/ccci-plausible-q47b.log`, PID 83743 (node clean).
|
||
On the fresh-IP Hetzner box the first clickhouse-backup wget should succeed (no accumulated GitHub
|
||
throttle from the old box). Expect install (base 3.0.0)+upgrade(→PR head)+backup+restore+custom all green
|
||
(§4.3 event-tracking tests already proven green). Polling ~5min.
|