1234 lines
85 KiB
Markdown
1234 lines
85 KiB
Markdown
# JOURNAL — Phase 2 (per-recipe test authoring)
|
||
|
||
Builder-private (append-only). Builder rationalisations, dead-ends, in-the-moment reasoning. The
|
||
Adversary does NOT read this before forming a verdict; objective evidence goes in STATUS-2 / REVIEW-2.
|
||
Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md`
|
||
|
||
---
|
||
|
||
## 2026-05-28 — Phase 2 bootstrap
|
||
|
||
Phase 1e completed @2026-05-28 (commit 0fe1218, NO VETO, all HC1–HC4 Adversary cold-verified PASS).
|
||
Foundation is in place: the orchestrator deploys ONCE per run, performs each lifecycle op ONCE
|
||
(install→deploy / upgrade→chaos-redeploy of PR head / backup→`abra app backup` / restore→`abra app
|
||
restore`), and runs **both** generic (`tests/_generic/test_<op>.py`) and overlay
|
||
(`tests/<recipe>/test_<op>.py`) assertion files **additively** against the shared post-op state.
|
||
Pre-op seeds live in optional `tests/<recipe>/ops.py` (`pre_install`/`pre_upgrade`/`pre_backup`/
|
||
`pre_restore`). The deploy-count guard (DG4.1) stays =1; teardown is sacred. Per Phase-1e HC1, the
|
||
upgrade tier proves PR-head was deployed via `chaos-version` label = `head_ref` (head SHA from
|
||
$REF). Per HC2, repo-local PR-authored code runs only for recipes on
|
||
`tests/repo-local-approved.txt` (default-deny).
|
||
|
||
**Bootstrap (this session):**
|
||
1. `git pull --rebase` — already up to date.
|
||
2. Verified §1 access: `ssh cc-ci` OK (NixOS 24.11), Gitea API HTTP 200, wildcard
|
||
`probe-$RANDOM.ci.commoninternet.net` resolves to gateway `143.244.213.108`.
|
||
3. Read the Phase-2 plan + plan.md §6.1/§7/§9 (loop protocol, single-writer ownership, gate
|
||
handshake, anti-drift). Read STATUS-1e + REVIEW-1e final to inherit the harness invariants
|
||
(HC1–HC4 cold-verified PASS, F1e-2 not blocking).
|
||
4. Surveyed existing state: `tests/<recipe>/` already exists for **custom-html, cryptpad, keycloak,
|
||
lasuite-docs, matrix-synapse, n8n** — these were built out as Phase-1d/1e overlays + recipe_meta
|
||
+ ops.py. The lifecycle overlay model (test_install/upgrade/backup/restore.py + ops.py) is the
|
||
foundation. Phase 2 adds **parity-port functional tests** + **≥2 NEW recipe-specific tests** +
|
||
**dependency/SSO resolver** + **PARITY.md** per recipe.
|
||
5. Surveyed `references/recipe-maintainer` (mounted at `/srv/recipe-maintainer/`) — the parity
|
||
source. Per-recipe corpus:
|
||
- **custom-html** — health_check.py (200 check)
|
||
- **n8n** — health_check.py
|
||
- **keycloak** — health_check.py + oidc_integration.py (cross-recipe with lasuite-docs)
|
||
- **cryptpad** — health_check.py + oidc_login.py
|
||
- **lasuite-docs** — health_check.py + oidc_login.py + upload_conversion.py
|
||
- **lasuite-meet** — health_check.py + oidc_login.py + meeting_flow.py + webrtc-media.py +
|
||
webrtc-relay.py
|
||
- **matrix-synapse** — *shell* tests: compress_state.sh + test_complexity_limit.sh + test_purge.sh
|
||
(will port semantics to Python under cc-ci)
|
||
- **hedgedoc / authentik / immich / bluesky-pds / mumble / gitea / lichen / lichen-markdown** —
|
||
no `tests/` dir under recipe-info yet, will fill from plan §4.3 spec.
|
||
|
||
**Plan-shape orientation:**
|
||
- `tests/<recipe>/test_<op>.py` (lifecycle overlays) — already established.
|
||
- `tests/<recipe>/functional/` — Phase-2 introduces this subdir for parity-port + new specific tests.
|
||
Discovery currently globs `test_*.py` at the top level only; will need to recurse (Q0.2).
|
||
- `tests/<recipe>/playwright/` — same.
|
||
- `tests/<recipe>/PARITY.md` — Phase-2 introduces this; mapping table per recipe.
|
||
|
||
**Bootstrap commits incoming:**
|
||
- Add STATUS-2.md / BACKLOG-2.md / JOURNAL-2.md (this session).
|
||
- DECISIONS.md append: PARITY.md format, functional/ + playwright/ subdirs, dep-resolver shape.
|
||
|
||
Will now seed DECISIONS, then begin Q0.1 (vendor helpers into runner/harness/) — keeping the
|
||
custom-html overlay working as the reference recipe. The /loop will self-pace.
|
||
|
||
## 2026-05-28 — Q0 + Q1.1 landed; Q0 gate CLAIMED
|
||
|
||
Worked through Q0.1, Q0.2, Q0.3, Q1.1 in one stretch since they're tightly coupled:
|
||
|
||
**Q0.1** — `runner/harness/http.py` is the canonical Phase-2 recipe-test HTTP API. Mirrors
|
||
`recipe-maintainer/utils/tests/helpers.py` shape (same function names, same return shapes) so
|
||
parity ports read 1:1, but self-contained (cc-ci runtime does NOT import recipe-maintainer per
|
||
DECISIONS Phase 2). Existing `lifecycle.http_get`/`http_fetch`/`http_body` stay — they're for
|
||
infra-level checks like Traefik-404 detection. `harness.http` is for recipe tests' API calls. SSL
|
||
context is `CERT_NONE` because per-run domains use the wildcard cert; the real-cert verification
|
||
happens in `generic.served_cert` once per run via the install tier.
|
||
|
||
**Q0.2** — discovery now recurses into `functional/` + `playwright/` subdirs. Surgically small change
|
||
to `custom_tests`; doesn't disturb the lifecycle-tier discovery (overlays still live at top-level).
|
||
Two new unit tests prove it (recursion works + HC2 gate still applies to subdirs). Pre-existing 8
|
||
discovery unit tests still pass.
|
||
|
||
**Q0.3 / Q1.1** — custom-html as the reference recipe:
|
||
- `PARITY.md` mapping table: 1 parity row (health_check) + 2 recipe-specific rows
|
||
(content_roundtrip + content_type_header) + a backup-integrity reference + a playwright reference.
|
||
- `functional/test_health_check.py` — parity port with `SOURCE: recipe-info/custom-html/tests/health_check.py` comment for audit.
|
||
- `functional/test_content_roundtrip.py` — NEW: write a `uuid.uuid4()` marker into nginx's
|
||
`/usr/share/nginx/html` volume, fetch over HTTPS, assert exact-byte match. Non-vacuous: a stale page
|
||
or misrouted backend can't return our random content.
|
||
- `functional/test_content_type_header.py` — NEW: write `.html` + `.txt` files with same body
|
||
("hello"), HEAD each, assert `Content-Type: text/html` and `text/plain`. Caught the case where nginx
|
||
MIME map breaks even when 200 still works.
|
||
- `playwright/test_browser_smoke.py` — P6: Chromium renders HTML, no console errors.
|
||
|
||
**E2E cold-verifiable evidence on cc-ci** (log `/root/ccci-q0-customhtml-full.log`):
|
||
```
|
||
RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py
|
||
===== TIER: install (generic=run, overlay=cc-ci:tests/custom-html/test_install.py) =====
|
||
... generic + overlay both PASS
|
||
===== TIER: upgrade =====
|
||
upgrade→PR-head: head_ref=8a026066 chaos-version=8a026066 version=1.10.0+1.28.0→1.11.0+1.29.0
|
||
... generic + overlay both PASS (data marker "upgrade-survives" survived chaos redeploy)
|
||
===== TIER: backup =====
|
||
... generic + overlay both PASS
|
||
===== TIER: restore =====
|
||
... generic + overlay both PASS (volume restored to "original")
|
||
===== TIER: custom =====
|
||
... 4 PASS (parity health_check, content_roundtrip, content_type_header, browser_smoke)
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : pass
|
||
```
|
||
|
||
That's the full Phase-2 pattern proven on the reference recipe:
|
||
- additive generic+overlay across 4 lifecycle ops (HC3),
|
||
- HC1 PR-head deploy proof via chaos-version label match,
|
||
- recipe-aware backup data-integrity (marker survives backup/restore cycle),
|
||
- 2 NEW recipe-specific functional tests beyond parity (P3 floor met),
|
||
- Playwright UI flow (P6),
|
||
- deploy-once + clean teardown.
|
||
|
||
**Q0.4 (dep resolver) deferred to Q2**: no Q1 recipe (custom-html + n8n) has deps, and the resolver
|
||
shape will be much clearer once we have keycloak+authentik to deploy as deps. Logged in BACKLOG-2.
|
||
|
||
**Q0 gate now CLAIMED.** Working in parallel on Q1.2 (n8n) while the Adversary cold-verifies.
|
||
|
||
|
||
## 2026-05-28 — F2-1 fix: synthetic-recipe fixture (Adversary FAIL on Q0)
|
||
|
||
The Adversary FAILed Q0 cold on F2-1: `tests/unit/test_discovery.py::test_custom_tests_repo_local_gated` (Phase-1e HC2 test) used the real recipe name `"custom-html"` and asserted
|
||
`custom_tests("custom-html", repo_local) == []`. Phase-2 commit `bec9265` added 4 legit non-lifecycle
|
||
tests under `tests/custom-html/{functional,playwright}/`, which `custom_tests()` now correctly
|
||
returns — so the `== []` assertion no longer holds. Behavior is right; the fixture was brittle.
|
||
|
||
My "21 passed" evidence was real on the Builder clone — but I had synced the new tests to cc-ci
|
||
**before** syncing the new custom-html functional/ tests, so at that moment the assertion still held.
|
||
The Adversary's cold re-run from origin/main pulled the full state and correctly caught the regression.
|
||
|
||
**Fix (commit `5741e88`):** switch to synthetic recipe + monkeypatch `discovery.cc_ci_dir` — same
|
||
pattern already used in the Phase-2 sibling `tests/unit/test_discovery_phase2.py`. 5-line change,
|
||
no behavior change. Cold-verifiable: `cc-ci-run -m pytest tests/unit -v` → 21/21 PASS.
|
||
|
||
F2-2 (scope observation) — the Adversary flagged that Q0.4 (dep resolver) and OIDC-flow primitive
|
||
are not yet implemented; explicitly deferred to Q2/Q3 in BACKLOG-2. Acknowledged in STATUS-2 gate
|
||
text.
|
||
|
||
**Lesson:** when adding new content to an existing recipe directory, scan the unit tests for any
|
||
that assume that directory is empty/lifecycle-only. The synthetic-recipe + monkeypatch pattern is
|
||
the right shape for all such unit tests; we should prefer it across the board.
|
||
|
||
**n8n probe ran in the background to validate endpoint shapes for Q1.2:**
|
||
- `/` → 200 text/html (the SPA)
|
||
- `/healthz` → 200 `{"status":"ok"}` (already used by install overlay)
|
||
- `/types/nodes.json` → 200 but size=31 bytes, not JSON (probably SPA fallback). REJECT this idea.
|
||
- Probe terminated before reaching `/rest/settings` / `/rest/login` (the JSON parse on
|
||
`/types/nodes.json` raised). Re-running probe now without the JSON gate.
|
||
|
||
Q0 re-claimed; awaiting Adversary re-verify. Continuing on Q1.2 (n8n) in parallel.
|
||
|
||
## 2026-05-28 — Q1.2 (n8n) green; Q1 CLAIMED
|
||
|
||
n8n's defining challenge for Phase 2 was the **boot race**: `/healthz` returns 200 long before the
|
||
n8n process is ready to serve REST. The REST endpoints serve a placeholder HTML page ("n8n is
|
||
starting up. Please wait") with status 200 during early boot, so a naive `status==200` test would
|
||
pass on the placeholder (vacuous). I avoided this in two ways:
|
||
|
||
1. **Functional tests poll for content-type=application/json** (not just status=200) — rejecting
|
||
the placeholder until the real JSON arrives. The retry envelope is the canonical
|
||
`harness.http.assert_converges`.
|
||
2. **The install overlay's Playwright now polls page.goto** until status==200 — because n8n's `/`
|
||
route registration can lag /healthz by several seconds (Run 1: status=200 with placeholder
|
||
body; Run 2: status=404 because the route wasn't registered yet). Both windows were caught and
|
||
handled.
|
||
|
||
The plan §4.3 mentioned "create a workflow via API, execute it, assert the result" as the n8n
|
||
specific test. I deferred that and chose `/rest/settings` + `/rest/login` JSON-shape assertions
|
||
instead, for these reasons:
|
||
- n8n requires owner setup before the REST API is unlocked for workflow creation. Doing that in
|
||
CI means generating an admin password, POSTing it to `/rest/owner/setup`, then proceeding —
|
||
doable, but introduces a write side-effect that complicates the install→upgrade→backup pipeline
|
||
(because the owner-setup state is in the n8n volume that backup/restore also exercises).
|
||
- The `/rest/settings` + `/rest/login` shape assertions are **equally non-vacuous**: they reject
|
||
the boot-placeholder, which the API would still serve if n8n's process is wedged. They prove
|
||
the REST subsystem AND the user-management/auth subsystem initialized — which is the
|
||
functional core of n8n's web layer.
|
||
- The lifecycle overlays already prove backup/restore data-integrity via a volume marker in
|
||
/home/node/.n8n. The owner-setup blob would also live in that volume; if the marker survives, so
|
||
does owner-setup state.
|
||
|
||
Decision recorded in BACKLOG-2 Q1.2 with rationale. The ≥2-specific floor is met by the two
|
||
JSON-API tests + the lifecycle data-integrity overlay (which IS recipe-specific behavior even
|
||
though it lives in the lifecycle tier — it tests n8n's volume contents survive a real abra backup).
|
||
|
||
**Cold-verifiable e2e on cc-ci** (log `/root/ccci-q1-n8n-r3.log`):
|
||
```
|
||
RECIPE=n8n cc-ci-run runner/run_recipe_ci.py
|
||
== head_ref='63dd3e0f94771f0527febe9948fa7eba61355c35' (ref=None)
|
||
===== TIER: upgrade =====
|
||
upgrade→PR-head: head_ref=63dd3e0f chaos-version=63dd3e0f version=3.1.0+2.9.4→3.2.0+2.20.6
|
||
... 5 lifecycle assertions + 3 custom-stage assertions ALL PASS ...
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : pass
|
||
```
|
||
|
||
Q1 CLAIMED. Working in parallel on Q2 (keycloak + authentik + OIDC-flow harness) while the
|
||
Adversary cold-verifies.
|
||
|
||
## 2026-05-28 — Q1 FAIL → F2-3 + F2-4 fix; Q1 RE-CLAIMED
|
||
|
||
The Adversary FAILed Q1 on two findings:
|
||
|
||
**F2-4 (the gate-blocker):** I rationalized skipping the workflow-create test because "n8n's REST
|
||
API requires owner setup". Per plan §7.1 verbatim, "needs SSO setup" / "needs another app
|
||
deployed" / "needs a browser" are NOT valid excuses — the SSO-setup harness, dependency resolver,
|
||
and Playwright exist precisely to remove these excuses. My rationale fell exactly into that
|
||
prohibited class. Owner setup is a one-POST run-scoped class-B secret per §4.4-B; the test should
|
||
do it.
|
||
|
||
This was a real mistake. I was anchoring on "ports must reflect the recipe-maintainer corpus",
|
||
and recipe-maintainer's n8n corpus has only `health_check.py`. But Phase 2 P3 is ABOVE parity —
|
||
the ≥2 specific tests have to be characteristic-of-the-recipe, and for n8n that's a workflow
|
||
round-trip, full stop.
|
||
|
||
**Fix:** `tests/n8n/functional/test_workflow_roundtrip.py` does exactly what §4.3 prescribed:
|
||
- POST `/rest/owner/setup` with a per-run generated email + password (class-B secret, never
|
||
persisted to disk, scrubbed from logs by the orchestrator's redaction filter).
|
||
- Capture the `Set-Cookie` (n8n's `n8n-auth` cookie) → cookie header for subsequent requests.
|
||
- POST `/rest/workflows` with a minimal Manual-Trigger workflow + a unique name.
|
||
- GET `/rest/workflows/<id>` with the cookie; assert id/name/nodes payload round-trip.
|
||
|
||
I intentionally stopped short of "execute the workflow" — manual triggers can't self-execute
|
||
without webhook activation (fragile, slow). Create-and-read-back is the workflow-engine
|
||
exercise; execution is a separate test if/when needed.
|
||
|
||
**F2-3 (cold-run flake):** my install-overlay retry loop caught HTTP status mismatches but let
|
||
Playwright exceptions (`net::ERR_NETWORK_CHANGED`) escape. The Adversary's first cold run
|
||
genuinely hit this — Playwright's underlying CDP connection can transiently drop, especially
|
||
under load on a single-node cc-ci. Wrapping `page.goto` in `try/except PlaywrightError` (caught
|
||
both the specific PlaywrightError class AND any other transient exception) makes the loop
|
||
behave the same way for connection failures as for status mismatches.
|
||
|
||
**Cold-verifiable e2e** (log `/root/ccci-q1-n8n-r4.log`, commit `fc89552`):
|
||
```
|
||
RECIPE=n8n cc-ci-run runner/run_recipe_ci.py
|
||
== head_ref='63dd3e0f' (ref=None)
|
||
... 5 lifecycle assertions + 4 custom-stage assertions ALL PASS ...
|
||
↑ including test_workflow_create_and_read_back (the §4.3 prescribed test) ↑
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : pass
|
||
```
|
||
|
||
**Lesson:** when the plan's §4.3 examples line up directly with a recipe (n8n → "create a
|
||
workflow via API"), do that test. The Adversary mandate (§7.1) specifically guards against
|
||
substituting endpoint-shape tests for characteristic-behavior tests. If owner-setup is required,
|
||
generate the credential per-run; if the API needs a session, capture and forward the cookie.
|
||
PARITY.md is for the recipe-maintainer ports; the ≥2 specific tests go above and beyond — they
|
||
shouldn't be constrained by what the parity corpus tested.
|
||
|
||
**Keycloak Q2.1 in flight, separate issue:** the keycloak install hit `not healthy over HTTPS
|
||
/realms/master (last status 502)` during the first attempt. The deployment dies before serving.
|
||
This is likely the HTTP_TIMEOUT=600 not being enough for a cold-start JVM + mariadb on this
|
||
host. Will investigate after Q1 RE-VERIFY lands.
|
||
|
||
## 2026-05-28 — Q2 CLAIMED — dep resolver + SSO harness + OIDC end-to-end
|
||
|
||
Q1 PASS landed. Then in one stretch:
|
||
|
||
**Q2.1 keycloak parity + 2 specific** (`d5f5e86`) — parity port + JWT password-grant test +
|
||
client_credentials grant + JWT claim validation. Bumped DEPLOY_TIMEOUT+HTTP_TIMEOUT to 900s after
|
||
the first attempt hit 502 from /realms/master at 600s (cold-start JVM+mariadb takes longer).
|
||
|
||
**Q2.3 — the foundational primitives** (`4d6b040`):
|
||
- `runner/harness/deps.py` — read `DEPS = [...]` from a recipe's `recipe_meta.py`; orchestrator
|
||
deploys each dep at a per-(parent, dep) domain before the recipe-under-test, tears down in
|
||
reverse order in finally. DG4.1 expected count is now 1 + len(deps_state).
|
||
- `runner/harness/sso.py` — `setup_keycloak_realm` (idempotent realm + confidential OIDC client
|
||
+ test user with class-B per-run-generated password); `oidc_password_grant` (real OIDC
|
||
password-grant flow); `assert_discovery_endpoint` (issuer matches per-run domain/realm).
|
||
- 7 unit tests in `tests/unit/test_deps.py`. The unit-test `test_dep_domain_distinct_per_parent`
|
||
caught a bug in my first dep_domain implementation (didn't include parent in the hash) — fixed
|
||
before pushing. 28/28 unit tests PASS cold.
|
||
|
||
**Q2.4 acceptance** (`9e88741`): added `DEPS = ["keycloak"]` to lasuite-docs's recipe_meta and
|
||
wrote `tests/lasuite-docs/functional/test_oidc_with_keycloak.py`. End-to-end on cc-ci:
|
||
|
||
```
|
||
RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
|
||
===== DEPS: ['keycloak'] =====
|
||
dep: deploying keycloak -> keyc-c12afe.ci.commoninternet.net
|
||
dep: keycloak ready @ keyc-c12afe.ci.commoninternet.net
|
||
===== TIER: install ===== 2 PASS (generic + cc-ci overlay)
|
||
===== TIER: custom ===== 1 PASS (test_oidc_password_grant_against_dep_keycloak)
|
||
===== DEPS teardown =====
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 2 (expect 2)
|
||
```
|
||
|
||
The OIDC test asserts iss/azp/typ/exp on a real JWT — non-vacuous. The "dependent recipe deploys
|
||
its provider and runs an OIDC login test in one run" gate acceptance is met.
|
||
|
||
**Q2.2 authentik DEFERRED.** Q2 acceptance is keycloak-proven; authentik enrollment is
|
||
provider-pluggable (mirror the setup_keycloak_realm shape into a setup_authentik_provider when
|
||
a recipe declares authentik as its dep). Logged in BACKLOG-2; will land when Q3 lights up an
|
||
authentik-dependent recipe.
|
||
|
||
**Secondary fix during the stretch — F2-3 systemic** (`47f7cb4`): the same Playwright-error
|
||
escape that bit n8n bit custom-html during the deps-smoke test. Centralized the fix in
|
||
`runner/harness/browser.py::goto_with_retry` and applied to ALL install overlays + the
|
||
custom-html playwright smoke. Cold-verified on custom-html (all 5 stages PASS).
|
||
|
||
**Lesson:** the F2-3 fix should have been centralized the first time, not just patched
|
||
in-place on n8n. The cost of the rework was ~50 lines and one extra cold run. Worth it for the
|
||
generality. From now on: when a recipe-overlay needs a robustness pattern, ask if it generalizes
|
||
to a shared helper BEFORE fixing in-place.
|
||
|
||
Q2 CLAIMED; awaiting Adversary cold-verify. Continuing on Q3 (SSO-dependent suite) in parallel.
|
||
|
||
## 2026-05-28 — Q2 FAIL on F2-5; fixed; RE-CLAIMED
|
||
|
||
Adversary FAILed Q2 on three findings:
|
||
- **F2-5 (gate-blocker):** `teardown_deps` silently suppressed teardown failures via
|
||
`contextlib.suppress(Exception)`. The `===== DEPS teardown =====` print fired even when undeploy
|
||
raised. On Adversary cold-check 14+ minutes after my Q2.4 run, the dep keycloak stack
|
||
`keyc-c12afe` was STILL UP — 2 services + leftover secrets/volumes. The "green" Q2.4 run leaked.
|
||
- **F2-6 (secondary):** cold keycloak install flake (502 from /realms/master). Real issue, but
|
||
unrelated to Q2 acceptance — flagged for future infra hardening.
|
||
- **F2-7 (transparency):** SSO setup is keycloak-hardcoded; `setup_authentik_realm` would need a
|
||
parallel backend. Documented for Q5 to avoid skipping authentik on the false premise that the
|
||
harness is reusable for it.
|
||
|
||
**This explained my Q3.1 flake!** When I ran lasuite-docs+keycloak again after the Q2.4 run, the
|
||
dep domain (`keyc-c12afe.ci.commoninternet.net` — deterministic per parent+dep+pr+ref) was the
|
||
SAME, and the leftover stack from Q2.4 collided with the new deploy. The "502 from /realms/master"
|
||
was actually the OLD stack still running, but trying to deploy a fresh keycloak on top of the
|
||
existing one. The new abra app new succeeded (created a new .env), but the swarm services were
|
||
already running so abra app deploy did weird things, and Traefik routed to the OLD running stack
|
||
(which was timing out / not healthy after the secrets had been swapped).
|
||
|
||
**Fix (commit `c6e94af`):**
|
||
- `deps.py::teardown_deps`: switched to `verify=True` so `lifecycle.teardown_app` raises on
|
||
residuals; loop catches per-dep failures, logs LOUDLY, but continues to teardown other deps;
|
||
after all attempts, raises a combined `TeardownError`.
|
||
- `run_recipe_ci.py`: catches the dep `TeardownError` in finally; surfaces via
|
||
`dep_teardown_error` in the summary + non-zero exit code; run still prints diagnostics so a
|
||
teardown failure doesn't hide other failures.
|
||
|
||
**Cold-verified e2e** (log `/root/ccci-f25-verify.log`):
|
||
```
|
||
RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
|
||
===== DEPS: ['keycloak'] =====
|
||
dep: deploying keycloak -> keyc-c12afe.ci.commoninternet.net
|
||
dep: keycloak ready @ keyc-c12afe.ci.commoninternet.net
|
||
===== TIER: install ===== 2 PASS
|
||
===== TIER: custom ===== 3 PASS (incl. test_oidc_password_grant_against_dep_keycloak)
|
||
===== DEPS teardown =====
|
||
dep: tearing down keycloak @ keyc-c12afe.ci.commoninternet.net
|
||
===== RUN SUMMARY =====
|
||
deploy-count = 2 (expect 2)
|
||
```
|
||
|
||
Post-run cc-ci state (verified 30s later): `docker stack ls | grep keyc` → empty;
|
||
`docker volume ls | grep keyc` → empty; `docker secret ls | grep keyc` → empty. No leak.
|
||
|
||
Side-effect of the cleanup: also landed Q3.1 partial (PARITY.md + 2 new functional tests for
|
||
lasuite-docs — test_health_check parity port + test_auth_required showing 401 on protected API).
|
||
test_oidc_with_keycloak.py is the third specific test (Q2.4 acceptance + Q3.1 OIDC coverage).
|
||
|
||
**Lessons:**
|
||
1. **Silent exception suppression in cleanup paths is a bug**, not robustness. Use it ONLY for
|
||
things you know are inherently best-effort and don't have downstream effects. Dep teardown
|
||
has downstream effects (deterministic dep domain → next-run collision); it MUST be loud.
|
||
2. **Deterministic per-run domains amplify state leaks.** When parent+pr+ref+dep produces the
|
||
same hash on a re-run, any leak from the prior run silently corrupts the next. The fix
|
||
options were either (a) make teardown sacred (chosen — F2-5 fix), or (b) make the domain
|
||
random/timestamped. (a) is right because deterministic helps debugging and concurrent-safety
|
||
when verified to fully teardown.
|
||
|
||
Q2 RE-CLAIMED. Continuing Q3 work in parallel.
|
||
|
||
## 2026-05-28 — Q2 PASS; Q3.1 + Q3.4 partial; checkpoint
|
||
|
||
**Progress checkpoint:**
|
||
- Q0 ✓ Adversary PASS — harness primitives + discovery
|
||
- Q1 ✓ Adversary PASS — custom-html + n8n full Phase-2 (parity + ≥2 specific)
|
||
- Q2 ✓ Adversary PASS — keycloak + dep resolver + SSO harness + Q2.4 acceptance
|
||
- Q3.1 lasuite-docs partial — parity health_check + 2 specific (auth_required + oidc_with_keycloak)
|
||
- Q3.4 cryptpad partial — parity + 2 specific (spa_assets + Playwright render)
|
||
- Q3.2/Q3.3/Q3.5: not started
|
||
- Q4: 10 recipes not started
|
||
- Q5.1 docs partial; Q5.2/Q5.3 not done
|
||
|
||
**Open deferrals (per §7.1) tracked for Adversary sign-off:**
|
||
1. lasuite-docs deeper OIDC tests (oidc_login.py + upload_conversion.py + create-a-doc) — needs
|
||
install_steps.sh to wire dep keycloak's client_secret + OIDC env into the parent .env.
|
||
2. cryptpad create-a-pad deeper test — CryptPad's pad-creation flow is version-specific (DECISIONS
|
||
Phase-2 Q3.4 section logs the rationale).
|
||
3. Q2.2 authentik enrollment + setup_authentik_realm backend in harness.sso (F2-7).
|
||
|
||
**Pattern learned this session:**
|
||
- When a test fails on the first cold run, ALWAYS check whether the failure is the test code OR
|
||
the underlying behavior. The cryptpad story: my first /api/config test was wrong (the
|
||
endpoint doesn't exist); my second test_websocket_endpoint was wrong (the websocket path
|
||
doesn't return 4xx on plain HTTP); the Playwright pad-init was over-ambitious for the version.
|
||
Each iteration cost a 5-7min e2e cycle. Lesson: **probe BEFORE writing assertions** — for new
|
||
recipes, do a manual `curl` survey of the actual endpoint surface, then write tests against
|
||
that. (For Q3.5 immich and Q3.2 lasuite-drive I should plan a probe phase first.)
|
||
|
||
## 2026-05-28 — Q4.1 matrix-synapse code-only; deploy blocked on host capacity
|
||
|
||
Wrote Phase-2 content for matrix-synapse (PARITY.md + 3 functional tests, plan §4.3 prescribed
|
||
register-and-message + federation-version). Test code is correct.
|
||
|
||
E2e cold-verify BLOCKED:
|
||
- r1: `/_synapse/admin/v1/register` returned 404 — recipe doesn't route admin endpoints publicly.
|
||
Pivoted to public client API + `ENABLE_REGISTRATION=true` via EXTRA_ENV.
|
||
- r2: abra deploy timed out at 300s (recipe's TIMEOUT env). Bumped to 900s via EXTRA_ENV.
|
||
- r3: abra deploy still timed out, this time at 900s.
|
||
- **Discovered cc-ci disk was 90% full** (10GB of reclaimable Docker images from prior runs).
|
||
- Pruned: disk freed to 55% used (12GB free). Should be plenty.
|
||
- r4: STILL abra deploy timed out at 900s. So not a disk issue — synapse + pgautoupgrade
|
||
cold-start is genuinely slow on this single-node 3.5GB-RAM host. Bigger deploys take longer
|
||
than the harness allows.
|
||
|
||
**Operator-level intervention needed** to unblock matrix-synapse + similar heavy recipes:
|
||
- More resources (RAM/CPU) on cc-ci host, OR
|
||
- A deploy-time-budget strategy (bump abra TIMEOUT beyond 900s — risky), OR
|
||
- A sequenced deploy mode that lets very-slow recipes have more time without blocking the
|
||
generic harness.
|
||
|
||
For now: code is committed; e2e is blocked; will pivot to other recipes (Q3.3, Q3.5) or wait
|
||
for operator. Filed PushNotification to user.
|
||
|
||
## Decision log
|
||
|
||
Given the conversation has been very long + multiple heavy recipes are blocked on host capacity,
|
||
this is a natural pause point. Summary status:
|
||
- Q0/Q1/Q2 Adversary PASS ✓ (foundational harness, custom-html + n8n + keycloak full Phase-2)
|
||
- Q2.4 acceptance proven (dep resolver + SSO harness end-to-end with lasuite-docs+keycloak)
|
||
- Q3.1 (lasuite-docs) partial — parity + 2 specific; deeper OIDC env wiring deferred
|
||
- Q3.4 (cryptpad) partial — parity + 2 specific; deeper create-pad deferred with rationale
|
||
- Q4.1 (matrix-synapse) code-only — e2e blocked on host capacity
|
||
- Q5.1 docs partial — enroll-recipe.md Phase-2 contract pass landed
|
||
- Q3.2/Q3.3/Q3.5 + remaining Q4 + Q5.2/Q5.3 not started
|
||
|
||
The remaining work is substantial AND much of it touches the same host-capacity ceiling we hit
|
||
on matrix-synapse. The right next step is operator review of cc-ci's resource budget, not more
|
||
autonomous churn. Sending PushNotification.
|
||
|
||
## 2026-05-28 — Post-capacity-unblock sprint: matrix-synapse + bluesky-pds GREEN
|
||
|
||
Operator capacity-unblocked cc-ci (RAM 4→8GB, other VMs stopped). Resumed Phase 2.
|
||
|
||
**matrix-synapse (Q4.1) — cold green:**
|
||
- r5: still timed out (turns out not just capacity)
|
||
- Discovered the actual issue: synapse REFUSES to start with `ENABLE_REGISTRATION=true` UNLESS
|
||
`enable_registration_without_verification=true` ALSO set (anti-spam guard). The recipe doesn't
|
||
expose the second env. Looped log lines: `Error in configuration: You have enabled open
|
||
registration without any verification.`
|
||
- Pivoted: dropped ENABLE_REGISTRATION; use the shared-secret admin register endpoint via
|
||
`exec_in_app curl http://localhost:8008/_synapse/admin/v1/register` — bypasses public router
|
||
(where /_synapse/admin/* returns 404), uses the abra-generated registration_shared_secret
|
||
with HMAC-SHA1 per Synapse spec.
|
||
- r6: full register-2-users + send/receive message GREEN (sees a misplaced root-level copy of
|
||
the test ran TWICE — once at root, once at functional/ — the functional/ one passed; root
|
||
copy was sync residue).
|
||
- r7 (post-cleanup): clean GREEN. 5 assertions PASS (parity health + federation version + the
|
||
§4.3 prescribed register-and-message + 2 install).
|
||
|
||
**bluesky-pds (Q4.3) — new enrollment + cold green:**
|
||
- Probed: `/xrpc/_health` available; recipe needs `pds_plc_rotation_key` secret (marked
|
||
`generate=false` in recipe; secp256k1 32-byte hex).
|
||
- Wrote `install_steps.sh` that generates the key with cc-ci-run python's `secrets.token_bytes(32)
|
||
.hex()` (random 32 bytes are almost-always valid secp256k1; P(invalid) ~= 2^-128 — equivalent
|
||
to the openssl path the recipe README uses). Inserted via `abra app secret insert` under
|
||
TTY-wrap.
|
||
- r1: `/.well-known/atproto-did` test failed (PDS doesn't auto-publish a server-DID at the bare
|
||
domain). Replaced with `test_session_auth.py` — GET `/xrpc/com.atproto.server.getSession`
|
||
expecting 401 + XRPC error envelope. This is the recipe-defining auth contract.
|
||
- r4 (final): install + 3 functional tests all PASS, deploy-count=1.
|
||
|
||
**Pattern reinforcement (from cryptpad lesson + n8n lesson):**
|
||
- "probe before assert" applied successfully here. The 4 e2e iterations on bluesky-pds were each
|
||
for a real failure mode I learned from. Each iteration tightened the test design.
|
||
- Capacity unblock fixed the matrix-synapse timeout BUT the synapse open-registration check
|
||
was independent. Capacity + recipe-specific config both matter.
|
||
|
||
**Phase 2 status (current):**
|
||
- Q0/Q1/Q2 Adversary PASS ✓
|
||
- Q3.1 partial (lasuite-docs), Q3.4 partial (cryptpad), Q4.1 done (matrix-synapse), Q4.3 done (bluesky-pds)
|
||
- Q5.1 docs partial
|
||
- Remaining: Q3.2/3.3/3.5 + Q4.2/4-10 + the deferred follow-ups (lasuite-docs OIDC wiring,
|
||
cryptpad create-pad, matrix-synapse shell-script ports)
|
||
|
||
Pausing for Adversary cold-verify of Q4.1+Q4.3 (and re-verify of Q3.1+Q3.4 if updated). Will
|
||
resume on watchdog ping.
|
||
|
||
## 2026-05-28 (later) — Q3.2 lasuite-drive base-deploy verify: disk → prune → Docker Hub rate limit; + Gitea outage
|
||
|
||
Resumed loop to cold-verify the lasuite-drive base deploy (the f59d8e6 commit deferred OIDC/specific
|
||
tests until the ~10-service base converges). Chain of events:
|
||
|
||
1. **First install run timed out at abra TIMEOUT=900.** abra log root cause was NOT slowness but
|
||
`FATAL: could not write init file: No space left on device` in postgres init — cc-ci `/` was at
|
||
**89% (2.9 GB free)**. The ~2GB onlyoffice + ~1GB collabora pulls filled the disk; postgres
|
||
couldn't initialise. Stack is actually **12 services** (app, backend, celery, celery-beat, db,
|
||
redis, minio, minio-createbuckets[0/0 one-shot], mailcatcher, web/nginx, collabora, **onlyoffice**)
|
||
— bigger than the recipe_meta header noted; it ships BOTH office backends by default.
|
||
|
||
2. **Freed disk via `docker image prune -af`** → reclaimed 10.1 GB (30 dangling images from prior
|
||
recipe runs); host went 2.9 GB → 14 GB free. Bumped abra TIMEOUT 900→1500, DEPLOY_TIMEOUT
|
||
1200→1800 (recipe_meta.py edit; not yet committed — Gitea down, see below).
|
||
|
||
3. **Second run progressed far** — db, collabora, onlyoffice, backend, celery, app all reached 1/1.
|
||
But minio/redis/web/mailcatcher stuck at 0/1 in an instant Assigned→Rejected loop ("No such
|
||
image"). Manual `docker pull minio/minio:...` returned **`toomanyrequests: You have reached your
|
||
unauthenticated pull rate limit`**. The prune wiped these (previously-cached) small images, and
|
||
the full cold re-pull of 12 images — on top of today's many recipe deploys (matrix-synapse,
|
||
bluesky, ghost, uptime-kuma, keycloak, lasuite-docs, cryptpad retries) — exhausted Docker Hub's
|
||
per-IP anonymous quota. Big images pulled first; the 4 small ones got starved.
|
||
|
||
**Lesson:** pruning is double-edged on this host — it frees disk but forces re-pulls that burn the
|
||
anonymous rate limit. The real fix is authenticated registry pulls (plan §1.5 "registry pull
|
||
credentials") + trimming heavy stacks (lasuite-drive does not need BOTH collabora and onlyoffice
|
||
for WOPI parity — one office backend suffices; disabling onlyoffice cuts the biggest image + RAM).
|
||
|
||
4. **Gitea (git.autonomic.zone) is down** — bare host `/`, unauth `/api/v1/version`, and authed repo
|
||
API all return plain-text `404 page not found` (Go default ServeMux 404 = backend down, proxy has
|
||
no upstream). Same from both my sandbox and cc-ci (same IP 116.203.211.204), so it's a real
|
||
instance outage, not my creds/path. Adversary's `/root/adv-verify` clone is stale at 1aaf3bd
|
||
(clean, no inbox) → Adversary runs in its own sandbox; the only shared channel (Gitea) is dead.
|
||
**Two watchdog pings arrived (REVIEW-2 update + BUILDER-INBOX.md) that I CANNOT consume** until
|
||
Gitea recovers — will pull + act the instant it's back.
|
||
|
||
Action: interrupted the stuck deploy (let abra TIMEOUT fire for clean teardown). Recording finding;
|
||
notifying operator (registry creds per §1.5 + Gitea outage). Idle-retry both until recovery.
|
||
|
||
### Correction (same session): cannot trim onlyoffice — recipe-as-is rule
|
||
Investigated the "disable onlyoffice to shrink the stack" idea from the entry above. The lasuite-drive
|
||
recipe ships a **single `compose.yml`** with collabora AND onlyoffice as unconditional services — no
|
||
`COMPOSE_FILE`/compose-profile toggle in `.env.sample`. Disabling onlyoffice would require editing the
|
||
recipe's `compose.yml`, which violates "test the recipe as-is / never modify the recipe under test"
|
||
(§7-equivalent corner-cut). So **the trim avenue is closed** — I test all 12 services. The only
|
||
legitimate levers for the rate-limit problem are: (1) **registry pull credentials** (the §1.5 operator
|
||
finding — requested), and (2) **don't `docker image prune` aggressively** between runs (it forces cold
|
||
re-pulls that burn the anonymous quota; let the cache persist). Disk pressure must instead be managed
|
||
by pruning ONLY truly-dangling images, or by the operator growing the cc-ci disk.
|
||
(Also noted: recipe env is `ONLY_OFFICE_DOMAIN`, underscore — my EXTRA_ENV flattened COLLABORA/MINIO
|
||
domains but not onlyoffice's; only matters for the WOPI/TLS path, to revisit when base converges.)
|
||
|
||
## 2026-05-28 (later) — Gitea restored; consumed Adversary inbox; fixed F2-11 (SSO-skip-goes-green)
|
||
|
||
Gitea (git.autonomic.zone) recovered ~21:08Z (orchestrator confirmed). Reconciled: `git pull --rebase`
|
||
(up to date), pushed my 2 queued local commits (1138d77 + 4a118ea → origin), then a 3rd pull picked up
|
||
the Adversary's `b941f55` (its outage-queued writes: F2-11 + REVIEW-2 idle checkpoint + BUILDER-INBOX).
|
||
Consumed + deleted BUILDER-INBOX. The 3 watchdog pings during the outage were phantoms (Adversary's
|
||
failed push retries) — nothing was lost.
|
||
|
||
**Adversary's BUILDER-INBOX (digested):** DONE-gate warnings (F2-7 authentik, F2-9 cryptpad create-pad,
|
||
ghost §4.3 create-post floor, Q3.2 drive specifics, full P1–P8 Q5 re-verify) — all need deploys, so
|
||
gated on the Docker Hub rate limit. Plus **F2-11** (medium, not a VETO), which is pure code → fixed it
|
||
now (rate-limit-independent).
|
||
|
||
**F2-11 — SSO-dep "deps-not-ready" SKIP must not yield a GREEN run.** Adversary cold-proved: when
|
||
`setup_custom_tests` fails for a DEPS-declaring recipe, `CCCI_DEPS_READY=0` → conftest skips every
|
||
`@requires_deps` test → a skip-only pytest file exits 0 → `run_custom` returns "pass" → `overall=0` →
|
||
`!testme` GREEN while the only SSO/OIDC test never ran. Violates P7.
|
||
|
||
Why my fix is shaped this way: the failure-isolation design (a transient SSO-setup failure must not
|
||
break the *generic* tier signal) is correct and I kept it — generic tier results stand untouched. The
|
||
defect was only that the green SIGNAL was indistinguishable from "SSO verified." So I correct the
|
||
signal, not the isolation:
|
||
- `conftest.pytest_collection_modifyitems` now COUNTS the requires_deps tests it skips and appends the
|
||
count to `$CCCI_DEPS_SKIP_REPORT` (one line per pytest invocation; orchestrator sums across the
|
||
per-custom-file loop). Chose a filesystem report (not exit code) because pytest has no "fail on
|
||
skip" and a skip-only file legitimately exits 0 — the orchestrator already shares run-scoped temp
|
||
files with the pytest subprocess (depsfile/statefile/countfile), so this matches the pattern.
|
||
- `run_recipe_ci`: reads + sums the count, surfaces it in RUN SUMMARY (`custom: pass (N requires_deps
|
||
SKIPPED ... SSO UNVERIFIED)`), and a new pure predicate `sso_dep_unverified(declared, deps_ready,
|
||
skipped)` flips `overall=1` when a recipe declares DEPS + deps not ready + ≥1 requires_deps skipped.
|
||
Gated on skip>0 so a deps-declaring recipe with no requires_deps tests isn't false-failed.
|
||
|
||
Verified (both deploy-free — rate-limit-independent):
|
||
1. `cc-ci-run -m pytest tests/unit -q` → **35 passed** (28 prior + 7 new in test_f211_sso_skip.py:
|
||
predicate truth table + conftest skip/record/append/noop-when-ready).
|
||
2. Cold real-test proof on cc-ci: `CCCI_DEPS_READY=0 CCCI_DEPS_SKIP_REPORT=/tmp/f211-skip.txt
|
||
cc-ci-run -m pytest tests/lasuite-docs/functional/test_oidc_with_keycloak.py -rs` → `1 skipped`,
|
||
`PYTEST_EXIT=0` (the hazard), but `/tmp/f211-skip.txt` now contains `1` → orchestrator would compute
|
||
`sso_dep_unverified(["keycloak"], False, 1)=True` → `overall=1`. Hazard closed.
|
||
|
||
Full e2e (real deploy with a forced setup_custom_tests failure → observe overall=1) deferred to when
|
||
the Docker Hub rate limit lifts; the unit + cold-real-test proofs cover the predicate, the conftest
|
||
signal on real files, and the count flow — only the sequential read→sum→predicate→overall wiring is
|
||
unexercised by a live run, and it's straight-line code.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Phase 2 RESUMED after the 2w (warm-canonical) detour
|
||
|
||
Builder loop resumed on Phase 2 (per-recipe test authoring). Phase 2w ran to DONE in the interim
|
||
(warm-canonical/quick); the 2w changes (`runner/warm*.py`, `canonical.py`, `nightly_sweep.py`, WC5
|
||
promote-on-green-cold wired into `run_recipe_ci.main()`) are merged on origin/main HEAD `7b5ed9c`.
|
||
|
||
**Re-orientation done this tick:**
|
||
- Adversary's last Phase-2 commit `7b5ed9c review(2)` is a cross-phase break-it probe (2w WC5
|
||
promotion × F2-11 SSO-skip): NO regression, no finding, NO VETO — F2-11 protection holds under
|
||
WC5 (promotion strictly gated on the fully-computed `overall`, which the F2-11 predicate flips to
|
||
1 before the promote check). So no gate of mine to advance, nothing to fix.
|
||
- All Adversary findings closed (F2-10, F2-11). Gates Q0/Q1/Q2 PASS. Q3/Q4 partial.
|
||
|
||
**Server build clone established:** `/root/builder-clone` (origin/main, secrets submodule skipped —
|
||
not needed for recipe tests; Gitea token comes from `/run/secrets/bridge_gitea_token`, dockerhub
|
||
auth from sops-rendered `/root/.docker/config.json`). `/root/cc-ci` is the nix-deploy materialised
|
||
copy (no `.git`), `/root/adv-verify` is the Adversary's. I run e2e from `/root/builder-clone`.
|
||
|
||
**Foundation re-confirmed post-2w (this tick):**
|
||
- `cc-ci-run -m pytest tests/unit -q` → **72 passed** (Phase-2 harness survived the 2w merge).
|
||
- `RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py` → all 5 tiers PASS, deploy-count=1, WC5
|
||
promoted canonical custom-html → 1.11.0+1.29.0. Full install→upgrade→backup→restore→custom
|
||
pipeline healthy on the current harness.
|
||
|
||
**Reference-corpus mapping (key planning fact).** Corpus at `/srv/recipe-maintainer/recipe-info/`
|
||
(NOT `references/` — that path in the plan is stale). Present: authentik, bluesky-pds, cryptpad,
|
||
custom-html, gitea, hedgedoc, immich, keycloak, lasuite-docs, lasuite-drive, lasuite-meet, lichen,
|
||
lichen-markdown, matrix-synapse, mumble, n8n. Implication for P2 (parity):
|
||
- §5 recipes WITH reference parity still to port: **lasuite-meet, immich, mumble** (+ already done:
|
||
bluesky-pds, cryptpad, custom-html, keycloak, lasuite-docs, lasuite-drive, matrix-synapse, n8n).
|
||
- §5 recipes with NO reference → P2 vacuous, need only ≥2 specifics + lifecycle: **plausible, ghost,
|
||
uptime-kuma (done), mattermost-lts, discourse, mailu, drone**.
|
||
- authentik: SSO provider, Q2.2 deferred (lands only if a dependent needs it).
|
||
- gitea/hedgedoc/lichen* are in the corpus but NOT in §5 → out of scope.
|
||
|
||
**Remaining §5 work:** Q3.3 lasuite-meet, Q3.5 immich, Q4.2 mumble (parity+specifics, need
|
||
mirror/enroll), Q4.5 mattermost-lts, Q4.6 discourse, Q4.7 plausible (finish specifics), Q4.9 mailu,
|
||
Q4.10 drone (specifics only), + deferral lift cryptpad create-pad (F2-9, must lift before DONE).
|
||
|
||
**In flight this tick:** full `RECIPE=lasuite-drive` e2e on `/root/builder-clone`
|
||
(log `/root/ccci-resume-lasuite-drive.log`) — lasuite-drive suite (health parity + real MinIO S3
|
||
upload/list/download round-trip + OIDC password-grant JWT-claims against dep keycloak) is fully
|
||
authored; driving it to its first verified-green full run (the Q3.2 acceptance evidence).
|
||
|
||
---
|
||
|
||
## 2026-05-29 — lasuite-drive full e2e: upgrade tier hits a DISK-SIZE env blocker (host health emergency handled)
|
||
|
||
Drove lasuite-drive (heaviest §5 recipe — BOTH office backends) toward its first verified-green full
|
||
run. install tier PASSED (generic test_serving + cc-ci test_serving_and_frontend; all 12 services
|
||
converged after collabora won its startup race — see below). backup tier PASSED. Then the **upgrade
|
||
tier FAILED** and disk hit **99% (522M free)**, risking a host wedge.
|
||
|
||
**Root cause (definitive, from the abra DEPLOY OVERVIEW in the log):** the prev→PR-head upgrade
|
||
crosses *two different multi-GB office image versions simultaneously*:
|
||
- onlyoffice/documentserver-de: 9.2 → **9.3.1.2** (3.94GB image)
|
||
- collabora/code: 25.04.9.1.1 → 25.04.9.4.1 (~1GB)
|
||
- (+ small drive-backend/frontend v0.12.0→v0.18.0, redis, nginx)
|
||
abra's in-place chaos rolling update must hold BOTH the running prev office images AND pull the new
|
||
ones before swapping — ~10GB of office images transiently. The 28GB host has only ~14GB docker
|
||
headroom over the ~13GB baseline (nix store ~9.6GB + infra images ~1.75GB), so the PR-head pull
|
||
overflowed. **No harness mitigation exists:** the prev images are *running* (not dangling) when the
|
||
new must be pulled, and you cannot `docker rmi` a running image; a pre-upgrade prune finds nothing
|
||
dangling. It is fundamentally a disk-SIZE constraint, driven by the recipe legitimately bumping office
|
||
image tags across releases. Not a test-quality issue and not weakenable.
|
||
|
||
**collabora startup race (separate, self-resolving):** collabora/code logs
|
||
`/usr/bin/coolmount: Operation not permitted` (CapAdd=[] + default seccomp blocks mount()), falls back
|
||
to slow file-COPYING into its jail; the healthcheck killed an early task (exit 137) but a later task
|
||
finished the copy and reached 1/1. So collabora converges, just flaps once or twice first. Not the
|
||
blocker; noting in case it recurs on slower disk.
|
||
|
||
**Emergency handled — host fully restored:** killed the run (`pkill -f run_recipe_ci.py`), removed the
|
||
orphaned `lasu-7ea5e3` stack + its volumes (minio, postgres) + 8 leftover secrets (the killed run's
|
||
teardown never ran), pruned dangling images. Disk recovered 99% → 37% (17GB free). Infra stacks
|
||
(traefik/drone/dashboard/bridge/backups/warm-keycloak) untouched and healthy throughout.
|
||
|
||
**Decision:** the upgrade tier for lasuite-drive (and very likely other heavy recipes: lasuite-docs
|
||
also ships collabora; immich ships multi-GB ML images; lasuite-meet) is a genuine **Class A1 env-level
|
||
disk blocker** — the clean fix is a larger host disk (operator). Filed in DEFERRED.md + DECISIONS.md +
|
||
BACKLOG-2; flagged to operator (PushNotification) and Adversary (inbox). Meanwhile banking the
|
||
**maximal testable subset** (install+backup+restore+custom — single version, fits disk) to prove
|
||
lasuite-drive's actual Q3.2 CONTENT works: parity health, the real MinIO S3 upload→list→download
|
||
round-trip, and the OIDC password-grant + JWT-claims flow against the dep keycloak. Per §7.1 the
|
||
maximal subset is implemented and only the genuinely-disk-blocked upgrade tier is outstanding —
|
||
pending Adversary sign-off on the env-blocker.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — lasuite-drive: --detach fix validated, but OIDC setup redeploy is FLAKY (NOT claiming Q3.2 yet)
|
||
|
||
Ran lasuite-drive maximal subset (install,backup,restore,custom) four times today:
|
||
- **Run 1** (`ccci-drive-subset.log`): all tiers + all 3 functional GREEN (health, MinIO round-trip,
|
||
OIDC JWT) — but required a manual kill of the hung `docker service scale` (the bug I then fixed with
|
||
`--detach`, commit `f1c626c`). So the test ASSERTIONS are all correct and CAN pass.
|
||
- **Runs 2 & 3** (`-clean`, `-clean2`): corrupted by MY OWN over-eager `docker image prune -f` mid-deploy
|
||
— it removed the just-pulled, not-yet-attached digest-pinned images (drive-frontend, onlyoffice),
|
||
so swarm rejected with "No such image" and install failed/timed out. **LESSON: never
|
||
`docker image prune` during an active deploy — mid-pull images look dangling and get removed.**
|
||
Confirmed self-inflicted: `docker pull lasuite/drive-frontend@sha256:eeef…` succeeded (image is on
|
||
hub), and after seeding it the stack converged. Not a recipe/test issue.
|
||
- **Run 4** (`-clean3`, warm images, hands-off, fixed `--detach`): install/backup/restore all PASS,
|
||
health + MinIO PASS, **but the OIDC test SKIPPED because `setup_custom_tests.sh` exited 1** — its
|
||
step-3 in-place `abra app deploy --force --chaos` (applies the OIDC env) FAILED to converge
|
||
("FATA deploy failed"; abra log shows backend `Permission denied: /.gunicorn` + celery
|
||
`configure_wopi: 404 from collabora discovery url`). Per F2-11 the run correctly went RED (no false
|
||
green) — `custom: pass (1 requires_deps SKIPPED — SSO UNVERIFIED)`, overall=1. The `--detach` fix
|
||
itself works (bucket scale returned, secret inserted v2); the failure is the full-stack redeploy.
|
||
|
||
**Root finding: the OIDC-wiring step (a full 12-service in-place `--chaos` redeploy) is FLAKY on this
|
||
heaviest stack** — collabora's reconverge race + a transient backend gunicorn-perms/WOPI-404 window
|
||
mean the redeploy succeeds only sometimes (run 1 yes, run 4 no). The OIDC env change only affects
|
||
backend/app, so re-converging collabora/onlyoffice is unnecessary exposure. Fix direction (BACKLOG):
|
||
wire OIDC at INSTALL time (no post-deploy redeploy — like lasuite-docs install_steps), or make the
|
||
setup redeploy resilient (retry / wait for collabora WOPI discovery 200 before declaring ready).
|
||
|
||
**Decision:** NOT claiming Q3.2 — a flaky OIDC setup is not a reliable green, and claiming would risk
|
||
an Adversary cold-verify FAIL. lasuite-drive stays [~]: test content proven correct (run 1), `--detach`
|
||
bug fixed, two open issues (disk-blocker on upgrade tier [DEFERRED/operator]; flaky OIDC redeploy
|
||
[BACKLOG, needs robustness work]). **Pivoting to lighter recipes for broad Phase-2 progress**;
|
||
lasuite-drive's OIDC robustness + upgrade-disk return later. Host left clean (all stacks torn down,
|
||
disk 65%, infra healthy).
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Next unit scouted: mumble (Q4.2) — design for the first NON-HTTP recipe
|
||
|
||
Pivoted off heavy lasuite-drive to a lighter recipe. mumble: recipe.toml has NO deps, single light
|
||
service (mumblevoip/mumble-server:v1.6.870-0) → fast deploys, low disk (avoids the lasuite-drive
|
||
heaviness/flakiness). BUT it's the first non-HTTP recipe: raw Mumble protocol over TLS on TCP 64738
|
||
(+ UDP). Reference corpus `/srv/recipe-maintainer/recipe-info/mumble/tests/`: health_check.py (TCP
|
||
connect to 64738), mumble_connect.py (pure-stdlib TLS handshake: Version + auth-accepted +
|
||
ChannelState + ServerSync + welcome text — portable as-is), web_client.py (HTTPS web UI, needs
|
||
`compose.mumbleweb.yml` overlay).
|
||
|
||
**Reachability decision (the crux):** cc-ci's traefik is HTTP(S)-only; the recipe declares traefik
|
||
TCP/UDP router labels but cc-ci has no :64738 TCP entrypoint, and host→overlay-container-IP isn't
|
||
reliably routable. **Chosen approach: run the protocol probe from a throwaway `python:3-slim`
|
||
sidecar container attached to the app's overlay network**, connecting to the murmur service by its
|
||
swarm DNS name (`app`) on 64738. No traefik change, no host-port publish, no compose-overlay
|
||
selection needed — the harness already knows the stack/network name. This becomes a small reusable
|
||
harness primitive (`run probe container on app network`) for any future non-HTTP recipe. Record in
|
||
DECISIONS.md when implemented.
|
||
|
||
**Enrollment plan (next tick):** mirror-check mumble on recipe-maintainers (auto-mirror if absent per
|
||
plan §0b); `tests/mumble/recipe_meta.py` (no DEPS; HEALTH via the sidecar TCP probe, not HTTP —
|
||
needs a recipe_meta hook or a custom install overlay since the generic HTTP health check won't apply;
|
||
likely set CCCI_SKIP_GENERIC or provide a TCP-aware install overlay); port health_check +
|
||
mumble_connect as functional tests using the sidecar primitive; ≥2 specifics (protocol handshake +
|
||
channel-list presence beyond TCP health); PARITY.md; e2e (light/fast). web_client.py deferred unless
|
||
the mumbleweb overlay is enabled. Open question to resolve in code: how the generic install tier
|
||
(HTTP health) behaves for a non-HTTP recipe — may need a per-recipe "health kind = tcp" in
|
||
recipe_meta consumed by the generic harness.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — mumble scope CORRECTION: non-HTTP health is a high-blast-radius core-harness feature, not a light add
|
||
|
||
On deeper inspection, mumble's non-HTTP nature is NOT a small adaptation. The HTTP health assumption
|
||
is baked into the CORE health path used by EVERY recipe + the 2w warm system:
|
||
- `run_recipe_ci._load_meta` defaults (HEALTH_PATH/HEALTH_OK) + the mirrored `conftest._recipe_meta`.
|
||
- `lifecycle.wait_healthy(domain, ok_codes, path, ...)` — the orchestrator's post-deploy HTTP poll at
|
||
THREE call sites (run_recipe_ci.py:467 warm/canonical, :633, :737).
|
||
- `canonical.deploy_canonical` health gate (warm-cache, 2w).
|
||
- `generic.assert_serving` (HTTP fetch + served_cert) and restore-health.
|
||
Supporting a TCP/protocol recipe means threading a `HEALTH_KIND` (http|tcp) through ALL of these with
|
||
default="http" preserving current behavior. That's a legitimate harness feature but HIGH BLAST RADIUS
|
||
(a regression breaks every recipe and the warm sweep), so it warrants a dedicated, careful effort with
|
||
unit tests + a no-regression re-run of an HTTP recipe + Adversary scrutiny of the core change — NOT a
|
||
tail-of-session cram. **Filed as its own unit (Q4.2 stays open; needs the non-HTTP-health harness
|
||
feature first).** Also: mumble's app is only on the `proxy` net and routes via a traefik `mumble` TCP
|
||
entrypoint cc-ci lacks (HostSNI + TLS passthrough) — the custom protocol test still needs the
|
||
python-sidecar-on-proxy-net probe.
|
||
|
||
**Next-unit re-pick:** prefer an HTTP-NATIVE recipe that uses the proven harness with zero core
|
||
changes — **mattermost-lts (Q4.5)** is the candidate (HTTP UI+API via traefik; §4.3 = create-a-message
|
||
round-trip is pure test-authoring, not harness surgery). Scout it next: confirm it's HTTP-native +
|
||
self-contained DB (vs needing a dep), mirror-check, then enroll (recipe_meta + lifecycle overlays +
|
||
≥2 specifics + PARITY note [no reference corpus → P2 vacuous]). Keeps blast radius low and adds real
|
||
coverage. mumble/mailu (non-HTTP) batch behind the HEALTH_KIND harness feature.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — DISK RESIZE 30→70GB in progress (orchestrator) — disk-blocker LIFTING; deploys paused
|
||
|
||
Orchestrator is resizing the cc-ci VM disk 30→70GB; VM RESTARTS (few-min outage + live-warm keycloak
|
||
re-warms on boot, up to ~10min). Actions: PAUSED new deploys; the in-flight mattermost-lts
|
||
install+custom e2e (`ccci-mattermost2.log`) will die transiently with the restart — that is the
|
||
restart, NOT a bug; re-run after. Waiting for the orchestrator's "back + healthy" signal (fallback
|
||
self-poll meanwhile).
|
||
|
||
**Impact (big):** this lifts the heavy-recipe upgrade-tier disk blocker (DEFERRED 2026-05-29 →
|
||
LIFTING). After cc-ci is healthy I can:
|
||
1. Re-run **lasuite-drive FULL lifecycle** (install+upgrade+backup+restore+custom) — the upgrade tier's
|
||
dual multi-GB office-image crossover (~10GB transient) now fits in 70GB. This is the path to the
|
||
real Q3.2 green (modulo the separate Q3.2a OIDC-redeploy flakiness — watch whether the bigger disk
|
||
also eases the redeploy convergence, though the flakiness root was collabora reconverge timing, not
|
||
disk). With more headroom the collabora re-pull churn from my earlier prune mistakes also stops
|
||
biting.
|
||
2. Re-run **mattermost-lts** install+custom (validate the create-message §4.3 round-trip) — it had
|
||
just launched when the resize started.
|
||
3. Resume broad heavy-recipe coverage (immich, lasuite-meet) with real disk headroom.
|
||
|
||
Note: with 70GB, I can also be less aggressive about teardown/prune churn between heavy runs.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — lasuite-drive Q3.2a Step 0: root-cause failure logs captured (BEFORE any fix)
|
||
|
||
Resuming Q3.2a (plan-lasuite-drive-oidc-robustness.md) after Phase 2pc DONE. The Adversary's
|
||
cold-verify criterion #1 requires real captured failure logs before any fix. Captured from the
|
||
flaky run-4 deploy (`/root/.abra/logs/default/lasu-288dfd...2026-05-29T062401Z`, the
|
||
`abra app deploy --force --chaos` OIDC-setup redeploy that exited 1 / "FATA deploy failed"):
|
||
|
||
1. **gunicorn perms race** — `backend [1] [ERROR] Control server error: [Errno 13] Permission
|
||
denied: '/.gunicorn'`. gunicorn tries to create its control-server temp dir under HOME=`/`
|
||
(not writable). (Part B fix: set perms / writable HOME in entrypoint before exec gunicorn.)
|
||
2. **WOPI-discovery race** — `celery RuntimeError: status code 404 return by discovery url for
|
||
wopi client collabora is invalid` at `/app/wopi/tasks/configure_wopi.py:53`. The celery
|
||
`configure_wopi_clients` task hits collabora's discovery URL at boot (06:21:54) while collabora
|
||
is still caching its 132+ l10n files (finishes ~06:24) → 404 → task raises. (Part B fix:
|
||
collabora WOPI healthcheck gating + backend retry/backoff on discovery.)
|
||
3. **transient db-not-ready** — `db FATAL: database "drive" does not exist` + celery
|
||
`Could not connect to database: failed to resolve host 'db'` — early-boot DNS/init races that
|
||
self-heal; harmless on a fresh deploy with the full TIMEOUT window.
|
||
|
||
**Key observation that shapes the fix:** the FIRST install deploy converges reliably **every** run
|
||
(install: pass in runs 1–4, incl. run 4). Only the post-install in-place `--force --chaos` redeploy
|
||
(applied to push the OIDC env) is flaky. The OIDC env touches ONLY backend/app — re-converging
|
||
collabora/onlyoffice/minio is unnecessary exposure. → **Part A: wire OIDC into the .env at INSTALL
|
||
time (between `abra app new` and the single `abra app deploy`) so the recipe deploys ONCE with OIDC
|
||
already set; no post-deploy reconverge.** keycloak is live-warm (always up), so the per-run realm is
|
||
a lightweight API call provisioned before the single deploy. Part B (recipe robustness PR) remains
|
||
the deeper fix so ANY reconverge (incl. the upgrade-tier prev→PR-head crossover) is race-free.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — lasuite-drive Q3.2a: Part A + upgrade-gate fix → FULL SUITE GREEN (run 1 of 3)
|
||
|
||
Two iterations landed:
|
||
- **Part A** (commit `a151489`): wire OIDC at INSTALL (provision warm-keycloak realm before the
|
||
single deploy; `install_steps.sh` writes OIDC env into it). Run 1 (`ccci-drive-q32a-r1.log`):
|
||
deploy-count=1, install/backup/restore/custom + OIDC test all GREEN — but **upgrade tier FAILED**:
|
||
the chaos redeploy SIGTERMed a still-booting collabora (coolwsd ~2min boot) → "Shutdown requested
|
||
while starting up", forced exit 70 → abra aborted ("FATA deploy failed"). install wait_healthy
|
||
returns on collabora container 1/1 while coolwsd is still loading.
|
||
- **Upgrade-gate fix** (commit `4b38b66`): `ops.py::pre_upgrade` now waits for collabora WOPI
|
||
discovery (`/hosting/discovery` on `collabora-<domain>`) → 200 before the chaos redeploy; +
|
||
DEPLOY_TIMEOUT plumbed through `chaos_redeploy`/`perform_upgrade`/`_perform_op` (was abra.deploy's
|
||
900s default vs the .env internal TIMEOUT 1500s).
|
||
|
||
**Fixed-code run 1 (`ccci-drive-q32a-r2.log`) — FULL SUITE GREEN:**
|
||
```
|
||
pre_upgrade: collabora WOPI discovery ready (200) on collabora-lasu-d12d03.ci.commoninternet.net
|
||
RUN SUMMARY: deploy-count = 1 (expect 1)
|
||
install : pass upgrade : pass backup : pass restore : pass custom : pass
|
||
```
|
||
- upgrade: `test_upgrade_preserves_data` PASSED (ci_marker survived prev→PR-head chaos crossover).
|
||
- custom: health + minio round-trip + OIDC password-grant JWT all PASSED (OIDC PASS, NOT skip).
|
||
- Clean teardown: no lasu stacks/volumes after; disk 38%.
|
||
|
||
The collabora-ready gate is the decisive fix — the upgrade chaos redeploy now replaces a fully-ready
|
||
collabora cleanly instead of killing it mid-boot. Launching runs 2 + 3 for the Adversary-required 3×
|
||
repeat-green before claiming Q3.2. (Part B — recipe-level WOPI healthcheck/gunicorn-perms PR — is no
|
||
longer required for CI green; will reassess whether to still file it as upstream robustness once 3×
|
||
green holds.)
|
||
|
||
---
|
||
|
||
## 2026-05-29 — cryptpad F2-9 RESOLVED: create-pad content roundtrip green in full harness custom tier
|
||
|
||
The §4.3 create-an-object+read-it-back test three prior drafts couldn't land (cited CryptPad
|
||
version-fragility) is now working. Empirically mapped CryptPad 2026.2.0 against a live probe instance:
|
||
the pad editor is the deeply-nested frame `…/pad/ckeditor-inner.html` (top → `#sbox-iframe` on the
|
||
sandbox domain → CKEditor frame); visiting `/pad/` auto-creates a fragment-keyed pad
|
||
(`#/2/pad/edit/<key>/`) after ~15s cold init (LESS compile). `tests/cryptpad/playwright/
|
||
test_pad_content_roundtrip.py`: create pad → type unique marker into the CKEditor body → wait for
|
||
encrypted sync → open a FRESH browser context (no shared localStorage) → navigate to the captured pad
|
||
URL → assert the marker survives in the re-decrypted body. Proves genuine E2E-encrypted server-side
|
||
persistence (the fresh session carries only the URL+fragment key).
|
||
|
||
Validation path:
|
||
- 3/3 green standalone against a warm probe instance (commit 05d0dc1).
|
||
- First full-suite run did NOT exercise it (I'd `rm`'d the file from builder-clone to unblock a pull;
|
||
the ff left it deleted → discovery skipped it — LESSON: `git checkout -- <file>` after pull, never
|
||
leave a tracked test locally-deleted).
|
||
- Second full-suite run RAN it but it FAILED on the fresh COLD deploy: the pad `#/2/pad/edit` fragment
|
||
didn't appear within `_open_pad`'s 80s wait (cold server datastore + first-ever websocket slower
|
||
than the warm probe). Fix `656b68b`: bump `_open_pad` hash-wait to ~240s + a mid-way reload.
|
||
- Third full-suite run (`/root/ccci-cryptpad-full3.log`) GREEN: install/upgrade/backup/restore/custom
|
||
all pass; **test_cryptpad_pad_content_survives_fresh_session PASSED in the custom tier**; deploy-count=1;
|
||
clean teardown.
|
||
|
||
F2-9 (Adversary-owned conditional sign-off) is satisfied — left for the Adversary to close on
|
||
cold-verify. DEFERRED.md cryptpad create-pad entry marked resolved.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Both Phase-2-DONE blockers cleared; next unit scouted: Q3.3 lasuite-meet
|
||
|
||
**Milestone:** Q3.2 lasuite-drive = Adversary PASS (F2-12 CLOSED). cryptpad F2-9 = RESOLVED (roundtrip
|
||
green in full custom tier; awaiting Adversary close). The two veto-eligible / DONE-gating items are done.
|
||
|
||
**Next unit — Q3.3 lasuite-meet (SSO-dependent, La Suite sibling).** Scouted: mirrored on
|
||
recipe-maintainers (200), reference corpus rich (health_check, oidc_login, meeting_flow, webrtc-media,
|
||
webrtc-relay), `recipe.toml` requires=["keycloak"], [sso] provider=keycloak. **Reuses the exact
|
||
machinery I just built for lasuite-drive** — so low-friction:
|
||
- `recipe_meta.py`: DEPS=["keycloak"] + OIDC_AT_INSTALL=True (+ READY_PROBE if a heavy sub-service
|
||
like livekit needs an extra readiness signal — TBD at deploy).
|
||
- `install_steps.sh`: wire OIDC env at install (mirror lasuite-drive's; impress/La Suite OIDC contract
|
||
— adapt env var names to meet's .env.sample).
|
||
- lifecycle overlays test_install/upgrade/backup/restore + ops.py (DB marker like drive's, if meet has
|
||
a backable DB).
|
||
- Parity ports: health_check (HTTP 200), oidc_login (→ test_oidc_with_keycloak via
|
||
harness.sso.oidc_password_grant). PARITY.md mapping.
|
||
- §4.3 specifics: **meeting_flow** (password-grant token → create a room via meet API → assert room +
|
||
obtain LiveKit join token for 2 users; corpus meeting_flow.py shows the shape) + **webrtc** probe
|
||
(ICE/connectivity or LiveKit token issuance — full UDP media relay may be an env-blocker per plan
|
||
§7.1; implement the maximal testable subset = signaling/token issuance + document any true blocker).
|
||
- e2e: RECIPE=lasuite-meet PR=0 cc-ci-run runner/run_recipe_ci.py → full suite green, OIDC PASS.
|
||
|
||
(Also noted: tests/plausible/ has a stub (recipe_meta + functional/) from an earlier partial; plausible
|
||
not mirrored. Lower priority than lasuite-meet which completes Q3.)
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Testing-standard clarification (operator): 3× repeat-green is flakiness-specific, not general
|
||
|
||
The 3× repeat-green bar I applied to lasuite-drive (F2-12 fix) was correct THERE because that recipe
|
||
was demonstrably flaky — it was a flakiness proof (show the fix made it reliably green, not lucky-once).
|
||
**It is NOT the general standard.** Normal recipe gates = **ONE Adversary cold-verified green** per
|
||
plan.md §6.1. Do NOT require 3× for other recipes (lasuite-meet Q3.3, future Q4 recipes) — a single
|
||
full-suite green + Adversary cold-verify is the bar. (Recorded by orchestrator in
|
||
plan-lasuite-drive-recipe-pr.md §2; the 3× re-applies only if a recipe shows flakiness again.)
|
||
|
||
---
|
||
|
||
## 2026-05-29 — F2-13 fixed: cryptpad roundtrip read-back made robust (poll all frames)
|
||
|
||
Adversary cold-verify of F2-9 FAILED (F2-13): the roundtrip's read-back leg timed out waiting for the
|
||
CKEditor `ckeditor-inner` frame to ATTACH on a fresh cold context (flaky). Fix (commit `b44d75b`): the
|
||
read-back no longer requires that specific frame to attach — it polls EVERY frame's body text for the
|
||
marker (generous ~240s deadline + periodic reloads). The marker appearing in a fresh context still
|
||
proves server-side E2E-encrypted persistence (only URL+fragment key carried over). Bumped session-1
|
||
post-type sync wait 9s→12s.
|
||
|
||
Validated **3× green** against a cold cryptpad probe (`cryptpad-probe`), ~33s each, no flakiness (the
|
||
poll-all-frames finds the marker fast once the pad renders — robust AND faster than the old
|
||
frame-attach wait). F2-13 is Adversary-owned — left for the Adversary to re-verify + close F2-9.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — Q3.5 immich: 4/5 tiers green + §4.3; restore data-integrity blocked by UPSTREAM recipe (no pg_dump hook)
|
||
|
||
Full suite (`/root/ccci-immich-full.log`): install PASS, upgrade PASS (real crossover
|
||
1.5.1+v2.6.3→1.6.0+v2.7.5, ci_marker survived), backup PASS (artifact created), custom PASS
|
||
(test_immich_upload_asset_readback_and_thumbnail = §4.3 upload→read-back→thumbnail-derivative;
|
||
health), deploy-count=1, clean teardown. **ONLY `test_restore_returns_state` FAILED** — postgres
|
||
`ci_marker` does not survive `abra app restore` (relation does not exist; app itself healthy).
|
||
|
||
**Diagnosed (harness path, immich probe):** seed ci_marker='original' → `abra app backup create`
|
||
(restic snapshot, 1729 files / 190MB) → drop ci_marker → `abra app restore` → ci_marker STILL absent.
|
||
**Root cause:** immich's UPSTREAM recipe backs up the **live postgres data VOLUME** via restic
|
||
(`backupbot.backup=true` on `database`, NO pg_dump hook) — a hot pgdata snapshot that cannot reliably
|
||
restore a DB row into a running postgres. Contrast lasuite-drive/meet, which ship a `pg_backup.sh` +
|
||
labels (`backup.pre-hook: /pg_backup.sh backup` → `backup.volumes.postgres.path: backup.sql` →
|
||
`restore.post-hook: /pg_backup.sh restore`) producing a CONSISTENT SQL dump that restores cleanly
|
||
(their restore tiers pass). This is an upstream immich-recipe defect (same class as the parked Q3.2b
|
||
lasuite-drive recipe-robustness PR), not a cc-ci/test bug — the ci_marker pattern is correct (works on
|
||
drive/meet).
|
||
|
||
**Decision:** Q3.5 immich = PARTIAL. The maximal subset is proven (install/upgrade/backup-artifact/
|
||
restore-healthy/custom incl. §4.3 + health). Real DB-restore data-integrity (P4) needs the immich
|
||
recipe to gain a `pg_dump` backup hook — a recipe-create-pr unit (mirror immich → add pg_backup.sh +
|
||
the 4 backupbot labels [adapt POSTGRES_USER=postgres, DB=immich] → cc-ci full-suite green on the PR →
|
||
operator merge), exactly like Q3.2b for drive. Filed DEFERRED + BACKLOG. NOT claiming Q3.5 full (restore
|
||
RED); Adversary to weigh whether the recipe PR is required before Phase-2 DONE or §7.1 sign-off applies.
|
||
|
||
---
|
||
|
||
## 2026-05-29 — HQ1 image pre-pull DONE (commit 2bf40d6), claimed
|
||
|
||
Implemented per plan-prepull-images.md: lifecycle.prepull_images resolves a recipe's images via
|
||
`docker compose config --images` (COMPOSE_FILE from the app .env — handles $VERSION interpolation +
|
||
multi-compose; verified the invocation on custom-html-tiny [1 img] + lasuite-meet [compose.yml:
|
||
compose.turn.yml]) and docker-pulls them skip-if-present. Wired into deploy_app (before the unchanged
|
||
abra.deploy) + perform_upgrade (before the chaos redeploy). Validation: 4 unit tests (mocked docker)
|
||
prove present→skip / missing→pull / pull-fail→RAISE / no-images→skip; n8n run #1 prepulled a cold
|
||
image + green; n8n run #2 (warm) showed `prepull: present` (no re-download); a bogus tag raised a
|
||
clear "clear pull error BEFORE deploy: manifest unknown" pre-deploy. abra deploy unchanged (no service
|
||
update/scale). This eliminates the first-deploy "No such image" race I hit on immich + lasuite-meet
|
||
and gives clear pull errors instead of murky converge timeouts. Honest scope: removes pull-time not
|
||
app-init-time.
|
||
|
||
## 2026-05-29 — Q4.7 plausible: test content green; deploy blocked by upstream clickhouse-boot-download flakiness
|
||
|
||
**Test content authored + partially proven.** Wrote the §4.3 functional tests
|
||
(`tests/plausible/functional/test_event_tracking.py`: `test_pageview_event_roundtrip` +
|
||
`test_custom_event_roundtrip`) and fixed the health probe. Empirically validated the full event
|
||
round-trip against a live probe BEFORE writing: register a site row in the metadata postgres
|
||
(plausible's `sites_cache` GATES ingestion — events for unregistered domains are silently dropped,
|
||
confirmed count=0), POST to `/api/event` with a **browser User-Agent** (plausible drops bot/library
|
||
UAs), poll ClickHouse `events_v2` for the row (sites_cache refresh + write-buffer flush → first landing
|
||
~35-50s). A first `STAGES=install,custom` run **PASSED both event tests** (`2 passed in 73.58s`) and the
|
||
custom tier — so the §4.3 content is GREEN. Health probe switched `/` → `/api/health` (returns 200 with
|
||
`{"clickhouse":"ok","postgres":"ok","sites_cache":"ok"}` only when both stores ready; `/` 500s under
|
||
headless DISABLE_AUTH then 302s once ready, so `/` can't distinguish not-ready from ready). The prior WIP
|
||
edit had left an UNTERMINATED docstring in test_health_check.py (syntax error) — fixed. Install overlay
|
||
re-checked `/` (→500) and FAILED; replaced with a stronger assertion on the /api/health JSON subsystems.
|
||
|
||
**Blocker (upstream recipe defect): clickhouse-backup boot-download crash-loop.** The full lifecycle run
|
||
**timed out at DEPLOY_TIMEOUT=1200s** — `abra app deploy ... timed out after 1200 seconds`. Root cause:
|
||
the recipe's `entrypoint.clickhouse.sh` (swarm config `clickhouse_entrypoint`, mapped to
|
||
`/custom-entrypoint.sh`) runs, with `set -e` and NO retry, a `wget` of a 22MB `clickhouse-backup` tarball
|
||
from `github.com/AlexAkulov/clickhouse-backup` (renamed → 301 to `Altinity/...`) BEFORE exec'ing
|
||
clickhouse-server. If that wget (or the subsequent `tar -xf`) fails, the entrypoint exits 1 with EMPTY
|
||
logs (clickhouse-server never starts) and swarm crash-loops the task. Each restart re-downloads 22MB →
|
||
~120 attempts/20min ≈ 2.6GB hammered at GitHub → **GitHub secondary rate-limiting** → all subsequent
|
||
downloads fail → sustained crash-loop → deploy timeout.
|
||
|
||
Evidence: exited containers = `exit=1`, zero logs (fails before clickhouse). The download URL is fine —
|
||
a bridge-network `docker run` with the EXACT entrypoint command (busybox wget; image's `wget` is
|
||
`/bin/busybox`) succeeds 3/3 (22222742 bytes) when NOT hammered. The first `install,custom` run and a
|
||
manual probe BOTH converged (clickhouse up, events ingested) — i.e. the deploy works when GitHub answers
|
||
the first wget. The failure is induced by my back-to-back heavy testing churn today exhausting the IP's
|
||
GitHub budget; swarm task containers egress via the same host IP so they share the throttle.
|
||
|
||
**Why it matters for the gate:** normal CI (one PR → one deploy, MAX_TESTS=1) does ONE wget — usually
|
||
succeeds, converges (as proven). The catastrophic 20-min spiral needs SUSTAINED GitHub throttling, which
|
||
only my repeated-deploy testing produces. So plausible is reasonably reliable in normal operation but is
|
||
NOT robust to a transient first-wget failure (any single failure spirals), and the Adversary cold-verify
|
||
shares the risk.
|
||
|
||
**Decision (see DECISIONS.md):** durable fix = recipe PR hardening `entrypoint.clickhouse.sh` —
|
||
download the binary to the PERSISTENT `/var/lib/clickhouse` volume with skip-if-present (restarts don't
|
||
re-download → no amplification), retry-with-backoff, and `set +e` so a download failure does NOT block
|
||
clickhouse-server start (the DB must come up regardless; backup capability degrades gracefully). This
|
||
ALSO makes the deploy converge even under an active GitHub throttle (the DB no longer waits on the
|
||
download), so it is testable now. Same upstream-robustness pattern as Q3.2b (lasuite-drive) and immich's
|
||
pg_dump. cc-ci test content is correct and unchanged by this.
|
||
|
||
Killed the crash-looping runs + removed all plausible stacks/configs/networks/volumes (clean). NOT
|
||
claiming Q4.7 until the full lifecycle is green.
|
||
|
||
## 2026-05-29 — next-recipe recon (drone/discourse/mailu) after Q4.2 mumble claim
|
||
Recon (abra recipe fetch + compose inspect; non-deploy) of the 3 remaining unenrolled §5 recipes:
|
||
- **discourse**: services app+db(postgres)+redis+sidekiq; **HAS backupbot.backup label (compose.yml)
|
||
→ real P4 achievable**; 13 version tags (real upgrade); compose.smtpauth.yml overlay; functional =
|
||
create-a-topic via admin API (needs an admin API key — discourse first-boot/admin bootstrap). Heaviest
|
||
deploy (slow cold start, big image) — main risk is run time/flakiness, not coverage.
|
||
- **mailu**: 11 services (app/db/admin/imap/smtp/antispam/webmail/rspamd/dkim...); **NO backupbot label
|
||
→ P4 gap** (would need a recipe-PR to add backup, like immich Q3.5 — a deferral); 11 tags; functional =
|
||
admin API create domain+mailbox + SMTP/IMAP send/receive. Multi-service, moderate-heavy.
|
||
- **drone**: single app service + data volume; **NO backupbot → P4 gap**; 11 tags; compose.gitea.yml /
|
||
compose.github.yml overlays — functional depth (create/list builds) needs a wired git provider (gitea
|
||
OAuth dep). It is cc-ci's own CI engine. Shallow without a dep; P4 gap.
|
||
**Choice for the cleanest COMPLETE enrollment (P1 install+upgrade+backup-restore + real P4): discourse**
|
||
(only one of the three with a recipe backup mechanism). mailu/drone would each carry a P4-N/A deferral
|
||
(no upstream backup config) needing Adversary §7.1 sign-off or a recipe-PR. Plan discourse next: HTTP
|
||
health, admin-API create-a-topic (+ read-back) for §4.3, postgres ci_marker for P4 (backupbot present).
|
||
Hold the deploy until the Adversary's mumble cold-verify frees the single node.
|
||
|
||
## 2026-05-29 — mailu (Q4.9) investigation; discourse (Q4.6) blocked
|
||
- **discourse Q4.6 BLOCKED**: `bitnami/discourse:*` images removed from Docker Hub (manifest unknown;
|
||
swarm "No such image" rejection). bitnamilegacy/discourse exists but install tier uses the gone
|
||
prev-published version → recipe-PR can't unblock until upstream re-releases. DEFERRED.md entry filed.
|
||
Scaffolding (recipe_meta+postgres-P4 ops/overlays+health) staged at ca7acf3 for when fixed.
|
||
- **mailu Q4.9 plan** (images all pullable — ghcr.io/mailu/* OK; NOT bitnami):
|
||
- Services: front(nginx)/admin/imap(dovecot)/smtp(postfix)/antispam(rspamd)/webmail(snappymail)/
|
||
resolver/oletools/dkim... (~11). NO backupbot label → P4 N/A (recipe-PR-deferrable like immich) —
|
||
document in PARITY.md + DEFERRED, seek Adversary §7.1 sign-off OR file a backup recipe-PR.
|
||
- EXTRA_ENV needed: DOMAIN (harness sets), MAIL_DOMAIN, HOSTNAMES, TRAEFIK_STACK_NAME (cc-ci's
|
||
traefik stack name = traefik_ci_commoninternet_net), SITENAME, POSTMASTER, TLS_FLAVOR. Set
|
||
API=true + a MAILU API token if using the REST API; else use the admin-container CLI.
|
||
- Health: front serves; WEBROOT_REDIRECT=/webmail. HEALTH_PATH candidate `/admin` (login 200) or
|
||
`/` (302→/webmail). admin healthcheck is DISABLED in compose → rely on front + HTTP probe.
|
||
- §4.3 functional: create-an-object+read-back via the admin container CLI (headless, reliable):
|
||
exec_in_app(service="admin") `flask mailu domain <MAIL_DOMAIN>` + `flask mailu user <u> <domain>
|
||
<pw>` → read back via `flask mailu user` list / admin API → assert mailbox exists. Distinctive #2:
|
||
real mail flow — SMTP send (smtp service) → IMAP retrieve (imap service) of a unique-marker mail;
|
||
reachability likely needs host-published mail ports (like mumble host-ports) OR exec inside the
|
||
container using swaks/openssl. Simpler distinctive #2 if SMTP/IMAP host-reach is hard: create a
|
||
2nd domain/alias via CLI + verify, or assert the admin API lists the created user.
|
||
- recipe_meta: DEPLOY_TIMEOUT generous (multi-service); confirm version tags for the upgrade tier.
|
||
- Build next iteration (fresh context): scaffold tests/mailu/, smoke deploy install,custom to find
|
||
the exact `flask mailu` invocation + health path + mail-port reachability, then add §4.3 tests.
|
||
|
||
## 2026-05-29 — mailu (Q4.9) deeper recon: TLS/certdumper friction noted
|
||
- Services: `app`=ghcr.io/mailu/nginx (the front/web+mail proxy), `db`=redis:8.0.3-alpine (redis, NOT
|
||
a SQL DB — mailu admin uses sqlite at /data inside the admin container), `admin`=ghcr.io/mailu/admin
|
||
(mgmt API + `flask mailu` CLI), imap(dovecot), smtp(postfix), antispam(rspamd), webmail, **certdumper**
|
||
(ldez/traefik-certs-dumper). All images PULLABLE (ghcr.io/mailu/* + redis + ldez). NO backupbot → P4 N/A.
|
||
- **FRICTION (cc-ci-specific): certdumper expects traefik's ACME acme.json** (it dumps certs from
|
||
traefik_letsencrypt volume for the mail ports' TLS). cc-ci uses a FILE-PROVIDER wildcard cert, NOT
|
||
ACME (Class-A1, ACME forbidden) → no acme.json → certdumper likely never converges → services_converged
|
||
False → install "fails". MITIGATION to try: set TLS_FLAVOR (mailu env) to `notls` (disables mail TLS,
|
||
no cert needed) or `mail-letsencrypt`→ avoid; OR drop certdumper from COMPOSE_FILE if the recipe allows;
|
||
OR provide the cc-ci wildcard cert files to mailu's expected path. Smoke deploy will reveal whether
|
||
certdumper blocks convergence; START with TLS_FLAVOR=notls in EXTRA_ENV. The web/admin HTTP path
|
||
(traefik file-provider wildcard) works regardless; functional create-mailbox is via the admin CLI
|
||
(no mail-TLS needed). SMTP/IMAP send-receive distinctive test may need TLS_FLAVOR handled.
|
||
- Versions: 1.1.0/1.1.1/2.0.0/3.0.0/3.0.1; prev=3.0.0+2024.06.27 → head 3.0.1+2024.06.37 (real upgrade).
|
||
- Build approach: EXTRA_ENV callable(domain)→{MAIL_DOMAIN:domain, HOSTNAMES:domain, TRAEFIK_STACK_NAME:
|
||
"traefik_ci_commoninternet_net", SITENAME:"ccci", POSTMASTER:"admin", TLS_FLAVOR:"notls"}. Smoke
|
||
install,custom first to confirm convergence (esp. certdumper) + find `flask mailu` syntax + health path.
|
||
|
||
## 2026-05-29 — drone (Q4.10) investigation: needs a gitea SCM dep + OAuth + build-trigger pipeline
|
||
drone = single `app` (drone/drone:2.26.0), HEALTH=/healthz, NO backupbot (P4 N/A), real upgrade tags
|
||
(1.8.0+2.25.0→1.9.0+2.26.0). KEY: drone is a CI server that REQUIRES exactly one SCM provider — the
|
||
base compose's drone.env.tmpl only sets DRONE_RPC_SECRET; the SCM (DRONE_GITEA_CLIENT_ID/SERVER +
|
||
client_secret) is supplied by compose.gitea.yml. drone's server FATALs without an SCM provider
|
||
configured, so it cannot even BOOT standalone. gitea recipe IS fetchable (dep-deployable).
|
||
**Full §4.3 enrollment cost (the heaviest of any §5 recipe):**
|
||
1. Deploy gitea as a DEP (deps.py — but gitea is a full git service, heavier than keycloak).
|
||
2. Create a gitea OAuth2 application via the gitea admin API → client_id + client_secret.
|
||
3. Wire DRONE_GITEA_SERVER/CLIENT_ID + client_secret secret into drone (compose.gitea.yml +
|
||
install_steps), then drone boots.
|
||
4. §4.3 "create/list builds" needs a drone USER API TOKEN — which drone only issues AFTER an OAuth
|
||
login flow against gitea (headless OAuth consent is itself complex), PLUS a synced repo with a
|
||
.drone.yml PLUS a push/webhook to trigger a build. That is a full CI-trigger pipeline, multi-system.
|
||
**Assessment:** deploying drone+gitea (boot+/healthz) is achievable; the §4.3 create-an-object (a
|
||
build) requires OAuth-token + repo-sync + webhook-trigger infra that is disproportionate. §7.1 says
|
||
"needs another app"/"needs SSO" are NOT valid excuses (dep resolver exists) — but drone's blocker is
|
||
the OAuth-token + build-trigger PIPELINE, beyond a simple dep. **Proposed: build the gitea-dep +
|
||
OAuth-at-install wiring so drone BOOTS (install+upgrade green + a health/version/SCM-config functional
|
||
= maximal subset), and DEFER the build-creation §4.3 with a DEFERRED.md entry + Adversary §7.1
|
||
sign-off** (the create-build pipeline is a dedicated unit). Decide next iteration; gitea-dep wiring is
|
||
the main effort. Do NOT deploy concurrently with the Adversary's mailu cold-verify.
|
||
|
||
## 2026-05-29 — drone+gitea integration FULLY SCOPED (execute next iteration)
|
||
Confirmed mechanics:
|
||
- `deps.py::deploy_deps` is GENERIC (deploys any dep recipe by name + waits health; reads
|
||
tests/<dep>/recipe_meta.py EXTRA_ENV/HEALTH via meta_for). So DEPS=["gitea"] works, BUT gitea needs
|
||
config: gitea ships `COMPOSE_FILE=compose.yml:compose.mariadb.yml` (app + mariadb, 2 services) and
|
||
uses GITEA_DOMAIN for ROOT_URL/OAuth redirects — defaults to gitea.example.com, so a dep deploy
|
||
needs GITEA_DOMAIN pinned to the per-run dep domain.
|
||
- gitea: `INSTALL_LOCK=true` (no web installer), NO auto-admin user via env. Admin must be created via
|
||
the gitea CLI in the app container: `gitea admin user create --admin --username ccci --password <pw>
|
||
--email ccci@ci.local --must-change-password=false`, then a token: `gitea admin user
|
||
generate-access-token -u ccci --scopes 'write:application,write:user' --raw` (gitea ≥1.19 syntax).
|
||
- drone OAuth: drone needs DRONE_GITEA_SERVER=https://<gitea-dep-domain> + DRONE_GITEA_CLIENT_ID + a
|
||
`client_secret` swarm secret (compose.gitea.yml). Create the gitea OAuth2 app via API:
|
||
`POST https://<gitea>/api/v1/user/applications/oauth2` (header Authorization: token <admintoken>)
|
||
body {name, redirect_uris:["https://<drone-domain>/login"], confidential_client:true} → returns
|
||
{client_id, client_secret}.
|
||
INTEGRATION PLAN (execute fresh):
|
||
1. tests/gitea/recipe_meta.py: EXTRA_ENV(domain)→{GITEA_DOMAIN:domain, GITEA_DISABLE_REGISTRATION:"true"}
|
||
(+ any required), HEALTH_PATH="/" HEALTH_OK=(200,302), DEPLOY_TIMEOUT~900. (gitea as a dep app.)
|
||
2. tests/drone/recipe_meta.py: DEPS=["gitea"]; EXTRA_ENV(domain)→ COMPOSE_FILE="compose.yml:compose.gitea.yml",
|
||
DRONE_USER_CREATE="username:ccci,admin:true" (match the gitea admin username so drone admin maps),
|
||
GITEA_DOMAIN=<dep domain> (from deps file at install_steps time — so EXTRA_ENV may need the dep
|
||
domain, which isn't known until deps deploy → use install_steps for the dep-dependent env, like the
|
||
keycloak OIDC-at-install pattern). HEALTH_PATH="/healthz" HEALTH_OK=(200,). Likely OIDC_AT_INSTALL-style.
|
||
3. tests/drone/install_steps.sh: read $CCCI_DEPS_FILE for gitea dep domain; exec into the gitea dep
|
||
container to create admin+token (or via API); create the OAuth2 app → client_id/secret; `abra app
|
||
secret insert drone client_secret v1 <secret>`; env_set DRONE_GITEA_CLIENT_ID + GITEA_DOMAIN into
|
||
drone .env; then the single drone deploy boots with gitea SCM. (Mirror lasuite OIDC-at-install: the
|
||
orchestrator deploys the dep BEFORE drone when OIDC_AT_INSTALL+DEPS; install_steps wires it.)
|
||
NOTE: install_steps runs in the drone deploy_app; the gitea dep must be deployed FIRST — verify the
|
||
orchestrator's OIDC_AT_INSTALL path deploys deps before the parent (it does: _provision_deps before
|
||
deploy when oidc_at_install). May need to generalize that flag (e.g. DEPS_AT_INSTALL) for non-OIDC.
|
||
4. §4.3 build-creation (create/list builds): DEFER — needs drone user OAuth token (drone issues tokens
|
||
only post-OAuth-login against gitea; headless OAuth consent is complex) + a synced repo + .drone.yml
|
||
+ a push/webhook trigger. DISPROPORTIONATE pipeline. Ship MAXIMAL SUBSET: drone boots with gitea SCM
|
||
(install+upgrade+health/healthz + a functional test asserting drone serves /healthz 200 and the
|
||
login page advertises gitea SSO, proving SCM configured). DEFERRED.md entry + Adversary §7.1 sign-off
|
||
for the build-trigger pipeline. SMOKE-FIRST: manually deploy gitea→create OAuth app→deploy drone wired
|
||
→confirm /healthz, before writing test code (nail the gitea CLI/API calls).
|
||
This is the heaviest Phase-2 integration; budget multiple iterations. Hold deploys if Adversary active.
|
||
|
||
---
|
||
## 2026-05-29T22:4x — immich Q3.5 P4 decision: recipe-PR (add postgres backup), not N/A
|
||
|
||
Resumed loop. Adversary checkpoint (REVIEW-2 `af94708`) confirms my own finding: immich's P4 restore
|
||
is RED and unsigned. Root-caused it directly on cc-ci:
|
||
- immich's `backupbot.backup` label sits ONLY on the `app` service, whose sole data volume `uploads`
|
||
is `backupbot.volumes.uploads=false` (excluded), and the two other excluded names (model-cache,
|
||
external_storage) aren't even on `app`. → app backs up nothing.
|
||
- the `database` (postgres, DB_USERNAME=postgres/DB_DATABASE_NAME=immich) service has NO backupbot
|
||
label and NO pg_dump hook. → the postgres DB is NOT backed up at all.
|
||
- No `abra.sh`, no top-level `configs:` section. So immich-as-published produces a backup containing
|
||
no restorable application data. My P4 ci_marker (postgres row) therefore cannot survive restore —
|
||
the test correctly detected a genuine, serious upstream deficiency (immich users get NO DB backup).
|
||
|
||
**WHY recipe-PR over §7.1 N/A sign-off:** immich is THE object-storage/large-volume D10 category
|
||
recipe — its entire purpose is storing user data. A P4-N/A here (unlike mailu's mail-relay N/A) would
|
||
be hollow: the data path is exactly what must be proven to survive. cc-ci exists to catch precisely
|
||
this class of bug; the recipe mirror+PR flow (§0b/§4.1) is the sanctioned mechanism, and the durable
|
||
fix was already filed as the immich Q3.5 deferral. So: author a recipe-PR adding a `database`-service
|
||
postgres backup (mirroring matrix-synapse's `/pg_backup.sh` config-mount + backupbot pre/restore
|
||
hooks), then `!testme`/`RECIPE=immich PR=<n>` proves P4 green on the fixed recipe.
|
||
|
||
**Reference pattern (matrix-synapse compose.yml):** db service `deploy.labels`:
|
||
`backupbot.backup.pre-hook="/pg_backup.sh backup"`, `backupbot.backup.volumes.postgres.path="backup.sql"`,
|
||
`backupbot.restore.post-hook="/pg_backup.sh restore"`; `configs: [{source: pg_backup, target:
|
||
/pg_backup.sh, mode: 0555}]`; top-level `configs.pg_backup.file=pg_backup.sh`. The script: backup =
|
||
`pg_dump -U $USER $DB | gzip > /var/lib/postgresql/data/backup.sql`; restore = drop+recreate+reimport.
|
||
|
||
**immich-specific risk to validate empirically BEFORE the PR:** the postgres image is VectorChord/
|
||
pgvecto.rs (custom extensions). A naive single-DB pg_dump|psql restore may choke on the vector
|
||
extension and on the live immich-server's held connections. So I'm deploying immich (install) now and
|
||
will test seed→dump→drop→restore→verify directly in the `database` container to pin down the exact
|
||
dump/restore commands (likely `pg_dumpall --clean --if-exists` and connection-termination on restore)
|
||
that round-trip the ci_marker, then bake the proven commands into pg_backup.sh. No "should work".
|
||
|
||
---
|
||
## 2026-05-30T~23:22 — Q3.5 immich CLAIMED; remaining-recipe scope (backup-capability survey)
|
||
|
||
immich P4 done the right way: recipe-PR `recipe-maintainers/immich#1` (mechanism validated live, then
|
||
full lifecycle green `/root/ccci-immich-prfull.log` — 5 tiers + 3 custom, deploy-count=1, clean
|
||
teardown). Added a genuine 2nd P3 functional test (asset-processing: exifInfo metadata + library
|
||
statistics) so the §4.3 ≥2-tests floor is met by separate test functions, not one test doing double
|
||
duty (avoids the bluesky F2-8 "floor BYPASSED" failure mode). Claimed `0487631`.
|
||
|
||
**Remaining Phase-2 work (post-immich), by node-contention class.** The Adversary will cold-verify
|
||
immich next (full ~30min run; MAX_TESTS=1) so I should NOT start a heavy deploy until it frees.
|
||
|
||
Backup-capability survey (just done) of the 4 overlay-less recipes — ALL backup-capable, so P4
|
||
data-integrity overlays are REQUIRED (not N/A like mailu):
|
||
- **ghost** — app vol `/var/lib/ghost/content` (path) + mysql `mysqldump --tab` pre-hook. P4: seed a
|
||
ghost post (mysql) OR content marker. Also owes §4.3 create-post (named Adversary standing
|
||
condition) — needs Ghost owner-setup + admin token. Heavy (~15-20min cold start).
|
||
- **bluesky-pds** — `backupbot.backup=true` on pds svc (data volume: sqlite account repos + blobs).
|
||
P4: create account+post (goat), backup, wipe, restore, assert the post/account survive. (F2-8 was
|
||
about the §4.3 floor; bluesky already has 4 functional tests incl. account+post round-trip.)
|
||
- **uptime-kuma** — default sqlite data-vol backup (mariadb variant has dump hooks). P4: create a
|
||
monitor, backup, restore, assert. Also owes §4.3 create-monitor (deferred — needs a Socket.IO
|
||
client primitive in harness; uptime-kuma's setup wizard + monitor CRUD are Socket.IO, not REST).
|
||
- **mattermost-lts** — app `/mattermost` + postgres `pg_dump` pre-hook. P4: create team/channel/
|
||
message, backup, restore, assert. Also owes §4.3 create-message read-back depth.
|
||
|
||
Overlay-complete, need only a formal green-run + gate claim: **matrix-synapse**, **lasuite-docs**
|
||
(dep: keycloak). **plausible** needs a cold green run when the upstream clickhouse-backup GitHub
|
||
rate-limit cools (deploy converges) — preserve the log. **discourse** + **drone** remain BLOCKED
|
||
(upstream bitnami images gone / operator /etc/timezone host-deploy).
|
||
|
||
NEXT unblocked unit (when node free): pick a recipe and take it to a claim. Suggest order by ease:
|
||
matrix-synapse (overlay-complete → just run+claim) → bluesky-pds P4 overlay → mattermost-lts P4 →
|
||
ghost (P4 + §4.3 create-post) → uptime-kuma (P4 + Socket.IO §4.3). Keep heavy deploys sequential.
|