Files
cc-ci-orchestrator/cc-ci-plan/plan-lasuite-drive-recipe-pr.md

78 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Sub-plan — lasuite-drive recipe robustness PR (fix the root cause upstream)
**Status:** QUEUED — a **recipe-maintainer PR to the lasuite-drive recipe** (we maintain it). Picks up
after the Q3.2 lasuite-drive test work settles. Complements — and largely **obsoletes** — the
CI-side workarounds the harness currently uses for lasuite-drive's fragility.
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-lasuite-drive-recipe-pr.md`
**Relationship:** this is the **recipe-side** deliverable. The cc-ci **harness-side** OIDC-at-install
work is `plan-lasuite-drive-oidc-robustness.md` Part A; this sub-plan is its Part B, broken out.
---
## 0. Why (CI surfaced real recipe bugs — fix them at the source)
cc-ci has surfaced genuine fragility in the lasuite-drive recipe that a **real operator would also
hit**, currently papered over by CI-side workarounds:
- **Install-time:** backend comes up before collabora's WOPI discovery is ready → transient
**WOPI-404** + a **gunicorn-perms** startup race. The flaky 12-service `--chaos` OIDC redeploy.
- **Upgrade-time (F2-12):** upgrading to the heavier new collabora (25.04.9.4.1) **does not converge
within abra's monitor window** → abra FATAs. The harness currently works around this by skipping
abra's convergence monitor (`-c`) and using its own collabora WOPI-200 `READY_PROBE`.
These are recipe defects. Fixing them upstream helps every lasuite-drive operator **and** lets cc-ci
**go back to abra's native convergence** (per the guardrail "prefer abra convergence; custom probe
only when necessary") — turning the harness `-c`/READY_PROBE from a *necessity* into a *backstop*.
## 1. The fixes (lasuite-drive recipe)
1. **Collabora healthcheck + start_period (the keystone).** Add a real Docker **healthcheck** to the
collabora service — WOPI discovery endpoint returns 200 — with a `start_period` generous enough
for the heavy 25.04 image to boot. Effect: (a) swarm/abra see collabora as *unhealthy until WOPI
is actually up*, so **abra's own convergence monitor waits correctly** (fixes F2-12 at the source
— no `-c` skip needed); (b) the install-time WOPI-404 window closes because dependents can gate on
collabora health.
2. **Backend tolerates / waits for collabora WOPI.** Make backend **retry WOPI discovery with
backoff** (and/or order it behind collabora health) instead of failing on the transient 404.
3. **Fix the gunicorn-perms startup race.** Set the volume permissions in the backend entrypoint
(or an init step) **before** exec'ing gunicorn, so there's no read/write race on a freshly-mounted
volume at startup.
4. **Lazy / retrying OIDC discovery.** Backend resolves the OIDC provider **at first login with
retry**, not eagerly at boot — so the app boots cleanly with OIDC env set even if the provider
isn't reachable yet. (This is also what the harness-side OIDC-at-install pattern relies on, and
what keeps the generic-first invariant safe.)
## 2. Mechanics — branch, PR, and the merge rule
- Make the change on a **lasuite-drive recipe branch** and open a PR via the **`recipe-create-pr`
skill** (`/srv/recipe-maintainer/.opencode/skills/recipe-create-pr/SKILL.md`) — mirror to
`git.autonomic.zone/recipe-maintainers/lasuite-drive` as needed; upstream is
`git.coopcloud.tech`.
- **The PR is "working" ONLY when cc-ci verifies it green** (operator rule): trigger cc-ci
(`!testme` on the lasuite-drive PR) and require the **full suite incl. the UPGRADE tier** to pass
**repeatedly-green** (e.g. 3 consecutive passes, not a one-off), **Adversary cold-verified**.
**Only then does the operator merge.** This dogfoods cc-ci: the CI that found the bugs gates the fix.
- **SCOPE (operator, 2026-05-29):** this repeated-green / 3× bar is **specific to lasuite-drive
because it was demonstrably FLAKY** — it's a *flakiness proof* (show the fix made it reliably
green, not green-by-luck-once). It is **NOT the general testing standard.** Normal recipe gates
remain **one Adversary cold-verified green** (`plan.md §6.1`); do not generalize 3× to other
recipes/gates.
## 3. Definition of done
- [ ] Recipe branch with fixes #1#4; PR opened (recipe-create-pr).
- [ ] **cc-ci runs the full suite (install + upgrade + backup + restore + custom/OIDC) on the PR,
repeatedly green, Adversary cold-verified.**
- [ ] **Root-cause proof:** with the collabora healthcheck in place, demonstrate the upgrade tier
passes under **abra's NATIVE convergence** (i.e. drop `-c` for lasuite-drive and it still
converges + stays green) — confirming the recipe fix resolved F2-12 at the source. If it still
needs the harness backstop, say so honestly (record why).
- [ ] Operator merges the recipe PR. Then: cc-ci can **revert the lasuite-drive `-c`/READY_PROBE
workaround to abra-native convergence** (per the guardrail), and close the lasuite-drive flaky
items.
## 4. Guardrails
- **Don't weaken any test** to make the PR pass — the fixes must make the recipe genuinely robust,
proven by repeated-green cc-ci runs, not by loosening assertions.
- **Real abra path** throughout (no docker-level bypass).
- **Bounded** — the four targeted robustness fixes; not a recipe rewrite. Bigger recipe improvements
→ upstream issues / IDEAS, not this PR.