78 lines
5.2 KiB
Markdown
78 lines
5.2 KiB
Markdown
# Sub-plan — lasuite-drive recipe robustness PR (fix the root cause upstream)
|
||
|
||
**Status:** QUEUED — a **recipe-maintainer PR to the lasuite-drive recipe** (we maintain it). Picks up
|
||
after the Q3.2 lasuite-drive test work settles. Complements — and largely **obsoletes** — the
|
||
CI-side workarounds the harness currently uses for lasuite-drive's fragility.
|
||
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-lasuite-drive-recipe-pr.md`
|
||
**Relationship:** this is the **recipe-side** deliverable. The cc-ci **harness-side** OIDC-at-install
|
||
work is `plan-lasuite-drive-oidc-robustness.md` Part A; this sub-plan is its Part B, broken out.
|
||
|
||
---
|
||
|
||
## 0. Why (CI surfaced real recipe bugs — fix them at the source)
|
||
|
||
cc-ci has surfaced genuine fragility in the lasuite-drive recipe that a **real operator would also
|
||
hit**, currently papered over by CI-side workarounds:
|
||
- **Install-time:** backend comes up before collabora's WOPI discovery is ready → transient
|
||
**WOPI-404** + a **gunicorn-perms** startup race. The flaky 12-service `--chaos` OIDC redeploy.
|
||
- **Upgrade-time (F2-12):** upgrading to the heavier new collabora (25.04.9.4.1) **does not converge
|
||
within abra's monitor window** → abra FATAs. The harness currently works around this by skipping
|
||
abra's convergence monitor (`-c`) and using its own collabora WOPI-200 `READY_PROBE`.
|
||
|
||
These are recipe defects. Fixing them upstream helps every lasuite-drive operator **and** lets cc-ci
|
||
**go back to abra's native convergence** (per the guardrail "prefer abra convergence; custom probe
|
||
only when necessary") — turning the harness `-c`/READY_PROBE from a *necessity* into a *backstop*.
|
||
|
||
## 1. The fixes (lasuite-drive recipe)
|
||
|
||
1. **Collabora healthcheck + start_period (the keystone).** Add a real Docker **healthcheck** to the
|
||
collabora service — WOPI discovery endpoint returns 200 — with a `start_period` generous enough
|
||
for the heavy 25.04 image to boot. Effect: (a) swarm/abra see collabora as *unhealthy until WOPI
|
||
is actually up*, so **abra's own convergence monitor waits correctly** (fixes F2-12 at the source
|
||
— no `-c` skip needed); (b) the install-time WOPI-404 window closes because dependents can gate on
|
||
collabora health.
|
||
2. **Backend tolerates / waits for collabora WOPI.** Make backend **retry WOPI discovery with
|
||
backoff** (and/or order it behind collabora health) instead of failing on the transient 404.
|
||
3. **Fix the gunicorn-perms startup race.** Set the volume permissions in the backend entrypoint
|
||
(or an init step) **before** exec'ing gunicorn, so there's no read/write race on a freshly-mounted
|
||
volume at startup.
|
||
4. **Lazy / retrying OIDC discovery.** Backend resolves the OIDC provider **at first login with
|
||
retry**, not eagerly at boot — so the app boots cleanly with OIDC env set even if the provider
|
||
isn't reachable yet. (This is also what the harness-side OIDC-at-install pattern relies on, and
|
||
what keeps the generic-first invariant safe.)
|
||
|
||
## 2. Mechanics — branch, PR, and the merge rule
|
||
|
||
- Make the change on a **lasuite-drive recipe branch** and open a PR via the **`recipe-create-pr`
|
||
skill** (`/srv/recipe-maintainer/.opencode/skills/recipe-create-pr/SKILL.md`) — mirror to
|
||
`git.autonomic.zone/recipe-maintainers/lasuite-drive` as needed; upstream is
|
||
`git.coopcloud.tech`.
|
||
- **The PR is "working" ONLY when cc-ci verifies it green** (operator rule): trigger cc-ci
|
||
(`!testme` on the lasuite-drive PR) and require the **full suite incl. the UPGRADE tier** to pass
|
||
**repeatedly-green** (e.g. 3 consecutive passes, not a one-off), **Adversary cold-verified**.
|
||
**Only then does the operator merge.** This dogfoods cc-ci: the CI that found the bugs gates the fix.
|
||
- **SCOPE (operator, 2026-05-29):** this repeated-green / 3× bar is **specific to lasuite-drive
|
||
because it was demonstrably FLAKY** — it's a *flakiness proof* (show the fix made it reliably
|
||
green, not green-by-luck-once). It is **NOT the general testing standard.** Normal recipe gates
|
||
remain **one Adversary cold-verified green** (`plan.md §6.1`); do not generalize 3× to other
|
||
recipes/gates.
|
||
|
||
## 3. Definition of done
|
||
- [ ] Recipe branch with fixes #1–#4; PR opened (recipe-create-pr).
|
||
- [ ] **cc-ci runs the full suite (install + upgrade + backup + restore + custom/OIDC) on the PR,
|
||
repeatedly green, Adversary cold-verified.**
|
||
- [ ] **Root-cause proof:** with the collabora healthcheck in place, demonstrate the upgrade tier
|
||
passes under **abra's NATIVE convergence** (i.e. drop `-c` for lasuite-drive and it still
|
||
converges + stays green) — confirming the recipe fix resolved F2-12 at the source. If it still
|
||
needs the harness backstop, say so honestly (record why).
|
||
- [ ] Operator merges the recipe PR. Then: cc-ci can **revert the lasuite-drive `-c`/READY_PROBE
|
||
workaround to abra-native convergence** (per the guardrail), and close the lasuite-drive flaky
|
||
items.
|
||
|
||
## 4. Guardrails
|
||
- **Don't weaken any test** to make the PR pass — the fixes must make the recipe genuinely robust,
|
||
proven by repeated-green cc-ci runs, not by loosening assertions.
|
||
- **Real abra path** throughout (no docker-level bypass).
|
||
- **Bounded** — the four targeted robustness fixes; not a recipe rewrite. Bigger recipe improvements
|
||
→ upstream issues / IDEAS, not this PR.
|