plan: lasuite-drive recipe-robustness PR sub-plan (collabora healthcheck + perms + lazy OIDC)

Operator (2026-05-29): dedicated sub-plan for the upstream recipe PR. Fixes collabora WOPI
healthcheck/start_period (keystone — fixes F2-12 at the source so cc-ci can return to abra-native
convergence + drop the -c/READY_PROBE backstop), backend WOPI retry, gunicorn-perms race, lazy OIDC.
PR is 'working' only when cc-ci runs the full suite incl. upgrade tier green + Adversary cold-verify,
then operator merges. Broken out from plan-lasuite-drive-oidc-robustness.md Part B (now points here).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 12:58:36 +01:00
parent 7f8e6cb13e
commit 7a87dc02b1
2 changed files with 76 additions and 0 deletions

View File

@ -52,6 +52,10 @@ dep is already running, so configure OIDC **before the single deploy**:
## Part B — lasuite-drive recipe PR (root-cause robustness, we're maintainers)
> **Broken out into its own sub-plan: `plan-lasuite-drive-recipe-pr.md`** (operator, 2026-05-29) —
> it now also folds in the F2-12 upgrade-convergence fix (collabora healthcheck) so cc-ci can return
> to abra-native convergence. The summary below is retained; the sub-plan is authoritative.
Fix the fragility in the recipe itself so it's robust under ANY reconverge — this helps real
operators, not just CI, and is the real payoff of maintaining the recipe. On a branch of the
lasuite-drive recipe (recipe-maintainer):

View File

@ -0,0 +1,72 @@
# Sub-plan — lasuite-drive recipe robustness PR (fix the root cause upstream)
**Status:** QUEUED — a **recipe-maintainer PR to the lasuite-drive recipe** (we maintain it). Picks up
after the Q3.2 lasuite-drive test work settles. Complements — and largely **obsoletes** — the
CI-side workarounds the harness currently uses for lasuite-drive's fragility.
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-lasuite-drive-recipe-pr.md`
**Relationship:** this is the **recipe-side** deliverable. The cc-ci **harness-side** OIDC-at-install
work is `plan-lasuite-drive-oidc-robustness.md` Part A; this sub-plan is its Part B, broken out.
---
## 0. Why (CI surfaced real recipe bugs — fix them at the source)
cc-ci has surfaced genuine fragility in the lasuite-drive recipe that a **real operator would also
hit**, currently papered over by CI-side workarounds:
- **Install-time:** backend comes up before collabora's WOPI discovery is ready → transient
**WOPI-404** + a **gunicorn-perms** startup race. The flaky 12-service `--chaos` OIDC redeploy.
- **Upgrade-time (F2-12):** upgrading to the heavier new collabora (25.04.9.4.1) **does not converge
within abra's monitor window** → abra FATAs. The harness currently works around this by skipping
abra's convergence monitor (`-c`) and using its own collabora WOPI-200 `READY_PROBE`.
These are recipe defects. Fixing them upstream helps every lasuite-drive operator **and** lets cc-ci
**go back to abra's native convergence** (per the guardrail "prefer abra convergence; custom probe
only when necessary") — turning the harness `-c`/READY_PROBE from a *necessity* into a *backstop*.
## 1. The fixes (lasuite-drive recipe)
1. **Collabora healthcheck + start_period (the keystone).** Add a real Docker **healthcheck** to the
collabora service — WOPI discovery endpoint returns 200 — with a `start_period` generous enough
for the heavy 25.04 image to boot. Effect: (a) swarm/abra see collabora as *unhealthy until WOPI
is actually up*, so **abra's own convergence monitor waits correctly** (fixes F2-12 at the source
— no `-c` skip needed); (b) the install-time WOPI-404 window closes because dependents can gate on
collabora health.
2. **Backend tolerates / waits for collabora WOPI.** Make backend **retry WOPI discovery with
backoff** (and/or order it behind collabora health) instead of failing on the transient 404.
3. **Fix the gunicorn-perms startup race.** Set the volume permissions in the backend entrypoint
(or an init step) **before** exec'ing gunicorn, so there's no read/write race on a freshly-mounted
volume at startup.
4. **Lazy / retrying OIDC discovery.** Backend resolves the OIDC provider **at first login with
retry**, not eagerly at boot — so the app boots cleanly with OIDC env set even if the provider
isn't reachable yet. (This is also what the harness-side OIDC-at-install pattern relies on, and
what keeps the generic-first invariant safe.)
## 2. Mechanics — branch, PR, and the merge rule
- Make the change on a **lasuite-drive recipe branch** and open a PR via the **`recipe-create-pr`
skill** (`/srv/recipe-maintainer/.opencode/skills/recipe-create-pr/SKILL.md`) — mirror to
`git.autonomic.zone/recipe-maintainers/lasuite-drive` as needed; upstream is
`git.coopcloud.tech`.
- **The PR is "working" ONLY when cc-ci verifies it green** (operator rule): trigger cc-ci
(`!testme` on the lasuite-drive PR) and require the **full suite incl. the UPGRADE tier** to pass
**repeatedly-green** (not a one-off), **Adversary cold-verified**. **Only then does the operator
merge.** This dogfoods cc-ci: the CI that found the bugs gates the fix.
## 3. Definition of done
- [ ] Recipe branch with fixes #1#4; PR opened (recipe-create-pr).
- [ ] **cc-ci runs the full suite (install + upgrade + backup + restore + custom/OIDC) on the PR,
repeatedly green, Adversary cold-verified.**
- [ ] **Root-cause proof:** with the collabora healthcheck in place, demonstrate the upgrade tier
passes under **abra's NATIVE convergence** (i.e. drop `-c` for lasuite-drive and it still
converges + stays green) — confirming the recipe fix resolved F2-12 at the source. If it still
needs the harness backstop, say so honestly (record why).
- [ ] Operator merges the recipe PR. Then: cc-ci can **revert the lasuite-drive `-c`/READY_PROBE
workaround to abra-native convergence** (per the guardrail), and close the lasuite-drive flaky
items.
## 4. Guardrails
- **Don't weaken any test** to make the PR pass — the fixes must make the recipe genuinely robust,
proven by repeated-green cc-ci runs, not by loosening assertions.
- **Real abra path** throughout (no docker-level bypass).
- **Bounded** — the four targeted robustness fixes; not a recipe rewrite. Bigger recipe improvements
→ upstream issues / IDEAS, not this PR.