Files
cc-ci-orchestrator/cc-ci-plan/plan-lasuite-drive-oidc-robustness.md
autonomic-bot 7a87dc02b1 plan: lasuite-drive recipe-robustness PR sub-plan (collabora healthcheck + perms + lazy OIDC)
Operator (2026-05-29): dedicated sub-plan for the upstream recipe PR. Fixes collabora WOPI
healthcheck/start_period (keystone — fixes F2-12 at the source so cc-ci can return to abra-native
convergence + drop the -c/READY_PROBE backstop), backend WOPI retry, gunicorn-perms race, lazy OIDC.
PR is 'working' only when cc-ci runs the full suite incl. upgrade tier green + Adversary cold-verify,
then operator merges. Broken out from plan-lasuite-drive-oidc-robustness.md Part B (now points here).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 12:58:36 +01:00

98 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Plan — lasuite-drive OIDC-setup flakiness: harness restructure + recipe robustness PR
**Status:** QUEUED — picks up the deferred lasuite-drive `[~]` item (Q3.2) when the loops return to it.
The disk blocker is now lifted (host grew to 70 GB, 2026-05-29), so the upgrade tier can run too.
**Owner:** Builder + Adversary loops. **Two deliverables:** (A) a cc-ci harness change, (B) a
**lasuite-drive recipe PR** (we maintain the recipe via recipe-maintainer).
**This file:** `/srv/cc-ci/cc-ci-plan/plan-lasuite-drive-oidc-robustness.md`
---
## 0. The problem (ground it in real logs FIRST)
lasuite-drive's `setup_custom_tests` wires OIDC by doing a **full 12-service `abra app deploy --chaos`
redeploy** of an already-running heavy stack. That reconverge is **flaky**: a collabora reconverge
race + a **transient backend gunicorn-perms / WOPI-404 window**. Only **backend/app** consume the
OIDC env, so re-converging collabora/onlyoffice/minio/db is unnecessary exposure. The Builder
correctly did NOT claim Q3.2 (a flaky setup isn't a reliable green) and filed this.
**Step 0 (do before fixing):** capture the actual failure from a flaky run — collabora WOPI-discovery
timing, the backend log at the 404, and the exact gunicorn perms error — so we fix the real root
cause, not a guess. Record in JOURNAL.
---
## Part A — cc-ci harness: wire OIDC at INSTALL, eliminate the redeploy
Leverage the now **live-warm keycloak** (WC1): instead of deploy-recipe-then-redeploy-with-OIDC, the
dep is already running, so configure OIDC **before the single deploy**:
1. Create the per-run namespaced realm+client in the **warm keycloak** (lightweight API calls — no
deploy).
2. Set lasuite-drive's OIDC env in its `.env` **before** the first `abra app deploy`.
3. Deploy lasuite-drive **once**, OIDC already wired. **No mid-run `--chaos` reconverge** → the flaky
window is gone.
- **Use REAL abra commands only** (operator decision, 2026-05-29): the install is a normal
`abra app deploy`. Do **NOT** use `docker service update`/`docker service scale` to surgically
patch services — CI must exercise the same abra path a real operator uses. If any redeploy ever
remains necessary, it is `abra app deploy` (+ proper readiness gating, §C), never a docker-level
bypass.
- **Generic-first invariant still holds.** keycloak is live-warm (persistent infra, ~always up), so
there's no per-run dep *deploy* that could fail and break a generic tier. Generic tiers still
assert serving recipe-alone; if realm creation or the provider is unreachable, only the OIDC
*custom* test fails (isolated). This is SAFE **iff** the recipe boots with OIDC-env-set even when
the provider isn't reachable at boot — which Part B guarantees (lazy OIDC). A and B reinforce each
other.
- **Revises `plan-sso-dep-testing.md`** for recipes whose app tolerates OIDC-at-boot: prefer
install-time wiring against the warm dep over the post-deploy `setup_custom_tests` redeploy. Keep
the redeploy path only for recipes that genuinely can't take OIDC env at first boot.
- **Verify:** lasuite-drive full suite green **without** a mid-run reconverge; run it repeatedly
(e.g. 3×) to show the flakiness is gone, not just absent once.
## Part B — lasuite-drive recipe PR (root-cause robustness, we're maintainers)
> **Broken out into its own sub-plan: `plan-lasuite-drive-recipe-pr.md`** (operator, 2026-05-29) —
> it now also folds in the F2-12 upgrade-convergence fix (collabora healthcheck) so cc-ci can return
> to abra-native convergence. The summary below is retained; the sub-plan is authoritative.
Fix the fragility in the recipe itself so it's robust under ANY reconverge — this helps real
operators, not just CI, and is the real payoff of maintaining the recipe. On a branch of the
lasuite-drive recipe (recipe-maintainer):
1. **Order backend behind collabora's WOPI readiness.** Add a Docker **healthcheck on collabora**
(WOPI discovery endpoint → 200) and make backend **wait for it / retry WOPI discovery with
backoff** instead of failing during collabora startup. Turns the race into deterministic ordering.
2. **Fix the gunicorn-perms race.** The transient perms error is a volume-permission race at startup
(gunicorn can't read/write a mount until perms are set). Fix in the entrypoint: set perms before
exec'ing gunicorn (or a proper init step).
3. **Make OIDC discovery lazy / retrying.** If backend validates the OIDC provider eagerly at boot
and dies when it's briefly unreachable, that's what forces the deploy-with-dep dance. Resolve OIDC
discovery lazily (at first login, with retry) so the app boots without the provider up. (This is
also the property Part A relies on for the generic-first invariant.)
### The PR is "working" ONLY when cc-ci verifies it green (operator rule, 2026-05-29)
A lasuite-drive recipe change is **not** considered done until **our CI server has run the full suite
on the PR and it passes** — install + upgrade + backup + restore + custom (incl. the OIDC login
test), repeatedly-green (no flakiness). Concretely: push the recipe branch/PR, trigger cc-ci
(`!testme` on the lasuite-drive PR), and require a clean green (Adversary cold-verified). **Only then
does the operator merge** the recipe PR. This dogfoods cc-ci: the CI that surfaced the bug also gates
its fix.
## C. If any redeploy must remain (fallback only)
If a recipe genuinely needs OIDC env applied after first boot, do it with **`abra app deploy`** (real
abra — never `docker service update`), and **gate readiness properly**: poll collabora WOPI = 200 +
backend health with retries/timeout before declaring ready. This is strictly a fallback; Part A +
Part B should remove the need.
## Definition of done
- [ ] Step 0 root-cause logs captured + recorded.
- [ ] Part A: lasuite-drive tested via install-time OIDC wiring against the warm keycloak, **no
mid-run reconverge**, full suite green **3× in a row** (flakiness gone), Adversary cold-verified.
- [ ] Part B: lasuite-drive recipe branch with the WOPI-healthcheck-gating + gunicorn-perms + lazy-OIDC
fixes; **cc-ci runs the full suite (incl. upgrade tier) on the PR and it is repeatedly green**;
Adversary cold-verifies; operator merges.
- [ ] Q3.2 (lasuite-drive) claimed + DEFERRED.md entry closed (flaky-OIDC item) — only after the above.
- [ ] `plan-sso-dep-testing.md` updated to prefer install-time OIDC wiring (warm dep) where the recipe
supports it.