plan: lasuite-drive OIDC-setup flakiness — harness restructure (A) + recipe robustness PR (B)

Deferred lasuite-drive [~] (Q3.2). Two parts: (A) cc-ci wires OIDC at INSTALL against the live-warm keycloak (WC1) so there's no flaky mid-run 12-service --chaos reconverge — using REAL abra commands only (no docker service update bypass; operator decision); (B) a lasuite-drive recipe PR fixing the root cause (collabora WOPI healthcheck-gating + gunicorn-perms race + lazy/retrying OIDC discovery). Operator rule: a recipe change is "working" only once cc-ci runs the full suite on the PR and it's repeatedly green (Adversary cold-verified) — then the operator merges. A+B reinforce (lazy OIDC makes install-time wiring safe for the generic-first invariant). Ground the fix in captured failure logs first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 08:57:26 +01:00
parent ae83a8120d
commit 269253916c
1 changed files with 93 additions and 0 deletions
--- a/cc-ci-plan/plan-lasuite-drive-oidc-robustness.md
+++ b/cc-ci-plan/plan-lasuite-drive-oidc-robustness.md
@ -0,0 +1,93 @@
+# Plan — lasuite-drive OIDC-setup flakiness: harness restructure + recipe robustness PR
+
+**Status:** QUEUED — picks up the deferred lasuite-drive `[~]` item (Q3.2) when the loops return to it.
+The disk blocker is now lifted (host grew to 70 GB, 2026-05-29), so the upgrade tier can run too.
+**Owner:** Builder + Adversary loops. **Two deliverables:** (A) a cc-ci harness change, (B) a
+**lasuite-drive recipe PR** (we maintain the recipe via recipe-maintainer).
+**This file:** `/srv/cc-ci/cc-ci-plan/plan-lasuite-drive-oidc-robustness.md`
+
+---
+
+## 0. The problem (ground it in real logs FIRST)
+
+lasuite-drive's `setup_custom_tests` wires OIDC by doing a **full 12-service `abra app deploy --chaos`
+redeploy** of an already-running heavy stack. That reconverge is **flaky**: a collabora reconverge
+race + a **transient backend gunicorn-perms / WOPI-404 window**. Only **backend/app** consume the
+OIDC env, so re-converging collabora/onlyoffice/minio/db is unnecessary exposure. The Builder
+correctly did NOT claim Q3.2 (a flaky setup isn't a reliable green) and filed this.
+
+**Step 0 (do before fixing):** capture the actual failure from a flaky run — collabora WOPI-discovery
+timing, the backend log at the 404, and the exact gunicorn perms error — so we fix the real root
+cause, not a guess. Record in JOURNAL.
+
+---
+
+## Part A — cc-ci harness: wire OIDC at INSTALL, eliminate the redeploy
+
+Leverage the now **live-warm keycloak** (WC1): instead of deploy-recipe-then-redeploy-with-OIDC, the
+dep is already running, so configure OIDC **before the single deploy**:
+
+1. Create the per-run namespaced realm+client in the **warm keycloak** (lightweight API calls — no
+   deploy).
+2. Set lasuite-drive's OIDC env in its `.env` **before** the first `abra app deploy`.
+3. Deploy lasuite-drive **once**, OIDC already wired. **No mid-run `--chaos` reconverge** → the flaky
+   window is gone.
+
+- **Use REAL abra commands only** (operator decision, 2026-05-29): the install is a normal
+  `abra app deploy`. Do **NOT** use `docker service update`/`docker service scale` to surgically
+  patch services — CI must exercise the same abra path a real operator uses. If any redeploy ever
+  remains necessary, it is `abra app deploy` (+ proper readiness gating, §C), never a docker-level
+  bypass.
+- **Generic-first invariant still holds.** keycloak is live-warm (persistent infra, ~always up), so
+  there's no per-run dep *deploy* that could fail and break a generic tier. Generic tiers still
+  assert serving recipe-alone; if realm creation or the provider is unreachable, only the OIDC
+  *custom* test fails (isolated). This is SAFE **iff** the recipe boots with OIDC-env-set even when
+  the provider isn't reachable at boot — which Part B guarantees (lazy OIDC). A and B reinforce each
+  other.
+- **Revises `plan-sso-dep-testing.md`** for recipes whose app tolerates OIDC-at-boot: prefer
+  install-time wiring against the warm dep over the post-deploy `setup_custom_tests` redeploy. Keep
+  the redeploy path only for recipes that genuinely can't take OIDC env at first boot.
+- **Verify:** lasuite-drive full suite green **without** a mid-run reconverge; run it repeatedly
+  (e.g. 3×) to show the flakiness is gone, not just absent once.
+
+## Part B — lasuite-drive recipe PR (root-cause robustness, we're maintainers)
+
+Fix the fragility in the recipe itself so it's robust under ANY reconverge — this helps real
+operators, not just CI, and is the real payoff of maintaining the recipe. On a branch of the
+lasuite-drive recipe (recipe-maintainer):
+
+1. **Order backend behind collabora's WOPI readiness.** Add a Docker **healthcheck on collabora**
+   (WOPI discovery endpoint → 200) and make backend **wait for it / retry WOPI discovery with
+   backoff** instead of failing during collabora startup. Turns the race into deterministic ordering.
+2. **Fix the gunicorn-perms race.** The transient perms error is a volume-permission race at startup
+   (gunicorn can't read/write a mount until perms are set). Fix in the entrypoint: set perms before
+   exec'ing gunicorn (or a proper init step).
+3. **Make OIDC discovery lazy / retrying.** If backend validates the OIDC provider eagerly at boot
+   and dies when it's briefly unreachable, that's what forces the deploy-with-dep dance. Resolve OIDC
+   discovery lazily (at first login, with retry) so the app boots without the provider up. (This is
+   also the property Part A relies on for the generic-first invariant.)
+
+### The PR is "working" ONLY when cc-ci verifies it green (operator rule, 2026-05-29)
+A lasuite-drive recipe change is **not** considered done until **our CI server has run the full suite
+on the PR and it passes** — install + upgrade + backup + restore + custom (incl. the OIDC login
+test), repeatedly-green (no flakiness). Concretely: push the recipe branch/PR, trigger cc-ci
+(`!testme` on the lasuite-drive PR), and require a clean green (Adversary cold-verified). **Only then
+does the operator merge** the recipe PR. This dogfoods cc-ci: the CI that surfaced the bug also gates
+its fix.
+
+## C. If any redeploy must remain (fallback only)
+If a recipe genuinely needs OIDC env applied after first boot, do it with **`abra app deploy`** (real
+abra — never `docker service update`), and **gate readiness properly**: poll collabora WOPI = 200 +
+backend health with retries/timeout before declaring ready. This is strictly a fallback; Part A +
+Part B should remove the need.
+
+## Definition of done
+- [ ] Step 0 root-cause logs captured + recorded.
+- [ ] Part A: lasuite-drive tested via install-time OIDC wiring against the warm keycloak, **no
+      mid-run reconverge**, full suite green **3× in a row** (flakiness gone), Adversary cold-verified.
+- [ ] Part B: lasuite-drive recipe branch with the WOPI-healthcheck-gating + gunicorn-perms + lazy-OIDC
+      fixes; **cc-ci runs the full suite (incl. upgrade tier) on the PR and it is repeatedly green**;
+      Adversary cold-verifies; operator merges.
+- [ ] Q3.2 (lasuite-drive) claimed + DEFERRED.md entry closed (flaky-OIDC item) — only after the above.
+- [ ] `plan-sso-dep-testing.md` updated to prefer install-time OIDC wiring (warm dep) where the recipe
+      supports it.