Files
cc-ci-orchestrator/cc-ci-plan/plan-lasuite-drive-oidc-robustness.md
autonomic-bot 7a87dc02b1 plan: lasuite-drive recipe-robustness PR sub-plan (collabora healthcheck + perms + lazy OIDC)
Operator (2026-05-29): dedicated sub-plan for the upstream recipe PR. Fixes collabora WOPI
healthcheck/start_period (keystone — fixes F2-12 at the source so cc-ci can return to abra-native
convergence + drop the -c/READY_PROBE backstop), backend WOPI retry, gunicorn-perms race, lazy OIDC.
PR is 'working' only when cc-ci runs the full suite incl. upgrade tier green + Adversary cold-verify,
then operator merges. Broken out from plan-lasuite-drive-oidc-robustness.md Part B (now points here).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 12:58:36 +01:00

6.3 KiB
Raw Blame History

Plan — lasuite-drive OIDC-setup flakiness: harness restructure + recipe robustness PR

Status: QUEUED — picks up the deferred lasuite-drive [~] item (Q3.2) when the loops return to it. The disk blocker is now lifted (host grew to 70 GB, 2026-05-29), so the upgrade tier can run too. Owner: Builder + Adversary loops. Two deliverables: (A) a cc-ci harness change, (B) a lasuite-drive recipe PR (we maintain the recipe via recipe-maintainer). This file: /srv/cc-ci/cc-ci-plan/plan-lasuite-drive-oidc-robustness.md


0. The problem (ground it in real logs FIRST)

lasuite-drive's setup_custom_tests wires OIDC by doing a full 12-service abra app deploy --chaos redeploy of an already-running heavy stack. That reconverge is flaky: a collabora reconverge race + a transient backend gunicorn-perms / WOPI-404 window. Only backend/app consume the OIDC env, so re-converging collabora/onlyoffice/minio/db is unnecessary exposure. The Builder correctly did NOT claim Q3.2 (a flaky setup isn't a reliable green) and filed this.

Step 0 (do before fixing): capture the actual failure from a flaky run — collabora WOPI-discovery timing, the backend log at the 404, and the exact gunicorn perms error — so we fix the real root cause, not a guess. Record in JOURNAL.


Part A — cc-ci harness: wire OIDC at INSTALL, eliminate the redeploy

Leverage the now live-warm keycloak (WC1): instead of deploy-recipe-then-redeploy-with-OIDC, the dep is already running, so configure OIDC before the single deploy:

  1. Create the per-run namespaced realm+client in the warm keycloak (lightweight API calls — no deploy).
  2. Set lasuite-drive's OIDC env in its .env before the first abra app deploy.
  3. Deploy lasuite-drive once, OIDC already wired. No mid-run --chaos reconverge → the flaky window is gone.
  • Use REAL abra commands only (operator decision, 2026-05-29): the install is a normal abra app deploy. Do NOT use docker service update/docker service scale to surgically patch services — CI must exercise the same abra path a real operator uses. If any redeploy ever remains necessary, it is abra app deploy (+ proper readiness gating, §C), never a docker-level bypass.
  • Generic-first invariant still holds. keycloak is live-warm (persistent infra, ~always up), so there's no per-run dep deploy that could fail and break a generic tier. Generic tiers still assert serving recipe-alone; if realm creation or the provider is unreachable, only the OIDC custom test fails (isolated). This is SAFE iff the recipe boots with OIDC-env-set even when the provider isn't reachable at boot — which Part B guarantees (lazy OIDC). A and B reinforce each other.
  • Revises plan-sso-dep-testing.md for recipes whose app tolerates OIDC-at-boot: prefer install-time wiring against the warm dep over the post-deploy setup_custom_tests redeploy. Keep the redeploy path only for recipes that genuinely can't take OIDC env at first boot.
  • Verify: lasuite-drive full suite green without a mid-run reconverge; run it repeatedly (e.g. 3×) to show the flakiness is gone, not just absent once.

Part B — lasuite-drive recipe PR (root-cause robustness, we're maintainers)

Broken out into its own sub-plan: plan-lasuite-drive-recipe-pr.md (operator, 2026-05-29) — it now also folds in the F2-12 upgrade-convergence fix (collabora healthcheck) so cc-ci can return to abra-native convergence. The summary below is retained; the sub-plan is authoritative.

Fix the fragility in the recipe itself so it's robust under ANY reconverge — this helps real operators, not just CI, and is the real payoff of maintaining the recipe. On a branch of the lasuite-drive recipe (recipe-maintainer):

  1. Order backend behind collabora's WOPI readiness. Add a Docker healthcheck on collabora (WOPI discovery endpoint → 200) and make backend wait for it / retry WOPI discovery with backoff instead of failing during collabora startup. Turns the race into deterministic ordering.
  2. Fix the gunicorn-perms race. The transient perms error is a volume-permission race at startup (gunicorn can't read/write a mount until perms are set). Fix in the entrypoint: set perms before exec'ing gunicorn (or a proper init step).
  3. Make OIDC discovery lazy / retrying. If backend validates the OIDC provider eagerly at boot and dies when it's briefly unreachable, that's what forces the deploy-with-dep dance. Resolve OIDC discovery lazily (at first login, with retry) so the app boots without the provider up. (This is also the property Part A relies on for the generic-first invariant.)

The PR is "working" ONLY when cc-ci verifies it green (operator rule, 2026-05-29)

A lasuite-drive recipe change is not considered done until our CI server has run the full suite on the PR and it passes — install + upgrade + backup + restore + custom (incl. the OIDC login test), repeatedly-green (no flakiness). Concretely: push the recipe branch/PR, trigger cc-ci (!testme on the lasuite-drive PR), and require a clean green (Adversary cold-verified). Only then does the operator merge the recipe PR. This dogfoods cc-ci: the CI that surfaced the bug also gates its fix.

C. If any redeploy must remain (fallback only)

If a recipe genuinely needs OIDC env applied after first boot, do it with abra app deploy (real abra — never docker service update), and gate readiness properly: poll collabora WOPI = 200 + backend health with retries/timeout before declaring ready. This is strictly a fallback; Part A + Part B should remove the need.

Definition of done

  • Step 0 root-cause logs captured + recorded.
  • Part A: lasuite-drive tested via install-time OIDC wiring against the warm keycloak, no mid-run reconverge, full suite green 3× in a row (flakiness gone), Adversary cold-verified.
  • Part B: lasuite-drive recipe branch with the WOPI-healthcheck-gating + gunicorn-perms + lazy-OIDC fixes; cc-ci runs the full suite (incl. upgrade tier) on the PR and it is repeatedly green; Adversary cold-verifies; operator merges.
  • Q3.2 (lasuite-drive) claimed + DEFERRED.md entry closed (flaky-OIDC item) — only after the above.
  • plan-sso-dep-testing.md updated to prefer install-time OIDC wiring (warm dep) where the recipe supports it.