feat(2): lasuite-drive Q3.2a Part A — wire OIDC at INSTALL, eliminate flaky redeploy

Q3.2a / plan-lasuite-drive-oidc-robustness.md Part A. The old setup_custom_tests.sh did a
post-deploy in-place `abra app deploy --force --chaos` of the heavy 12-service stack to apply
the OIDC env — flaky (collabora WOPI-discovery race + gunicorn-perms; JOURNAL Step 0). Since
the OIDC env only affects backend/app and keycloak is live-warm, provision the per-run realm
BEFORE the single deploy and wire OIDC into the .env at install time (no reconverge).

- runner/run_recipe_ci.py: new _provision_deps() helper (warm/cold split + SSO enrich + write
  $CCCI_DEPS_FILE), used by both paths. New per-recipe OIDC_AT_INSTALL meta flag (added to
  _load_meta whitelist). When set + deps live-warm: provision BEFORE deploy_app; the install
  tier's install_steps.sh wires OIDC into the single deploy; post-deploy step runs only the
  MinIO bucket one-shot — no re-provision, no redeploy. Legacy post-deploy path unchanged for
  all other dep recipes (gated on `not oidc_at_install`).
- tests/lasuite-drive/install_steps.sh (NEW): install-time OIDC env + secret wiring; no-ops on
  empty deps file (recipe still boots, OIDC test skips → F2-11 RED).
- tests/lasuite-drive/setup_custom_tests.sh: trimmed to MinIO-bucket-only (OIDC moved out).
- tests/lasuite-drive/recipe_meta.py: OIDC_AT_INSTALL = True.
- JOURNAL-2: Step-0 root-cause failure logs captured before the fix.

NOT a claim — validating 3x green (incl. now-required upgrade tier) before claiming Q3.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 10:10:05 +01:00
parent 4356f0009c
commit a151489996
5 changed files with 210 additions and 117 deletions

View File

@ -796,3 +796,33 @@ LIFTING). After cc-ci is healthy I can:
3. Resume broad heavy-recipe coverage (immich, lasuite-meet) with real disk headroom.
Note: with 70GB, I can also be less aggressive about teardown/prune churn between heavy runs.
---
## 2026-05-29 — lasuite-drive Q3.2a Step 0: root-cause failure logs captured (BEFORE any fix)
Resuming Q3.2a (plan-lasuite-drive-oidc-robustness.md) after Phase 2pc DONE. The Adversary's
cold-verify criterion #1 requires real captured failure logs before any fix. Captured from the
flaky run-4 deploy (`/root/.abra/logs/default/lasu-288dfd...2026-05-29T062401Z`, the
`abra app deploy --force --chaos` OIDC-setup redeploy that exited 1 / "FATA deploy failed"):
1. **gunicorn perms race** — `backend [1] [ERROR] Control server error: [Errno 13] Permission
denied: '/.gunicorn'`. gunicorn tries to create its control-server temp dir under HOME=`/`
(not writable). (Part B fix: set perms / writable HOME in entrypoint before exec gunicorn.)
2. **WOPI-discovery race** — `celery RuntimeError: status code 404 return by discovery url for
wopi client collabora is invalid` at `/app/wopi/tasks/configure_wopi.py:53`. The celery
`configure_wopi_clients` task hits collabora's discovery URL at boot (06:21:54) while collabora
is still caching its 132+ l10n files (finishes ~06:24) → 404 → task raises. (Part B fix:
collabora WOPI healthcheck gating + backend retry/backoff on discovery.)
3. **transient db-not-ready** — `db FATAL: database "drive" does not exist` + celery
`Could not connect to database: failed to resolve host 'db'` — early-boot DNS/init races that
self-heal; harmless on a fresh deploy with the full TIMEOUT window.
**Key observation that shapes the fix:** the FIRST install deploy converges reliably **every** run
(install: pass in runs 14, incl. run 4). Only the post-install in-place `--force --chaos` redeploy
(applied to push the OIDC env) is flaky. The OIDC env touches ONLY backend/app — re-converging
collabora/onlyoffice/minio is unnecessary exposure. → **Part A: wire OIDC into the .env at INSTALL
time (between `abra app new` and the single `abra app deploy`) so the recipe deploys ONCE with OIDC
already set; no post-deploy reconverge.** keycloak is live-warm (always up), so the per-run realm is
a lightweight API call provisioned before the single deploy. Part B (recipe robustness PR) remains
the deeper fix so ANY reconverge (incl. the upgrade-tier prev→PR-head crossover) is race-free.