status(2): lasuite-drive Q3.2 NOT claimed — OIDC setup redeploy flaky (collabora reconverge); --detach fix validated; test assertions proven correct (run 1); Q3.2a robustness item added; prune-during-deploy lesson recorded

2026-05-29 07:27:50 +01:00
parent 75ae226c0d
commit 426a953c2b
2 changed files with 47 additions and 2 deletions
--- a/machine-docs/JOURNAL-2.md
+++ b/machine-docs/JOURNAL-2.md
@ -678,3 +678,39 @@ lasuite-drive's actual Q3.2 CONTENT works: parity health, the real MinIO S3 uplo
 round-trip, and the OIDC password-grant + JWT-claims flow against the dep keycloak. Per §7.1 the
 maximal subset is implemented and only the genuinely-disk-blocked upgrade tier is outstanding —
 pending Adversary sign-off on the env-blocker.
+
+---
+
+## 2026-05-29 — lasuite-drive: --detach fix validated, but OIDC setup redeploy is FLAKY (NOT claiming Q3.2 yet)
+
+Ran lasuite-drive maximal subset (install,backup,restore,custom) four times today:
+- **Run 1** (`ccci-drive-subset.log`): all tiers + all 3 functional GREEN (health, MinIO round-trip,
+  OIDC JWT) — but required a manual kill of the hung `docker service scale` (the bug I then fixed with
+  `--detach`, commit `f1c626c`). So the test ASSERTIONS are all correct and CAN pass.
+- **Runs 2 & 3** (`-clean`, `-clean2`): corrupted by MY OWN over-eager `docker image prune -f` mid-deploy
+  — it removed the just-pulled, not-yet-attached digest-pinned images (drive-frontend, onlyoffice),
+  so swarm rejected with "No such image" and install failed/timed out. **LESSON: never
+  `docker image prune` during an active deploy — mid-pull images look dangling and get removed.**
+  Confirmed self-inflicted: `docker pull lasuite/drive-frontend@sha256:eeef…` succeeded (image is on
+  hub), and after seeding it the stack converged. Not a recipe/test issue.
+- **Run 4** (`-clean3`, warm images, hands-off, fixed `--detach`): install/backup/restore all PASS,
+  health + MinIO PASS, **but the OIDC test SKIPPED because `setup_custom_tests.sh` exited 1** — its
+  step-3 in-place `abra app deploy --force --chaos` (applies the OIDC env) FAILED to converge
+  ("FATA deploy failed"; abra log shows backend `Permission denied: /.gunicorn` + celery
+  `configure_wopi: 404 from collabora discovery url`). Per F2-11 the run correctly went RED (no false
+  green) — `custom: pass (1 requires_deps SKIPPED — SSO UNVERIFIED)`, overall=1. The `--detach` fix
+  itself works (bucket scale returned, secret inserted v2); the failure is the full-stack redeploy.
+
+**Root finding: the OIDC-wiring step (a full 12-service in-place `--chaos` redeploy) is FLAKY on this
+heaviest stack** — collabora's reconverge race + a transient backend gunicorn-perms/WOPI-404 window
+mean the redeploy succeeds only sometimes (run 1 yes, run 4 no). The OIDC env change only affects
+backend/app, so re-converging collabora/onlyoffice is unnecessary exposure. Fix direction (BACKLOG):
+wire OIDC at INSTALL time (no post-deploy redeploy — like lasuite-docs install_steps), or make the
+setup redeploy resilient (retry / wait for collabora WOPI discovery 200 before declaring ready).
+
+**Decision:** NOT claiming Q3.2 — a flaky OIDC setup is not a reliable green, and claiming would risk
+an Adversary cold-verify FAIL. lasuite-drive stays [~]: test content proven correct (run 1), `--detach`
+bug fixed, two open issues (disk-blocker on upgrade tier [DEFERRED/operator]; flaky OIDC redeploy
+[BACKLOG, needs robustness work]). **Pivoting to lighter recipes for broad Phase-2 progress**;
+lasuite-drive's OIDC robustness + upgrade-disk return later. Host left clean (all stacks torn down,
+disk 65%, infra healthy).