diff --git a/machine-docs/BACKLOG-2.md b/machine-docs/BACKLOG-2.md index 11b942d..4b8df52 100644 --- a/machine-docs/BACKLOG-2.md +++ b/machine-docs/BACKLOG-2.md @@ -75,8 +75,17 @@ Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md` conditional/deferred sign-off (NOT a §7.1 waiver): upgrade tier stays a veto-eligible open item, must run green + cold-verified before Phase-2 DONE once disk grows. Bug fixed en route: `fix(2)` `f1c626c` — setup_custom_tests `docker service scale --detach` (the run-once minio-createbuckets - job made a blocking scale hang the custom tier). Clean unassisted re-run with the fix in flight - before the formal Q3.2 claim. + job made a blocking scale hang the custom tier). **NOT CLAIMED — OIDC setup is FLAKY:** the + step-3 in-place full-stack `abra app deploy --force --chaos` (applies OIDC env) only converges + sometimes on this heaviest 12-service stack (run 1 OK → OIDC PASS; run 4 FAIL → OIDC SKIP → F2-11 + RED). Test assertions are all correct (run 1 proved health+MinIO+OIDC green); the flakiness is in + the redeploy infra. **Two open issues block a reliable Q3.2 green:** (a) [Q3.2a] flaky OIDC + redeploy — see below; (b) upgrade tier disk-blocker (DEFERRED/operator). See JOURNAL-2 2026-05-29. +- [ ] **Q3.2a** — Make lasuite-drive OIDC wiring reliable. The full 12-service `--chaos` redeploy to + apply OIDC env exposes collabora's flaky reconverge (+ transient backend gunicorn-perms / celery + WOPI-404). Fix direction: wire OIDC at INSTALL time (install_steps, no post-deploy redeploy — the + lasuite-docs model) OR make setup_custom_tests redeploy resilient (retry + wait for collabora WOPI + discovery 200 before ready). Then re-run subset to a reliable green before claiming Q3.2. - [ ] **Q3.3** — lasuite-meet: parity (health_check, oidc_login, meeting_flow, webrtc-media, webrtc-relay) + specific (create-a-room, two-user LiveKit token issuance, ICE-candidate gathering). - [~] **Q3.4** — cryptpad: parity port (health_check) ✓ + 2 NEW recipe-specific diff --git a/machine-docs/JOURNAL-2.md b/machine-docs/JOURNAL-2.md index 81e4c44..8907d94 100644 --- a/machine-docs/JOURNAL-2.md +++ b/machine-docs/JOURNAL-2.md @@ -678,3 +678,39 @@ lasuite-drive's actual Q3.2 CONTENT works: parity health, the real MinIO S3 uplo round-trip, and the OIDC password-grant + JWT-claims flow against the dep keycloak. Per §7.1 the maximal subset is implemented and only the genuinely-disk-blocked upgrade tier is outstanding — pending Adversary sign-off on the env-blocker. + +--- + +## 2026-05-29 — lasuite-drive: --detach fix validated, but OIDC setup redeploy is FLAKY (NOT claiming Q3.2 yet) + +Ran lasuite-drive maximal subset (install,backup,restore,custom) four times today: +- **Run 1** (`ccci-drive-subset.log`): all tiers + all 3 functional GREEN (health, MinIO round-trip, + OIDC JWT) — but required a manual kill of the hung `docker service scale` (the bug I then fixed with + `--detach`, commit `f1c626c`). So the test ASSERTIONS are all correct and CAN pass. +- **Runs 2 & 3** (`-clean`, `-clean2`): corrupted by MY OWN over-eager `docker image prune -f` mid-deploy + — it removed the just-pulled, not-yet-attached digest-pinned images (drive-frontend, onlyoffice), + so swarm rejected with "No such image" and install failed/timed out. **LESSON: never + `docker image prune` during an active deploy — mid-pull images look dangling and get removed.** + Confirmed self-inflicted: `docker pull lasuite/drive-frontend@sha256:eeef…` succeeded (image is on + hub), and after seeding it the stack converged. Not a recipe/test issue. +- **Run 4** (`-clean3`, warm images, hands-off, fixed `--detach`): install/backup/restore all PASS, + health + MinIO PASS, **but the OIDC test SKIPPED because `setup_custom_tests.sh` exited 1** — its + step-3 in-place `abra app deploy --force --chaos` (applies the OIDC env) FAILED to converge + ("FATA deploy failed"; abra log shows backend `Permission denied: /.gunicorn` + celery + `configure_wopi: 404 from collabora discovery url`). Per F2-11 the run correctly went RED (no false + green) — `custom: pass (1 requires_deps SKIPPED — SSO UNVERIFIED)`, overall=1. The `--detach` fix + itself works (bucket scale returned, secret inserted v2); the failure is the full-stack redeploy. + +**Root finding: the OIDC-wiring step (a full 12-service in-place `--chaos` redeploy) is FLAKY on this +heaviest stack** — collabora's reconverge race + a transient backend gunicorn-perms/WOPI-404 window +mean the redeploy succeeds only sometimes (run 1 yes, run 4 no). The OIDC env change only affects +backend/app, so re-converging collabora/onlyoffice is unnecessary exposure. Fix direction (BACKLOG): +wire OIDC at INSTALL time (no post-deploy redeploy — like lasuite-docs install_steps), or make the +setup redeploy resilient (retry / wait for collabora WOPI discovery 200 before declaring ready). + +**Decision:** NOT claiming Q3.2 — a flaky OIDC setup is not a reliable green, and claiming would risk +an Adversary cold-verify FAIL. lasuite-drive stays [~]: test content proven correct (run 1), `--detach` +bug fixed, two open issues (disk-blocker on upgrade tier [DEFERRED/operator]; flaky OIDC redeploy +[BACKLOG, needs robustness work]). **Pivoting to lighter recipes for broad Phase-2 progress**; +lasuite-drive's OIDC robustness + upgrade-disk return later. Host left clean (all stacks torn down, +disk 65%, infra healthy).