review(2): FAIL gate Q3.2 lasuite-drive (claim 911680f/code 4b38b66) — cold re-run upgrade tier FAILS (abra chaos-deploy FATA: new collabora 25.04.9.4.1 not converged; WOPI pre-gate DID work). install/backup/restore/custom+OIDC pass, deploy-count=1, teardown clean. Filed F2-12 BLOCKING

2026-05-29 11:47:38 +01:00
parent 05d0dc14eb
commit aab77ea0f3
2 changed files with 80 additions and 0 deletions
--- a/machine-docs/BACKLOG-2.md
+++ b/machine-docs/BACKLOG-2.md
@ -536,3 +536,25 @@ Phase plan: `/srv/cc-ci/cc-ci-plan/plan-phase2-recipe-tests.md`
      gate-blocker, but Q0 cannot be considered "complete" in the broad sense of the §6 enumeration
      until those primitives ship in Q2/Q3. Recording so a future Q2/Q3 verdict checks them off.
      - Filed by Adversary @2026-05-28.
+
+- [ ] **F2-12 [adversary] — BLOCKS Q3.2 gate** — lasuite-drive **upgrade tier FAILS on cold re-run**,
+      contradicting the claim "full lifecycle 3× green". Cold-verified @2026-05-29 from `/root/adv-verify`
+      @ origin/main `911680f` (code `4b38b66`, git==host). `RECIPE=lasuite-drive PR=0 cc-ci-run
+      runner/run_recipe_ci.py` → RUN SUMMARY: install/backup/restore/custom **pass**, **upgrade FAIL**,
+      deploy-count=1.
+      - **Repro:** the prev→PR-head chaos upgrade redeploy does not converge —
+        `!! upgrade op failed: abra app deploy lasu-<hex>… failed (1)` → `FATA deploy failed 🛑`
+        (abra log `/root/.abra/logs/default/lasu-…2026-05-29T103335Z`). Heavy crossover: collabora/code
+        25.04.9.1.1→25.04.9.4.1, drive-backend/-frontend v0.12.0→v0.18.0, onlyoffice 9.2→9.3.1.2.
+        The NEW collabora is still in jail/config init (`Kit core version…`, many `Linking file…`,
+        `etc/* needs to be updated`) when abra's convergence poll gives up.
+      - **NOT the WOPI pre-gate** — that fix worked: `pre_upgrade: collabora WOPI discovery ready (200)`.
+        The gap is NEW-collabora convergence within abra's upgrade poll window, not OLD-collabora readiness.
+      - **Repro steps:** `RECIPE=lasuite-drive PR=0 cc-ci-run runner/run_recipe_ci.py`; observe upgrade fail.
+      - **Likely fix direction (Builder's call):** raise the abra per-service convergence timeout for the
+        upgrade redeploy (recipe-internal TIMEOUT/`DEPLOY_TIMEOUT` covers the python subprocess, but abra's
+        own poll emitted FATA), and/or wait for new-collabora health before asserting reconverge.
+      - **Close condition (Adversary-owned):** upgrade tier GREEN on **my** cold re-run (repeat-green),
+        per my standing veto-eligible obligation (disk lifted; deferral void). Full verdict: REVIEW-2.md
+        "## Q3.2 lasuite-drive — FAIL @2026-05-29".
+      - Filed by Adversary @2026-05-29.
--- a/machine-docs/REVIEW-2.md
+++ b/machine-docs/REVIEW-2.md
@ -821,3 +821,61 @@ ahead of the claim so my verdict is instant. Findings to carry into the gate (re
 (install = 1 deploy, no mid-run reconverge), the now-REQUIRED **upgrade tier GREEN** (disk lifted),
 repeat-green + my own cold re-run reading the assertions. This note is recon only — NO PASS/FAIL until
 the Builder claims the gate.
+
+## Q3.2 lasuite-drive — FAIL @2026-05-29 (cold-verify; gate claim 911680f / code 4b38b66)
+Cold-verified from my own clone `/root/adv-verify` synced to origin/main `911680f` (claim commit is
+**docs-only** — BACKLOG-2/DEFERRED/STATUS-2; verified *code* == `4b38b66`. git==host confirmed:
+Builder `/root/builder-clone` @ 4b38b66, deploy tree clean). Ran `RECIPE=lasuite-drive PR=0 cc-ci-run
+runner/run_recipe_ci.py` from /root/adv-verify (log `/root/adv-q32-102348.log`).
+
+**Result — RUN SUMMARY (verbatim):**
+```
+deploy-count = 1 (expect 1)
+  install : pass
+  upgrade : fail        <-- FAILS the gate (claim said full lifecycle 3x green)
+  backup  : pass
+  restore : pass
+  custom  : pass
+```
+
+**Root cause (from the actual log + abra deploy log — NOT the WOPI gate):** the collabora WOPI-discovery
+pre-upgrade gate **worked** — log line 43: `pre_upgrade: collabora WOPI discovery ready (200) on
+collabora-lasu-cbcdd6.ci.commoninternet.net`. The failure is the **chaos upgrade deploy itself not
+converging**: line 44 `!! upgrade op failed: abra app deploy lasu-cbcdd6.ci.commoninternet.net -o -n -C
+failed (1)` → `INFO polling deployment status` → `FATA deploy failed 🛑`
+(abra log `/root/.abra/logs/default/lasu-cbcdd6...2026-05-29T103335Z`). This was a real prev→PR-head
+crossover with heavy image bumps — collabora/code 25.04.9.1.1→**25.04.9.4.1**, drive-backend
+v0.12.0→**v0.18.0**, drive-frontend v0.12.0→**v0.18.0**, onlyoffice 9.2→**9.3.1.2**, nginx 1.29→1.30,
+redis 8→8.6.3. The abra deploy log shows the NEW collabora still doing lengthy jail/config init
+(`Kit core version …`, hundreds of `Linking file …` lines, `child-roots/.../etc/* needs to be updated`)
+when abra's convergence poll gave up. So the upgrade redeploy timed out waiting for the new collabora
+to become healthy, not the pre-deploy gate.
+
+**Why FAIL, not a flake-to-retry:**
+- The claim is **"flakiness gone, full lifecycle 3× green"** (r2/r3/r4). My **first independent cold
+  run** does NOT reproduce green — the upgrade tier fails. That contradicts "reproducibly green."
+- Upgrade-tier GREEN is my **standing veto-eligible obligation** (disk lifted; deferral void). My
+  stated criteria required **repeat-green + my own cold re-run** of the upgrade tier. It failed on my run.
+- The new-collabora-convergence timeout is the *same class* of collabora-timing problem `4b38b66` set
+  out to fix; the WOPI pre-gate addresses readiness of the OLD collabora before redeploy, but does not
+  ensure the NEW collabora (heavier 25.04.9.4.1) converges within abra's upgrade poll window. The fix
+  is incomplete for the crossover it claims to make green.
+
+**What DID verify (fix is partial, not worthless):**
+- **Part A install-time OIDC — GREEN & real.** `deploy-count = 1` (single deploy, no post-deploy
+  `--chaos` reconverge); log: `using live-warm keycloak … per-run realm`, `install_steps: OIDC env wired
+  into .env (… no reconverge)`; `test_oidc_password_grant_against_dep_keycloak` **PASSED, not skipped**
+  (real password-grant JWT vs a per-run realm). **Real-abra-only confirmed** — no `docker service
+  update/scale` patching of app state (the lone `service scale …minio-createbuckets` triggers the
+  recipe's own `replicas:0` one-shot; established acceptable in my pre-claim recon).
+- **install + backup + restore + custom all pass**; `test_minio_storage` (S3 round-trip) PASSED.
+- **Teardown sacred:** post-run NO `lasu` stacks, NO per-run `lasu` volumes; warm-keycloak + warm
+  custom-html canonical volumes intact (prune/teardown didn't touch the cache).
+
+**FILED: F2-12 [adversary] (BLOCKS the Q3.2 gate).** No phase `## VETO`. Q3.2 cannot PASS until the
+**upgrade tier runs GREEN on my own cold re-run** (repeat-green). Likely real fixes for the Builder to
+consider: raise the abra upgrade convergence timeout for the new-collabora crossover (the recipe-internal
+TIMEOUT/`DEPLOY_TIMEOUT` covers the python subprocess, but abra's own per-service convergence poll is
+what emitted `FATA deploy failed`), and/or a post-redeploy collabora-health wait before asserting
+reconverge. Anti-anchoring honored: verdict formed from the plan + code + my own run's observable log;
+I did NOT read JOURNAL-2 before writing this.