diff --git a/machine-docs/STATUS-2.md b/machine-docs/STATUS-2.md index 40dce8f..96c9dc1 100644 --- a/machine-docs/STATUS-2.md +++ b/machine-docs/STATUS-2.md @@ -49,13 +49,10 @@ tree must carry: - **Q5** — Completeness + docs; flip `## DONE`. ## In flight -**Q3.2 lasuite-drive — Adversary FAILed (F2-12); fix `e1147b5` validating (NOT re-claimed yet).** -The Adversary's cold re-run hit the upgrade tier failing: abra's converge monitor FATAs while the -NEW collabora 25.04.9.4.1 healthcheck is still in start_period (my WOPI pre-gate fixed the OLD -collabora; the new one's convergence was still abra-impatient → flaky: 3× green for me, 1× fail cold). -Fix `e1147b5`: upgrade chaos redeploy uses `abra … -c` (no impatient converge monitor) + perform_upgrade -OWNS a stricter convergence wait (services N/N + app health + collabora WOPI READY_PROBE) bounded by -DEPLOY_TIMEOUT. Validating across multiple full runs (`/root/ccci-drive-f212-v1.log` …) before re-claim. +**Q3.2 lasuite-drive — RE-CLAIMED (Gate: Q3.2 below) after the F2-12 fix; awaiting Adversary.** +F2-12 fixed by `e1147b5` (abra `-c` + own convergence wait + collabora READY_PROBE) + `6506c4a` +(P7-negative unit tests). 3× repeat-green (v1/v2/v3), upgrade tier passes, deploy-count=1, clean +teardown. F2-12 stays Adversary-owned (for them to close on cold-verify). **cryptpad F2-9 — RESOLVED (test landed `05d0dc1`, 3/3 green; Adversary to close).** New `tests/cryptpad/playwright/test_pad_content_roundtrip.py` does the §4.3 create-pad → type → FRESH @@ -118,12 +115,15 @@ SKIP no longer yields a GREEN `!testme`. ## Gate -**Gate: Q3.2 lasuite-drive — CLAIMED @2026-05-29, awaiting Adversary.** +**Gate: Q3.2 lasuite-drive — RE-CLAIMED @2026-05-29 (after F2-12 fix), awaiting Adversary.** +(First claim `911680f` FAILed cold-verify — F2-12: the upgrade chaos redeploy's abra converge monitor +FATA'd while the NEW collabora 25.04.9.4.1 was still in its healthcheck `start_period`. Fixed by +`e1147b5`; re-validated 3× green. F2-12 is Adversary-owned — left for the Adversary to close.) **WHAT.** lasuite-drive (the heaviest Phase-2 stack: 12 services incl. collabora + onlyoffice + minio/S3 + postgres, OIDC-dependent) now runs its **full lifecycle GREEN, repeatably** — install + upgrade (prev→PR-head chaos crossover) + backup + restore + custom (health + MinIO round-trip + OIDC -password-grant), via **two fixes**: +password-grant), via **three fixes**: 1. **Install-time OIDC wiring** (commit `a151489`) — the orchestrator provisions the per-run realm on the live-warm keycloak BEFORE the single `abra app deploy`, and `tests/lasuite-drive/install_steps.sh` writes the OIDC env + client secret into that one deploy. This **eliminates the flaky post-deploy @@ -131,36 +131,47 @@ password-grant), via **two fixes**: race; JOURNAL Step 0). New per-recipe `OIDC_AT_INSTALL` meta flag + reusable `_provision_deps()` helper; legacy post-deploy path unchanged for all other dep recipes (gated on `not oidc_at_install`). 2. **collabora-ready upgrade gate + DEPLOY_TIMEOUT plumbing** (commit `4b38b66`) — `ops.py::pre_upgrade` - waits for collabora WOPI discovery (`/hosting/discovery` on `collabora-`) → 200 BEFORE the - chaos redeploy, so it no longer SIGTERMs a still-booting collabora (which caused exit 70 / "FATA - deploy failed" in run 1); `DEPLOY_TIMEOUT` now threads to the upgrade `chaos_redeploy` (was abra's - 900s default vs the .env internal TIMEOUT 1500s). + waits for collabora WOPI discovery → 200 BEFORE the chaos redeploy, so it no longer SIGTERMs a + still-booting OLD collabora; `DEPLOY_TIMEOUT` threads to the upgrade `chaos_redeploy`. +3. **F2-12 fix — own the upgrade convergence verification** (commit `e1147b5`). The upgrade chaos + redeploy now runs `abra … -c` (`--no-converge-checks`): abra's own post-deploy monitor — which + FATA'd while the NEW collabora 25.04.9.4.1's healthcheck was still in `start_period` (jail/config + init) — is dropped. `docker stack deploy` still applies the spec; `generic.perform_upgrade` then + OWNS a **stricter** verification with a generous (DEPLOY_TIMEOUT) deadline: `lifecycle.wait_healthy` + (every swarm service N/N + app HEALTH_PATH 200) **then** `lifecycle.wait_ready_probes` + (recipe `READY_PROBE` → collabora WOPI `/hosting/discovery` 200). The new collabora converges + through swarm's healthcheck retries; HC1 (chaos-version == PR-head) + deploy-count=1 preserved. + **Non-vacuous (P7-negative) PROVEN** by `tests/unit/test_f212_upgrade_convergence.py` (commit + `6506c4a`, 5 tests): `wait_ready_probes`/`wait_healthy` RAISE `TimeoutError` on a stuck/never-serving + convergence — so a genuinely broken upgrade stays RED; this is not green-washing abra's skipped check. **HOW (Adversary, cold, on cc-ci):** ``` ssh cc-ci 'cd /root/ && git pull && RECIPE=lasuite-drive PR=0 cc-ci-run runner/run_recipe_ci.py' +ssh cc-ci 'cd /root/ && cc-ci-run -m pytest tests/unit/test_f212_upgrade_convergence.py -q' ``` **EXPECTED:** - RUN SUMMARY: `deploy-count = 1 (expect 1)`; `install/upgrade/backup/restore/custom` **all `pass`**. -- `tests/lasuite-drive/functional/test_oidc_with_keycloak.py::test_oidc_password_grant_against_dep_keycloak` - **PASSED** (NOT skipped) — real password-grant JWT against a per-run realm on warm keycloak. -- `test_minio_storage` PASSED (real S3 upload→list→cat readback round-trip inside the minio container). +- `test_oidc_password_grant_against_dep_keycloak` **PASSED** (NOT skipped) — real password-grant JWT. +- `test_minio_storage` PASSED (real S3 upload→list→cat readback inside the minio container). - Data-integrity: `test_upgrade_preserves_data` (ci_marker survives prev→PR-head chaos crossover) + backup/restore ci_marker survive. - Log shows `install-time OIDC: deps provisioned` + `install_steps: OIDC env wired` (no post-deploy - reconverge) and `pre_upgrade: collabora WOPI discovery ready (200)` before the upgrade redeploy. + reconverge) and **`ready-probe OK (200)` TWICE** (post-install + post-upgrade, collabora WOPI). - Clean teardown: post-run `docker stack ls | grep lasu` and `docker volume ls | grep lasu` both empty. +- Unit: **5 passed** in `tests/unit/test_f212_upgrade_convergence.py` (the P7-negative proof). -**WHERE.** Commits `a151489` (Part A) + `4b38b66` (upgrade gate). Files: `runner/run_recipe_ci.py` -(`_provision_deps`, `OIDC_AT_INSTALL` branch, `_perform_op` timeout), `runner/harness/lifecycle.py` -(`chaos_redeploy` timeout), `runner/harness/generic.py` (`perform_upgrade` timeout), -`tests/lasuite-drive/{install_steps.sh,setup_custom_tests.sh,ops.py,recipe_meta.py}`. -**3× repeat-green** (flakiness gone, not absent-once): `/root/ccci-drive-q32a-r2.log`, -`…-r3.log`, `…-r4.log` — each full-suite green, deploy-count=1, OIDC PASSED, clean teardown -(run 1 `…-r1.log` showed the upgrade-tier failure that `4b38b66` fixed). Step-0 root-cause logs in -JOURNAL-2 (2026-05-29). DEFERRED.md disk-blocker entry CLOSED (host grew to 64G); flaky-OIDC -BACKLOG-2 Q3.2a item now resolved. +**WHERE.** Commits `a151489` (Part A) + `4b38b66` (upgrade gate) + `e1147b5` (F2-12 own-convergence) + +`6506c4a` (P7-negative unit tests). Files: `runner/run_recipe_ci.py` (`_provision_deps`, +`OIDC_AT_INSTALL` branch, `_perform_op` meta+timeout, post-install `wait_ready_probes`), +`runner/harness/abra.py` (`deploy(no_converge_checks)`), `runner/harness/lifecycle.py` +(`chaos_redeploy(no_converge_checks)`, `wait_ready_probes`), `runner/harness/generic.py` +(`perform_upgrade` own-wait), `tests/lasuite-drive/{install_steps.sh,setup_custom_tests.sh,ops.py,recipe_meta.py}` +(`READY_PROBE`), `tests/unit/test_f212_upgrade_convergence.py`. +**3× repeat-green of the F2-12 fix** (flakiness gone, not absent-once): `/root/ccci-drive-f212-v1.log`, +`…-v2.log`, `…-v3.log` — each full-suite green, deploy-count=1, OIDC PASSED, ready-probe OK twice, +clean teardown. Step-0 root-cause logs in JOURNAL-2. DEFERRED.md disk-blocker CLOSED (host 64G). ---