review(2): pre-claim recon F2-12 fix e1147b5 — abra -c skips converge monitor BUT harness owns stricter wait_healthy(N/N all svcs)+READY_PROBE(collabora 200, raises on timeout); plausibly not-a-weakening, MUST cold-verify upgrade-GREEN + P7-negative at re-claim; NO verdict yet

2026-05-29 12:21:30 +01:00
parent cc4af49c99
commit f7c5681cd0
1 changed files with 24 additions and 0 deletions
--- a/machine-docs/REVIEW-2.md
+++ b/machine-docs/REVIEW-2.md
@ -879,3 +879,27 @@ TIMEOUT/`DEPLOY_TIMEOUT` covers the python subprocess, but abra's own per-servic
 what emitted `FATA deploy failed`), and/or a post-redeploy collabora-health wait before asserting
 reconverge. Anti-anchoring honored: verdict formed from the plan + code + my own run's observable log;
 I did NOT read JOURNAL-2 before writing this.
 ## @2026-05-29 — Pre-claim recon: F2-12 fix e1147b5 (NOT re-claimed yet — no verdict)
 Builder ACKed F2-12 and pushed fix `e1147b5` ("own convergence wait via abra `-c` + collabora
 READY_PROBE"), status `cc4af49` = validating multi-run before RE-CLAIM. Read the fix ahead of the
 re-claim. **The adversarial crux: the upgrade redeploy now passes `abra … -c` (`--no-converge-checks`),
 which skips abra's own convergence monitor.** Skipping a convergence check is exactly the shape of a
 P7 weakening — so I scrutinized whether the replacement is genuinely stronger or a green-washing.
 - **Plausibly NOT a weakening (pending cold proof):** `-c` only skips abra's *post-deploy monitor*;
  `docker stack deploy` (the real spec apply) still runs. The harness then owns the verification in
  `generic.perform_upgrade`: `lifecycle.wait_healthy` (= `_wait_services_converged` "every swarm
  service shows running == configured replicas" + HEALTH_PATH) **then** `lifecycle.wait_ready_probes`
  (collabora `/hosting/discovery` → 200), bounded by the generous recipe DEPLOY_TIMEOUT. The READY_PROBE
  loop **raises TimeoutError** if discovery never hits 200 (while/else) → upgrade op fails → tier fails,
  so it's non-vacuous by construction. HC1 (chaos-version label == PR-head) preserved; chaos_redeploy
  still bypasses deploy_app so deploy-count stays 1.
 - **MUST cold-verify at re-claim (cannot fully settle by reading):**
  1. **Upgrade tier GREEN on MY own cold run** — the F2-12 close condition (repeat-green, not one-off;
     Builder admits it was 3×green/1×fail before this fix).
  2. **P7 negative:** confirm `_wait_services_converged` truly fails on a stuck `0/1` service (i.e. `-c`
     + owned-wait catches a genuinely broken converge, not just a slow one). I started reading its
     parser (lifecycle.py ~286–328) — finish that read + ideally observe a broken-upgrade-still-RED.
  3. deploy-count == 1; clean teardown.
 F2-12 stays OPEN (Adversary-owned). NO verdict until Q3.2 is re-claimed. Anti-anchoring: not reading
 JOURNAL before the verdict.