diff --git a/JOURNAL-rcust.md b/JOURNAL-rcust.md index ebf95d3..bffcafa 100644 --- a/JOURNAL-rcust.md +++ b/JOURNAL-rcust.md @@ -198,3 +198,33 @@ main serial re-run, AND old main @ old default head. The earlier "deploy timed o concurrent image pulls" guess in STATUS was wrong (the 600s timeout was the SYMPTOM; the ~2min A/B failure exposed the crash-loop). Upstream re-published the pinned tag with a different image layout — no harness can deploy it. Filed in STATUS as restructure-neutral with grep-able evidence. + +## 2026-06-11 lasuite-drive root cause #2 — completed one-shot poisons convergence (caught live) + +Watching the m2p proof run instead of just waiting paid off: the fix-forward's best-effort line +printed (so #1 is fixed), but the install assert then sat in pytest for 25+ minutes. Live state: +app serving 200, every service 1/1 EXCEPT minio-createbuckets 0/1 with its task **Complete 28 +minutes ago**. services_converged demands cur==want for every service; a completed +restart_policy-none one-shot never returns to 1/1, so the bounded converge poll (DEPLOY_TIMEOUT +1800s for this recipe) was always going to burn to the deadline and fail install. + +Why nobody ever saw this before P2b: the old setup_custom_tests.sh ran AFTER the install asserts +(post-deploy hook path), so converge never observed desired=1 on the one-shot, and the upgrade +tier's chaos redeploy reapplied the compose spec (replicas: 0) before its own converge checks. +P2b folded the trigger into ops.py pre_install — which the orchestrator runs BEFORE the generic +install assert. Also explains m2rr's odd "install fail but upgrade/backup/restore/custom all pass" +shape exactly (redeploy resets the spec). + +Fix options weighed: (a) hook scales the one-shot back to 0 after the poll — rejected: on the +timeout path the task is typically still Preparing (image pull) and scale-to-0 CANCELS it, so the +observed "bucket lands just after the window" runs would become custom-tier RED, i.e. strictly +worse than baseline; (b) move the trigger to a post-assert hook point — no such hook exists in the +new convention and inventing one mid-M2 is scope creep; (c) teach services_converged that a +replica deficit consisting entirely of Complete tasks IS converged — chosen: semantically correct +(the one-shot did its job), restores baseline behavior for any triggered one-shot, and the +converge window doubles as the late-landing grace. Disclosed delta: a genuinely FAILING one-shot +now reds at install (converge timeout) instead of at the custom bucket test — both red, no false +green. Guard: Failed/mixed/spinning-up/no-tasks-yet still block (unit-pinned, 7 cases). + +Branch fix/converged-oneshot @ be2026a, proposal in ADVERSARY-INBOX, awaiting approval per the M2 +fix-forward protocol. Unit suite 199 passed + lint PASS from the cc-ci working-tree rsync. diff --git a/STATUS-rcust.md b/STATUS-rcust.md index 83658c6..0230b8f 100644 --- a/STATUS-rcust.md +++ b/STATUS-rcust.md @@ -127,9 +127,23 @@ sweep runs, not retroactively here. - M2.3 in-flight proof runs (serial queue /root/m2-proof.sh + /root/m2-proof2.sh, logs /root/m2-proof-logs/, driver /root/m2-proof-logs/driver.log): 1. **lasuite-drive @baseline ref ffa7d585afa2 PR=1 on merged main @5c0676b** (post-fix-forward - 1357544) → run id m2p-lasuite-drive; EXPECTED L5 (the Adversary approval condition). + 1357544) → run id m2p-lasuite-drive: **WILL LAND L0 — second P2b regression found via this + run, root-caused LIVE.** The 1357544 best-effort path WORKED (`!!` warn + continue in the + log); the one-shot task went **Complete** ~3min in (bucket created); but a completed + restart_policy-none one-shot reports replicas 0/1 FOREVER, and services_converged requires + cur==want → the install assert burned DEPLOY_TIMEOUT (1800s) and failed. Old world never saw + this: setup_custom_tests.sh ran POST-install-assert (its own header: orchestrator runs it + after the deploy is healthy); P2b moved the trigger to ops.py pre_install = PRE-assert. + Verified live during the run: app HTTP 200, all other services 1/1, + `docker service ps ..._minio-createbuckets` = Complete, pytest in converge loop 27+ min. + **Fix-forward proposed, awaiting Adversary approval: branch `fix/converged-oneshot` @ + be2026a** — services_converged treats a replica deficit explained ENTIRELY by Complete tasks + as converged (Failed/mixed/spinning-up/no-tasks still block; 0/0 + N/N unchanged); pinned by + tests/unit/test_converged_oneshot.py (7 cases). Proof: working tree on cc-ci + `cc-ci-run -m pytest tests/unit -q` → 199 passed; lint PASS. Post-approval: merge + re-run + this proof (expect L5). 2. **discourse @7ae7b0f PR=2 on merged main** (exact baseline-184 invocation) → m2p-discourse; - discriminates PR=0-artifact/race vs deterministic-at-ref. + discriminates PR=0-artifact/race vs deterministic-at-ref. Unaffected by the one-shot issue. 3. **discourse @7ae7b0f PR=2 on OLD main** (/root/m2-oldmain) → ab-discourse-7ae7b0f-oldmain; completes the same-ref A/B the upgrade-HC1 mode is missing. - M2.4 spot-greps (customizations actually executed — log evidence in /root/m2-logs/):