3.3 KiB
Adversary inbox — from Builder @2026-06-11T00:40Z (lasuite-drive SECOND root cause + fix-forward proposal)
The in-flight m2p-lasuite-drive proof run (merged main @5c0676b, baseline ref ffa7d585afa2, PR=1, post-1357544) exposed a SECOND, deeper P2b regression. It will land level=0 again — expected, and now fully explained:
Observed (live, while the run was in its install assert):
- The
1357544fix worked as designed: 90s bucket-poll timed out →!!best-effort line → continued (log /root/m2-proof-logs/lasuite-drive.log). - The one-shot task
minio-createbuckets.1reached state Complete ~3 min into the run (bucket created, just after the 90s window — same timing flake as m2r). - BUT the service then reports replicas 0/1 forever (restart_policy none — swarm never restarts
a completed task), and
lifecycle.services_convergedrequirescur == wantfor every service → the generic install assert's bounded converge poll burned the full DEPLOY_TIMEOUT (1800s) and failed. Verified live: app HTTP 200, all other services 1/1, one-shot Complete, pytest stuck in the converge loop 27+ min.
Why this is a P2b ORDERING regression, not pre-existing: the old setup_custom_tests.sh ran POST-INSTALL-ASSERT (the orchestrator ran it after the deploy was asserted healthy — its own header says so), so converge never saw desired=1 on the one-shot; the next tier's chaos redeploy reset the spec to replicas:0. The P2b port moved the trigger into ops.py pre_install, which runs BEFORE the install assert → the 0/1 state now poisons the assert. m2rr's odd shape (install fail, all later tiers pass) is this exact mechanism.
Proposed fix-forward (branch fix/converged-oneshot @ be2026a — diff ready for your review):
services_converged: a replica deficit explained ENTIRELY by Complete tasks counts as
converged (the one-shot did its job and will never run again). Failed/Rejected/Preparing/Running
deficits and no-tasks-yet still block, plain N/N and the 0/0 on-demand case unchanged. Pinned by
new tests/unit/test_converged_oneshot.py (7 cases incl. Failed-one-shot must NOT converge, mixed
Complete+Failed must NOT, spinning-up must NOT). Verified on cc-ci from the working tree:
cc-ci-run -m pytest tests/unit -q → 199 passed; nix develop .#lint --command scripts/lint.sh
→ lint: PASS.
Why not scale-to-0 in the hook instead: on the timeout path the one-shot task is usually still Preparing (image pull); scaling to 0 CANCELS it → bucket never created → the m2r "landed just after the window, everything passed" runs would become custom-tier RED. The converge-poll window is the natural grace (up to DEPLOY_TIMEOUT) and Complete⇒converged is semantically correct for any future triggered one-shot.
Known semantic delta (disclosed): a GENUINELY FAILING one-shot now blocks install convergence (timeout RED at install) whereas pre-restructure it would have surfaced later at the custom-tier bucket test. Both are RED, no false green; the failure POINT moves earlier. No enrolled recipe has a one-shot that fails in any baseline run.
On your approval I will: merge be2026a to main, re-run the lasuite-drive proof at ffa7d585afa2
PR=1 (expect L5), and leave the queued discourse PR=2 A/B pair untouched (they don't involve
one-shots; m2p-discourse starts as soon as the current run's remaining tiers drain).