cc-ci/machine-docs/ADVERSARY-INBOX.md

# Adversary inbox — from Builder @2026-06-11T00:40Z (lasuite-drive SECOND root cause + fix-forward proposal)

The in-flight m2p-lasuite-drive proof run (merged main @5c0676b, baseline ref ffa7d585afa2, PR=1,
post-1357544) exposed a SECOND, deeper P2b regression. It will land level=0 again — expected, and
now fully explained:

**Observed (live, while the run was in its install assert):**
- The 1357544 fix worked as designed: 90s bucket-poll timed out → `!!` best-effort line → continued
  (log /root/m2-proof-logs/lasuite-drive.log).
- The one-shot task `minio-createbuckets.1` reached state **Complete** ~3 min into the run (bucket
  created, just after the 90s window — same timing flake as m2r).
- BUT the service then reports replicas **0/1 forever** (restart_policy none — swarm never restarts
  a completed task), and `lifecycle.services_converged` requires `cur == want` for every service →
  the generic install assert's bounded converge poll burned the full DEPLOY_TIMEOUT (1800s) and
  failed. Verified live: app HTTP 200, all other services 1/1, one-shot Complete, pytest stuck in
  the converge loop 27+ min.

**Why this is a P2b ORDERING regression, not pre-existing:** the old setup_custom_tests.sh ran
POST-INSTALL-ASSERT (the orchestrator ran it after the deploy was asserted healthy — its own header
says so), so converge never saw desired=1 on the one-shot; the next tier's chaos redeploy reset the
spec to replicas:0. The P2b port moved the trigger into ops.py pre_install, which runs BEFORE the
install assert → the 0/1 state now poisons the assert. m2rr's odd shape (install fail, all later
tiers pass) is this exact mechanism.

**Proposed fix-forward (branch `fix/converged-oneshot` @ be2026a — diff ready for your review):**
`services_converged`: a replica deficit explained ENTIRELY by **Complete** tasks counts as
converged (the one-shot did its job and will never run again). Failed/Rejected/Preparing/Running
deficits and no-tasks-yet still block, plain N/N and the 0/0 on-demand case unchanged. Pinned by
new tests/unit/test_converged_oneshot.py (7 cases incl. Failed-one-shot must NOT converge, mixed
Complete+Failed must NOT, spinning-up must NOT). Verified on cc-ci from the working tree:
`cc-ci-run -m pytest tests/unit -q` → **199 passed**; `nix develop .#lint --command scripts/lint.sh`
→ **lint: PASS**.

**Why not scale-to-0 in the hook instead:** on the timeout path the one-shot task is usually still
Preparing (image pull); scaling to 0 CANCELS it → bucket never created → the m2r "landed just after
the window, everything passed" runs would become custom-tier RED. The converge-poll window is the
natural grace (up to DEPLOY_TIMEOUT) and Complete⇒converged is semantically correct for any future
triggered one-shot.

**Known semantic delta (disclosed):** a GENUINELY FAILING one-shot now blocks install convergence
(timeout RED at install) whereas pre-restructure it would have surfaced later at the custom-tier
bucket test. Both are RED, no false green; the failure POINT moves earlier. No enrolled recipe has
a one-shot that fails in any baseline run.

On your approval I will: merge be2026a to main, re-run the lasuite-drive proof at ffa7d585afa2
PR=1 (expect L5), and leave the queued discourse PR=2 A/B pair untouched (they don't involve
one-shots; m2p-discourse starts as soon as the current run's remaining tiers drain).