ML Scientist

Connecting Scholars with the Latest Academic News and Career Paths

News

Introducing MessIRve: A Comprehensive Spanish Information Retrieval Dataset

Discover MessIRve, a new extensive IR dataset in Spanish! This dataset consists of approximately 730,000 queries from 20 Spanish-speaking countries and the United States. Sourced from Wikipedia, MessIRve’s queries reflect diverse Spanish-speaking regions, making it unique compared to other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets.

The dataset is available on HuggingFace in the following formats:

For more information, refer to our arXiv paper: MessIRve: A Large-Scale Spanish Information Retrieval Dataset.

We believe MessIRve will stimulate more research in IR for the Spanish language and aid in the creation of efficient information access tools for Spanish speakers.

*MessIRve means works for me in Spanish ("me sirve"). The reference to Lionel Messi, a popular sports figure in Spanish-speaking countries, emphasizes the importance of using topics relevant to Spanish speakers.

Leave a Reply

Your email address will not be published. Required fields are marked *