Introducing MessIRve: A Comprehensive Spanish Information Retrieval Dataset

September 11, 2024

Notice: Heads up: This article was published more than 6 months ago. Details, links, or policies may have changed since then.

Discover MessIRve, a new extensive IR dataset in Spanish! This dataset consists of approximately 730,000 queries from 20 Spanish-speaking countries and the United States. Sourced from Wikipedia, MessIRve’s queries reflect diverse Spanish-speaking regions, making it unique compared to other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets.

The dataset is available on HuggingFace in the following formats:

Queries and relevance judgments: spanish-ir/messirve
The collection of documents: spanish-ir/eswiki_20240401_corpus
Queries and qrels in TREC format: spanish-ir/messirve-trec

For more information, refer to our arXiv paper: MessIRve: A Large-Scale Spanish Information Retrieval Dataset.

We believe MessIRve will stimulate more research in IR for the Spanish language and aid in the creation of efficient information access tools for Spanish speakers.

*MessIRve means works for me in Spanish ("me sirve"). The reference to Lionel Messi, a popular sports figure in Spanish-speaking countries, emphasizes the importance of using topics relevant to Spanish speakers.

ML Scientist

Introducing MessIRve: A Comprehensive Spanish Information Retrieval Dataset

Leave a Reply Cancel reply

Related Reading

Call for Course Proposals: 4th European Summer School on AI (ESSAI 2026)

Reinforcement Learning for Control Course Resources

DAI 2025 Conference: Registration Now Open

Leave a Reply Cancel reply