Jean-Pierre Chevallet

2020

pdf bib abs
WIKIR: A Python Toolkit for Building a Large-scale Wikipedia-based English Information Retrieval Dataset
Jibril Frej | Didier Schwab | Jean-Pierre Chevallet
Proceedings of the Twelfth Language Resources and Evaluation Conference

Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR59k: a large-scale publicly available dataset that contains 59,252 queries and 2,617,003 (query, relevant documents) pairs.

Jean-Pierre Chevallet

2020

2015

2012

Co-authors

Venues