Abstract
Cross-lingual information retrieval (CLIR) continues to be an actively studied topic in information retrieval (IR), and there have
been consistent efforts in curating test collections to support its research. However, there is a lack of high-quality human-annotated
CLIR resources for African languages: the few existing collections
are mostly curated synthetically or from sources with limited corpora for these languages. We present CIRAL, a test collection for
cross-lingual retrieval with English queries and passages in four
African languages: Hausa, Somali, Swahili, and Yoruba. CIRAL’s
corpora are obtained from Indigenous African websites and consist of a total of over 2.5 million passages. We gathered over 1,600
queries and 30k high-quality binary relevance judgments annotated by native speakers of the languages. Additional pools were
also obtained at CIRAL’s shared task, which was hosted at the
Forum for Information Retrieval Evaluation 2023 to encourage
community participation in CLIR for African languages. We describe the design and curation process of our test collection and
provide reproducible baselines that demonstrate CIRAL’s utility
in evaluating the effectiveness of systems. CIRAL is available at
https://github.com/ciralproject/ciral.