SAIL Logo
HomeAboutProjectsNews & EventsNLP ResourcesContact
SAIL Logo

Somali-language AI and Innovation Lab — Pioneering the digital frontier for Somali language through cutting-edge AI research and innovation.

Jamhuriya University of Science and Technology
Mogadishu, Somalia
sail@just.edu.so
+252 - 61- 2223999

About

  • About SAIL
  • Research Areas
  • Why SAIL?

Quick Links

  • Featured Projects
  • News & Insights
  • Resources
  • Contact

2026 SAIL - Somali-language AI and Innovation Lab. All rights reserved.

NLPcompleted

Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform

Read Full Article
March 8, 2026
SAIL Team

Abstract

Lemmatization, which reduces words to their root forms, plays a key role in tasks such as information retrieval, text indexing, and machine learning-based language models. However, a key research challenge for low-resourced languages such as the Somali is the lack of human-annotated lemmatization datasets and reliable ground truth to underpin accurate morphological analysis and training relevant NLP models. To address this problem, we developed the first large-scale, purpose-built Somali lemmatization lexicon, coupled with a crowdsourcing platform for ongoing expansion. The system leverages Somali’s agglutinative and derivational morphology, encompassing over 5,584 root words and 78,629 derivative forms, each annotated with part-of-speech tags. For data validation purpose, we have devised a pilot lexicon-based lemmatizer integrated with rule-based logic to handle out-of-vocabulary terms. Evaluation on a 294-document corpus covering news articles, social media posts, and short messages shows lemmatization accuracies of 51.27% for full articles, 44.14% for excerpts, and 59.51% for short texts such as tweets. These results demonstrate that combining lexical resources, POS tagging, and rulebased strategies provides a robust and scalable framework for addressing morphological complexity in Somali and other low-resource languages.

Related Projects

Explore more projects in this category

Somali NLP Engine
AI/NLP

Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Large Language Models

OCR System
NLP

CIRAL: A Test Collection for CLIR Evaluation in African Languages

AI Chatbot
AI

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages