Abstract
Lemmatization, which reduces words to their
root forms, plays a key role in tasks such as information retrieval, text indexing, and machine
learning-based language models. However,
a key research challenge for low-resourced
languages such as the Somali is the lack of
human-annotated lemmatization datasets and
reliable ground truth to underpin accurate morphological analysis and training relevant NLP
models. To address this problem, we developed the first large-scale, purpose-built Somali
lemmatization lexicon, coupled with a crowdsourcing platform for ongoing expansion. The
system leverages Somali’s agglutinative and
derivational morphology, encompassing over
5,584 root words and 78,629 derivative forms,
each annotated with part-of-speech tags. For
data validation purpose, we have devised a pilot lexicon-based lemmatizer integrated with
rule-based logic to handle out-of-vocabulary
terms. Evaluation on a 294-document corpus
covering news articles, social media posts, and
short messages shows lemmatization accuracies of 51.27% for full articles, 44.14% for
excerpts, and 59.51% for short texts such as
tweets. These results demonstrate that combining lexical resources, POS tagging, and rulebased strategies provides a robust and scalable
framework for addressing morphological complexity in Somali and other low-resource languages.