Abstract
The fact that everyone with a social media account can create and share content, and the
increasing public reliance on social media platforms as a news and information source bring
about significant challenges such as misinformation, fake news, harmful content, etc. Although human content moderation may be useful to an extent and used by these platforms
to flag posted materials, the use of AI models provides a more sustainable, scalable, and
effective way to mitigate these harmful contents. However, low-resourced languages such
as the Somali language face limitations in AI
automation, including scarce annotated training datasets and lack of language models tailored to their unique linguistic characteristics.
This paper presents part of our ongoing research work to bridge some of these gaps for
the Somali language. In particular, we created
two human-annotated social-media-sourced Somali datasets for two downstream applications,
fake news & toxicity classification, and developed a transformer-based monolingual Somali language model (named SomBERTa) –
the first of its kind to the best of our knowledge. SomBERTa is then fine-tuned and evaluated on toxic content, fake news and news
topic classification datasets. Comparative evaluation analysis of the proposed model against
related multilingual models (e.g., AfriBERTa,
AfroXLMR, etc) demonstrated that SomBERTa
consistently outperformed these comparators
in both fake news and toxic content classification tasks while achieving the best average
accuracy (87.99%) across all tasks. This research contributes to Somali NLP by offering a
foundational language model and a replicable
framework for other low-resource languages,
promoting digital and AI inclusivity and linguistic diversity