Pleias and GSMA launch 'CommonLingua', an open source language identification model that supports 61 African languages


CommonLingua, the first joint release of the GSMA’s “AI Language Models in Africa, by Africa, for Africa” initiative, is a compact, 2 million-parameter open source model that covers 334 languages ​​(including 61 African languages) and outperforms systems up to 300 times larger.

April 28, 2026, London: Pleias and GSMA today announced the launch of CommonLingua, an open-source language identification (LID) model designed specifically to unlock African language data at scale. Delivered under the GSMA AI language models in Africa, by Africa, for Africa initiative, a coalition dedicated to closing the African languages ​​gap in AI.

Africa is home to more than 2,000 living languages, many of which remain underrepresented in AI training data. As a result, language identification systems often perform less reliably on African language content, particularly when distinguishing between closely related or code-mixed texts. Before a Swahili, Yoruba, or Wolof language model can be built, the underlying text must first be correctly identified by language, a step where currently existing tools often fail for African content.

This is because major LID systems, such as fastText, GlotLID, and OpenLID, were built around resource-intensive European and Asian languages ​​and often mislabel African language text as English or French. Even the most modern border models lose about 30 points of accuracy in African languages ​​compared to the world's major languages.

CommonLingua is designed to solve this first step of the process. On the new CommonLID benchmark, CommonLingua achieves an accuracy of 83% and a macro F1 score of 0.79, outperforming leading LID models by more than 10 percentage points under comparable evaluation conditions, while using approximately one-third the parameters. The model is lightweight with 2 million parameters and shipped as an 8 MB checkpoint, and is designed for efficient deployment, running approximately 20 texts per second on the CPU and up to 3000 texts per second on a single GPU.

CommonLingua covers 334 languages ​​in total, including 61 African languages ​​in eight language families: Bantu (21), Niger-Congo/West Africa (18), Afroasiatic and Semitic (7), Cushitic and Chadic (4), Berber (3), Nilo-Saharan (3), and Pidgins, Creoles and Others (5). The model operates directly on UTF-8 byte sequences rather than relying on a language-specific tokenizer, allowing consistent handling across scripts including Latin, Arabic, Ethiopic, N'ko, and Tifinagh.

“African languages ​​are not a fringe case. They are the working languages ​​of hundreds of millions of people and deserve an AI infrastructure built with the same care as any other language. CommonLingua is deliberately the first brick we are laying: you cannot select what you cannot identify.” saying Pierre-Carl Langlais, co-founder and CTO of Pleias.

The model is trained exclusively on public domain and openly licensed content aggregated through the Common Corpus project, which includes Wikipedia, scientific publications on OpenAlex, VOA Africa, WaxalNLP, Cultural Heritage and Pralekha. All data sets are released under permissive licenses.

Louis Powell, Director of AI Initiatives at GSMA aggregate: Closing the gap in African language AI is critical to digital inclusion and unlocking economic opportunities. Progress has long been held back by a lack of basic infrastructure, starting with something as essential as language identification. CommonLingua addresses this critical gap, enabling the development of richer data sets and more representative AI systems at scale. Through our initiative, the GSMA is bringing together partners to move beyond fragmented efforts towards a shared infrastructure that can power Africa's digital ecosystem.

This conversation will continue at MWC26 Kigali, where GSMA and its partners will bring together industry leaders to accelerate progress in African-language AI. Register now be part of the discussion.

-ENDS-

About Pleyas

Pleias is a research lab and artificial intelligence company specializing in open, auditable language models trained exclusively on permissively licensed data. Pleias develops Common Corpus, the largest fully open multilingual pretraining dataset, and the Pleias family of small language models optimized for retrieval, reasoning, and low-resource languages.

About the GSMA
The GSMA is a global organization that unifies the mobile ecosystem to discover, develop and deliver critical innovation for positive business environments and social change. Our vision is to unlock the full power of connectivity so that people, industry and society thrive. Representing mobile operators and organizations across the mobile ecosystem and adjacent industries, the GSMA offers its members three broad pillars: connectivity for good, industry services and solutions, and outreach. This activity includes policy advocacy; address today's biggest social challenges; underpin the technology and interoperability that make mobile devices work; and provide the world's largest platform to bring together the mobile ecosystem at the MWC and M360 series of events.

Media contacts

Pleias: [email protected]

GSMA: press [email protected]

scroll to top