The value of inclusion of African Languages in AI

Chenai Chair
Authors
Chenai Chair
Published on Apr 1, 2026
The value of inclusion of African Languages in AI

Masakhane African Languages Hub

This guest article was originally posted on the Masakhane African Languages Hub website

The article "Top AI models underperforming in languages other than English" published in The Economist March 18th 2026, highlights a critical and long-standing concern. Systems that perform well in English often degrade significantly in African languages. In some cases, accuracy drops are substantial enough to raise questions not only about performance, but about safety.

In high-stakes contexts such as healthcare, this disparity matters. If an AI system provides reliable guidance in English but weaker or misleading responses in ChiShona, Hausa, or isiZulu, it is effectively operating with unequal standards.

When such tools are deployed across African clinics and other key sectors, such as agriculture and public services, it becomes more than a technical issue. It is a question of equity and risk in an already unequal society.

The technical causes of the language gap are increasingly well understood: data imbalance, inefficient tokenisation, limited compute and the reliance on translation-based reasoning. Encouragingly, research shows that even modest amounts of high-quality data can improve performance in under-resourced languages. But progress has been uneven and largely dependent on whether such languages are prioritised.

This points to a broader issue. The expansion of AI into African contexts is proceeding faster than the development of systems that are robust in African languages. Without deliberate intervention, this risks entrenching a two-tier system: one in which English speakers benefit from high-performing tools, while others do not.

These issues are being addressed by Masakhane (which translates to "we build together" in isiZulu) — a pan-African community founded in 2019. Initially, the constraint was the near absence of datasets for African languages. That gap is gradually narrowing due to Masakhane's and others' work in growing the African NLP community.

However, the broader challenge remains: most leading models are still fundamentally English-centric, both in their training data and in how they process language.

The Masakhane African Languages Hub, where I lead a team dedicated to addressing the underrepresentation and misrepresentation of African languages in AI, is already making progress in correcting this imbalance. Our approach is based on three principles: community, collaboration, and care. We emphasise data, research, innovation, and building institutional capacity.

Community

Language technologies cannot be built in isolation from the people who speak those languages. The Hub's data work focuses on developing high-quality multimodal datasets through community-led processes, rather than extractive collection. Our focus is on collecting datasets for 50 languages at 500 hours of speech data and also domain-specific areas such as agriculture, government services, health, education and economic inclusion.

With 500 hours of speech data, we can fine-tune a model to achieve a useful WER close to 10 (lower is better). Multimodal dataset creation enables expanding the knowledge base as we create datasets that enable voice-to-voice translation within African languages; more datasets for building tokenisers and a complete pipeline of how most African languages are used in everyday life.

Collaboration

Addressing language inequity requires coordinated investment in African-led research ecosystems. Through funding partnerships with organisations including the Gates Foundation, IDRC/FCDO, Google.org, and Microsoft AI for Good, the Hub supports a network of researchers, tech entrepreneurs, and institutions working on African language AI.

One focus is the development of multilingual benchmarks across up to 100 African languages — an attempt to address a gap the article rightly identifies: the lack of meaningful evaluation frameworks for non-English contexts. We are also establishing guidance on licensing, attribution, and safeguarding — recognising that more data alone will not resolve the problem if governance is overlooked.

Care

Language gaps intersect with broader inequalities — of gender, geography, education, and income. Through initiatives such as Project ECHO (Enhancing Communications for Her Opportunities), the Hub will identify and support four gender-responsive AI innovations in livelihoods and low-risk health domains. Innovation will be approached through co-design and rigorous evaluation of high-potential use cases, while also sharing insights with the broader ecosystem on gender-focused AI.

By the project's conclusion, at least 20,000 unique women will meaningfully benefit from AI tools. Additionally, innovators and stakeholders will adopt and institutionalise sustained gender-responsive practices in their strategies and operations.

In conclusion, for AI to have a meaningful impact in Africa, it must uphold comparable standards across various languages. Achieving this will require not only technical advancements but also continuous investment in African-led research, data infrastructure, and evaluation.

The individuals who are most likely to benefit from these technologies should be part of developing the solutions. Therefore, ensuring that AI systems can function reliably in African languages is a fundamental issue, not a secondary one. This is crucial for establishing their legitimacy.

enfr