Tech

African languages for AI: The project that’s gathering a huge new dataset

Published

5 months ago

October 19, 2025

African languages for AI: The project that’s gathering a huge new dataset

Credit: Unsplash/CC0 Public Domain

Artificial intelligence (AI) tools like ChatGPT, DeepSeek, Siri or Google Assistant are developed by the global north and trained in English, Chinese or European languages. In comparison, African languages are largely missing from the internet.

A team of African computer scientists, linguists, language specialists and others have been working on precisely this problem for two years already. The African Next Voices project recently released what’s thought to be the largest dataset of African languages for AI so far. We asked them about their project, with sites in Kenya, Nigeria and South Africa.

Why is language so important to AI?

Language is how we interact, ask for help, and hold meaning in community. We use it to organize complex thoughts and share ideas. It’s the medium we use to tell an AI what we want—and to judge whether it understood us.

We are seeing an upsurge of applications that rely on AI, from education to health to agriculture. These models are trained from large volumes of (mostly) linguistic (language) data. These are called large language models or LLMs but are found in only a few of the world’s languages.

Languages also carry culture, values and local wisdom. If AI doesn’t speak our languages, it can’t reliably understand our intent, and we can’t trust or verify its answers. In short: without language, AI can’t communicate with us—and we can’t communicate with it. Building AI in our languages is therefore the only way for AI to work for people.

If we limit whose language gets modeled, we risk missing out on the majority of human cultures, history and knowledge.

Why are African languages missing and what are the consequences for AI?

The development of language is intertwined with the histories of people. Many of those who experienced colonialism and empire have seen their own languages being marginalized and not developed to the same extent as colonial languages. African languages are not as often recorded, including on the internet.

So there isn’t enough high-quality, digitized text and speech to train and evaluate robust AI models. That scarcity is the result of decades of policy choices that privilege colonial languages in schools, media and government.

Language data is just one of the things that’s missing. Do we have dictionaries, terminologies, glossaries? Basic tools are few and many other issues raise the cost of building datasets. These include African language keyboards, fonts, spell-checkers, tokenizers (which break text into smaller pieces so a language model can understand it), orthographic variation (differences in how words are spelled across regions), tone marking and rich dialect diversity.

The result is AI that performs poorly and sometimes unsafely: mistranslations, poor transcription, and systems that barely understand African languages.

In practice this denies many Africans access—in their own languages—to global news, educational materials, health care information, and the productivity gains AI can deliver.

When a language isn’t in the data, its speakers aren’t in the product, and AI cannot be safe, useful or fair for them. They end up missing the necessary language technology tools that could support service delivery. This marginalizes millions of people and increases the technology divide.

What is your project doing about it—and how?

Our main objective is to collect speech data for automatic speech recognition (ASR). ASR is an important tool for languages that are largely spoken. This technology converts spoken language into written text.

The bigger ambition of our project is to explore how data for ASR is collected and how much of it is needed to create ASR tools. We aim to share our experiences across different geographic regions.

The data we collect is diverse by design: spontaneous and read speech; in various domains—everyday conversations, health care, financial inclusion and agriculture. We are collecting data from people of diverse ages, gender and educational backgrounds.

Every recording is collected with informed consent, fair compensation and clear data-rights terms. We transcribe with language-specific guidelines and a large range of other technical checks.

In Kenya, through Maseno Centre for Applied AI, we are collecting voice data for five languages. We’re capturing the three main language groups Nilotic (Dholuo, Maasai and Kalenjin) as well as Cushitic (Somali) and Bantu (Kikuyu).

Through Data Science Nigeria, we are collecting speech in five widely spoken languages—Bambara, Hausa, Igbo, Nigerian Pidgin and Yoruba. The dataset aims to accurately reflect authentic language use within these communities.

In South Africa, working through the Data Science for Social Impact lab and its collaborators, we have been recording seven South African languages. The aim is to reflect the country’s rich linguistic diversity: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele and Tshivenda.

Importantly, this work does not happen in isolation. We are building on the momentum and ideas from the Masakhane Research Foundation network, Lelapa AI, Mozilla Common Voice, EqualyzAI, and many other organizations and individuals who have been pioneering African language models, data and tooling.

Each project strengthens the others, and together they form a growing ecosystem committed to making African languages visible and usable in the age of AI.

How can this be put to use?

The data and models will be useful for captioning local-language media; voice assistants for agriculture and health; call-center and support in the languages. The data will also be archived for cultural preservation.

Larger, balanced, publicly available African language datasets will allow us to connect text and speech resources. Models will not just be experimental, but useful in chatbots, education tools and local service delivery. The opportunity is there to go beyond datasets into ecosystems of tools (spell-checkers, dictionaries, translation systems, summarization engines) that make African languages a living presence in digital spaces.

In short, we are pairing ethically collected, high-quality speech at scale with models. The aim is for people to be able to speak naturally, be understood accurately, and access AI in the languages they live their lives in.

What happens next for the project?

This project only collected voice data for certain languages. What of the remaining languages? What of other tools like machine translation or grammar checkers?

We will continue to work on multiple languages, ensuring that we build data and models that reflect how Africans use their languages. We prioritize building smaller language models that are both energy efficient and accurate for the African context.

The challenge now is integration: making these pieces work together so that African languages are not just represented in isolated demos, but in real-world platforms.

One of the lessons from this project, and others like it, is that collecting data is only step one. What matters is making sure that the data is benchmarked, reusable, and linked to communities of practice. For us, the “next” is to ensure that the ASR benchmarks we build can connect with other ongoing African efforts.

We also need to ensure sustainability: that students, researchers, and innovators have continued access to compute (computer resources and processing power), training materials and licensing frameworks (Like NOODL or Esethu). The long-term vision is to enable choice: so that a farmer, a teacher, or a local business can use AI in isiZulu, Hausa, or Kikuyu, not just in English or French.

If we succeed, built-in AI in African languages won’t just be catching up. It will be setting new standards for inclusive, responsible AI worldwide.

Provided by
The Conversation

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Citation:
African languages for AI: The project that’s gathering a huge new dataset (2025, October 19)
retrieved 19 October 2025
from https://techxplore.com/news/2025-10-african-languages-ai-huge-dataset.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link

Related Topics:computer news hi-tech news hitech information technology innovation inventions

Up Next

Scenes From Saturday’s Nationwide ‘No Kings’ Protests

Don't Miss

Here Are the Housewarming Gifts I Love Having in My Home

Click to comment

Tech

Anthropic Sues Department of Defense Over Supply-Chain Risk Designation

Published

2 hours ago

March 9, 2026

cineplex360

Anthropic Sues Department of Defense Over Supply-Chain Risk Designation

Anthropic filed a federal lawsuit against the US Department of Defense and other federal agencies on Monday, challenging its designation of the AI company as a “supply-chain risk.”

The Pentagon formally sanctioned Anthropic last week, capping a weeks-long, publicly aired disagreement over limits on use of its generative AI technology for military applications such as autonomous weapons.

“We do not believe this action is legally sound, and we see no choice but to challenge it in court,” Anthropic CEO Dario Amodei wrote in a blog post on Thursday.

The lawsuit, which was filed in a federal court in California, requested that a judge reverse the designation and stop federal agencies from enforcing it. “The Constitution does not allow the government to wield its enormous power to punish a company for its protected speech,” Anthropic said in the filing. “Anthropic turns to the judiciary as a last resort to vindicate its rights and halt the Executive’s unlawful campaign of retaliation.”

The AI startup, which develops a suite of AI models called Claude, is facing the possibility of losing hundreds of millions of dollars in annual revenue from the Pentagon and the rest of the US government. It also may lose the business of software companies that incorporate Claude into services they sell to federal agencies. Several Anthropic customers have reportedly said they are pursuing alternatives due to the Defense Department’s risk designation.

Amodei wrote that the “vast majority” of Anthropic’s customers will not have to make changes. The US government’s designation “plainly applies only to the use of Claude by customers as a direct part of contracts with the” military, he said. General use of Anthropic technologies by military contractors should be unaffected.

The Department of Defense, which also goes by the Department of War, and the White House did not immediately respond to requests for comment about Anthropic’s lawsuit.

Attorneys with expertise in government contracting say Anthropic faces a difficult battle in court. The rules that authorize the Department of Defense to label a tech company as a supply-chain risk don’t allow for much in the way of an appeal. “It’s 100 percent in the government’s prerogative to set the parameters of a contract,” says Brett Johnson, a partner at the law firm Snell & Wilmer. The Pentagon, he says, also has the right to express that a product of concern, if used by any of its suppliers, “hurts the government’s ability to effectuate its mission.”

Anthropic’s best chance of success in court could be proving it was singled out, Johnson says. Soon after Defense Secretary Pete Hegseth announced that he was designating Anthropic a supply-chain risk, rival OpenAI announced it had struck a new contract with the Pentagon. That could be instrumental to Anthropic’s legal argument if the company can demonstrate it was seeking similar terms as the ChatGPT developer.

OpenAI said its deal included contractual and technical means of assuring its technology would not be used for mass domestic surveillance or to direct autonomous weapons systems. It added that it opposed the action against Anthropic and did know why its rival could not reach the same deal with the government.

Military Priority

Hegseth has prioritized military adoption of AI technologies, with posters recently seen in the Pentagon showing him pointing and that read, “I want you to use AI.” The dispute with Anthropic kicked up in January after Hegseth ordered several AI suppliers to agree that the department was free to use their technologies for any lawful purpose.

Anthropic, which is the only company currently providing AI chatbot and analysis tools for the military’s most sensitive use cases, pushed back. It contends that its technologies are not yet capable enough to be used for mass domestic surveillance of Americans or fully autonomous weapons. Hegseth has said Anthropic wants veto power over judgments that should be left to the Defense Department.

Source link

Tech

Interview: Nick Pearson, CIO, Ricoh Europe | Computer Weekly

Published

7 hours ago

March 9, 2026

cineplex360

Interview: Nick Pearson, CIO, Ricoh Europe | Computer Weekly

Nick Pearson, CIO at Ricoh Europe, describes his job as energising. The company’s shift in business model presents challenges and significant opportunities for him to draw on previous experience at other blue-chip firms.

Pearson joined the printer and tech services supplier in December 2023. He was previously head of IT platforms at Vodafone and held senior tech roles at RS Group and PepsiCo, where he spent almost a decade and was latterly UK IT director. In his role as CIO at Ricoh, Pearson is a member of the European executive board engaged in business transformation.

“We’re going through a seismic change that pivots Ricoh from a device and manufacturing company – an asset-based firm – into a services organisation,” he says.

“That shift on all levels – the way people think, the way we build systems, the way we sell, the way we operate, and the way we deliver – is energising. And that change is occurring against the backdrop of everything happening in the wider technology space.”

CinePlex360

African languages for AI: The project that’s gathering a huge new dataset

Tech

African languages for AI: The project that’s gathering a huge new dataset

Why is language so important to AI?

Why are African languages missing and what are the consequences for AI?

What is your project doing about it—and how?

How can this be put to use?

What happens next for the project?

Leave a Reply
Cancel reply

Leave a Reply

Tech

Anthropic Sues Department of Defense Over Supply-Chain Risk Designation

Military Priority

Tech

Interview: Nick Pearson, CIO, Ricoh Europe | Computer Weekly

Tech

AI factory builder Nscale announces another $2bn of funding | Computer Weekly

NASA crashes spacecraft into asteroid moonlet, successfully deflects its orbit

Anthropic Sues Department of Defense Over Supply-Chain Risk Designation

Nigeria forms textile steering committee to set up CTGDB

India let Iran warship dock the day US sank another off Sri Lanka, say officials

Pakistan set for FIH Pro League debut | The Express Tribune

College basketball star suspended by team for spitting toward opposing fan

Illinois’ financial crisis could bring the state to a halt

The final 6 ‘Game of Thrones’ episodes might feel like a full season

New Season 8 Walking Dead trailer flashes forward in time

Trending

CinePlex360

African languages for AI: The project that’s gathering a huge new dataset

Why is language so important to AI?

Why are African languages missing and what are the consequences for AI?

What is your project doing about it—and how?

How can this be put to use?

What happens next for the project?

You may like

Leave a Reply Cancel reply

Leave a Reply

Tech

Anthropic Sues Department of Defense Over Supply-Chain Risk Designation

Military Priority

Tech

Interview: Nick Pearson, CIO, Ricoh Europe | Computer Weekly

Assuming responsibility

Shifting priorities

Enabling transformation

Driving growth

Tech

AI factory builder Nscale announces another $2bn of funding | Computer Weekly

NASA crashes spacecraft into asteroid moonlet, successfully deflects its orbit

Anthropic Sues Department of Defense Over Supply-Chain Risk Designation

Nigeria forms textile steering committee to set up CTGDB

India let Iran warship dock the day US sank another off Sri Lanka, say officials

Pakistan set for FIH Pro League debut | The Express Tribune

College basketball star suspended by team for spitting toward opposing fan

Illinois’ financial crisis could bring the state to a halt

The final 6 ‘Game of Thrones’ episodes might feel like a full season

New Season 8 Walking Dead trailer flashes forward in time

Trending

Leave a Reply
Cancel reply