Tech
African languages for AI: The project that’s gathering a huge new dataset
Artificial intelligence (AI) tools like ChatGPT, DeepSeek, Siri or Google Assistant are developed by the global north and trained in English, Chinese or European languages. In comparison, African languages are largely missing from the internet.
A team of African computer scientists, linguists, language specialists and others have been working on precisely this problem for two years already. The African Next Voices project recently released what’s thought to be the largest dataset of African languages for AI so far. We asked them about their project, with sites in Kenya, Nigeria and South Africa.
Why is language so important to AI?
Language is how we interact, ask for help, and hold meaning in community. We use it to organize complex thoughts and share ideas. It’s the medium we use to tell an AI what we want—and to judge whether it understood us.
We are seeing an upsurge of applications that rely on AI, from education to health to agriculture. These models are trained from large volumes of (mostly) linguistic (language) data. These are called large language models or LLMs but are found in only a few of the world’s languages.
Languages also carry culture, values and local wisdom. If AI doesn’t speak our languages, it can’t reliably understand our intent, and we can’t trust or verify its answers. In short: without language, AI can’t communicate with us—and we can’t communicate with it. Building AI in our languages is therefore the only way for AI to work for people.
If we limit whose language gets modeled, we risk missing out on the majority of human cultures, history and knowledge.
Why are African languages missing and what are the consequences for AI?
The development of language is intertwined with the histories of people. Many of those who experienced colonialism and empire have seen their own languages being marginalized and not developed to the same extent as colonial languages. African languages are not as often recorded, including on the internet.
So there isn’t enough high-quality, digitized text and speech to train and evaluate robust AI models. That scarcity is the result of decades of policy choices that privilege colonial languages in schools, media and government.
Language data is just one of the things that’s missing. Do we have dictionaries, terminologies, glossaries? Basic tools are few and many other issues raise the cost of building datasets. These include African language keyboards, fonts, spell-checkers, tokenizers (which break text into smaller pieces so a language model can understand it), orthographic variation (differences in how words are spelled across regions), tone marking and rich dialect diversity.
The result is AI that performs poorly and sometimes unsafely: mistranslations, poor transcription, and systems that barely understand African languages.
In practice this denies many Africans access—in their own languages—to global news, educational materials, health care information, and the productivity gains AI can deliver.
When a language isn’t in the data, its speakers aren’t in the product, and AI cannot be safe, useful or fair for them. They end up missing the necessary language technology tools that could support service delivery. This marginalizes millions of people and increases the technology divide.
What is your project doing about it—and how?
Our main objective is to collect speech data for automatic speech recognition (ASR). ASR is an important tool for languages that are largely spoken. This technology converts spoken language into written text.
The bigger ambition of our project is to explore how data for ASR is collected and how much of it is needed to create ASR tools. We aim to share our experiences across different geographic regions.
The data we collect is diverse by design: spontaneous and read speech; in various domains—everyday conversations, health care, financial inclusion and agriculture. We are collecting data from people of diverse ages, gender and educational backgrounds.
Every recording is collected with informed consent, fair compensation and clear data-rights terms. We transcribe with language-specific guidelines and a large range of other technical checks.
In Kenya, through Maseno Centre for Applied AI, we are collecting voice data for five languages. We’re capturing the three main language groups Nilotic (Dholuo, Maasai and Kalenjin) as well as Cushitic (Somali) and Bantu (Kikuyu).
Through Data Science Nigeria, we are collecting speech in five widely spoken languages—Bambara, Hausa, Igbo, Nigerian Pidgin and Yoruba. The dataset aims to accurately reflect authentic language use within these communities.
In South Africa, working through the Data Science for Social Impact lab and its collaborators, we have been recording seven South African languages. The aim is to reflect the country’s rich linguistic diversity: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele and Tshivenda.
Importantly, this work does not happen in isolation. We are building on the momentum and ideas from the Masakhane Research Foundation network, Lelapa AI, Mozilla Common Voice, EqualyzAI, and many other organizations and individuals who have been pioneering African language models, data and tooling.
Each project strengthens the others, and together they form a growing ecosystem committed to making African languages visible and usable in the age of AI.
How can this be put to use?
The data and models will be useful for captioning local-language media; voice assistants for agriculture and health; call-center and support in the languages. The data will also be archived for cultural preservation.
Larger, balanced, publicly available African language datasets will allow us to connect text and speech resources. Models will not just be experimental, but useful in chatbots, education tools and local service delivery. The opportunity is there to go beyond datasets into ecosystems of tools (spell-checkers, dictionaries, translation systems, summarization engines) that make African languages a living presence in digital spaces.
In short, we are pairing ethically collected, high-quality speech at scale with models. The aim is for people to be able to speak naturally, be understood accurately, and access AI in the languages they live their lives in.
What happens next for the project?
This project only collected voice data for certain languages. What of the remaining languages? What of other tools like machine translation or grammar checkers?
We will continue to work on multiple languages, ensuring that we build data and models that reflect how Africans use their languages. We prioritize building smaller language models that are both energy efficient and accurate for the African context.
The challenge now is integration: making these pieces work together so that African languages are not just represented in isolated demos, but in real-world platforms.
One of the lessons from this project, and others like it, is that collecting data is only step one. What matters is making sure that the data is benchmarked, reusable, and linked to communities of practice. For us, the “next” is to ensure that the ASR benchmarks we build can connect with other ongoing African efforts.
We also need to ensure sustainability: that students, researchers, and innovators have continued access to compute (computer resources and processing power), training materials and licensing frameworks (Like NOODL or Esethu). The long-term vision is to enable choice: so that a farmer, a teacher, or a local business can use AI in isiZulu, Hausa, or Kikuyu, not just in English or French.
If we succeed, built-in AI in African languages won’t just be catching up. It will be setting new standards for inclusive, responsible AI worldwide.
This article is republished from The Conversation under a Creative Commons license. Read the original article.
Citation:
African languages for AI: The project that’s gathering a huge new dataset (2025, October 19)
retrieved 19 October 2025
from https://techxplore.com/news/2025-10-african-languages-ai-huge-dataset.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.
Tech
Faithful Companions: The Best Printers We’ve Tried
Before anything else, you’ll have to decide between ink and laser. I’ll get into the details when it comes to each model, but the most important consideration is paper type, because it’s a limitation rather than a benefit. Laser printers use heat in the bonding process, which means if you regularly print on windowed envelopes or photo paper, you’ll need to either use an ink printer or change to a thermally safe alternative, which can be cost prohibitive if you print a lot.
Inkjets are the most common flavor of home printer, and they work like you might expect, by boiling ink until it splatters through a series of tiny holes. You didn’t expect that? Me neither! Pretty exciting stuff.
Inkjet printers come in two flavors, with either prefilled cartridges or built-in tanks. The latter is quickly becoming more popular thanks to better pricing, more convenience, and a massive reduction in wasted plastic. If you’re buying a new printer in 2025 you should opt for an ink tank, if not a laser printer. They’re a little more work to setup and maintain, since you have to keep the tanks topped off, and they should remain in one place on a flat surface to avoid leaks. I can’t imagine many situations where a printer would be constantly moving and tilting, but it’s a consideration.
You thought InkJets were cool? Laser printers work by blasting a tube full of dried plastic particles, then fusing them to the paper with heat. They tend to cost more upfront, but the cost per page is overall much lower. Where a $20 ink cartridge might print 200 pages, a $60 toner cartridge could print 2000. They tend to be a lot faster than inkjet printers, and you don’t have to worry about them drying out. Plus, the pages come out of the printer nice and warm, and you can’t really put a price on that.
There are also thermal printers, which are commonly used for receipts or shipping labels. Instead of filling the printer with ink and depositing it onto a surface, they apply heat in precise patterns to special paper, allowing you to print text and images in low resolution, and typically in one color. If you print shipping labels or simple stickers at home, these can save you a lot of time and ink cost, but they have more limitations.
Laser printers are my preferred type, as long as your paper type and budget can support them, but most home users will be happy with an ink tank printer.
Tech
Gravel Running Shoes Are the Best Suitcase Shoe
“In general, we are noticing many of these shoes have more of a road running influence than they do trail,” says Bodin. “So, there will be a mix of foams, midsole geometries, less attention to fit, and a more subtle outsole pattern compared to trail shoes.”
What Are the Benefits of Gravel Shoes?
In a word: versatility. You can lace up a gravel shoe at home with confidence that they’ll handle whatever lies ahead, provided you’re not hitting a really technical trail or ankle-deep mud.
“Many of the shoes in this category can run well on roads, gravel paths, and light trails,” says Bodin. “That’s not something that very many strictly road shoes or dedicated trail shoes can do.”
The more rockered midsoles aim to smooth your heel-to-toe transitions, cutting the calf muscle fatigue over uneven ground and on longer runs. They’re also often lighter than technical trail shoes, thanks to the smaller lugs, less pronounced rock plates, and lower levels of upper reinforcement. That serves up more agility than heftier trail shoes, so you can move faster and lighter over runnable ground.
Do Gravel Shoes Feel Different From “Regular” Trail Shoes?
“Yes and no,” says Bodin. A lot depends on the brand. Some companies, like Craft, have many gravel-specific options. Others, like Salomon and Hoka, use their redesigned road running shoes for their gravel category.
Gravel shoes also have limits, warns Bodin. “In my experience, most gravel shoes will be limited when they reach a moderately technical trail-running scenario. Again, because the bulk of the gravel shoe experience is focused on the overall ride on smoother terrain, performance declines when there are more turns or more challenging terrain with rocks and roots.”
Do You Really Need a Gravel Shoe?
Like everything in running shoe world, that depends. There are trail shoes out there with the chops to conquer everything from technical to more runnable terrain, like the Hoka Speedgoat 6 ($125). Some of the pricier trail shoes like the North Face Vectiv Pro 3 ($250) pair modified versions of their springy road-shoe foams with carbon plates to deliver bouncier rides that don’t feel out of place on the road. I’ve tested loads of these shoes, and some top-tier trail shoes run better on the road than cheaper road shoes.
However, if you regularly tackle firmer, less technical mixed terrain on your runs, generally in drier conditions—and rarely venture onto more technical trails—there’s a good case for investing in a gravel shoe. It’ll carry you happily from road to trail and back again, and even cover your road runs on the way to the trail.
Likewise, if you’re a newcomer to trail running, a gravel shoe could be a good halfway house as you transition from the asphalt to the single track, thanks to a ride which retains some road-shoe familiarity. They’re also an excellent suitcase shoe—if you’re traveling and you can only fit one shoe in your luggage, the versatility of a gravel shoe makes it a great choice.
Tech
This AI Model Can Intuit How the Physical World Works
The original version of this story appeared in Quanta Magazine.
Here’s a test for infants: Show them a glass of water on a desk. Hide it behind a wooden board. Now move the board toward the glass. If the board keeps going past the glass, as if it weren’t there, are they surprised? Many 6-month-olds are, and by a year, almost all children have an intuitive notion of an object’s permanence, learned through observation. Now some artificial intelligence models do too.
Researchers have developed an AI system that learns about the world via videos and demonstrates a notion of “surprise” when presented with information that goes against the knowledge it has gleaned.
The model, created by Meta and called Video Joint Embedding Predictive Architecture (V-JEPA), does not make any assumptions about the physics of the world contained in the videos. Nonetheless, it can begin to make sense of how the world works.
“Their claims are, a priori, very plausible, and the results are super interesting,” says Micha Heilbron, a cognitive scientist at the University of Amsterdam who studies how brains and artificial systems make sense of the world.
Higher Abstractions
As the engineers who build self-driving cars know, it can be hard to get an AI system to reliably make sense of what it sees. Most systems designed to “understand” videos in order to either classify their content (“a person playing tennis,” for example) or identify the contours of an object—say, a car up ahead—work in what’s called “pixel space.” The model essentially treats every pixel in a video as equal in importance.
But these pixel-space models come with limitations. Imagine trying to make sense of a suburban street. If the scene has cars, traffic lights and trees, the model might focus too much on irrelevant details such as the motion of the leaves. It might miss the color of the traffic light, or the positions of nearby cars. “When you go to images or video, you don’t want to work in [pixel] space because there are too many details you don’t want to model,” said Randall Balestriero, a computer scientist at Brown University.
-
Tech1 week agoGet Your Steps In From Your Home Office With This Walking Pad—On Sale This Week
-
Sports1 week agoIndia Triumphs Over South Africa in First ODI Thanks to Kohli’s Heroics – SUCH TV
-
Entertainment1 week agoSadie Sink talks about the future of Max in ‘Stranger Things’
-
Fashion1 week agoResults are in: US Black Friday store visits down, e-visits up, apparel shines
-
Politics1 week agoElon Musk reveals partner’s half-Indian roots, son’s middle name ‘Sekhar’
-
Tech1 week agoPrague’s City Center Sparkles, Buzzes, and Burns at the Signal Festival
-
Sports1 week agoBroncos secure thrilling OT victory over Commanders behind clutch performances
-
Sports1 week agoF1 set for final-race showdown as Verstappen exploits McLaren blunder | The Express Tribune
