Connect with us

Tech

African languages for AI: The project that’s gathering a huge new dataset

Published

on

African languages for AI: The project that’s gathering a huge new dataset


Credit: Unsplash/CC0 Public Domain

Artificial intelligence (AI) tools like ChatGPT, DeepSeek, Siri or Google Assistant are developed by the global north and trained in English, Chinese or European languages. In comparison, African languages are largely missing from the internet.

A team of African computer scientists, linguists, language specialists and others have been working on precisely this problem for two years already. The African Next Voices project recently released what’s thought to be the largest dataset of African languages for AI so far. We asked them about their project, with sites in Kenya, Nigeria and South Africa.

Why is language so important to AI?

Language is how we interact, ask for help, and hold meaning in community. We use it to organize complex thoughts and share ideas. It’s the medium we use to tell an AI what we want—and to judge whether it understood us.

We are seeing an upsurge of applications that rely on AI, from education to health to agriculture. These models are trained from large volumes of (mostly) linguistic (language) data. These are called or LLMs but are found in only a few of the world’s languages.

Languages also carry culture, values and local wisdom. If AI doesn’t speak our languages, it can’t reliably understand our intent, and we can’t trust or verify its answers. In short: without language, AI can’t communicate with us—and we can’t communicate with it. Building AI in our languages is therefore the only way for AI to work for people.

If we limit whose language gets modeled, we risk missing out on the majority of human cultures, history and knowledge.

Why are African languages missing and what are the consequences for AI?

The development of language is intertwined with the histories of people. Many of those who experienced colonialism and empire have seen their own languages being marginalized and not developed to the same extent as colonial languages. African languages are not as often recorded, including on the internet.

So there isn’t enough high-quality, digitized text and speech to train and evaluate robust AI models. That scarcity is the result of decades of policy choices that privilege colonial languages in schools, media and government.

Language data is just one of the things that’s missing. Do we have dictionaries, terminologies, glossaries? Basic tools are few and many other issues raise the cost of building datasets. These include African language keyboards, fonts, spell-checkers, tokenizers (which break text into smaller pieces so a language model can understand it), orthographic variation (differences in how words are spelled across regions), tone marking and rich dialect diversity.

The result is AI that performs poorly and sometimes unsafely: mistranslations, poor transcription, and systems that barely understand African languages.

In practice this denies many Africans access—in their own languages—to global news, educational materials, health care information, and the productivity gains AI can deliver.

When a language isn’t in the data, its speakers aren’t in the product, and AI cannot be safe, useful or fair for them. They end up missing the necessary language technology tools that could support service delivery. This marginalizes millions of people and increases the technology divide.

What is your project doing about it—and how?

Our main objective is to collect speech data for automatic speech recognition (ASR). ASR is an important tool for languages that are largely spoken. This technology converts spoken language into written text.

The bigger ambition of our project is to explore how data for ASR is collected and how much of it is needed to create ASR tools. We aim to share our experiences across different geographic regions.

The data we collect is diverse by design: spontaneous and read speech; in various domains—everyday conversations, health care, financial inclusion and agriculture. We are collecting data from people of diverse ages, gender and educational backgrounds.

Every recording is collected with informed consent, fair compensation and clear data-rights terms. We transcribe with language-specific guidelines and a large range of other technical checks.

In Kenya, through Maseno Centre for Applied AI, we are collecting voice data for five languages. We’re capturing the three main language groups Nilotic (Dholuo, Maasai and Kalenjin) as well as Cushitic (Somali) and Bantu (Kikuyu).

Through Data Science Nigeria, we are collecting speech in five widely spoken languages—Bambara, Hausa, Igbo, Nigerian Pidgin and Yoruba. The dataset aims to accurately reflect authentic language use within these communities.

In South Africa, working through the Data Science for Social Impact lab and its collaborators, we have been recording seven South African languages. The aim is to reflect the country’s rich linguistic diversity: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele and Tshivenda.

Importantly, this work does not happen in isolation. We are building on the momentum and ideas from the Masakhane Research Foundation network, Lelapa AI, Mozilla Common Voice, EqualyzAI, and many other organizations and individuals who have been pioneering African language models, data and tooling.

Each project strengthens the others, and together they form a growing ecosystem committed to making African languages visible and usable in the age of AI.

How can this be put to use?

The data and models will be useful for captioning local-language media; voice assistants for agriculture and health; call-center and support in the languages. The data will also be archived for cultural preservation.

Larger, balanced, publicly available African language datasets will allow us to connect text and speech resources. Models will not just be experimental, but useful in chatbots, education tools and local service delivery. The opportunity is there to go beyond datasets into ecosystems of tools (spell-checkers, dictionaries, translation systems, summarization engines) that make African languages a living presence in digital spaces.

In short, we are pairing ethically collected, high-quality speech at scale with models. The aim is for people to be able to speak naturally, be understood accurately, and access AI in the languages they live their lives in.

What happens next for the project?

This project only collected voice data for certain languages. What of the remaining languages? What of other tools like machine translation or grammar checkers?

We will continue to work on multiple languages, ensuring that we build data and models that reflect how Africans use their languages. We prioritize building smaller language models that are both energy efficient and accurate for the African context.

The challenge now is integration: making these pieces work together so that African languages are not just represented in isolated demos, but in real-world platforms.

One of the lessons from this project, and others like it, is that collecting data is only step one. What matters is making sure that the data is benchmarked, reusable, and linked to communities of practice. For us, the “next” is to ensure that the ASR benchmarks we build can connect with other ongoing African efforts.

We also need to ensure sustainability: that students, researchers, and innovators have continued access to compute (computer resources and processing power), training materials and licensing frameworks (Like NOODL or Esethu). The long-term vision is to enable choice: so that a farmer, a teacher, or a local business can use AI in isiZulu, Hausa, or Kikuyu, not just in English or French.

If we succeed, built-in AI in African languages won’t just be catching up. It will be setting new standards for inclusive, responsible AI worldwide.

Provided by
The Conversation


This article is republished from The Conversation under a Creative Commons license. Read the original article.The Conversation

Citation:
African languages for AI: The project that’s gathering a huge new dataset (2025, October 19)
retrieved 19 October 2025
from https://techxplore.com/news/2025-10-african-languages-ai-huge-dataset.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Tech

Security News This Week: Oh Crap, Kohler’s Toilet Cameras Aren’t Really End-to-End Encrypted

Published

on

Security News This Week: Oh Crap, Kohler’s Toilet Cameras Aren’t Really End-to-End Encrypted


An AI image creator startup left its database unsecured, exposing more than a million images and videos its users had created—the “overwhelming majority” of which depicted nudes and even nude images of children. A US inspector general report released its official determination that Defense Secretary Pete Hegseth put military personnel at risk through his negligence in the SignalGate scandal, but recommended only a compliance review and consideration of new regulations. Cloudflare’s CEO Matthew Prince told WIRED onstage at our Big Interview event in San Francisco this week that his company has blocked more than 400 billion AI bot requests for its customers since July 1.

A new New York law will require retailers to disclose if personal data collected about you results in algorithmic changes to their prices. And we profiled a new cellular carrier aiming to offer the closest thing possible to truly anonymous phone service—and its founder, Nicholas Merrill, who famously spent a decade-plus in court fighting an FBI surveillance order targeted at one of the customers of his internet service provider.

Putting a camera-enabled digital device in your toilet that uploads an analysis of your actual bodily waste to a corporation represents such a laughably bad idea that, 11 years ago, it was the subject of a parody infomercial. In 2025, it’s an actual product—and one whose privacy problems, despite the marketing copy of the company behind it, have turned out to be exactly as bad as any normal human might have imagined.

Security researcher Simon Fondrie-Teitler this week published a blog post revealing that the Dekota, a camera-packing smart device sold by Kohler, does not in fact use “end-to-end encryption” as it claimed. That term typically means that data is encrypted so that only user devices on either “end” of a conversation can decrypt the information therein, not the server that sits in between them and hosts that encrypted communication. But Fondrie-Teitler found that the Dekota only encrypts its data from the device to the server. In other words, according to the company’s definition of end-to-end encryption, one end is essentially—forgive us—your rear end, and the other is Kohler’s backend, where the images of its output are “decrypted and processed to provide our service,” as the company wrote in a statement to Fondrie-Teitler.

In response to his post pointing out that this is generally not what end-to-end encryption means, Kohler has removed all instances of that term from its descriptions of the Dekota.

The cyberespionage campaign known as Salt Typhoon represents one of the biggest counterintelligence debacles in modern US history. State-sponsored Chinese hackers infiltrated virtually every US telecom and gained access to the real-time calls and texts of Americans—including then presidential and vice-presidential candidates Donald Trump and J.D. Vance. But according to the Financial Times, the US government has declined to impose sanctions on China in response to that hacking spree amid the White House’s effort to reach a trade deal with China’s government. That decision has led to criticism that the administration is backing off key national security initiatives in an effort to accommodate Trump’s economic goals. But it’s worth noting that imposing sanctions in response to espionage has always been a controversial move, given that the United States no doubt carries out plenty of espionage-oriented hacking of its own across the world.

As 2025 draws to a close, the nation’s leading cyberdefense agency, the Cybersecurity and Infrastructure Agency (CISA), still has no director. And the nominee to fill that position, once considered a shoo-in, now faces congressional hurdles that may have permanently tanked his chances to run the agency. Sean Plankey’s name was excluded from a Senate vote Thursday on a panel of appointments, suggesting his nomination may be “over,” according to CyberScoop. Plankey’s nomination had faced various opposition from senators on both sides of the aisle with a broad mix of demands: Florida’s Republican senator Rick Scott had placed a hold on his nomination due to the Department of Homeland Security (DHS) terminating a Coast Guard contract with a company in his state, while North Carolina’s GOP senators opposed any new DHS nominees until disaster relief funding was allocated to their state. Democratic senator Ron Wyden, meanwhile, has demanded CISA publish a long-awaited report on telecom security prior to his appointment, which still has yet to be released.

The Chinese hacking campaign centered around the malware known as “Brickstorm” first came to light in September, when Google warned that the stealthy spy tool has been infecting dozens of victim organizations since 2022. Now CISA, the National Security Agency, and the Canadian Centre for Cybersecurity jointly added to Google’s warnings this week in an advisory about how to spot the malware. They also cautioned that the hackers behind it appear to be positioned not only for espionage targeting US infrastructure but also potentially disruptive cyberattacks, too. Most disturbing, perhaps, is a particular data point from Google, measuring the average time until the Brickstorm breaches have been discovered in a victim’s network: 393 days.



Source link

Continue Reading

Tech

Top Vimeo Promo Codes and Discounts This Month in 2025

Published

on

Top Vimeo Promo Codes and Discounts This Month in 2025


Remember Vimeo? You probably don’t use it to browse videos the way you might with some other services. But if you landed on this page, there’s a good chance you use it to host your professional portfolio. Or assets for your business. Or your short films. Vimeo has tools other video hosting services simply don’t have, like AI editing tools, on-demand content selling, customizable embeds, and collaborative editing features. And best of all: There are no ads. WIRED has rotating Vimeo promo codes to help you save.

Get 10% Off Annual Plans With This Vimeo Promo Code

No matter what you need for your business or career, when it comes to video, Vimeo’s got multiple plans to suit. And luckily, right now, you can save with a Vimeo promo code—even on the annual plans, which already include 40% in savings. Just use Vimeo coupon code GETVIMEO10 to save 10% on your membership plan.

The Easiest Way to Save 40% on Your Vimeo Plan

Vimeo has a few different membership plans that you can save on. No matter which you go with, the easiest way to save a lot is with an annual membership, which has automatic 40% savings compared to paying monthly. And yes, you can even stack promo codes with the annual billing options.

More on Vimeo Pricing and Membership Plans

So what tier do you need? The Starter plan starts at $12 per month (billed annually) or $20 per month (billed monthly). It comes with 100 gigabytes of storage, plus boosted privacy controls, custom video players, custom URLs, and automatic closed captioning.

Boost your plan to Standard for $25 per month (billed annually) or $41 per month (billed monthly) to upgrade to 2 terabytes of storage, 5 “seats” (which are collaborative team member spots), a brand kit, a teleprompter, text-based video editing, AI script generation, and engagement and social analytics.

Finally, there’s the Advanced plan, which costs $75 per month (billed annually) or $125 per month (billed monthly). You’ll get 10 “seats”, 7 terabytes of storage, AI-generated chapters and text summaries, live chat and poll options, plus streaming and live broadcast capabilities.

Use a Vimeo Coupon Code to Get Savings on Vimeo on Demand

Vimeo on Demand is a new way to stream and download movies online. Through Vimeo on Demand, you can rent, buy and subscribe to the best original films, documentaries and series directly from your favorite small business video creators, including The Talent and Wild Magic.

Vimeo Enterprise Solutions 2025

You may have not heard about Vimeo Enterprise, but it’s probably the most essential program for content creators, videographers, and digital media in the workplace in general. From meeting recordings and AI-driven video creation to compliance and distribution, Vimeo Enterprise helps centralize and manage video workflows.

Does Vimeo Have a Free Trial?

While Vimeo doesn’t have a free trial of its paid plans, it does have a free plan with some basic features. Additionally, paid plans can be canceled anytime–within 14 days for an annual subscription, or 3 days for a monthly subscription. You’ll get a full refund if you decide to cancel within the respective timeframes.



Source link

Continue Reading

Tech

WIRED Roundup: DOGE Isn’t Dead, Facebook Dating Is Real, and Amazon’s AI Ambitions

Published

on

WIRED Roundup: DOGE Isn’t Dead, Facebook Dating Is Real, and Amazon’s AI Ambitions


Leah Feiger: So it’s a really good question actually, and it’s one that I’ve thought about for quite some time. I think if it’s not annoying, I want to read this quote from Scott Kupor, the director of OPM and the former managing partner at Andreessen Horowitz, to be clear, just to remind everyone where people are coming from in this current administration. He posted this on X late last month, and this was part of Reuter’s reporting. So he posts, “The truth is, DOGE may not have centralized leadership under USDS anymore, but the principles of DOGE remain alive and well, deregulation, eliminating fraud, waste and abuse, reshaping the federal workforce, et cetera, et cetera, et cetera.” Which is the exact same, the thing that they’ve been saying this entire time, but it’s all smoke and mirrors, right? It’s like, oh no, no, well, DOGE doesn’t exactly exist anymore. There’s no Elon Musk character leading it, which Elon Musk himself said on the podcast with Joe Rogan last month as well. He’s like, “Yeah, once I left, they weren’t able to pick on anyone, but don’t worry, DOGE is still there.” So it feels wild to watch people fall for this and go like, “DOGE is gone now.” And I’m like, they’re literally telling us that it’s not.

Zoë Schiffer: I think one thing that does feel honestly true is that it is harder and harder to differentiate where DOGE stops and the Trump administration begins because they have infiltrated so many different parts of government and the DOGE ethos, what you’re talking about, deregulation, cost cuttings, zero-based budgeting, those have really become kind of table stakes for the admin, right?

Leah Feiger: I think that’s such a good point. And honestly, by the end of Elon Musk’s reign, something that kept coming up wasn’t necessarily that the Trump administration didn’t agree with DOGE’s ethos at all. It was that they didn’t really agree with how Musk was going about it. They didn’t like that he was stepping on Treasury Secretary Scott Bessent and having fights outside of the Oval Office. That was bad optics and that also wasn’t helping the Trump administration even look like they were on top of it.



Source link

Continue Reading

Trending