Tech
African languages for AI: The project that’s gathering a huge new dataset
Artificial intelligence (AI) tools like ChatGPT, DeepSeek, Siri or Google Assistant are developed by the global north and trained in English, Chinese or European languages. In comparison, African languages are largely missing from the internet.
A team of African computer scientists, linguists, language specialists and others have been working on precisely this problem for two years already. The African Next Voices project recently released what’s thought to be the largest dataset of African languages for AI so far. We asked them about their project, with sites in Kenya, Nigeria and South Africa.
Why is language so important to AI?
Language is how we interact, ask for help, and hold meaning in community. We use it to organize complex thoughts and share ideas. It’s the medium we use to tell an AI what we want—and to judge whether it understood us.
We are seeing an upsurge of applications that rely on AI, from education to health to agriculture. These models are trained from large volumes of (mostly) linguistic (language) data. These are called large language models or LLMs but are found in only a few of the world’s languages.
Languages also carry culture, values and local wisdom. If AI doesn’t speak our languages, it can’t reliably understand our intent, and we can’t trust or verify its answers. In short: without language, AI can’t communicate with us—and we can’t communicate with it. Building AI in our languages is therefore the only way for AI to work for people.
If we limit whose language gets modeled, we risk missing out on the majority of human cultures, history and knowledge.
Why are African languages missing and what are the consequences for AI?
The development of language is intertwined with the histories of people. Many of those who experienced colonialism and empire have seen their own languages being marginalized and not developed to the same extent as colonial languages. African languages are not as often recorded, including on the internet.
So there isn’t enough high-quality, digitized text and speech to train and evaluate robust AI models. That scarcity is the result of decades of policy choices that privilege colonial languages in schools, media and government.
Language data is just one of the things that’s missing. Do we have dictionaries, terminologies, glossaries? Basic tools are few and many other issues raise the cost of building datasets. These include African language keyboards, fonts, spell-checkers, tokenizers (which break text into smaller pieces so a language model can understand it), orthographic variation (differences in how words are spelled across regions), tone marking and rich dialect diversity.
The result is AI that performs poorly and sometimes unsafely: mistranslations, poor transcription, and systems that barely understand African languages.
In practice this denies many Africans access—in their own languages—to global news, educational materials, health care information, and the productivity gains AI can deliver.
When a language isn’t in the data, its speakers aren’t in the product, and AI cannot be safe, useful or fair for them. They end up missing the necessary language technology tools that could support service delivery. This marginalizes millions of people and increases the technology divide.
What is your project doing about it—and how?
Our main objective is to collect speech data for automatic speech recognition (ASR). ASR is an important tool for languages that are largely spoken. This technology converts spoken language into written text.
The bigger ambition of our project is to explore how data for ASR is collected and how much of it is needed to create ASR tools. We aim to share our experiences across different geographic regions.
The data we collect is diverse by design: spontaneous and read speech; in various domains—everyday conversations, health care, financial inclusion and agriculture. We are collecting data from people of diverse ages, gender and educational backgrounds.
Every recording is collected with informed consent, fair compensation and clear data-rights terms. We transcribe with language-specific guidelines and a large range of other technical checks.
In Kenya, through Maseno Centre for Applied AI, we are collecting voice data for five languages. We’re capturing the three main language groups Nilotic (Dholuo, Maasai and Kalenjin) as well as Cushitic (Somali) and Bantu (Kikuyu).
Through Data Science Nigeria, we are collecting speech in five widely spoken languages—Bambara, Hausa, Igbo, Nigerian Pidgin and Yoruba. The dataset aims to accurately reflect authentic language use within these communities.
In South Africa, working through the Data Science for Social Impact lab and its collaborators, we have been recording seven South African languages. The aim is to reflect the country’s rich linguistic diversity: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele and Tshivenda.
Importantly, this work does not happen in isolation. We are building on the momentum and ideas from the Masakhane Research Foundation network, Lelapa AI, Mozilla Common Voice, EqualyzAI, and many other organizations and individuals who have been pioneering African language models, data and tooling.
Each project strengthens the others, and together they form a growing ecosystem committed to making African languages visible and usable in the age of AI.
How can this be put to use?
The data and models will be useful for captioning local-language media; voice assistants for agriculture and health; call-center and support in the languages. The data will also be archived for cultural preservation.
Larger, balanced, publicly available African language datasets will allow us to connect text and speech resources. Models will not just be experimental, but useful in chatbots, education tools and local service delivery. The opportunity is there to go beyond datasets into ecosystems of tools (spell-checkers, dictionaries, translation systems, summarization engines) that make African languages a living presence in digital spaces.
In short, we are pairing ethically collected, high-quality speech at scale with models. The aim is for people to be able to speak naturally, be understood accurately, and access AI in the languages they live their lives in.
What happens next for the project?
This project only collected voice data for certain languages. What of the remaining languages? What of other tools like machine translation or grammar checkers?
We will continue to work on multiple languages, ensuring that we build data and models that reflect how Africans use their languages. We prioritize building smaller language models that are both energy efficient and accurate for the African context.
The challenge now is integration: making these pieces work together so that African languages are not just represented in isolated demos, but in real-world platforms.
One of the lessons from this project, and others like it, is that collecting data is only step one. What matters is making sure that the data is benchmarked, reusable, and linked to communities of practice. For us, the “next” is to ensure that the ASR benchmarks we build can connect with other ongoing African efforts.
We also need to ensure sustainability: that students, researchers, and innovators have continued access to compute (computer resources and processing power), training materials and licensing frameworks (Like NOODL or Esethu). The long-term vision is to enable choice: so that a farmer, a teacher, or a local business can use AI in isiZulu, Hausa, or Kikuyu, not just in English or French.
If we succeed, built-in AI in African languages won’t just be catching up. It will be setting new standards for inclusive, responsible AI worldwide.
This article is republished from The Conversation under a Creative Commons license. Read the original article.
Citation:
African languages for AI: The project that’s gathering a huge new dataset (2025, October 19)
retrieved 19 October 2025
from https://techxplore.com/news/2025-10-african-languages-ai-huge-dataset.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.
Tech
How a cloud-native architecture handles persistent storage | Computer Weekly
Cloud-native, or containerised, applications are now mainstream. As many as 82% of enterprises now have Kubernetes in production, according to the Cloud Native Computing Forum (CNCF). That is up from 66% in 2023. And a full 98% of organisations have at least some cloud-native applications, the industry body says.
But moving applications to cloud-native environments does not just mean creating new code. It also means adapting infrastructure. Compute, networking and data storage all need to work with container environments. By no means can all systems do this out of the box, especially when it comes to on-premise hardware.
At the same time, enterprise IT architects need to consider the requirements of legacy applications and virtual machines (VMs) that are not being updated. And enterprises will want to make the most efficient use of their storage hardware, regardless of their application environments.
Moving to containers means adapting a technology that was not designed for persistent storage to handle business-critical data.
Stateless states
Containerised applications started out as stateless, or ephemeral. The designers never intended containers to hold persistent data. They expected that microservices or containerised applications would use no non-volatile storage and discard the contents of memory, and even their settings, once they had completed their tasks.
Instead, containerised applications rely on an external data store, usually a database or cache.
There are advantages to this approach. These include simpler deployment, easier scaling, fault tolerance and recovery, and application portability. But most business applications, if not the majority, need persistent data.
“Most business applications require storage. In reality, unless you’re converting Fahrenheit to Celsius and back, you’re storing something somewhere,” says Dan Ciruli, vice-president and general manager for cloud native at Nutanix.
And the need to work with persistent data is all the more important, as enterprises look to containers as an alternative to conventional virtual machines.
But this means rethinking the way applications work. And it requires IT architects to update their storage systems to support modernised, cloud-native applications. This can be directly, where array manufacturers support containers, or through a control plane such as Nutanix or Everpure’s Portworx.
Almost inevitably, changes are being driven by AI, as enterprises look to support its data-heavy workloads in modern, cloud-native environments. But there are other drivers, too, including a trend to move virtualised applications to containers and the need for cost controls.
“Kubernetes might be over a decade old, but it’s continuing to evolve as AI transforms the way we handle data. Already, Kubernetes has moved beyond the days when it was built only for ephemeral, stateless applications,” says Michael Cade, global field chief technology officer at Veeam Software.
“Today, stateful applications such as databases, machine learning pipelines and streaming systems are now being treated as first-class citizens [in containerised environments] and have been given the specialised tools they need to thrive.”
Storage connections
Connecting storage to Kubernetes, though, relies on support from both application developers and hardware suppliers.
The main way to connect storage to container environments is through the container storage interface (CSI). CSI needs to be supported directly by the storage provider, be that the hardware manufacturer, a cloud service, or a software-defined storage (SDS) supplier.
As the CNCF’s Kubernetes page notes: “CSI was developed as a standard for exposing arbitrary block and file storage systems to containerised workloads on container orchestration systems like Kubernetes.” CSI allows third-party storage providers to write, and deploy, plug-ins for storage without changing the core Kubernetes code.
SDS technologies, for their part, also use CSI drivers, but run on commodity hardware rather than dedicated storage arrays, as well as hyper-converged infrastructure. It also includes open source options, such as OpenEBS, Longhorn and Ceph.
“Every environment needs a storage back end, with a CSI driver that connects it to Kubernetes. It’s up to the storage provider to provide the CSI driver,” says Nigel Poulton, an author and independent expert in Kubernetes and containers.
“Most CSI drivers create at least one StorageClass that maps to a tier of storage and its capabilities. For example, a CSI driver might create a StorageClass called ‘fast-replicated’ that maps to high-speed flash storage automatically replication to a remote location. Any application using this class automatically gets that tier and set of capabilities,” he adds.
This level of abstraction is highly useful for application developers, as they no longer have to worry about the physical capabilities of the storage system. That is handled by the CSI drivers.
“The CSI drivers enable us to give access to storage from the containerised application, but [for firms to] still administer the storage the way they do the storage that’s running under their VMs,” says Nutanix’s Ciruli. “And that’s a big advantage.” He also sees customers installing Kubernetes on bare metal clusters.
This also maintains separation between the Kubernetes workloads and the underlying storage hardware. On paper at least, enterprises can move their containerised applications to a different platform or supplier, or new storage hardware, without rewriting code and with minimal disruption.
In practice, large-scale moves of Kubernetes applications between platforms are still relatively rare. Enterprises tend to develop applications to run on Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, or local hardware, depending on their business requirements.
Application portability, supported by CSI, is a useful insurance, even if there are enough differences between platforms to suggest caution.
“We really don’t need to become an expert in how EBS [Elastic Block Store] works versus Azure disk, or local SSD [solid-state drives] and how that works,” says Greg Muscarella, general manager for Portworx at Everpure. “If you have to manage those things, it becomes somewhat complex. Companies tend to focus on a single cloud environment.”
Few organisations, he suggests, have code where they could “push a button and move it to a different cloud”, not least because of differences between storage architectures from both hardware suppliers and cloud providers. However, enterprises are moving more applications to cloud-native environments. And this increasingly includes databases and applications that previously ran in conventional virtual machines.
New platforms
One of the most significant trends in application modernisation is to move both virtual machines and database-driven applications to containers. Cost, avoiding supplier lock-in and the need to consolidate on fewer platforms are all drivers.
“The line between ‘containerised’ and ‘virtualised’ is blurring,” suggests Veeam’s Cade. “For a long time, containers and VMs were seen as two separate siloes. But as stateful applications have developed, and since VMs are essentially a typical stateful workload, we’re seeing a significant rise in businesses running them directly within Kubernetes using platforms such as Red Hat OpenShift Virtualization.”
Poulton agrees. He sees more organisations moving virtualised workloads to containers, via tools such as KubeVirt. But, although organisations are porting over virtualised applications, and databases, IT architects need to be sure that all the application’s requirements are met by the storage layer.
“Databases have much more demanding requirements, including ordered startup, replication, automated failover and backup,” he cautions. “The two biggest changes are ensuring a CSI driver exists for the storage system and potentially deploying an operator.”
A Kubernetes operator provides details about a database’s specific requirements, and sometimes storage, too. Operator support is essential to allow databases to deliver enterprise workloads over Kubernetes. Again, the operator supports the modern application goal of separating the code from the storage array or cloud storage service.
Percona, for example, provides operators for MySQL, PostgreSQL and MongoDB, as well as Everest. “The operators are basically the game changers,” says Kate Obiidykhata, the company’s general manager for cloud native. “They encode the human DBA knowledge into the software, and you have all those most important resilience components, backup, failover, replication and upgrades automated.”
Operators, she adds, help enterprises to adopt hybrid architectures or multicloud strategies, allowing data portability without the need to rewrite applications. But workloads that operate on VMs will not automatically run on containers, she says. Firms will need to plan, and test, their deployments with care.
“There are specific playbooks that you should apply and methodologies that are obviously different from the classic database setup on VMs,” says Obiidykhata. “But it’s all doable, and many companies are now running those databases on Kubernetes. They just have a different playbook to mitigate those issues.”
Firms also need to factor in how they run their ported applications in production. Development, understandably, attracts much of the attention. But how systems run from “day two” onwards is critical. This includes storage provisioning and tiering, as well as backup, recovery and security.
The CSI drivers take care of much of the hard work, but enterprises are likely to look to invest in new hardware, or even storage from suppliers focused on cloud-native environments, to ease the migration to containers.
“This is usually by deploying new storage architectures, either via new storage products from existing vendors, but increasingly by engaging with new vendors,” says Poulton. Enterprises, he adds, might still be running older hardware systems, but they are unlikely to use them for Kubernetes.
Tech
The Asus Zenbook 16 Delivers Great Performance in an Otherwise Mediocre Laptop
So, what’s not to like? Well, early compatibility problems slowed the initial uptake of Snapdragon X, and the CPU’s integrated graphics performance turned out to be pretty terrible. And to date, powerful onboard AI features just haven’t proven important, as most AI workloads are still being done in the cloud. With the second-generation X2, Qualcomm set out to deliver on the original promise of faster performance.
But what exactly does “faster” mean? As with most claims in the PC computing space, it’s all about the benchmarks. On the Zenbook A16, the tests I ran indeed showcased exemplary performance from the X2 Elite Extreme, in some of the most widely used benchmarking tools, namely Geekbench 6 and Cinebench 2024. (I don’t have enough competitive Cinebench 2026 results to make wide comparisons yet on that benchmark.)
The performance boost on Geekbench is particularly striking, with the A16 scoring 50 to 100 percent faster than competing systems from AMD and Intel. It’s even faster than the Apple MacBook M4 Pro, the last Mac for which I have comparable benchmark scores. However, that Mac did beat the Asus on the Cinebench benchmark, but not by much, and the Asus now stands solidly in second place in my testing archive.
Graphics performance is much better than in previous generations of Snapdragon X chips, with frame rates quadrupling on average, depending on the test. That’s a dramatic and much-needed improvement for the CPU, and while no one will accuse the A16 of being a gaming rig, it does at least make for a workable experience with less taxing games and graphics-heavy workloads.
Beige Belies Performance
Photograph: Chris Null
I’m happy enough with how the Snapdragon X2 Elite Extreme performs to sign off on its performance claims, but there’s a lot more to the Zenbook A16 than its CPU.
Under the hood, the Snapdragon X2 Elite Extreme X2E94100 CPU is complemented by 48 GB of RAM and a 1-TB SSD. The 16-inch touchscreen offers a solid resolution of 2880 x 1800 pixels, and it’s incredibly bright. A weight of 2.9 pounds is impressive (if not unheard of) for the 16-inch category, and at 0.65 inches (at its thickest), it has a svelte, quite portable carrying experience. Asus’s Ceraluminum technology (now with added magnesium) is used in the machine’s lid, base, and keyboard frame. That helps keep it thin and light, though when adjusted or touched, the screen shimmied more than I expected.
Tech
I Lugged the Best Travel Totes on Work Trips, Weekends, and More
Compare Top 6 Travel Totes
More Travel Totes I Recommend
Longchamp Large Le Pliage Tote for $180: This bestseller is the equivalent of a classic white tee: timeless, versatile, and built to be passed down for generations. Inspired by origami, Le Pliage folds down small when you need to pack it, but it’s also roomy enough to double as your personal item. I can fit all the essentials in here—laptop, Kindle, my airport toiletries, snacks, and then some. With its minimalist design and zipper closure for valuables, it’s also the ideal work bag for business trips. My one gripe with this travel tote bag is the lack of internal compartments (besides two impractical flat pockets), but if you’re someone who has little pouches and tech organizers for your gear, you might not miss it.
Cincha the Vegan Leather Go-Tote for $130: This vegan leather bag is deceptively huge. The base is 7.5 inches deep, so while it doesn’t look that big in pictures, it holds an astonishing amount of stuff. I’ve packed enough clothes in it for a full weekend trip. I usually have concerns about vegan leather cracking and breaking with use, but Cincha’s soft pebbled fabric does not look or feel obviously plastic. This is the tote bag I took on a multi-week trip to the Philippines, and the leather stood up to rain and being kicked around airport lounges, ferries, and train depots. However, it is more than 2 pounds heavier than a Longchamp Le Pliage, so this is strictly for when you can sling it on top of your carry-on. —Adrienne So
Mission Workshop Drift Laptop Tote for $345: The Drift is my favorite travel tote. It’s burly but with styling that’s refined and classy, and the rolled handles and removable strap make it comfortable to carry by hand or over the shoulder. But the best thing about it is the smartly organized storage pockets inside and out. It feels designed especially for people like me who always carry an army of gadgets. The Drift is kind of a beast, though. It’s too huge to slide under the seat in front of you on an airplane, but it fits into the overhead baggage compartment. —Michael Calore
Vera Bradley Original Duffel for $105: If there were ever a product I would refer to as “ol’ reliable,” it’s undoubtedly the Vera Bradley bag. The bright pattern, durable materials, and washable cotton structure have held up remarkably well for over a decade’s worth of travels. Even when I’ve completely overpacked and lugged it with me on planes, trains, and car travels, I don’t detect strain on the handle stitching. There are no internal pockets, but you do have four exterior ones located around the sides of the bag for easy access (or last-minute additions to your planned outfits). —Julia Forbes
BaubleBar Large Custom Icon Tote for $98: What sets BaubleBar apart is its playful personalization. Your chosen icons (up to six, depending on the size) are embroidered directly onto the canvas tote. The process is super user-friendly, with predesignated spots to help you visualize your picks. Choose from zodiac signs, cutesy foods, initials, and more. Just note that it’s a final sale, so be sure of your design before ordering. The large size fits everything you need for a beach day trip, and the medium and small options are better for light shopping or city exploring. It closes with just a snap button, which isn’t the most secure for crowded areas.
Aer Simple Tote for $139: Have you ever hefted a nylon or leather tote in your hand and realized that slinging it over your shoulder would give you immediate scoliosis? Then you want Aer’s ultra-lightweight, simple sailcloth tote, which weighs less than a pound. Its 15 liters felt surprisingly capacious. I fit two jackets inside on a walk with my kids, and the 3-inch-wide bag tucked neatly under my arm. The two exterior drop pockets fit my Nalgene and Kinto mug, and my phone fits neatly in the exterior zipper pocket. This is a great upgrade if you are getting tired of carrying everything in your canvas tote from Umami Mart and want a bag that’s not going to get soaked in something questionable if you put it down in the wrong place on the subway. It is a little more expensive, though. —Adrienne So
Cuyana System Tote 16-Inch for $378: The Cuyana System Tote is a modular gear-hauler that shape-shifts with your itinerary. Designed to outlast the churn of fast fashion, this travel tote starts minimal, but the genius lies in its add-ons. A laptop sleeve or insert organizer creates a structure on the go, with dedicated slots for your computer, water bottle, and other work essentials. A System Flap Bag insert doubles as a clutch or in-bag organizer, and a detachable, adjustable crossbody strap (also available in a wide model) converts the tote, perfect for hands-free airport sprints if you’re unintentionally trying out airport theory. Instead of stitched-on straps prone to failure, the System Tote’s handles are cut directly from its leather body, minimizing points of wear. The main compartment snaps shut rather than zips, something to know if you’re the spill-averse type.
Avoid This Tote
Calpak Diaper Tote Bag with Laptop Sleeve for $195: This bag was really puffy, but felt bulky, and space was lost to give the puffiness to the bag’s layers. It was somehow too big for everyday use, but not big enough when I needed a lot of stuff brought along for a day trip or long outings. It also didn’t really feel that diaper bag-centric; the only thing “diaper” about it was the baby wipe compartment on the outside, but I would have preferred an exterior pocket to store actual diapers along with it. You could stuff a couple of diapers in the flat front pocket, but it’s not as ideal as other designs I’ve tried. The insulated bottle pockets are handy if you travel with bottles, but feel useless after your baby graduates from bottles (which they graduate much earlier than diapers!) It’s not a bad bag, but I’d recommend a different design for parents and travelers alike. —Nena Farrell
To determine the best travel tote, I put each bag through real-world travel scenarios to see how it performs. That means packing it with laptops, chargers, clothes, and toiletries, testing comfort when worn over the shoulder or carried by hand. I’ll overstuff the totes to check durability, organization, and accessibility. I’ll evaluate how it fits under airplane seats, protects tech gear, and resists wear and weather. If it’s supposedly water-resistant, I’ll take it out in the rain to determine whether it survives without soaking its contents.
I scrutinize every pocket, compartment, and zipper for usability. When it comes to design, I pay attention to the details: interior fabric choices that make contents easy to see, convenient pocket placement, and hardware choices like zippers and zipper pulls. I also like to take note of the key design elements, such as the handle length and overall structure.
I prioritize quality and sustainability, and I include eco-friendly brands for environmentally conscious consumers. I also made sure to include an array of fabrics for stylistic variability. Lastly, I consider how each bag stacked up against its price point, ensuring that the quality justified the cost.
Power up with unlimited access to WIRED. Get best-in-class reporting and exclusive subscriber content that’s too important to ignore. Subscribe Today.
-
Tech7 days agoA Brain Implant for Depression Is About to Be Tested in Humans
-
Tech1 week agoAlmost 90% of women leave tech industry within 10 years | Computer Weekly
-
Business7 days ago‘I had £20,000 stolen and had to fight a 13-month fraud reporting rule to get it back’
-
Business1 week agoPakistan’s oil market is fuelling the crisis | The Express Tribune
-
Sports6 days agoPro wrestling star Steph De Lander reveals how colleague’s advice helped lead her to title triumph at ACW
-
Tech6 days agoThis Ambitious Laptop Doesn’t Leave Much Room for Your Hands
-
Entertainment7 days agoNorway joins Type 26 Frigate Programme to boost NATO naval power
-
Entertainment7 days agoMelania Trump says ABC should ‘take a stand’ on late-night host Kimmel
