Tech
AI systems are great at tests. But how do they perform in real life?
Earlier this month, when OpenAI released its latest flagship artificial intelligence (AI) system, GPT-5, the company said it was “much smarter across the board” than earlier models. Backing up the claim were high scores on a range of benchmark tests assessing domains such as software coding, mathematics and health care.
Benchmark tests like these have become the standard way we assess AI systems—but they don’t tell us much about the actual performance and effects of these systems in the real world.
What would be a better way to measure AI models? A group of AI researchers and metrologists—experts in the science of measurement—recently outlined a way forward.
Metrology is important here because we need ways of not only ensuring the reliability of the AI systems we may increasingly depend upon, but also some measure of their broader economic, cultural, and societal impact.
Measuring safety
We count on metrology to ensure the tools, products, services, and processes we use are reliable.
Take something close to my heart as a biomedical ethicist—health AI. In health care, AI promises to improve diagnoses and patient monitoring, make medicine more personalized and help prevent diseases, as well as handle some administrative tasks.
These promises will only be realized if we can be sure health AI is safe and effective, and that means finding reliable ways to measure it.
We already have well-established systems for measuring the safety and effectiveness of drugs and medical devices, for example. But this is not yet the case for AI—not in health care, or in other domains such as education, employment, law enforcement, insurance, and biometrics.
Test results and real effects
At present, most evaluation of state-of-the-art AI systems relies on benchmarks. These are tests that aim to assess AI systems based on their outputs.
They might answer questions about how often a system’s responses are accurate or relevant, or how they compare to responses from a human expert.
There are literally hundreds of AI benchmarks, covering a wide range of knowledge domains.
However, benchmark performance tells us little about the effect these models will have in real-world settings. For this, we need to consider the context in which a system is deployed.
The problem with benchmarks
Benchmarks have become very important to commercial AI developers to show off product performance and attract funding.
For example, in April this year a young startup called Cognition AI posted impressive results on a software engineering benchmark. Soon after, the company raised US$175 million (A$270 million) in funding in a deal that valued it at US$2 billion (A$3.1 billion).
Benchmarks have also been gamed. Meta seems to have adjusted some versions of its Llama-4 model to optimize its score on a prominent chatbot-ranking site. After OpenAI’s o3 model scored highly on the FrontierMath benchmark, it came out that the company had had access to the dataset behind the benchmark, raising questions about the result.
The overall risk here is known as Goodhart’s law, after British economist Charles Goodhart: “When a measure becomes a target, it ceases to be a good measure.”
In the words of Rumman Chowdhury, who has helped shape the development of the field of algorithmic ethics, placing too much importance on metrics can lead to “manipulation, gaming, and a myopic focus on short-term qualities and inadequate consideration of long-term consequences”.
Beyond benchmarks
So if not benchmarks, then what? Let’s return to the example of health AI. The first benchmarks for evaluating the usefulness of large language models (LLMs) in health care made use of medical licensing exams. These are used to assess the competence and safety of doctors before they’re allowed to practice in particular jurisdictions.
State-of-the-art models now achieve near-perfect scores on such benchmarks. However, these have been widely criticized for not adequately reflecting the complexity and diversity of real-world clinical practice.
In response, a new generation of “holistic” frameworks have been developed to evaluate these models across more diverse and realistic tasks. For health applications, the most sophisticated is the MedHELM evaluation framework, which includes 35 benchmarks across five categories of clinical tasks, from decision-making and note-taking to communication and research.
What better testing would look like
More holistic evaluation frameworks such as MedHELM aim to avoid these pitfalls. They have been designed to reflect the actual demands of a particular field of practice.
However, these frameworks still fall short of accounting for the ways humans interact with AI system in the real world. And they don’t even begin to come to terms with their impacts on the broader economic, cultural, and societal contexts in which they operate.
For this we will need a whole new evaluation ecosystem. It will need to draw on expertise from academia, industry, and civil society with the aim of developing rigorous and reproducible ways to evaluate AI systems.
Work on this has already begun. There are methods for evaluating the real-world impact of AI systems in the contexts in which they’re deployed—things like red-teaming (where testers deliberately try to produce unwanted outputs from the system) and field testing (where a system is tested in real-world environments). The next step is to refine and systematize these methods, so that what actually counts can be reliably measured.
If AI delivers even a fraction of the transformation it’s hyped to bring, we need a measurement science that safeguards the interests of all of us, not just the tech elite.
More information:
Reva Schwartz et al, Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI’s Real World Effects, arXiv (2025). DOI: 10.48550/arxiv.2505.18893
This article is republished from The Conversation under a Creative Commons license. Read the original article.
Citation:
AI systems are great at tests. But how do they perform in real life? (2025, August 25)
retrieved 25 August 2025
from https://techxplore.com/news/2025-08-ai-great-real-life.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.
Tech
This Backyard Smoker Delivers Results Even a Pitmaster Would Approve Of
While my love of smoked meats is well-documented, my own journey into actually tending the fire started just last spring when I jumped at the opportunity to review the Traeger Woodridge Pro. When Recteq came calling with a similar offer to check out the Flagship 1600, I figured it would be a good way to stay warm all winter.
While the two smokers have a lot in common, the Recteq definitely feels like an upgrade from the Traeger I’ve been using. Not only does it have nearly twice the cooking space, but the huge pellet hopper, rounded barrel, and proper smokestack help me feel like a real pitmaster.
The trade-off is losing some of the usability features that make the Woodridge Pro a great first smoker. The setup isn’t as quite as simple, and the larger footprint and less ergonomic conditions require a little more experience or patience. With both options, excellent smoked meat is just a few button presses away, but speaking as someone with both in their backyard, I’ve been firing up the Recteq more often.
Getting Settled
Photograph: Brad Bourque
Setting up the Recteq wasn’t as time-consuming as the Woodridge, but it was more difficult to manage on my own. Some of the steps, like attaching the bull horns to the lid, or flipping the barrel onto its stand, would really benefit from a patient friend or loved one. Like most smokers, you’ll need to run a burn-in cycle at 400 degrees Fahrenheit to make sure there’s nothing left over from manufacturing or shipping. Given the amount of setup time and need to cool down the smoker after, I would recommend setting this up Friday afternoon if you want to smoke on a Saturday.
Tech
Make the Most of Chrome’s Toolbar by Customizing It to Your Liking
The main job of Google Chrome is to give you a window to the web. With so much engaging content out there on the internet, you may not have given much thought to the browser framework that serves as the container for the sites you visit.
You’d be forgiven for still using the default toolbar configuration that was in place when you first installed Chrome. But if you take a few minutes to customize it, it can make a significant difference to your browsing. You can get quicker access to the key features you need, and you may even discover features you didn’t know about.
If you’re reading this in Chrome on the desktop, you can experiment with a few customizations right now—all it takes is a few clicks. Here’s how the toolbar in Chrome is put together, and all the different changes you can make.
The Default Layout
Take a look up at the top right corner of your Chrome browser tab and you’ll see two key buttons: One reveals your browser extensions (the jigsaw piece), and the other opens up your bookmarks (the double-star icon). There should also be a button showing a downward arrow, which gives you access to recently downloaded files.
Right away, you can start customizing. If you click the jigsaw piece icon to show your browser extensions, you can also click the pin button next to any one of these extensions to make it permanently visible on the toolbar. While you don’t want your toolbar to become too cluttered, it means you can put your most-used add-ons within easy reach.
For the extension icons you choose to have on the toolbar, you can choose the way they’re arranged, too: Click and drag on any of the icons to change its position (though the extensions panel itself has to stay in the same place). To remove an extension icon (without uninstalling the extension), right-click on it and choose Unpin.
Making Changes
Click the three dots up in the top right corner of any browser window and then Settings > Appearance > Customize your toolbar to get to the main toolbar customization panel, which has recently been revamped. Straight away you’ll see toggle switches that let you show or hide certain buttons on the toolbar.
Tech
The Piracy Problem Streaming Platforms Can’t Solve
“The trade-off isn’t only ethical or economic,” Andreaux adds. “It’s also about reliability, privacy and personal security.”
Abed Kataya, digital content manager at SMEX, a Beirut-based digital rights organization focused on internet policy in the Middle East and North Africa, says piracy in the region is shaped less by culture than by structural barriers.
“I see that piracy in MENA is not a cultural choice; rather, it has multiple layers,” Kataya tells WIRED Middle East.
“First, when the internet spread across the region, as in many other regions, people thought everything on it was free,” Kataya says. “This perception was based on the nature of Web 1.0 and 2.0, and how the internet was presented to people.”
Today, he says, structural barriers still lead many users towards illegal platforms. “Users began to watch online on unofficial streaming platforms for many reasons: lack of local platforms, inability to pay, bypassing censorship and, of course, to watch for free or at lower prices.”
Payment access also remains a major factor. “Not to mention that many are unbanked, do not have bank accounts, lack access to online payments, or do not trust paying with their cards and have a general distrust of online payments,” Kataya adds.
Algerian students also share external hard drives loaded with television series, while in Lebanon streaming passwords are frequently shared across households. In Egypt, large Telegram channels distribute content across different genres, including Korean dramas, classic Arab films and underground music.
“We grew up solving problems online,” says Mira. “When something is blocked, you find a way around it. It’s … a fundamental human instinct.”
Streaming Platforms Adapting
Andreaux says StarzPlay has tried to address some of the payment barriers that limit streaming adoption in the region. “StarzPlay recognized early that payment friction was a regional barrier to adoption,” he says. “That’s why we invested in flexible subscription models and alternative payment methods, including telecom-led billing options that make access easier across different markets.”
At the same time, international media companies are working together to combat piracy through the Alliance for Creativity and Entertainment (ACE), a coalition of film studios, television networks and streaming platforms that targets illegal distribution of films, television and sports content. Its members include global companies such as Netflix as well as regional players like OSN Group, which operates the streaming service OSN+ across the Middle East and North Africa.
Kataya notes that legitimate streaming platforms are still expanding across the region. “The user base of official streaming platforms has been growing in the region,” he says. “For example, Shahid, the Saudi platform, is expanding and Netflix has dedicated packages for the region.”
“Other players, like StarzPlay and local platforms in Egypt, are also finding their place,” Kataya adds. “Social media also plays a huge role, especially when a film is widely discussed or controversial.”
Piracy carries legal and security risks, Andreaux says. “Rather than just ‘free streaming’, piracy exposes consumers to malware and insecure payment channels,” he says. “It also weakens investment in local content by depriving creators of revenue and reducing jobs.”
But the structural barriers described by users across the region remain. For many viewers in North Africa and the Levant, the challenge is not choosing between piracy and legality—it is whether legitimate access exists at all.
-
Politics1 week agoPakistan carries out precision strikes on seven militant hideouts in Afghanistan
-
Business1 week agoEye-popping rise in one year: Betting on just gold and silver for long-term wealth creation? Think again! – The Times of India
-
Sports1 week agoKansas’ Darryn Peterson misses most of 2nd half with cramping
-
Tech1 week agoThe Supreme Court’s Tariff Ruling Won’t Bring Car Prices Back to Earth
-
Entertainment1 week agoViral monkey Punch makes IKEA toy global sensation: Here’s what it costs
-
Tech1 week agoThese Cheap Noise-Cancelling Sony Headphones Are Even Cheaper Right Now
-
Entertainment1 week agoSaturday Sessions: Say She She performs "Under the Sun"
-
Business1 week agoEquinox chairman says ‘health is the new luxury’ as wellness spending soars

