Tech
AI systems are great at tests. But how do they perform in real life?
Earlier this month, when OpenAI released its latest flagship artificial intelligence (AI) system, GPT-5, the company said it was “much smarter across the board” than earlier models. Backing up the claim were high scores on a range of benchmark tests assessing domains such as software coding, mathematics and health care.
Benchmark tests like these have become the standard way we assess AI systems—but they don’t tell us much about the actual performance and effects of these systems in the real world.
What would be a better way to measure AI models? A group of AI researchers and metrologists—experts in the science of measurement—recently outlined a way forward.
Metrology is important here because we need ways of not only ensuring the reliability of the AI systems we may increasingly depend upon, but also some measure of their broader economic, cultural, and societal impact.
Measuring safety
We count on metrology to ensure the tools, products, services, and processes we use are reliable.
Take something close to my heart as a biomedical ethicist—health AI. In health care, AI promises to improve diagnoses and patient monitoring, make medicine more personalized and help prevent diseases, as well as handle some administrative tasks.
These promises will only be realized if we can be sure health AI is safe and effective, and that means finding reliable ways to measure it.
We already have well-established systems for measuring the safety and effectiveness of drugs and medical devices, for example. But this is not yet the case for AI—not in health care, or in other domains such as education, employment, law enforcement, insurance, and biometrics.
Test results and real effects
At present, most evaluation of state-of-the-art AI systems relies on benchmarks. These are tests that aim to assess AI systems based on their outputs.
They might answer questions about how often a system’s responses are accurate or relevant, or how they compare to responses from a human expert.
There are literally hundreds of AI benchmarks, covering a wide range of knowledge domains.
However, benchmark performance tells us little about the effect these models will have in real-world settings. For this, we need to consider the context in which a system is deployed.
The problem with benchmarks
Benchmarks have become very important to commercial AI developers to show off product performance and attract funding.
For example, in April this year a young startup called Cognition AI posted impressive results on a software engineering benchmark. Soon after, the company raised US$175 million (A$270 million) in funding in a deal that valued it at US$2 billion (A$3.1 billion).
Benchmarks have also been gamed. Meta seems to have adjusted some versions of its Llama-4 model to optimize its score on a prominent chatbot-ranking site. After OpenAI’s o3 model scored highly on the FrontierMath benchmark, it came out that the company had had access to the dataset behind the benchmark, raising questions about the result.
The overall risk here is known as Goodhart’s law, after British economist Charles Goodhart: “When a measure becomes a target, it ceases to be a good measure.”
In the words of Rumman Chowdhury, who has helped shape the development of the field of algorithmic ethics, placing too much importance on metrics can lead to “manipulation, gaming, and a myopic focus on short-term qualities and inadequate consideration of long-term consequences”.
Beyond benchmarks
So if not benchmarks, then what? Let’s return to the example of health AI. The first benchmarks for evaluating the usefulness of large language models (LLMs) in health care made use of medical licensing exams. These are used to assess the competence and safety of doctors before they’re allowed to practice in particular jurisdictions.
State-of-the-art models now achieve near-perfect scores on such benchmarks. However, these have been widely criticized for not adequately reflecting the complexity and diversity of real-world clinical practice.
In response, a new generation of “holistic” frameworks have been developed to evaluate these models across more diverse and realistic tasks. For health applications, the most sophisticated is the MedHELM evaluation framework, which includes 35 benchmarks across five categories of clinical tasks, from decision-making and note-taking to communication and research.
What better testing would look like
More holistic evaluation frameworks such as MedHELM aim to avoid these pitfalls. They have been designed to reflect the actual demands of a particular field of practice.
However, these frameworks still fall short of accounting for the ways humans interact with AI system in the real world. And they don’t even begin to come to terms with their impacts on the broader economic, cultural, and societal contexts in which they operate.
For this we will need a whole new evaluation ecosystem. It will need to draw on expertise from academia, industry, and civil society with the aim of developing rigorous and reproducible ways to evaluate AI systems.
Work on this has already begun. There are methods for evaluating the real-world impact of AI systems in the contexts in which they’re deployed—things like red-teaming (where testers deliberately try to produce unwanted outputs from the system) and field testing (where a system is tested in real-world environments). The next step is to refine and systematize these methods, so that what actually counts can be reliably measured.
If AI delivers even a fraction of the transformation it’s hyped to bring, we need a measurement science that safeguards the interests of all of us, not just the tech elite.
More information:
Reva Schwartz et al, Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI’s Real World Effects, arXiv (2025). DOI: 10.48550/arxiv.2505.18893
This article is republished from The Conversation under a Creative Commons license. Read the original article.
Citation:
AI systems are great at tests. But how do they perform in real life? (2025, August 25)
retrieved 25 August 2025
from https://techxplore.com/news/2025-08-ai-great-real-life.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.
Tech
Prague’s City Center Sparkles, Buzzes, and Burns at the Signal Festival
And thanks to a mention in Dan Brown’s new novel, The Secret of Secrets, the festival has gained even more global recognition. Just a few weeks after the release of Brown’s new bestseller set in contemporary Prague, viewers were able to see for themselves what drew the popular writer to the festival, which is the largest Czech and Central European showcase of digital art. In one passage, the Signal Festival has a cameo appearance when the novel’s protagonist recalls attending an event at the 2024 edition.
“We’re happy about it,” festival director Martin Pošta says about the mention. “It’s a kind of recognition.” Not that the event needed promotion, even in one of the most anticipated novels of recent years. The organizers have yet to share the number of visitors to the festival this year, but the four-day event typically attracts half a million visitors.
On the final day, there was a long queue in front of the monumental installation Tristan’s Ascension by American video art pioneer Bill Viola before it opened for the evening, even though it was a ticketed event. In the Church of St. Salvator in the Convent of St. Agnes, visitors could watch a Christ-like figure rise upwards, streams of water defying gravity along with him, all projected on a huge screen.
The festival premiere took place on the Vltava River near the Dvořák Embankment. Taiwan’s Peppercorns Interactive Media Art presented a projection on a cloud of mist called Tzolk’in Light. While creators of other light installations have to deal with the challenges of buildings—their irregular surfaces, decorative details, and awkward cornices—projecting onto water droplets is a challenge of a different kind with artists having to give up control over the resulting image. The shape and depth of the Peppercorns’ work depended on the wind at any given moment, which determined how much of the scene was revealed to viewers and how much simply blown away. The reward, however, was an extraordinary 3D spectacle reminiscent of a hologram—something that can’t be achieved with video projections on static and flat buildings.
Another premiere event was a projection on the tower of the Old Town Hall, created for the festival by the Italian studio mammasONica. It transformed the 230-foot structure into a kaleidoscope of blue, green, red, and white surfaces. A short distance away, on Republic Square, Peppercorns had another installation. On a circular LED installation, they projected a work entitled Between Mountains and Seas, which recounted the history of Taiwan.
Tech
Every Model of this New Snoopy MoonSwatch Is Different—And You Can Only Get One When It Snows
First a confession: I own more MoonSwatches than I care to admit. Never let it be said that WIRED does not walk the walk when it comes to recommending products—Swatch has assiduously extracted a considerable amount of cash from me, all in $285 increments. This was no doubt the Swiss company’s dastardly plan all along, to lure us in, then, oh so gently, get watch fans hooked. The horological equivalent of boiling a frog. It’s worked, too—Swatch has, so far, netted hundreds of millions of dollars from MoonSwatch sales.
But while I’ve been a fan of the Omega X Swatch mashup since we reported on exactly how the hugely lucrative collaboration came to be in the first place, I have never liked the iterative Moonshine Gold versions. Employing a sliver of Omega’s exclusive 18K pale yellow gold alloy in marginally different ways on each design, they seemed almost cynical—a way of milking the MoonSwatch superfans on the hunt to complete the set.
Now, though, just when I thought I was done with MoonSwatch—having gone as far as to upgrade all of mine with official $45 color-matching rubber straps—Swatch has managed to ensnare me once again, and with a Moonshine Gold model: the new MoonSwatch Mission To Earthphase Moonshine Gold Cold Moon.
Clumsy moniker aside, this version takes the all-white 2024 Snoopy model (WIRED’s top pick of the entire collection), mixes it with the Earthphase MoonSwatches, and replaces the inferior original strap for a superior white and blue Swatch rubber velcro one. Aesthetically, it’s definitely a win, but this is not the Cold Moon’s party trick.
On each $450 Cold Moon MoonSwatch, a snowflake is lasered onto its Moonshine Gold moon phase indicator—and, just like a real snowflake, Swatch claims each one will be completely unique. When you consider the volumes of MoonSwatches Swatch produces each year, this is no mean feat.
Tech
These Are the Best Tech Deals to Shop This Cyber Monday
Welcome to WIRED’s guide to the best Cyber Monday tech deals, where we can promise you two things: these devices are worth buying (we’ve tested and recommended every one of them), and these are actual discounts (not the year-round price). So, whether you need an upgrade, want to treat yourself, or are seeking a great gift, we have you covered.
Want a wider range of deals? Check out the Absolute Best Black Friday Deals roundup to find more bargains this sale weekend.
Updated November 30: We’ve added deals on Samsung Galaxy Z Fold7 and Galaxy Z Flip7, Galaxy S25, S25+, S25 Ultra, Lenovo Flex 5i Chromebook Plus, Apple AirPods Pro 2, and Apple Watch SE 3.
The Google Pixel 10 is one of the best Android phones you can buy. Easy to recommend at full price, the Pixel 10 is an absolute bargain with this discount. You get an excellent triple-camera system with a 5X optical zoom sensor, support for Qi2 wireless charging, so you can magnetically attach to wireless chargers and docks, and Google’s super smart software features (Call Screen to filter out spam calls is our favorite). Learn more in the Best Pixel Phones guide.
The Pixel 9a is our top smartphone choice for most people, and it’s now $50 cheaper than it was on Black Friday itself. At $349, you’re getting a smooth-performing smartphone with a reliable dual-camera system that’s unmatched at this price, not to mention day-long battery life and a completely flat camera lens system for anyone who hates giant camera bumps. Oh, and it’ll get 7 years of software updates.
Sony’s A7 IV is the best mirrorless camera on the market (for most people). It’s a 33-megapixel, full-frame camera with a brilliant autofocus system, impressive dynamic range, and crisp images. There’s an expansive range of 4K video options as well, along with customizable buttons to set up your preferences, so you don’t have to always rummage through the menus. Reviewer Scott Gilbertson found the grip to be super comfortable and the camera to be light enough to endure for long periods without any back strain. —Boutayna Chokrane
If you’re shopping for open earbuds so that you can enjoy your music but still be aware of your surroundings, the Soundcore Aeroclip is the best we’ve tested so far. Reviewer Ryan Waniata praises the comfort, sound quality, usability, and value. The sound is wide and balanced, and the built-in controls are ideal for runs. Waniata likes to use them during outdoor activities, like hiking or biking, but he finds them especially helpful when he’s cooking dinner and needs to stay alert for his newborn’s cries. —Boutayna Chokrane
Editor Adrienne So says the Fitbit Ace LTE is the first fitness tracker she’s gotten her kids to use. It’s a fitness tracker (designed with Fitbit’s health sensors), gaming device, and location tracker. The $10/monthly subscription includes both LTE connectivity and Fitbit Arcade, which has a variety of movement-based games that get children on their feet and incentivize them to keep their watches on. They can call and text their guardians (and other approved contacts) through the Fitbit Ace app, and their location is trackable via Google Find My. —Boutayna Chokrane
This is a rare and tasty deal on my favorite Xmas lights. They work indoors or out, can be scheduled, and support a bunch of lovely animated effects. While I’m mentioning Philips Hue and its excellent but horribly expensive wares, you might want to check out some of its other Cyber Monday deals. My picks would be the wall washers ($316), TV lightstrip ($129), and HDMI sync box ($270).
The Asus RT-BE58U is perhaps the ideal Wi-Fi 7 upgrade for modest homes and apartments still struggling with the crappy router their internet service provider sent, and that’s why it tops our Best Wi-Fi Routers guide. It’s easy to set up and use, can cover up to 2,000 square feet, and boasts plenty of ports. As a dual-band router, it lacks the 6-GHz band, but has all the other advantages of Wi-Fi 7. There’s also support for VPN service, separate IoT or guest networks, and Ai Mesh.
Don’t ask me why they keep taking our ports away. God forbid you should want to plug something into your laptop. Well, you can stick it to those minimalist designers with the best laptop docking station. This one doubles as a wedge to prop your laptop up and has a storage slot.
These wireless noise-canceling headphones may not be the latest release from Sony, but they are still an excellent pair of cans with a far deeper discount. The Sony WH-1000XM5 are relatively light and comfortable, producing accomplished sound in every scenario, and have great control options.
You can spend a lot on a TV, but you can also get a great screen without breaking the bank, and the TCL QM6K proves it. This is the best TV for most people right now as it offers excellent color and processing, all the apps you want, and great performance, even in bright rooms. There are discounts across the range of screen sizes.
If you want to get the latest streaming apps on an older TV, the Roku Streaming Stick Plus is for you. It’s easy to set up, works reliably, and has a handy voice remote that makes finding content easier than ever. It slots neatly behind most TVs, and Roku’s interface is nice and clear.
Apple doesn’t really do sales, but other retailers do. This is the lowest price we’ve seen on a solid iPad the whole family can enjoy. The Apple iPad (A16, 2025) performs great for most tasks, looks pretty nice, and has a 12-megapixel camera. It is honestly all the iPad most folks need for surfing the web and streaming shows in bed. With iPadOS 26 and the new windowing apps feature, you can even comfortably do some work if you pair it with a Bluetooth keyboard and mouse.
Handy as they are for keeping you connected when your phone dies unexpectedly, portable chargers can be very same-y. The reason the Nimble Champ tops our Best Power Banks guide is Nimble’s focus on the environment. It’s made from 90 percent certified recycled plastic and comes in fully biodegradable packaging. It also works well, with capacities starting from 5,200 mAh, with USB-A and USB-C ports, and up to 15-watt charging.
Yes, you should read more, and Amazon’s Kindle e-readers make it easier to do exactly that. Our current favorite is the Kindle Paperwhite (12th generation). It has a sharp 7-inch display, auto-adjusting warm light, three-month battery life, snappy performance, and it’s slim and light, making it comfortable to hold. It even has integration with Overdrive for your library books and support for several languages.
The reMarkable 2 is one of the best digital notebooks, offering a paperlike writing experience, intuitive software, and several weeks of battery life. This is a budget model, so it lacks front light and color, but it’s still a decent device. Bundles where you choose both a marker and folio are heavily discounted right now, and they’re not often on sale, so it’s a good time to snap one up.
Keychron boards are popular here at WIRED, and the Q6 HE is our current pick of the best mechanical keyboards. Sturdy, satisfying to type on, with a lovely retro aesthetic, what more do you need? Well, the Q6 HE also boasts hot-swappable Hall Effect switches, four macro keys, and is relatively easy to customize or repair.
The great thing about Nomad’s 65-watt charger is that it’s incredibly slim, with flip-out prongs, so it can slip easily into small pockets in your bag or purse. You get dual USB-C ports, and can pull 45 watts out of the left port and 20 out of the right. Or, if you’re just charging one device, the full 65 watts is enough for any phone, most tablets, and even some MacBooks or Windows laptops (though they may not charge at top speed).
You know what I don’t miss in the slightest? Mowing the lawn. A good robot mower, like this relatively affordable one from newcomer Anthbot, will do it for you, quietly. No wire required; it recharges itself, you just set a schedule and relax. OK, it sometimes leaves a verge, but the only model I’ve tried that doesn’t is more than twice the price.
Sharp 2K video, color night vision, a wide 160-degree field of view, and clear two-way audio make the Arlo Pro 5S easy to recommend for folks seeking a security camera. You also get AI recognition for people and pets, a siren to scare intruders, and the quick-loading Arlo app. But you need Arlo Secure ($8 per month for one camera or $13 per month for unlimited cameras) for subject recognition, smart alerts, and cloud storage. The Arlo Pro 5S is our pick of the best outdoor security cameras.
I love my Oura Ring 4. It accurately tracks my sleep, activity, and stress levels and offers insights that I find genuinely useful. It’s also very comfortable, the app is super slick, with new features being added all the time, and it’s far less obtrusive than any other kind of tracker you could wear. The catch is a subscription, but this is still the best of the best smart rings.
It’s the thoughtful design that elevates the Backbone One above the rest of the best mobile controllers. Slot your phone into the compact cradle, with a USB-C jack for speedy connectivity, and you get satisfyingly clicky and responsive controls plus a 3.5-mm headphone port. You can also customize it for different games, or even use Backbone’s software as a one-stop gaming hub, if you’re willing to pay a subscription.
Yes, there is a new version of the Ray-Ban Meta Wayfarers, but the good news is that the old pair is now on sale. If you can stomach Meta AI’s privacy policies, there’s a strong argument that it has won the smart glasses race already (at least, so far). The best smart glasses must be easy to wear, and these look great and help offload things from your phone, so you don’t have to dip into that pocket quite as much.
The JBL Flip 7 is the Bluetooth speaker that has it all. It’s durable, it has stamina, it produces a punchy sound, and it comes in fun colors. As the best Bluetooth speaker you can buy, this deal is for real.
Toniebox is our favorite speaker for young kids, particularly ages three through seven. It’s essentially a squishy cube that plays stories and songs tied to different characters (aka Tonies). It’s activated when your child places the figurine on top of the speaker. There are so many Tonies to choose from. Peppa Pig, Moana, Winnie the Pooh, the list goes on. You can also buy Creative Tonies to record your own audio. Super easy to use, and the cutesy ears double as volume controls. —Boutayna Chokrane
The best tech books unpack the rise and fall of the characters that invented the stuff that runs our lives. The New York Times and former Wall Street Journal reporter John Carreyrou writes about Elizabeth Holmes, as she miserably fails to build a blood testing machine that would allegedly eliminate the need for hypodermic needles. Her company raised hundreds of millions of dollars, but its technology was inaccurate. Rather than admit defeat, she pressed on, which is why Holmes was put on trial for fraud and sentenced to 11 years in prison. —Boutayna Chokrane
Samsung’s flagship Galaxy S25 has been heavily discounted all Cyber Weekend, probably because its successors are right around the corner (the Galaxy S26 series is expected to be announced in January). But we still love these excellent smartphones. The S25 is the smallest, the S25+ gets a few extra perks, plus a bigger screen and better battery life, and the Galaxy S25 Ultra has a dual telephoto camera system, integrated S Pen stylus, and a beefy battery. Be sure to check out the Best Samsung Phones guide for the full scoop. —Molly Higgins
You’re not like other girls; you have a folding phone. In all seriousness, folding phones are not as fragile as they used to be, with durability improving while remaining slim. We love the Galaxy Z Fold7 because it’s amazingly slim and versatile. You can use the front screen like normal, and when you need extra real estate, open the device up. You can view apps on a much larger scale or easily split-screen two apps. If you’re not feeling a folding phone, the updated Galaxy Z Flip7 has a more usable front screen. Read our Best Folding Phones guide to decide which is best for you at the discounted price this Cyber Weekend. —Molly Higgins
The Lenovo Flex 5i Chromebook Plus is super cheap and compact, with a small touchscreen for more versatility. Especially with the Cyber Monday discount, it’s one of the most affordable Chromebook Plus models you can find, plus WIRED reviewer Luke Larsen thinks it’s in a whole different league over standard Chromebooks at this price because of its improved screen with a 360-degree hinge and touchscreen, fast performance, more storage, and crisp webcam. —Molly Higgins
Even though they’re an older model, we like these AirPods because of their hearing aid feature, comfort, and outstanding streaming experience. If you’re an iPhone user, you should have some AirPods, and we still think these are a good choice for most people because of their active noise cancellation, sound quality, and easy pairing within the Apple ecosystem. Plus, it doesn’t hurt that they’re nearly 25 percent off for Cyber Monday. —Molly Higgins
We on the WIRED Reviews team still think this is the best Apple Watch for most people. With its newest upgrade, it now has the latest S10 chip, a Liquid Glass display, Workout Buddy, and wrist-flick gestures. If you have an iPhone, this accessory is a no-brainer. It makes a great gift for yourself or others, and is seriously discounted at only $200 right now. —Molly Higgins
Power up with unlimited access to WIRED. Get best-in-class reporting and exclusive subscriber content that’s too important to ignore. Subscribe Today.
-
Sports1 week agoWATCH: Ronaldo scores spectacular bicycle kick
-
Entertainment1 week agoWelcome to Derry’ episode 5 delivers shocking twist
-
Politics1 week agoWashington and Kyiv Stress Any Peace Deal Must Fully Respect Ukraine’s Sovereignty
-
Business1 week agoKey economic data and trends that will shape Rachel Reeves’ Budget
-
Tech6 days agoWake Up—the Best Black Friday Mattress Sales Are Here
-
Politics1 week ago53,000 Sikhs vote in Ottawa Khalistan Referendum amid Carney-Modi trade talks scrutiny
-
Fashion1 week agoCanada’s Lululemon unveils team Canada kit for Milano Cortina 2026
-
Tech6 days agoThe Alienware Aurora Gaming Desktop Punches Above Its Weight

.jpeg)










.jpg)

-Reviewer-Photo-SOURCE-Simon-Hill.jpg)






-Portable-Charger-Reviewer-Photo-(no-border)-SOURCE-Simon-Hill.jpg)
%2520_%2520Nena%2520Farrell.png)




.jpg)










-Reviewer-Photo-SOURCE-Ryan-Waniata.jpg)





.png)

.png)


