Tech

How to ensure high-quality synthetic wireless data when real-world data runs dry

Published

on


: Our quality assessment and quality-guided utilization of wireless synthetic data. Generative models produce synthetic data from conditions to supplement data quantity for wireless applications. Compared to previous quality-oblivious utilization using all synthetic data with conditions as labels, we assess synthetic data quality, reveal its affinity limitation, and propose a quality-guided utilization scheme incorporating filtered synthetic samples with assigned pseudo-labels for better data quality and task performance. Credit: arXiv (2025). DOI: 10.48550/arxiv.2506.23174

To train artificial intelligence (AI) models, researchers need good data and lots of it. However, most real-world data has already been used, leading scientists to generate synthetic data. While the generated data helps solve the issue of quantity, it may not always have good quality, and assessing its quality has been overlooked.

Wei Gao, associate professor of electrical and computer engineering at the University of Pittsburgh Swanson School of Engineering, has collaborated with researchers from Peking University to develop analytical metrics to qualitatively evaluate the quality of synthetic wireless data. The researchers have created a novel framework that significantly improves the task-driven training of AI models using synthetic wireless data.

Their work is detailed on the arXiv preprint server in a study titled “Data Can Speak for Itself: Quality-Guided Utilization of Wireless Synthetic Data,” which received the Best Paper Award in June at the MobiSys 2025 International Conference on Mobile Systems, Applications, and Services.

Assessing affinity and diversity

“Synthetic data is vital for training AI models, but for modalities such as images, video, or sound, and especially wireless signals, generating good data can be difficult,” said Gao, who also directs the Pitt Intelligent Systems Laboratory.

Gao has developed metrics to quantify and diversity, essential qualities for to be used for effectively training AI models.

“Generated data shouldn’t be random,” said Gao. “Take human faces. If you’re training an AI model to identify human faces, you need to ensure that the images of faces represent actual faces. They can’t have three eyes or two noses. They must have affinity.”

The images also need diversity. Training an AI model on a million images of an identical face won’t achieve much. While the faces must have affinity, they must also be different, as human faces are. As Gao noted, “AI models learn from variation.”

Different tasks have different requirements for judging affinity and diversity. Recognizing a specific human face is different than distinguishing it from that of a dog or a cat, with each task having unique data requirements. Therefore, in systemically assessing the quality of synthetic data, the team applied a task-specific approach.

“We applied our method to downstream tasks and evaluated the existing work of synthesizing data,” said Gao. “We found that most synthetic data achieved good diversity, but some had problems satisfying affinity, especially wireless signals.”

The challenge of synthetic wireless data

Today, wireless signals are used in technologies such as home and sleep monitoring, interactive gaming, and virtual reality. Cell phone and Wi-Fi signals, as , hit objects and bounce back toward their source. These signals can be interpreted to indicate everything from sleep patterns to the shape of a person sitting on a couch.

To advance this technology, researchers need more wireless data to train models to recognize human behaviors in the signal patterns. However, as a waveform, the signals are difficult for humans to evaluate.

It’s not like human faces, which can be clearly defined. “Our research found that current synthetic wireless data is limited in its affinity,” said Gao. “This leads to mislabeled data and degraded task performance.”

To improve affinity in , the researchers took a semi-supervised learning approach. “We used a small amount of labeled synthetic data, which was verified as legitimate,” Gao said. “We used this data to teach the model what is and isn’t legitimate.”

Gao and his collaborators developed SynCheck, a framework that filters out synthetic wireless samples with low affinity and labels the remaining samples during iterative training of a model.

“We found that our system improves performance by 4.3% whereas a nonselective use of synthetic wireless data degrades performance by 13.4%,” Gao noted.

This research takes an important first step toward ensuring not just an endless stream of data, but of quality data that scientists can use to train more sophisticated AI models.

More information:
Chen Gong et al, Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data, arXiv (2025). DOI: 10.48550/arxiv.2506.23174

Journal information:
arXiv


Citation:
How to ensure high-quality synthetic wireless data when real-world data runs dry (2025, September 15)
retrieved 15 September 2025
from https://techxplore.com/news/2025-09-high-quality-synthetic-wireless-real.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending

Exit mobile version