You can now train AI models without using real-world data
26 Jan 2026
Researchers from Tsinghua University and Microsoft have created a synthetic data pipeline for training artificial intelligence (AI) models.
The innovative system, dubbed SynthSmith, leverages processors from leading US chip designer NVIDIA.
The development marks a significant step in overcoming the challenge of real-world data scarcity for enhancing AI models.
SynthSmith outperforms larger models with less data
Performance
The SynthSmith pipeline was able to train an X-Coder model with seven billion parameters.
This model outperformed others with 14 billion parameters on major coding benchmarks, despite using less data and none from the real world.
The finding highlights the potential of synthetic data in improving AI performance, even when real-world data is scarce.
Synthetic data: A solution to real-world data scarcity
Solution
Synthetic data, which mimics real-world data, is generated by AI algorithms.
As new real-world data becomes scarce, AI researchers are turning to synthetic data as a viable alternative for improving their models.
The success of SynthSmith demonstrates the potential of this approach in overcoming one of the key challenges in AI development today.
Contact to : xlf550402@gmail.com
Copyright © boyuanhulian 2020 - 2023. All Right Reserved.