Can Synthetic Data Help Solve Generative A.I.’s Training Data Crisis?
The supply of quality, real-world data used to train generative A.I. models appears to be dwindling as digital publishers increasingly restrict their access to their public data, according to a recent study. That means the advancement of large language models like OpenAI’s GPT-4 and Google’s Gemini could hit a wall once the A.I.s scrape all the remaining data on the internet.
To address the growing A.I. training data crisis, some experts are considering synthetic data as a potential alternative. Real-world data, created by real humans, include news articles, YouTube videos, and other text and image content forms. Synthetic data, on the other hand, is artificially generated by machine learning models based on samples of real data. While synthetic data isn’t particularly new, using it to train A.I. models like GPT is a technique major companies including OpenAI are exploring—a practice experts say could backfire if done incorrectly.
“It’s still kind of the Wild West when it comes to generative A.I. models,” Kjell Carlsson, head of AI strategy at Domino Data Lab, a machine learning platform for businesses, told Observer.
How synthetic data can be used to train generative A.I.
Synthetic data has long been used to address the lack of sufficient training data for A.I. applications such as autonomous driving systems. For instance, companies like Waymo and Tesla use synthetic data to train their systems to respond to a wide range of road conditions. Now, some experts believe there are creative ways synthetic data can be used to train generative A.I. models.
Synthetic data generated by large models like OpenAI’s GPT-4 can potentially be used to fine-tune smaller, more specialized models, according to Carlsson. For instance, automaker advertisers may use ChatGPT to generate customer profiles of middle-aged women from Minneapolis who own cars. That data can then be used to train a smaller model representing that customer segment to create targeted ads. Additionally, LLMs that are good at translation can produce an abundance of training data in other languages to “boost the performance of a different LLM” with those languages, Carlsson said.
“Synthetic data plays a crucial role in enhancing our large language models,” Jigyasa Grover, a former machine learning engineer at X who now leads A.I. at Bordo AI, a conversational data analytics software maker, told Observer. “By generating synthetic datasets, we can train LLMs on a diverse range of scenarios and edge cases that may not be adequately represented in real-world data. This improves the generalization capabilities of our models, making them more adaptable and effective in various applications.”
Synthetic data can be an alternative to sensitive data
Artificially generated data can also be used to fill in information gaps when organizations don’t want to give up their sensitive data, especially in high-impact sectors like health care, finance and law enforcement, said Neil Sahota, an A.I. advisor to the United Nations and CEO of the A.I. research firm ACSILabs. For example, hospitals can synthetically generate images of lung cancer X-rays at different angles as a way to train A.I. models that could help doctors identify tumors more quickly and accurately, Sahota said. Similarly, governments can train their A.I. on examples of money laundering that financial institutions don’t make public to help identify the characteristics of actors behind corporate crime. “Synthetic data is a great way to bridge some of that gap,” Sahota told Observer.
Synthetic data also provides a way around intellectual property issues, a growing headache for A.I. companies. Training LLMs on synthetic data protects companies like OpenAI from being sued by artists, writers and publishers for using their creative works to train chatbots. “Synthetic training data could clear a lot of these issues,” Star Kashman, an attorney specializing in litigation in the tech sector, told Observer. “That gets around the hurdle of unintentionally infringing upon other people’s work.”
Synthetic data can create more problems—and isn’t always necessary
Despite the potential technical and legal advantages of using synthetic data, training A.I. on non-human data comes with risks. Aside from the skepticism around so-called “fake data,” synthetic data could perpetuate biases and inaccuracies in a model’s pre-existing dataset if the A.I. isn’t trained correctly.
A Nature study published in July found that A.I. models generated lower-quality outputs after they were trained on A.I.-generated data—a phenomenon known in the machine learning community as “model collapse.” That could be, in part, because synthetic data generation techniques are still new, and there’s just not enough engineers with the skills needed to perform and test them, according to Carlsson. “You can totally screw things up and make things worse,” he said.
In turn, companies that use biased synthetic data to train A.I. may be held liable if their models generate outputs that a plaintiff perceives as discriminatory, unethical or inaccurate, according to Kashman, the attorney.
After all, there may still be lots of real-world data that has yet to be extracted, according to Mayur Pillay, vice president of corporate development at Hyperscience, an A.I. software that converts corporate documents like claims and invoices into machine-actionable data. While synthesizing data could be useful in some cases, there’s no substitute for the real thing, especially for complex data types like handwriting on forms that are difficult to replicate because they require context, according to Pillay. “There’s actually so much data still that can be used to train these specialized models,” he said. “It’s just embedded at the core of the enterprise.”
Even though synthetic data seems to pose a risk, some experts agree that if handled with caution, synthetic data, when mixed with real data, could help address the shortage of A.I. training data. Still, it seems unlikely that synthetic data will be the main trove of information A.I. companies turn to as they seek new sources of training data—at least for now.
“Currently, you have gigabytes and petabytes of data being used to train a large language model,” Grover said. Clearly, we are not at the point yet where we can generate that amount of unbiased and balanced data set.”