The AI world's most valuable resource is running out, and it's scrambling to find an alternative: 'fake' data

09.08.2024 12:00

AI leaders like ChatGPT boss Sam Altman are hoping synthetic data will make their AI models smarter.

Andrew Caballero-Reynolds/AFP/Getty Images

The AI industry has a major problem: the real-world data used to make smarter models is running out.
Companies scrambling for an alternative think synthetic data could offer a solution.
Research suggests synthetic data could poison AI with low-quality information.

The AI world is on the cusp of running out of its most valuable resource — and it's leading industry leaders into a fierce debate over a fast-growing alternative being touted as a replacement: synthetic data, or essentially "fake" data.

For years, the likes of OpenAI and Google have scraped data from the internet to train the large language models (LLMs) that power their AI tools and features. These LLMs digested reams of text, video, and other media online produced by humans over centuries — be it research papers, novels, or YouTube clips.

Now, the supply of "real," human-generated data is running dry. Research firm Epoch AI predicts textual data could run out by 2028. Meanwhile, companies that have mined every corner of the internet for usable training data — sometimes breaking their policies to do so — face increased restrictions on what remains.

To some, that's not necessarily a problem. OpenAI CEO Sam Altman has argued that AI models should eventually produce synthetic data good enough to train themselves effectively. The allure is obvious: training data has become one of the most precious resources in the AI boom, and the possibility of generating it cheaply and seemingly infinitely is tantalizing.

Still, researchers debate whether synthetic data is the magic bullet, with some arguing this path could lead to AI models poisoning themselves with poor-quality information and a "collapse" as a result.

A recent paper published by a group of Oxford and Cambridge researchers discovered that feeding a model with AI-generated data eventually led it to produce gibberish. AI-generated data was not unusable for training, the authors claimed, but it should be balanced with real-world data.

As the well of usable human-generated data dries up, more companies look into using synthetic data. In 2021, research firm Gartner predicted that by 2024, 60% of data used for developing AI would be synthetically generated.

"It's a crisis," said Gary Marcus, an AI analyst and emeritus professor of psychology and neural science at New York University. "People had the illusion that you could infinitely make large language models better by just using more and more data, but now they've basically used all the data they can."

"Yes, it will help you with some problems, but the deeper problem is that these systems don't really reason, they don't really plan," Marcus added. "All the synthetic data you can imagine is not going to solve that foundational problem."

More companies create synthetic data

The need for "fake" data hinges on the notion that real-world data is fast running out.

This is partly because tech firms have been moving as fast as possible to use publicly available data to train AI in an effort to outsmart rivals. It's also because online data owners have become increasingly wary of companies taking their data for free.

OpenAI researchers revealed in 2020 how they used free data from Common Crawl, a web crawler that contains "nearly a trillion words" from online resources, to train the AI model that would eventually power ChatGPT.

Research published in July by MIT's Data Provenance Initiative found websites now putting restrictions in place to stop AI firms from using data that didn't belong to them. News publications and other top sites are increasingly blocking AI companies from freely cribbing their data.

To get around this problem, companies such as OpenAI and Google are cutting checks worth tens of millions of dollars for access to data from Reddit and news outlets, which act as conveyor belts of fresh data for training models. Even this has its limitations.

"There are no longer major areas of the textual web just waiting to be grabbed," Nathan Lambert, a researcher at the Allen Institute for AI, wrote in May.

This is where synthetic data comes in. Rather than being pulled from the real world, synthetic data is generated by AI systems that have been trained on real-world data.

In June, for instance, Nvidia released an AI model that can create artificial datasets for training and alignment. In July, researchers at Chinese tech giant Tencent created a synthetic data generator called Persona Hub, which does a similar job.

Some startups, such as Gretel and SynthLabs, are even popping up with the sole purpose of generating and selling troves of specific types of data to companies that need it.

A chat powered by Meta's Llama 3 AI model.

Anadolu/Getty Images

Proponents of synthetic data offer fair reasons for its use. Like the real world, human-generated data is often messy, leaving researchers with the complex and laborious task of cleaning and labeling it before it can be used.

Synthetic data can potentially fill holes that human data cannot fill. In late July, Meta introduced Llama 3.1, a new series of AI models that generate synthetic data and rely on it for "finetuning" in training. In particular, it used the data to improve the performance of specific skills, such as coding in languages like Python, Java, and Rush, as well as solving math problems.

Synthetic training could be particularly effective for smaller AI models. Microsoft last year said it gave OpenAI's models a diverse list of words that a typical 3-4 year old would know, and then asked it to generate short stories using that data. The resulting dataset was used to create a group of small but capable language models.

Synthetic data may help offer some effective counter-tuning to the biases produced by real-world data, too. In their 2021 paper, "On the Dangers of Stochastic Parrots," former Google researchers Timnit Gebru, Margaret Mitchell, and others argued that LLMs trained on massive datasets of text from the internet would likely reflect the data's biases.

In April, a group of Google DeepMind researchers published a paper championing the use of synthetic data to address problems around data scarcity and privacy concerns in training, adding that ensuring the accuracy and lack of bias in this AI-generated data "remains a critical challenge."

'Habsburg AI'

While the AI industry found some advantages in synthetic data, it faces serious issues it can't afford to ignore, such as fears synthetic data can wreck AI models entirely.

In Meta's research paper on Llama 3.1, the company said that training the 405 billion parameter version of the latest model "on its own generated data is not helpful," and may even "degrade performance."

A new study published in the journal Nature last month found that "indiscriminate use" of synthetic data in model training can cause "irreversible defects." The researchers called this phenomenon "model collapse" and warned that the problem must be taken seriously "if we are to sustain the benefits of training from large-scale data scraped from the web."

Jathan Sadowski, a senior research fellow at Monash University, coined a term for this idea: Habsburg AI, in reference to the Austrian dynasty that some historians believe destroyed itself through inbreeding. Since coining the term, Sadowski told BI he has felt validated by the research backing his assertion that models heavily trained on AI outputs can become mutated.

"The open question for researchers and companies building AI systems is how much synthetic data is too much?" said Sadowski. "They need to find any possible solution to overcome the challenges of data scarcity for AI systems—even if those solutions are just short-term fixes that could do more harm than good by creating low-quality systems."

However, findings from a paper published in April showed that models trained on their own generated data don't necessarily need to "collapse" if they are trained with both "real" and synthetic data. Now, some companies are betting on a future of "hybrid data," where synthetic data is generated by using some real data in an effort to stop the model going off-piste.

Scale AI, which helps companies label and test data, said the company is exploring "the direction of hybrid data," using both synthetic and non-synthetic data (Scale AI CEO Alexandr Wang recently declared: "Hybrid data is the real future.")

In search of other solutions

AI may require entirely new approaches, as simply jamming more data into models may only go so far.

A group of Google DeepMind researchers may have proven the merits of another approach in January when the company announced AlphaGeometry, an AI system that can solve geometry problems at an Olympiad level.

In a supplemental paper, the researchers explained how AlphaGeometry uses a "neuro-symbolic" approach, which meshes the strengths of other AI approaches, landing somewhere between data-hungry deep-learning models and rule-based logical reasoning. IBM's research group said it sees it as "a pathway to achieve artificial general intelligence."

What's more, in the case of AlphaGeometry, it was pre-trained on entirely synthetic data.

The neuro-symbolic field of AI is still relatively young, and it remains to be seen if it will propel AI forward.

Given the pressures companies such as OpenAI, Google, and Microsoft face in turning AI hype into profits, expect them to try every solution possible to solve the data crisis.

"People had the illusion that you could infinitely make large language models better by just using more and more data, but now they've basically used all the data they can," said Marcus. "We're still basically going to be stuck here unless we take new approaches altogether."

Read the original article on Business Insider