ChatGPT polluted the world forever, like the first atom bomb • The Register


Feature For artificial intelligence researchers, the launch of OpenAI’s ChatGPT on November 30, 2022, changed the world in a way similar to the detonation of the first atomic bomb.

The Trinity test, in New Mexico on July 16, 1945, marked the beginning of the atomic age. One manifestation of that moment was the contamination of metals manufactured after that date – as airborne particulates left over from Trinity and other nuclear weapons permeated the environment.

Everyone participating in generative AI is polluting the data supply for everyone

The poisoned metals interfered with the function of sensitive medical and technical equipment. So until recently, scientists involved in the production of those devices sought metals uncontaminated by background radiation, referred to as low-background steel, low-background lead, and so on.

One source of low-background steel was the German naval fleet that Admiral Ludwig von Reuter scuttled in 1919 to keep the ships from the British.

More about that later.

Shortly after the debut of ChatGPT, academics and technologists started to wonder if the recent explosion in AI models has also created contamination.

Their concern is that AI models are being trained with synthetic data created by AI models. Subsequent generations of AI models may therefore become less and less reliable, a state known as AI model collapse.

In March 2023, John Graham-Cumming, then CTO of Cloudflare and now a board member, registered the web domain lowbackgroundsteel.ai and began posting about various sources of data compiled prior to the 2022 AI explosion, such as the Arctic Code Vault (a snapshot of GitHub repos from 02/02/2020).

The Register asked Graham-Cumming whether he came up with the low-background steel analogy, but he said he didn’t recall.

“I knew about low-background steel from reading about it years ago,” he responded by email. “And I’d done some machine learning stuff in the early 2000s for [automatic email classification tool] POPFile. It was an analogy that just popped into my head and I liked the idea of a repository of known human-created stuff. Hence the site.”

Is collapse a real crisis?

Graham-Cumming isn’t sure contaminated AI corpuses is a problem.

“The interesting question is ‘Does this matter?'” he asked.

Some AI researchers think it does and that AI model collapse is concerning. The year after ChatGPT’s debut several academic papers explored the potential consequences of model collapse or Model Autophagy Disorder (MAD), as one set of authors termed the issue. The Register interviewed one of the authors of those papers, Ilia Shumailov, in early 2024.

Though AI practitioners have argued that model collapse can be mitigated, the extent to which that’s true remains a matter of ongoing debate.

Just last week, Apple researchers entered the fray with an analysis of model collapse in large reasoning models (e.g. OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking), only to have their conclusions challenged by Alex Lawsen, senior program associate with Open Philanthropy, with help from AI model Claude Opus.

Essentially, Lawsen argued that Apple’s reasoning evaluation tests, which found reasoning models fail at a certain level of complexity, were flawed because they forced the models to write more tokens than they could accommodate.

In December 2024, academics affiliated with several universities reiterated concerns about model collapse in a paper titled “Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training.”

They contended the world needs sources of clean data, akin to low-background steel, to maintain the function of AI models and to preserve competition.

“I often say that the greatest contribution to nuclear medicine in the world was the German admiral who scuppered the fleet in 1919,” Maurice Chiodo, research associate at the Centre for the Study of Existential Risk at the University of Cambridge and one of the co-authors, told The Register. “Because that enabled us to have this almost infinite supply of low-background steel. If it weren’t for that, we’d be kind of stuck.

“So the analogy works here because you need something that happened before a certain date. Now here the date is more flexible, let’s say 2022. But if you’re collecting data before 2022 you’re fairly confident that it has minimal, if any, contamination from generative AI. Everything before the date is ‘safe, fine, clean,’ everything after that is ‘dirty.'”

What Chiodo and his co-authors – John Burden, Henning Grosse Ruse-Khan, Lisa Markschies, Dennis Müller, Seán Ó hÉigeartaigh, Rupprecht Podszun, and Herbert Zech – worry about is not so much that models fed on their own output will produce unreliable information, but that access to supplies of clean data will confer a competitive advantage to early market entrants.

With AI model-makers spewing more and more generative AI data on a daily basis, AI startups will find it harder to obtain quality training data, creating a lockout effect that makes their models more susceptible to collapse and reinforces the power of dominant players. That’s their theory, anyway.

You can build a very usable model that lies. You can build quite a useless model that tells the truth

“So it’s not just about the sort of epistemic security of information and what we see is true, but it’s what it takes to build a generative AI, a large-range model, so that it produces output that’s comprehensible and that’s somehow usable,” Chiodo said. “You can build a very usable model that lies. You can build quite a useless model that tells the truth.”

Rupprecht Podszun, professor of civil and competition law at Heinrich Heine University Düsseldorf and a co-author, said, “If you look at email data or human communication data – which pre-2022 is really data which was typed in by human beings and sort of reflected their style of communication – that’s much more useful [for AI training] than getting what a chatbot communicated after 2022.”

Podszun said the accuracy of the content matters less than the style and the creativity of the ideas during real human interaction.

Chiodo said everyone participating in generative AI is polluting the data supply for everyone, for model makers who follow and even for current ones.

Cleaning the AI pollution

So how can we clean up the AI environment?

“In terms of policy recommendation, it’s difficult,” admits Chiodo. “We start by suggesting things like forced labeling of AI content, but even that gets hard because it’s very hard to label text and very easy to clean off watermarking.”

Labeling pictures and videos becomes complicated when different jurisdictions are involved, Chiodo added. “Anyone can deploy data anywhere on the internet, and so because of this scraping of data, it’s very hard to force all operating LLMs to always watermark output that they have,” he said.

The paper discusses other policy options like promoting federated learning, by which those holding uncontaminated data might allow third parties to train on that data without providing the data directly. The idea would be to limit the competitive advantage of those with access to unadulterated datasets, so we don’t end up with AI model monopolies.

But as Chiodo observes, there are other risks to having a centralized government-maintained store of uncontaminated data.

“You’ve got privacy and security risks for these vast amounts of data, so what do you keep, what do you not keep, how are you careful about what you keep, how do you keep it secure, how do you keep it politically stable,” he said. “You might put it in the hands of some governments who are okay today, but tomorrow they’re not.”

Podszun argues that competition in the management of uncontaminated data can help mitigate the risks. “That would obviously be something that is a bulwark against political influence, against technical mistakes, against sort of commercial concentration,” he said.

The problem we’re identifying with model collapse is that this issue is going to affect the development of AI itself

“The problem we’re identifying with model collapse is that this issue is going to affect the development of AI itself,” said Chiodo. “If the government cares about long-term good, productive, competitive development of AI, large-service models, then it should care very much about model collapse and about creating guardrails, regulations, guides for what’s going to happen with datasets, how we might keep some datasets clean, how we might grant access to data.”

There’s not much government regulation of AI in the US to speak of. The UK is also pursuing a light-touch regulatory regime for fear of falling behind rival nations. Europe, with the AI Act, seems more willing to set some ground rules.

“Currently we are in a first phase of regulation where we are shying away a bit from regulation because we think we have to be innovative,” Podszun said. “And this is very typical for whatever innovation we come up with. So AI is the big thing, let it go and fine.”

But he expects regulators will become more active to prevent a repeat of the inaction that allowed a few platforms to dominate the digital world. The lesson of the digital revolution for AI, he said, is to not wait until it’s too late and the market has concentrated.

Chiodo said, “Our concern, and why we’re raising this now, is that there’s quite a degree of irreversibility. If you’ve completely contaminated all your datasets, all the data environments, and there’ll be several of them, if they’re completely contaminated, it’s very hard to undo.

“Now, it’s not clear to what extent model collapse will be a problem, but if it is a problem, and we’ve contaminated this data environment, cleaning is going to be prohibitively expensive, probably impossible.” ®



Source link