What Happens When AI Eats its Own Slop? It’s Called Model Collapse.

from our collaborator, Taryn Talley (she/her)

Taryn KAY Talley

Table of Contents

Like people living solely on highly processed foods risk poorer health outcomes, if Large Language Models ingest a diet of non-stop AI-generated content, the health of their training data is at risk.

Aspect

Ultra-Processed Food Risk

AI-Generated Data Risk

Source

Producing over-processed ‘food’ designed for efficiency and cost often results in a loss of nutritional value.

Content produced by algorithms that favor high-probability patterns loses the "long-tail" of human nuance.

Result

Short-term energy but long-term health decline (e.g., metabolic issues).

Models appear fluent at first, but over time, reasoning, diversity, and accuracy "collapse" over time.

Mechanism

The body lacks the complex micronutrients found in whole foods.

LLMs lack the "unlikely" but true edge cases that only human creativity and an error-prone life provide.

The Loop

A diet of ultra-processed food can lead to cravings for more of the same, reinforcing bad habits.

Models trained on AI-generated data start "hallucinating" on their own errors, which amplifies them.

The Science Behind "Model Collapse"

In a research paper published by Nature in 2024, (authored by Shumailov et al.), it was confirmed that when AI models are trained exclusively on data generated by previous AI models, they go through two specific stages:

  • Early Model Collapse: Models begin losing "minority" data—the rare, unique, and creative parts of human language. The model output will start to sound  "average" at best.
  • Late Model Collapse: The model starts confusing different concepts (ex, answering a question about architecture with facts about biology) until every output is absolutely useless.

As that aforementioned research paper circulated, the term ‘model collapse’ began to gain traction, prompting top-level LLMs to shift their stances. OpenAI and Google began prioritizing content licensing to ensure access to "clean" human-generated data - and no doubt to limit their future liability, learning from the initial capture of copyrighted material (without citation or compensation). These same LLMs also sought to preserve "pre-AI internet" data (created before late 2022) for future training.

According to a Gemini (1) prompt response:

As of 2026, the industry is seeing three specific areas where collapse is manifesting:

Sign of Collapse

Real-World Observation

The "Tail" Vanishing

Models are becoming less capable of discussing rare languages, niche scientific theories, or ultra-specific coding edge cases. They default to the "average" answer more often than they did in 2023.

Bias Amplification

Since AI data reinforces majority patterns, models are showing increased "homogenization." They sound more like "the average of the internet," losing the unique voices and cultural nuances found in the original human-only datasets.

"Digital Dementia"

In recursive testing (feeding a model its own output repeatedly), models like Meta’s OPT-125M eventually began babbling about "jack rabbits" after starting with a prompt about architecture. While flagship models are more stable, they still show slight degradation when exposed to "AI slop" on the web.

What do the big three say about their teams’ efforts to prevent model collapse?

I wanted to share the perspective of the top LLMs. So I asked Gemini, Claude, and ChatGPT the following question: “Hi (LLM), what steps have your engineers taken to prevent the degradation that leads to late-stage model collapse?”

Not surprisingly, Gemini’s response was much more robust than the other LLMs. ChatGPT came in second with a decent but high-level response. Claude was by far the underwhelming response. So, let’s look at the techniques that the top three are currently employing to prevent model collapse. Ive also added which LLM mentioned which technique in their initial return.

Data Provenance and "The Vault" Strategy

In my research for this article, I’ve encountered the terms “gold standard data” and “pristine gold dataset” multiple times. It makes sense that, to prevent ingestion of AI-slop, they need to maintain pure human-generated content (as protected source data), reducing the risk of AI-polluted web scrapes.

  • Archival Priority: Heavier weight is given to high-quality, pre-2023 datasets to ensure the "gold standard" isn't lost. Gemini, Claude, ChatGPT 
  • Watermarking and Detectors: Using internal classifiers to identify and filter out all synthetic data (computer-generated data) from the training pipeline. Gemini, Claude, ChatGPT

Active Inheritance & Multi-Source Sampling

Rather than just training on whatever the previous model said, engineers use a technique called Active Inheritance.

  • High-Entropy Selection: Instead of picking the most "likely" response (from a previous generation), the system is trained on diverse, high-entropy samples that preserve lexical variation. Gemini, ChatGPT
  • Best-of-K Filtering: Multiple outputs are generated for a single prompt, and only the one with the highest "informational density" or "diversity score" is selected for future training. Gemini, Claude, ChatGPT

 "Human-in-the-Loop" (HITL) and RLHF

Reinforcement Learning from Human Feedback (RLHF) is the primary anchor against late-stage collapse. 

  • Manual Correction: Many of us probably already do this. As humans review the model’s outputs, providing feedback that may correct "drifting" facts or logic. That feedback is acting like a corrective force, pulling the model back toward human-level reasoning when it starts to spin off into statistical hallucinations. Gemini, Claude, ChatGPT
  • Expert Fine-Tuning: For specialist domains (like medicine or law), Google uses subject-matter experts to create "synthetic-but-verified" data that is higher quality than anything on the open web. Gemini, Claude, ChatGPT

Architectural Innovation: State Anchors

Engineers have found that models can collapse faster when they lose their "place" in long-context generations.

  • System Instructions: Another action that folks involved in marketing might be doing when generating content to a specific audience. These instructions are "state anchors" that force the model to adhere to a rigid cognitive framework (e.g., "Think like a lead architect"), thereby reducing the risk of the model falling into an infinite loop or "vector collapse." Gemini, ChatGPT
  • Regularization: Techniques such as KL-divergence annealing are used during training to balance the model's accuracy with its need to explore diverse ideas. Gemini, Claude, ChatGPT

That’s not say whether Gemini, Claude, and ChatGPT are using any (or all) of the above tactics to stave off the early stage model collapse - their answers were more directional.

If you look at some of the common platforms (Meta and LinkedIn) that have auto-opted their users into training those platforms’ AI with all the user-generated content, those platforms have begun to ask their users if their content is AI generated or labelling content with a CR tag (in the case of LinkedIn), my guess is their attempt to get users to self select as a way to protect their training data.

Why haven’t we seen a total collapse yet?

Model collapse isn’t a sudden event that just happens without warning. What an actual collapse looks like will be a slow erosion of quality, over time. Those of us working with LLMs have seen some of the initial signs of early collapse. 

As marketers, we’ve seen a ton of AI-generated content from our peers, from email marketing to social posts (even comments) to blogs. Finally, AI-written thought leadership is starting to perform worse in search results. Google and other systems are doing a better job detecting unhealthy AI-generated slop that leads to model collapse, and are just beginning to prioritize human-generated content as the healthy "whole food" alternative.

  1. Gemini response to the prompt “Has any LLM started model collapse?” on January 18, 2025
💾 Tech & Media📚 for communicators🪴 Article

Taryn KAY Talley

Fractional Head of Marketing helping Brands through integrated marketing strategies. LGBTQ2SIA+(Transgender/Two Spirit) and Indigenous advocate. Offers a unique blend of strategic vision, deep tactical expertise, and proven results.

Comments