Originally published in O’Reilly, October 24, 2023.
Ever since the current craze for AI-generated everything took hold, I’ve wondered: what will happen when the world is so full of AI-generated stuff (text, software, pictures, music) that our training sets for AI are dominated by content created by AI. We already see hints of that on GitHub: in February 2023, GitHub said that 46% of all the code checked in was written by Copilot. That’s good for the business, but what does that mean for future generations of Copilot? At some point in the near future, new models will be trained on code that they have written. The same is true for every other generative AI application: DALL-E 4 will be trained on data that includes images generated by DALL-E 3, Stable Diffusion, Midjourney, and others; GPT-5 will be trained on a set of texts that includes text generated by GPT-4; and so on. This is unavoidable. What does this mean for the quality of the output they generate? Will that quality improve or will it suffer?
I’m not the only person wondering about this. At least one research group has experimented with training a generative model on content generated by generative AI, and has found that the output, over successive generations, was more tightly constrained, and less likely to be original or unique. Generative AI output became more like itself over time, with less variation. They reported their results in “The Curse of Recursion,” a paper that’s well worth reading. (Andrew Ng’s newsletter has an excellent summary of this result.)