AI Just Copied Tolkien 🤯🔥 Genius? 😱

AI

🎧English flagFrench flagGerman flagSpanish flag

Summary

Researchers at Stanford and Yale Universities recently explored the ability of large language models to generate extensive text, drawing from works like *A Game of Thrones* and *The Hobbit*. Experiments revealed that models from OpenAI, Google, Meta, Anthropic, and xAI could produce thousands of words, with Gemini 2.5 achieving high accuracy recalling *Harry Potter and the Philosopher’s Stone*. Notably, Claude 3.7 Sonnet from Anthropic was successfully “jailbroken” to reproduce *The Hobbit* nearly verbatim. Existing research indicates that “open” models, such as Llama, retain significant portions of training data. Legal challenges surrounding these models’ training, including a $1.5 billion settlement related to copyrighted material and a German ruling against OpenAI, highlight ongoing concerns about copyright infringement. The core issue centers on whether the scale of these reproductions constitutes sufficient infringement to hold developers vicariously liable.

INSIGHTS


[CHAPTER 1: THE REVELATION: MEMORIZATION IN LLMS]
There’s growing evidence that memorization is a bigger thing than previously believed. Large language models (LLMs) from OpenAI, Google, Meta, Anthropic, and xAI are capable of generating near-verbatim copies of bestselling novels, raising significant questions about the industry’s long-held claim that these systems do not store copyrighted works. This revelation challenges a fundamental tenet of AI development and has profound implications for copyright law and the future of AI. Initial skepticism surrounding the extent of this “memorization” ability has been replaced by substantial research confirming its prevalence.

[CHAPTER 2: METHODOLOGICAL BREAKTHROUGHS & THE JAILED MODEL]
Researchers at Stanford and Yale Universities demonstrated the remarkable capacity of LLMs to memorize specific content through strategic prompting. By asking models to complete sentences from books like A Game of Thrones, The Hunger Games, and The Hobbit, they were able to elicit near-verbatim reproductions. Gemini 2.5, for example, regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with high accuracy, while Grok 3 generated 70.3 percent. Furthermore, they successfully extracted almost the entirety of “near-verbatim” from Anthropic’s Claude 3.7 Sonnet through a “jailbreaking” technique, where users can prompt LLMs to disregard their safeguards. This demonstrated a previously unknown vulnerability in even the most heavily guarded models.

[CHAPTER 3: CLOSED VS. OPEN MODELS: A SURPRISE DISCOVERY]
Prior to these breakthroughs, there was uncertainty regarding whether closed models, characterized by more robust safeguards against unwanted content generation, would also be prone to large-scale memorization. However, a study revealed a surprising outcome: “open” models, such as Meta’s Llama, demonstrably memorize significant portions of particular books within their training data. This finding challenged the assumption that stricter controls necessarily prevented memorization. A. Feder Cooper, a researcher at Yale University, noted the surprise that these models could memorize entire texts, highlighting the need for a reassessment of safeguards and training methodologies.

[CHAPTER 4: LEGAL RAMIFICATIONS & CASE LAW]
The implications of LLM memorization extend far beyond the technical realm, creating significant legal challenges. A US court last year ruled that Anthropic’s training of LLMs on some copyrighted content could be considered fair use, deeming it “transformative.” However, it determined that storing pirated works was “inherently, irredeemably infringing,” leading to a $1.5 billion settlement. Similarly, a ruling in Germany, brought by GEMA, found that OpenAI had infringed on copyright due to its model memorizing song lyrics. These landmark cases underscore the legal vulnerability of AI groups and the potential for liability related to copyright infringement. Legal experts emphasize the need to assess whether this memorization phenomenon constitutes enough to make AI companies vicariously liable for infringement.

[CHAPTER 5: INDUSTRY RESPONSE & FUTURE DIRECTIONS]
Despite the revelation, industry responses have been cautious. Anthropic acknowledged the jailbreaking technique used in the Stanford and Yale research as impractical for normal users, emphasizing that its model does not store copies of specific datasets but learns from patterns and relationships between words and strings in its training data. However, Imperial College London’s Yves-Alexandre de Montjoye noted that the fact that AI labs have put safeguards in place to prevent training data from being extracted indicates they are aware of the problem. Computer science professor Ben Zhao questioned whether AI labs truly needed to use copyrighted content in training data to create cutting-edge models in the first place, raising fundamental questions about the ethical and technical considerations surrounding AI development. The ongoing debate surrounding LLM memorization will undoubtedly shape the future of AI research, development, and regulation.

This article is AI-synthesized from public sources and may not reflect original reporting.