Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs (arxiv.org) AI

A new arXiv preprint argues that model finetuning can reactivate verbatim memorization of copyrighted books in major LLMs. The authors claim that training models to expand plot summaries into full text enables GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce large portions of held-out copyrighted books, even when prompted only with semantic descriptions rather than any book text. They report the effect generalizes across authors and even across different models and providers, suggesting an industry-wide vulnerability beyond common alignment measures like RLHF and output filtering.

April 09, 2026 01:53 Source: Hacker News