Training genAI on copyright materials is fair use, judge rules

Using copyrighted books to train generative AI (genAI) systems is “quintessentially transformative” and therefore qualifies as copyright ‘fair use’, a US judge has ruled in a major blow to creatives’ battles against what they call mass intellectual property theft.

The decision validated a training method adopted by genAI firm Anthropic, which was sued for mass copyright infringement after downloading millions of copyrighted books, and scanning physical copies, to train its genAI large language model (LLM), Claude.

Anthropic wanted a central library of “all the books in the world” to retain “forever,” US District Judge William Alsup said, framing the lawsuit by three affected authors – Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson – as an issue of copyright ‘fair use’.

While training Claude, Anthropic “became convinced that using books was the most cost-effective means to achieve a world-class LLM” – but with legal concerns mounting, it switched to spending millions to buy, unbind, and scan books one at a time.

Anthropic’s use of books to train Claude, and its mass conversion of print books to digital copies, were both fair use for different reasons, Alsup concluded in ruling the company “presented a compelling explanation for why it was reasonably necessary.”

He was less charitable about Anthropic’s decision to participate in Internet content piracy by downloading the 7.2 million pirated books from illegal sources, noting that the company could have bought them from many places “but it preferred to steal them”.

“Against the purpose of acquiring all the books one could on the chance some might prove useful for training LLMs… almost any unauthorised copying would have been too much,” Alsup wrote, adding that “we will have a trial” about Anthropic’s mass piracy.

Copyright holders are fighting back

Explicit approval for the use of copyright materials to train LLMs is a significant victory for a fast-growing industry whose major players have regularly argued that it is “impossible” to meaningfully train genAI systems without copyrighted material.

^{The judge was unhappy with Antropic's mass piracy, saying the company could have bought the books it needed but "preferred to steal them". Photo: Shutterstock}

Claims that LLM training represents fair use are making their way through US courts, with the likes of the New York Times arguing that AI giants must be forced to pay copyright licensing fees, as News Corp did last year in a $378 million deal with OpenAI.

Although the Anthropic suit relates to the use of text to train LLMs, genAI systems’ increasingly capable image and video production capabilities have spawned other issues – with Nintendo, for one, pushing back against genAI over copyright concerns.

Major movie studios Disney and Universal also recently entered the fray, calling genAI company Midjourney “the quintessential copyright free-rider and a bottomless pit of plagiarism” and suing it for infringement that they call “calculated and wilful.”

Having “ignored” requests to “at a minimum” prevent users making images with the likes of Darth Vader, Wall-E and Homer Simpson, the plaintiffs said Midjourney was instead teasing its new video service and “very likely already infringing” their copyrights.

The studios argue that tools to control improper use are readily available, a point already made in a separate case that’s seen Anthropic add content ‘guardrails’ after several music studios sued it in 2023 for training Claude on hundreds of song lyrics.

Has ‘fair use’ become a catch-all excuse for piracy?

Anthropic’s looming prosecution for piracy aside, the new judgement is a significant precedent as other copyright battles make their way through court systems – and it’s sure to confound efforts to adapt longstanding copyright practices for the genAI age.

Indeed, just weeks ago the US Copyright Office released a major report concluding that training genAI tools on copyrighted content is illegal, and that genAI firms must negotiate licensing rights with content creators before doing so.

The new Anthropic win would seem to undermine that entire stance, potentially signalling an open season on copyrighted online content that is being harvested en masse by automated bots from Anthropic, TikTok, DeepSeek, ChatGPT, and others.

The Australian government, for its part, established a reference group on AI copyright in late 2023 and has been canvassing industry bodies for their input even as bodies like the National Library of Australia crack down on unauthorised use of their content.

Last year, Australian creative industry leaders told a Senate committee that genAI was threatening livelihoods, calling AI “a direct threat to that income” and demanding “unambiguous guidelines to encourage the use of only safe and responsible AI.”

Alsup’s ruling could also have implications for ongoing action against Meta and other firms, who are accused of training their genAI LLMs on pirated books from the same pirated libraries that Anthropic used.

“The purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative,” Alsup wrote.

“Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them – but to turn a hard corner and create something different.”

“If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use.”