It would be “impossible” to train generative artificial intelligence (genAI) tools without using copyrighted content, OpenAI has admitted as it faces a landmark legal battle with The New York Times.

Copyright has emerged as a key issue surrounding the skyrocketing popularity and usage of generative artificial intelligence tools such as OpenAI’s ChatGPT platform.

These large language models are trained on often copyrighted content which has already led to a number of lawsuits.

The New York Times recently filed a complaint in the Federal District Court in Manhattan against OpenAI, accusing the company of operating a business model based on “mass copyright infringement”.

The media giant accused OpenAI of using its copyrighted content to train ChatGPT, which then reproduces this content to users, effectively making it a competitor.

In a submission to a UK Parliamentary inquiry in December last year, before The New York Times filed the lawsuit, OpenAI admitted that the development of large language models such as ChatGPT would not be possible without the use of copyrighted content.

“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” the OpenAI submission said.

“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

In the submission, OpenAI argued it was not in breach of copyright laws as “legally copyright law does not forbid training” but that it was also in negotiations with copyright holders.

“We have been industry leaders in allowing creators to express their preferences with respect to the use of their works for AI training,” the company said.

“While we look forward to continuing to develop additional mechanisms to empower rights-holders to opt-out of training, we are actively engaging with them to find mutually beneficial arrangements to gain access to materials that are otherwise inaccessible, and also to display content in ways that go beyond what copyright law otherwise allows.”

OpenAI says copyright lawsuit ‘without merit’

In its lawsuit, The New York Times provided more than 100 examples of ChatGPT reproducing its content nearly identically, bypassing its paywall and making it a direct competitor, it claimed.

In a blog post, OpenAI railed against these claims, accusing The New York Times of “intentionally manipulating” its chatbot to reproduce its own articles, and that the case is “without merit”.

“Interestingly, the regurgitations The New York Times induced appear to be from years-old articles that have proliferated on multiple third-party websites,” the OpenAI blog post said.

“It seems they intentionally manipulated prompts, often including the excerpts of articles, in order to get our model to regurgitate them.

“Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.”

Replications a ‘rare bug’

OpenAI will also argue that the use of copyrighted material to train generative artificial intelligence models is fair use, and that this is “supported by long-standing and widely accepted precedents”.

The company said that the direct replication of online articles is a “rare bug” in the system.

“Memorisation is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites,” the company said.

“We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use.”