Generative AI is already a copyright mess

Today, Getty Images announced its own generative AI tool trained using the company’s vast database of professional photos. It's billed as being “commercially safe” and providing “full indemnification for commercial use”.

You might remember that Getty Images began a lawsuit against Stability AI, creator of Stable Diffusion, earlier this year for allegedly “[copying and processing] millions images pf protected by copyright and the associated metadata owned or represented by Getty Images absent a license”.

Stable Diffusion has been one of the most obvious examples of how generative AI systems being trained on data scraped from the web, like images, is murky ground for copyright.

It was obvious because the occasional garbled Getty Images watermark would appear on a Stable Diffusion output.

But Getty wasn’t going after the users of Stable Diffusion in that instance, so why does it make a note of its tool as being “commercially safe” for customers?

Can you be sued for using a generative AI system that was trained using unauthorised copyright material?

According to the Australian Copyright Council’s submission to the Safe and Responsible AI in Australia discussion paper (which explicitly didn’t include copyright in its scope): maybe.

“It is an infringement of the copyright in a work or other subject matter to reproduce a ‘substantial part’ of work or subject matter without permission,” the Council said.

“It is also an infringement of copyright to ‘authorise’ another person’s infringement. Authorisation is the endorsement or sanction of another’s infringement in certain circumstances. These actions may render a person liable for copyright infringement.

“It follows that using generative AI may pose copyright infringement risks in relation to the output of AI tools”.

The CSIRO has previously warned organisations to be careful with AI vendors because of the legal and reputational risk of their products and Getty isn’t the first firm looking to settle the nerves of prospective AI buyers.

Earlier this month, Microsoft made the Copilot Copyright Commitment which says that the company will defend any customer on the receiving end of a copyright claim caused by its AI products.

Microsoft’s chief legal officer Brad Smith put his name to that announcement, saying the company “will assume responsibility for the potential legal risks involved” with using Copilot in 365.

Copilot incorporates generative AI into Microsoft’s suite of productivity tools – PowerPoint, Excel, Word, Outlook – and it was reportedly planning to charge businesses US$30 per user per month to use it.

Last week, we learned that a handful of Australian companies – including energy provider AGL, health insurer Bupa, and NAB – have signed up for early access to Copilot, no doubt encouraged by knowing that Microsoft has offered to foot the legal bill for any possible copyright claims.

As long as they’re aware that, at the moment, it doesn’t look like AI-generated content can be copyrighted.

Protecting intellectual property

Because of how rights holders ruthlessly protect their intellectual property, copyright unofficially rules the internet.

YouTube scans every uploaded video using Content ID to spot any hint of copyright material to prevent movie studios, record labels and other rights holders from suing parent company Google into oblivion.

The Australian government keeps a list of websites that all ISPs must block at a DNS level. On that list are sites known for the most abhorrent content imaginable, sites that offer unregulated online gambling, and sites that let you watch TV shows for free.

But the power of copyright is facing a new challenge with generative AI.

Getty isn’t alone in thinking including copyrighted material in training datasets infringes intellectual property rights.

Artists, authors, and software developers have all launched suits to test the legality of these models, though not all of them have the ability to spin up their own generative AI tools in the way Getty has.

Unfortunately, it can be hard for creators to prove that their intellectual property was used in an AI’s training set.

“A fundamental difficulty for copyright owners wishing to enforce their rights is that there is currently no transparency or disclosure by those developing AI technologies as to the data/material collected and used to build the vast databases on which an AI system is trained,” the Australian Copyright Council said in its submission to government.

“The critical transparency issue for copyright owners is that they need to have clarity about both what and how protected material is used in order to ascertain whether indeed there has been copyright infringement as infringement may not occur in each instance of use.”

One jurisdiction where that might change is the European Union which wants, under its AI Act, to require generative AI producers to publish summaries of copyrighted data that was used in training.

What happens if AI companies are able to train new models on synthetic, AI-generated data before courts and regulators order them to open up their data? Will they have gotten away with an intellectual property heist on an unimaginable scale?

Locking down the net

Earlier this year, the New York Times was contemplating whether it wanted to join the chorus of litigation against AI companies by suing ChatGPT creator OpenAI.

Already the publisher had updated its terms of service to say scraping its site wasn’t allowed and had blocked OpenAI’s web crawler from the site.

Other publishers have done the same, relying on the voluntary Robots Exclusion Protocol to tell well-meaning crawlers to stay out.

If you go to the ABC News robots.txt file, you’ll see even the national news service has disallowed GPTBot.

Generative AI is changing the internet. Not only does it give people tools to easily impersonate others and adopt new personas, but now human creative effort, our work, is valuable as data.

For social media sites like X (formerly Twitter) and Reddit, this extra value has led owners to limit API access and start raising walls protecting that data.

Under Elon Musk’s ownership, X has been prioritising posts of paid verified users with a suggestion that the whole platform might one day be subscribers-only.

Reddit is incentivising users with its ‘Contributor Program’ that will give money to active verified users.

Like the early days of the web, we're facing serious questions over intellectual property and users' rights. The challenge for our legislators is to get the balance right in protecting one group over others.