Australian authors have been caught up in a major controversy around the data used to train large language models after an investigation uncovered 183,000 pirated books were included in a commonly used dataset called Books3.
Booker Prize winning novelist Richard Flanagan was among the Australian writers whose works were in the dataset that was reportedly used to train Meta’s open source large language model LLaMA.
He told the Guardian that it “felt as if [his] soul had been strip mined” when he learned 10 of his books had been used to train AI without his consent.
“This is the biggest act of copyright theft in history,” Flanagan said.
Journalist Alex Reisner got his hands on Books3 and wrote a program to extract ISBNs and run them through an online database of books.
Reisner’s findings have been turned into non-paywalled searchable database that shows the massive extent of copyrighted material contained in Books3 – enter the name of any well-known living author and you are bound to find their novels were part of the dataset.
Generative AI is creating all sorts of problems for copyright as tech companies race to create models using data from the internet, often with a disregard to its providence.
Crime writer Dervla McTiernan told the ABC the use of her work in to train data was tantamount to theft and epitomised the Silicon Valley motto of ‘move fast and break things’.
“They knew they were using pirated books, and they did so with gross indifference, and I think that's characteristic of the mentality of people who work in this industry,” she said.
John Marsden, author of the series Tomorrow, When the War Began, told the Australian Financial Review that generative AI trained on a vast corpus of works from established writers would create “frightening and a horrifying kind of tsunami of imitations which would do incredible, incalculable damage to the creative powers and efforts of human beings”.
The Australian Society of Authors (ASA) described the revelations as “horrifying” and noted the opacity with which AI systems had been trained made it nearly impossible to know just how much copyrighted material they contained.
“Tech companies will charge the end user of their products but will not pay for the labour that enabled it,” said ASA CEO Olivia Lanchester.
“It’s a supply chain that stops short of the primary producer. Like paying the supermarket for your fruit and vegetables without any of that revenue going back to the farmers who grew the produce.”
In the US, the Authors Guild and 17 well known authors including Jodi Picoult, George RR Martin, and John Grisham have filed a lawsuit against ChatGPT creator OpenAI for what they claim is “flagrant and harmful infringements” of copyright.
At the heart of that complaint is the way writers’ labour has been co-opted to create technology systems that seeks to replace the people whose work was stolen.
“Already, writers report losing income from copywriting, journalism, and online content writing – important sources of income for many book authors,” the complaint said.
Hollywood writers went on strike earlier this year in response to the change in compensation wrought by streaming services and the impending disruption caused by AI.
Last week, the union successfully struck a deal to better control the use of AI in writers rooms.