OpenAI, Google trained AI on YouTube transcripts

OpenAI and Google have reportedly trained their generative AI tools on millions of hours of YouTube videos in a murky move that may have breached copyright laws.

The practice of training large language models, used to power generative AI tools such as OpenAI’s chatGPT, on extraordinarily large datasets that inevitably include copyrighted material has been in the spotlight in recent months and has already led to several lawsuits.

These models need huge amounts of text data for training, leading to a race to find new data sources online.

A new report by the New York Times has revealed the lengths that tech companies are going to in order to find the data needed to train generative AI tools, leading them to “cut corners, ignore corporate policies and debate bending the law”.

Citing numerous people involved, the report reveals that by late 2021 OpenAI had nearly exhausted all of the English-language text on the internet, and had already hoovered information from computer code repository GitHub, databases of chess moves and descriptions of high school tests and homework.

To find more data to feed its generative AI machine, OpenAI developed a speech recognition service called Whisper to transcribe YouTube videos and podcasts.

All up, it transcribed more than one million hours of YouTube videos to input into GPT-4, which formed the basis of the ChatGPT chatbot, released publicly in late 2022.

The use of this data is likely a legal grey area, with YouTube’s terms of service prohibiting the use of videos on the platform for applications which are “independent” of the platform. YouTube also bans the accessing of its videos by “any automated means (such as robots, botnets or scrapers)”.

Updating terms of service

The report revealed that YouTube owner Google also used this same tactic to train its own large language model, in potentially violation of the copyright of the makers of the videos on its platform.

In an effort to find more English-language data to train its system, Google updated its terms of service last year to allow it to use publicly available Google Docs and restaurant reviews on Google Maps to train its AI models.

While its previous privacy policy allowed Google to use publicly-available information to “help train Google’s language models and build features like Google Translate”, the updated agreement said this data can be used for “AI models and [to] build products and features like Google Translate, Bard and Cloud AI capabilities”.

Copyright chaos

The report offers an inside look into the widespread copyright issues surrounding the astronomical rise of generative AI, kicked off by the launch of ChatGPT in September 2022.

The New York Times launched a landmark lawsuit against OpenAI and Microsoft over the use of its copyrighted content to train their respective chatbots.

In the lawsuit, the media giant said the companies are operating business models based on “mass copyright infringement”, and that they are effectively becoming competitors by reproducing the copyright content to users.

In response, OpenAI has argued its use of the material is “fair use” under copyright law, as it transformed the content for a different purpose.

In September last year nearly 185,000 pirated books were found in a widely used dataset reportedly used to train Meta’s generative AI tools, including the works of a number of Australian authors.

The US Authors Guild and a number of high-profile authors including Jodi Picoult and George RR Martin have filed a lawsuit against ChatGPT over “flagrant and harmful infringements” of copyright.

In late 2023, OpenAI said in a submission to UK Parliament that it would be “impossible” to train generative AI tools without using copyright content.

The Australian government is aiming to get on the front foot on this issue, and has launched a reference group focusing on AI and copyright late last year. The group will guide the government in preparing for future copyright challenges arising from the advent of generative AI.