Google denies training its AI with your emails

Within a day of its public release, Google’s AI chatbot Bard – the company’s answer to ChatGPT – erroneously told an AI researcher it was trained on data from Gmail, further highlighting the risk that generative AI can spread misinformation and foregrounding the importance of AI transparency.

Select Google accounts in the UK and US got first access to Bard this week following the calamitous product announcement in February that saw $140 billion wiped off the company’s share price in a single day.

Early demos showed Bard continuing to make factual errors, commonly anthropomorphised as “hallucinations”, while the UI is riddled with disclaimers that it “may display inaccurate or offensive information that doesn’t represent Google’s views”.

Despite the publication of failed attempts to get Bard to describe how to make mustard gas or fall in love with the user, it was an innocuous question that tripped up Bard when academic Kate Crawford, who holds an Honorary Professorship at the University of Sydney, simply asked where the AI’s dataset comes from.

“Bard’s dataset comes from a variety of sources,” the chatbot said, adding that it was trained using publicly available data like from “Wikipedia, GitHub, and Stack Overflow”, as well as data from third parties which “have partnered with Google to provide data for Bard’s training”.

But Bard also said it was trained on “Google’s internal data” which includes “data from Google Search, Gmail, and other products”.

Crawford tweeted her screenshot of this interaction saying he assumed it was wrong “otherwise Google is crossing some serious legal boundaries”.

An official Google account replied to say that, no, “it is not trained on Gmail data” and that “Bard is an early experiment based on Large Language Models and will make mistakes”.

Bard is an early experiment based on Large Language Models and will make mistakes. It is not trained on Gmail data. -JQ
— Google Workspace (@GoogleWorkspace) March 21, 2023

Evidently, Google’s social media team can also make mistakes as the company responded to another tweet by misspelling its own product’s name, saying that “no private data will be used during Barbs [sic] training process”. It has deleted the tweet.

Crawford has since gone on to say there is “a real problem” with the lack of transparency around the data on which large language models and other forms of artificial intelligence are trained.

“Scientists and researchers like me have no way to know what Bard, GPT4, or Sydney [Microsoft Bing] are trained on,” she said. “Companies refuse to say. This matters because training data is part of the core foundation on which models are built”.

Crawford’s concerns is that the absence of training data makes it harder to “test or develop mitigations, predict harms, or understand when and where they should not be deployed or trusted”.

Google, to its credit, has at least partially described the makeup of its Infiniset – the dataset used to train Language Models for Dialogue Applications (LaMDA), the model on which Bard is built.

In a paper from last February, Google said the data was half comprised of data from public forums, with another quarter coming from Wikipedia and data scraped from the web and provided by non-profit Common Crawl.