Access to online archives of public cultural and historical documents is seemingly at risk, with the developer of a 15-year-old online research tool reportedly shut out of a key archive of Australiana even as media organisations reassess their generative AI (genAI) policies.
The block was reported by historian and data hacker Tim Sherratt, whose GLAM Workbench tool – the acronym refers to cultural heritage institutions like galleries, libraries, archives, and museums – was flagged for misusing National Library of Australia (NLA) data.
GLAM Workbench, which among other things uses the NLA’s APIs to access and visualise the more than 271 million Australian cultural documents and six billion Australian Websites in its Trove archive, was cut off when its API keys were “cancelled without warning” in February.
Sherratt had allegedly breached s4.4(e) of the Trove API’s Terms of Use by using it to download full Trove documents rather than just their metadata – with the NLA informing him that it “consider[s] the use of an API to extract and save full text as being in violation.”
Subsequent conversations, a “totally blindsided” Sherratt explained, noted policy changes in version 2 of the API, which was released in 2020 during a major Trove upgrade that he said was “sold as opening up ‘access to richer data for API users’” who are allocated API keys.
“In most cases researchers want and need content, not just metadata, and I’ve developed a range of tools to help them access it,” Sherratt said, noting that the tool “has been helping researchers in this way for 15 years.”
“There was no indication then that access to this data required special permission,” he wrote, adding that NLA executives justified the block in terms of ‘data governance’ “and the fact that the online world had changed,” he said.
Yet, he added, “I fail to see how researchers downloading newspaper articles from the 1890s can be seen as a possible cyber threat… I had not imagined a world where the NLA would set itself up as gatekeeper to every use of the digitised newspaper corpus through the API.”
The NLA was contacted for comment.
Revisiting the meaning of ‘open data’
Sherratt’s experience reflects a broader challenge to the GLAM sector, whose institutions were designed to curate public works and open data, but face loss of control as genAI platforms harvest the content to train ever more voracious large language models (LLMs).
Arguments that the practice falls within ‘fair use’ provisions of copyright law were challenged when copyright holders pushed back against a proposed self-regulation model, forcing the UK Information Commissioner’s Office to cancel a proposed policy last year.
The 25 February closure of a new consultation on the issue – coincidentally, four days after Sherratt received his notice – saw UK newspapers and creatives launch a campaign against what they see as systemic copyright violations.
National archives like the NLA occupy a challenging middle ground because their need to prevent unauthorised use of their content clashes with open data’s core ideals.
The challenge was highlighted in a 2013 government analysis that noted “some GLAM agencies have had significant success with open data and content initiatives” and recommended “tailored guidance on implementing open government practices.”
Yet as voracious genAI training algorithms overstay open government’s welcome, the NLA joins overseas analogues like the UK’s National Archives in tweaking their terms of use in the age of genAI.
The US National Archives and Records Administration’s (NARA’s) API, for one, “allows researchers and developers to retrieve metadata… for any given record or search results set” and explicitly approves its use “to facilitate bulk downloads of records”.
As works of the federal government, that agency’s “archival metadata is all assumed to be in the public domain”, NARA notes, adding that “digital images themselves are generally also in the public domain for the same reason.”
Staking out common ground with genAI
Whether agencies like the NLA are intentionally restricting access to public data, or just wrestling with the implications of free and unlimited access, the issue is clearly front of mind.
NARA, for one, recently updated its policies with a focus on “responsible use of AI” – including as a tool for improving its recordkeeping practices – and hosted national archivists from Australia, Canada, and New Zealand last September to discuss the issues.
In February, the OECD released a report that recognised the “complex issues” that genAI training has created for intellectual property, including the “significant challenges” of data scraping practices about which big AI companies are increasingly keeping mum.
Many tech companies are rewriting user conditions to justify scraping their users’ data, while Open AI – after being famously sued by the New York Times – has finalised commercial agreements with new outlets like News Corporation, Le Monde and Guardian Media Group.