Generative AI (genAI) tools may have been transformative for users, but the companies training them are creating headaches for content owners like Wikipedia as hordes of ‘grey bots’ overload their systems with “unprecedented” volumes of requests to scrape their data.
Sold commercially by firms like Kadoa, Axiom.ai and Browse AI, and purpose-built by many AI developers for their own use, use of such bots – called ‘grey bots’ by Barracuda researchers to distinguish them from helpful search engine bots and malicious ‘bad bots’ – has exploded.
Barracuda, which among other things monitors individual customers’ sites for malicious behaviour, has noted regular surges in requests by grey bots such as ClaudeBot, TikTok’s Bytespider scraper, PerplexityBot, and DeepSeekBot – all of which collect data to train genAI.
Between December and the end of February alone, Barracuda reports, the Anthropic-owned ClaudeBot lodged up to 2.5 million requests for data in a day – with one application seeing more than 9.7 million requests in a month, and another fielding over 500,000 in a day.
The volume of requests remained relatively consistent over the course of a day, with an average of around 17,000 requests per hour, suggesting that the grey bots were running at a steady pace to find and download whatever data they could come across.
This is far more measured than the traffic floods created by bad bots, which a recent analysis found account for over a third of Australia’s Internet traffic as they pummel popular sites to snag choice concert tickets, harvest personal data, and commit ad fraud.
Yet “both scenarios – constant bombardment or unexpected, ad hoc traffic surges – present challenges for web applications,” Barracuda senior principal software engineer Rahul Gupta noted, with copyright, privacy and other legal issues only the tip of the iceberg.
Grey bots live in a policy grey area
GenAI companies’ wholesale scraping of other companies’ data has been highly contentious, with genAI firms arguing that ‘fair use’ gives them broad access to copyrighted materials while authors and content producers push back against mass appropriation of their content.
Yet as genAI companies build ever more effective bots to scrape data from other companies’ websites, their technological implications have drawn the ire of organisations like Wikipedia, which saw scraping of its 144 million media files surge 50 per cent over the past year.
Rather than feeding search engine queries, Wikimedia Foundation director of product for MediaWiki and Developer Experiences Birgit Müller notes, “we are observing a significant increase in request volume, with most of this traffic being driven by scraping bots.”
The death of Jimmy Carter in December, for example, caused network traffic to Wikipedia’s site to double – maxing out some of its Internet connections and causing slow load times for some users as the former US President’s Wikipedia page fielded 2.8 million queries in a day.
“This increase is not coming from human readers,” Müller said, “but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.”
“Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.”
“A significant amount of our time and resources go into responding to non-human traffic.”
The phenomenon isn’t new: security firm Kasada last year reported similar issues, noting that serial offender Bytespider was hitting servers with more than 3,000 times as many requests as Anthropic’s bots, and 25 times as many as those by OpenAI’s GPTBot.
A constant battle for content providers
If the Internet’s eighth most-visited website is struggling to manage the impact of genAI’s grey bot armies – who continue to operate in a legal grey area – what hope do smaller companies have of protecting their content from mass exploitation and chronic network congestion?
One Cloudflare analysis found AI bots accessing 39 per cent of the million biggest websites – yet while bot protection services can identify genAI bots, block their traffic and trap them in content honeypots, less than 3 per cent of those sites were actively blocking grey bots.
With AI giants and startups introducing new genAI scrapers on a regular basis – and many using sneaky tactics to avoid detection – managing their impact remains a cat-and-mouse game.
Grey bots threaten to crowd out legitimate users as they inundate servers with requests for data, often ignoring website owners’ requests, using a widespread method called the Robots Exclusion Protocol (REP) and its robots.txt file, to move the bots on.
Yet REP “is not legally binding,” Barracuda’s Gupta said, warning that the company’s analysis “suggests that grey bots such as genAI bots are now an everyday component of online traffic and are here to stay – [so] it’s time for organisations to factor them into security strategies.”