‘Blue screens of death’ proliferated on Microsoft machines across the planet this weekend – marking what may be the largest IT outage in history.
The underlying cause was a botched update at US cybersecurity company CrowdStrike – a platform used by 20,000 customers worldwide, including more than half of the Fortune 500 companies.
The issue began on Friday afternoon when CrowdStrike released an update for its flagship ‘Falcon’ platform – a partially automated, cloud-based cyber security software.
In short, the update triggered a logic error (effectively a coding bug) for an estimated 8.5 million Windows devices, causing ‘blue screens of death’ on affected machines and leaving many devices unusable as users struggled to fix perpetual reboot loops.
The outage quickly became apparent as social media flooded with blue screen sightings at bus stops, shopping centres, offices, supermarket self-checkouts and even MRI machines.
At the same time, critical infrastructure was disturbed across the globe, with many airlines ground to a halt, bank and payment system disruptions caused pileups at supermarkets and petrol stations, and international healthcare providers were forced to cancel ‘non-urgent’ surgeries.
While the Falcon update which caused the system crashes was released at approximately 2pm on Friday, CrowdStrike says the update was “remediated” by approximately 3.30pm the same afternoon.
“I want to sincerely apologise directly to all of you for today’s outage,” said CrowdStrike chief executive officer George Kurtz.
“All of CrowdStrike understands the gravity and impact of the situation.
“We quickly identified the issue and deployed a fix, allowing us to focus diligently on restoring customer systems as our highest priority.”
Kurtz further emphasised the issue did not impact Mac and Linux systems, and was not caused by a cyber attack.
Within hours of the update being released, prominent cyber security expert Troy Hunt said it would develop into “the largest IT outage in history”.
Where the update went wrong
One of CrowdStrike’s key selling points is that it can regularly and automatically push updates for its Falcon service, enabling clients to quickly apply the most up-to-date protection against constantly shifting cyber security threats.
As the BBC reported, however, the recent problematic Falcon update was sent out automatically – and for many international customers – applied to affected systems overnight.
Katie Moussouris, chief executive of security startup Luta Security, noted most capable organisations tend to test software updates before deployment – but “content updates” from operating systems and security softwares are more often set to automatically update because they are “viewed as safe”.
“Testing and phased rollout on both ends – producer and consumer – could have slowed this down and contained things,” said Moussouris.
“IT departments just got a new daily task.”
CrowdStrike has not explained how the botched update got through its internal testing procedures, but has detailed the underlying bug.
On social media, Kurtz described “a defect found in a single content update for Windows hosts”, while a more technical breakdown by CrowdStrike revealed a .sys file in Windows was mistakenly loaded with a critical logic error during the update.
The Verge noted the update appeared to target a “kernel-level driver” which CrowdStrike uses to protect Windows machines – meaning the affected file had high-level system implications – while CrowdStrike stressed the affected file was one of many “channel files”, which “are not kernel drivers” themselves.
CrowdStrike noted the issue impacts customers running Falcon sensor for Windows versions 7.11 and above, which were online between 2.09pm and 3.27pm on Friday, 19 Jul AEST.
A fix has been rolled which essentially requires customers to reboot their device so it can download another update – though CrowdStrike recommends using a wired network so any impacted devices can “acquire internet connectivity considerably faster via ethernet”.
If the issue persists, Microsoft has released more technical guidance for users to manually remove the buggy file, and has further developed a specific USB recovery tool.
“Although this was not a Microsoft incident, given it impacts our ecosystem, we want to provide an update on the steps we’ve taken with CrowdStrike and others to remediate and support our customers,” said Microsoft.
“We recognise the disruption this problem has caused for businesses and in the daily routines of many individuals.”
The impact in Australia and across the globe
Given Microsoft has estimated some 8.5 million devices were impacted by the update, it’s difficult to gauge precisely how many companies were disrupted.
Globally, major payment providers such as Square and Visa saw hundreds of users report issues through third-party monitoring service DownDetector.
Business Insider reported several internal services at Amazon were impacted, including work emails and some warehouse systems, while in the UK, Sky News was briefly unable to broadcast to TV.
The UK’s National Health Service suffered issues with an appointment and patient record system which caused disruptions in the majority of GP practices, while the US states of Alaska and Ohio reportedly saw emergency 911 lines go down overnight.
The BBC reports international airports in India, Hong Kong, UK, and US have reported issues, while multiple airlines were forced to ground flights.
Domestically, Jetstar seemed to be the most impacted airline as it temporarily cancelled all flights in Australia and New Zealand on Friday, while Qantas reported “technical issues” impacting its website, app, and services for booking and check-ins.
Australian banks such as NAB, Bendigo Bank, Suncorp Bank, Commonwealth Bank and Me Bank all saw an uptick on DownDetector, while Up Bank reported disruptions for Osko payments until late Saturday afternoon.
9News reported that itself, the ABC and SBS were affected, while Foxtel customers saw their Friday night telly interrupted by the outage.
Perhaps most noticeable was the impact to supermarkets Woolworths and Coles, whose sporadic self-service and checkout disruptions caused reported pile-ups across the nation.
Domestic payments organisation Australian Payments Plus meanwhile clarified that while the “central Eftpos environment” has not been affected by the global outage, some issuers using Windows may have suffered disruptions.
As for rectifying any damages for impacted businesses, compensation expert Jeanni Paterson told ABC News Breakfast the “compensation people are hoping for” is probably not possible.
“People might get their fees waived, but I don’t think they’re going to get compensation for the loss and the cost to their businesses,” said Paterson.
“Contract doesn’t guarantee a perfect service – unfortunately outages happen, and generally there will be exclusion clauses in the contract that say there’s no liability for outages.”
Cyber minister forecasts weeks of teething issues
While Prime Minister Anthony Albanese was quick to clarify there was “no impact to critical infrastructure, government services or triple-0 services” after the outage, Cyber Security Minister Clare O’Neil has forecast “teething issues” which may last for one or two weeks.
“There has been a huge amount of work over this weekend to get the economy back up and running,” said O’Neil.
“However, it will take time until all affected sectors are completely back online.”
The statement comes after the government convened a second meeting of the National Coordination Mechanism – a national crisis management framework which was used in recent data breaches, during the COVID-19 pandemic, and again for last year’s major outage at telecommunications provider Optus.