A configuration error from cloud provider Fastly brought down popular web services worldwide on Tuesday night highlighting the fragility of modern internet infrastructure.
The issue affected nearly everything from news sites like the Sydney Morning Herald and ABC to social media and video sharing sites Reddit, Twitter and Vimeo with many sites returning 503 errors.
Even users of internet giants Google and Amazon felt the disturbance.
The outage lasted around 45 minutes between Fastly’s initial investigation and the company applying a fix but customers still felt delays as the servers came back online.
“We identified a service configuration that triggered disruptions across our [points of presence] POPs globally and have disabled that configuration,” Fastly said in a tweet at 9.09pm on Tuesday.
“Our global network is coming back online.”
Fastly operates a content delivery network (CDN) which web services use to distribute their various forms of content to users around the world.
We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration. Our global network is coming back online. Continued status is available at https://t.co/RIQWX0LWwl— Fastly (@fastly) June 8, 2021
Typically, network nodes are located in large cities to reduce the distance a user’s server request needs to travel.
Fastly’s CDN has nodes in Brisbane, Perth, Melbourne, and Sydney which helps international services like Reddit and Twitter keep latency low for a better user experience.
Carsten Rudolph, Associate Professor of Software Systems and Cybersecurity at Monash University, said the outage showed how distributed internet architecture is still susceptible to outages.
“These types of reliability issues can potentially result in financial losses and point to the need for a proper risk analysis,” he said.
“Businesses need to understand exactly what services and infrastructures they rely on.
“Even if these services promise high stability and redundancy, it is always possible that one or even several could fail and businesses need to plan for these outages and have contingency actions in place, if the risk becomes too high.”
Many popular websites returned 503 errors on Tuesday night.
For businesses relying on regular sales and advertising revenue, an hour’s downtime can be expensive.
Andy Champagne from cloud company Akamai Labs said companies could consider engaging multiple CDNs for added redundancy in case one goes down, but it’s not a guaranteed solution.
“The same rigor in engaging providers must be applied even when using multiple CDNs because splitting does not guarantee 100 per cent uptime for customers,” he said.
“If one CDN experiences an outage, manual intervention is typically required to route traffic to another CDN – it is not automatic.”
Update Thursday 10 June: in a blog post about the incident, Fastly's senior vice president of engineering and infrastructure Nick Rockwell said the outage was caused by "an undiscovered software bug" that was"triggered by a valid customer configuration change".
The bug caused 85 per cent of requests to Fastly's network to return errors and has since been fixed.
Rockwell was apologetic.
"Even though there were specific conditions that triggered this outage, we should have anticipated it," he said.