Microsoft outage worsened by staff shortage

Microsoft has blamed insufficient staffing and automation issues for an outage at an Australian data centre which shook its Azure, Microsoft 365 and Power Platform services for over 24 hours.

Between 30 August and 1 September, Australian businesses reliant on software giant Microsoft’s cloud services suffered significant downtime when a “power sag” caused an outage impacting multiple products.

“This event was triggered by a utility power sag in the Australia East region which tripped a subset of the cooling units offline in one data centre, within one of the Availability Zones,” said Microsoft.

The large-scale outage impacted droves of customers, including notable Australian businesses such as budget airline Jetstar, accounting software maker MYOB and Aussie banks ME Bank and Bank of Queensland.

During the outage, users of Microsoft’s cloud platform Azure, productivity suite Microsoft 365, and developer software suite Power Platform faced widespread access and usability issues between 6.41pm on 30 August and 4.40pm on 1 September.

The incident saw the company’s chiller plant (a cooling system which provides essential cooling in data centres) for two data halls knocked offline – essentially frying parts of its storage hardware.

“The cooling capacity was reduced in two data halls for a prolonged time, so temperatures continued to rise,” said Microsoft.

“At 11.34 UTC, infrastructure thermal warnings from components in the affected data halls directed a shutdown of selected compute, network and storage infrastructure – by design, to protect data durability and infrastructure health.

“This resulted in a loss of service availability for a subset of this Availability Zone.”

Microsoft’s report of the incident suggested it may not have been adequately prepared for an outage of this scale, as the company said it did not have enough staff on-site to restart the chillers in a timely manner.

Only three people were on duty in Australia during the inciting “power sag”, which Microsoft itself admitted was too few.

“We have temporarily increased the team size, until the underlying issues are better understood and appropriate mitigations can be put in place,” said Microsoft.

The incident was further compounded by some complications with its automation – leaving the company confused while its infrastructure refused to come back online.

As rising temperatures damaged Microsoft’s storage hardware, the company’s diagnostic tools were unable to locate essential data because the related storage servers were down.

“Diagnostics were not able to identify the faults, because the storage nodes themselves were not online,” said Microsoft.

“As a result, our onsite data centre team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting.”

Furthermore, the company’s automation was “incorrectly approving stale requests” and “marking some healthy nodes as unhealthy” – further slowing its recovery efforts.

Users on Reddit and Twitter were quick to criticise the company and its slow recovery, attributing much of the incident to Microsoft’s recent layoffs.

“Not surprised,” said Reddit user No_Document_7800.

“Microsoft has been slimming up their teams, outsourcing, or offshoring them to reduce costs, and we can see from their quality of product and their reliability of service.”

However, Mark Culhane, director at Australian tech consultancy Zoak Solutions, told Information Age he was not disheartened from Microsoft’s cloud services following the incident.

“This incident does not raise significant concern about Microsoft's cloud services,” he said.

“They with the other major cloud providers – AWS, GCP – are generally much more reliable and failure resistant when compared to alternative solutions.”

Culhane further approved the software giant’s response to the outage – suggesting its track-record of reliable services should not be overshadowed due to recent events.

“Even if the impact was more significant, I still think that Microsoft's response in this case was appropriate,” said Culhane.

“Microsoft's root cause analysis of insufficient staffing and broken automation is not surprising. Considering the generally high stability of their cloud services over the past years, this specific incident and their following response does not elicit deep concern for us."