All of the roughly 400 Atlassian customers who were affected by a serious outage at the start of this month are finally back online after a “faulty script” deleted entire sites.
For two weeks Atlassian has been scrambling to bring its customers back online and repair trust in its business that was severely eroded when an attempt to “delete legacy data from a deprecated service” went horribly wrong.
“We have now restored our customers impacted by the outage and have reached out to key contacts for each affected site,” Atlassian CTO Sri Viswanath said in a blog post.
“Let me start by saying that this incident and our response time are not up to our standard, and I apologise on behalf of Atlassian.”
According to Viswanath, Atlassian’s engineers were attempting to deactivate an app for Jira Service Management and Jira Software that was built-in to the cloud product.
It should have been a straightforward update using a pre-existing script to delete the legacy app from customers’ instances without fuss.
But a “communication gap” between the teams running this operation meant that, instead of being given the IDs for the app, the deletion team was given IDs for full cloud sites where the apps were located.
Then, instead executing the script to just mark the files for deletion – which may have allowed the teams to notice they were preparing to delete entire cloud sites – the script “was executed with the wrong execution mode and the wrong list of IDs”.
All customers impacted by the outage have been restored. Our teams will be working on a detailed Post Incident Report to share publicly by the end of April.https://t.co/UryM6XwG42
— Atlassian (@Atlassian) April 19, 2022
Atlassian fortunately has a policy to keep backups for 30 days, along with a routine that can be used to roll-back customers when an admin has a bad day, accidentally deletes their own data, and needs to quietly restore it into a new environment.
What it doesn’t have is an automated process for simultaneously re-instating data for hundreds of customers into existing environments without affecting other customers who didn’t have their sites unceremoniously removed.
“Because the data deleted in this incident was only a portion of data stores that are continuing to be used by other customers, we have to manually extract and restore individual pieces from our backups,” Viswanath said.
“Each customer site recovery is a lengthy and complex process, requiring internal validation and final customer verification when the site is restored.”
Atlassian has been restoring sites “in batches of 60 tenants at a time” which took around four to five days to complete, finishing on Tuesday morning, around two full weeks after the outage began.
During the outage, customers expressed their dismay on social media as system admins tried in vain to get responses about when their Jira instances would return and if their data was backed up.
Jaime Vogel, CIO for Western Australian roadside assistance company RAC WA, was frustrated with the time it took for Atlassian to go public.
“Whilst I have empathy for the team resolving it, that doesn’t excuse the lack of awareness and recognition,” he tweeted.
“Like many customers, our business grinds to a halt when these services are down.”