Earlier this month, some Microsoft Cloud services had an unexpected outage. It didn’t affect all regions globally and wasn’t a direct services failure, but an issue with the authentication platform Azure Active Directory.
The bad parts
I noticed the issue when my phone popped up an error about being unable to sign-in to my Office 365 account to fetch my mail. “Service is temporarily unavailable. Please retry later.” Curious as to whether my phone was just glitching or if this was a symptom of a global outage, I tried signing into office.com on my PC. There’s nothing more concerning than receiving an error that your organization does not exist. So, like any good system administrator, I checked twitter and various outage websites. Initial reports seemed to suggest that the outage was only being reported in the pointed to the APAC area, until I saw tweets from South America, the UK and the Netherlands. Then someone reported that they couldn’t access the Azure Portal or line-of-business applications that use Azure Active Directory for authentication.
The Microsoft Office Status twitter account confirmed that there were aware of an issue, but that was the extent of it. At one point we were told to check MO133518 for details … which was in the Office 365 Admin panel … which we couldn’t log in successfully to access, because of the outage. On answers.microsoft.com, one user was told to refer to the Service Health Dashboard (SHD) for information, then the thread was closed. Again, the SHD is behind an authentication gate. It also took a while for https://status.office.com to be updated from “There are currently no known issues preventing you from signing in to your Office 365 service health dashboard.” No issues, except every affected Office 365 admin screaming on twitter.
The good parts
In Australia, it was a Friday night, so the APAC outage impact on the APAC region was minimized, but others waking up to their Friday weren’t so lucky.
The Post Incident Review report indicated that “A code issue caused data objects within the authentication infrastructure to get moved to an incorrect location, resulting in authentication failures.” Think of it like moving a file and then your shortcut not working. While we can’t tell if the code issue was a human error, there were no doubt a bunch of highly skilled humans focused on finding the cause and fixing it. All while I sat back and played Sea of Thieves on the Xbox. The total outage time was just over 3 hours. Considering the complexity of Azure Active Directory and its underlying infrastructure, that’s not bad. It’s not amazing, and you could argue for more redundancy, but sometimes even extra redundancy doesn’t help, depending on the root cause. Sometimes it just takes highly skilled individuals to put the puzzle pieces together and find the gremlin in the system. And for that, we pay a very small sum to Microsoft each month.
Communication – Microsoft needs to update their communication plan to note “can’t log in” type outages mean that nobody has access to the Admin portal or SHD. It seems Twitter has become the medium of choice for outage communications, because it’s not like they can email you and they’re not going to phone everybody. We’d also settle for a services status page that doesn’t require you to log in. I’m not going to complain about the level or frequency of information though because I’ve been one of those system administrators on the receiving end of an outage. You can be waiting for an hour or more and still not know what is going on from the Microsoft perspective. And you don’t want to stop troubleshooting to call in status reports or explain things to the PR team. The Post Incident Review report gives us a good overview of the timeline, a high-level explanation of the cause and a commitment to work on preventing similar issues in the future.
Patience – Screaming on Twitter does not help engineers fix problems any faster (who knew?) Yes, it’s frustrating. It may even be helpful for the vendor to see the extent of the problem, though it doesn’t take long for them to figure out which part of the systems is down and how many organizations would be affected. It definitely doesn’t help to yell that Microsoft is crap and you’re switching to GSuite. What does help is focusing on what you can do in the meantime. Which brings me to my next point …
Business Continuity – Why do we have this expectation that the Cloud is 100% perfect, all the time? Is it because the marketing material wants to sell that story? Does it help us feel better that we’ve migrated away from our own infrastructure? Outages happen. They happen to Google, they happen to AWS and the happen to Microsoft. It’s someone else’s computer with lots of moving parts and humans are responsible for it. So, like our own infrastructure, we owe it to the business to educate them on what their Business Continuity Plan should be. If you’ve moved to SaaS, you still need BCP. That can involve getting your helpdesk to tell users to switch Outlook to work offline mode, so they can use the data they have without constant connection errors. It could also involve switching to manual processes or paper-based instructions. And it might include sending out your own notification to your customers asking them to phone you for anything urgent. If you are yelling on Twitter that this is costing you millions (true story), you might want to focus on this last point.
For more information
Microsoft has published updates to the Service Health Dashboard (located in your Office 365 Admin portal) under ID MO133518. You might need to click View History and select Last 30 days or type in the Search bar to find it. It also contains a link to the Post Incident Report. Additionally, MO133811 has appeared for those organizations that were indeed impacted by the outage.