AWS outage lessons can help you harden your cloud strategy

I’ve done it, and I bet you’ve done it too. As evidenced by the most recent AWS outage, even cloud engineers do it.

I’m talking about taking down a system with a typo. I’m talking about configuration errors caused by human mistakes that take systems down. In my case, I was a junior Unix systems administrator, and I probably had way too many shells opened to way too many servers. I needed to reboot one, and right after I typed reboot -r and hit enter, I realized I was in the Sendmail server’s shell. It was 8:55 on a Monday morning. Not the ideal time for an email outage.

I learned an important lesson that day, albeit at the expense of the business. I was managing several on-premises servers, and managed to cause a little chaos. Avoiding this sort of human-induced outage is one of the benefits we expect if we move our applications to the cloud.

But then this happened…

The cause of the #Amazon Web Services failure that resulted in parts of the internet failing? A typo. https://t.co/T7tiOXjsRY #awsoutage pic.twitter.com/FVs8f70y0G

— Tamara McCleary (@TamaraMcCleary) March 3, 2017

//platform.twitter.com/widgets.js

The AWS Outage Incident Report

That’s right, per the AWS outage incident report, a typo caused a four-hour outage that impacted applications that stored data on their S3 (Simple Storage Service) storage platform in the US-EAST-1 Region. Other services that rely on the S3 service were also impacted, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda. There were real-world impacts. According to Apica, a company that helps companies test, monitor, and optimize cloud and mobile apps, 54 of the 100 applications they track were affected by at least 20% dedregation.

Here is the actual AWS outage incident report (all emphasis mine):

At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests.

Now Should I Worry About the Cloud?

It turns out if you move an application to the cloud, you’re still dependent on the humans who are managing the operations that support that cloud. Does this mean you should avoid going to the cloud? Of course, not! However, it does mean that you need to architect your cloud applications with the same rigor that you’ve always done on-premises.

No single point of failure. Applications that were designed to work across AWS regions did not crash when then servers in the US-EAST-1 Region needed to be rebooted. You wouldn’t design an on-premises application with high availability requirements to rely on one set of storage, why would you do that in a cloud? Check out your options before committing to a cloud provider, be sure they offer services that can give you the redundancy a highly available app requires.
Interview your cloud provider. Amazon offers multiple ways to architect applications across regions, so if there is a failure in one region your application won’t go down completely. They were also completely transparent in their incident report, and they mentioned protocols your cloud provider should have such as only allowing authorized team members to make infrastructure configuration changes, and established playbooks that are used to make those changes. Don’t be shy, remember you’re handing over operation of the physical infrastructure that supports your apps to this company. Ask them hard questions!
It is your data, and you’re ultimately responsible for it. Don’t neglect the basics of good data hygiene (backups, monitoring, security) because you’ve outsourced the day-to-day upkeep of the infrastructure to a cloud provider. At the end of the day, you’re still responsible to your business for the data generated by that application. You can apply the strategies you’ve learned supporting on-premises applications to the cloud, you may just need to adjust them to use cloud technologies and tools.

Looking to the Future!

This wasn’t the first big AWS outage, and it won’t be the last. Remember, we’ve all been the victim (or maybe the perpetrator!) of a typo. Hopefully we learned something from every outage to harden and protect our apps and data.