Panic. Yes, a quite a bit. But no matter the magnitude of the technical failures, AWS does a very good job of handling these issues when one of its cloud services goes down.
And no better source to hear it from, than Werner Vogels, the chief technology officer at Amazon Web Services.
In an interview on the sidelines of the recent AWS Summit in London in late June, Vogels talked about how the company maintains and handles the technical sides of the equation when things go wrong and take huge swathes of the internet with them.
For instance, the high-profile outage towards the end of February 2017 when a number of large websites quickly went down.
“We are so, so aware of the fact for many businesses their livelihoods are dependent on Amazon operating, on AWS really operating well, and that’s a heavy responsibility. We’re happy to take it.”
The first order of business, according to the AWS CTO, is to find the problem and then calming down the customers that depend on the cloud platform. And then the internal teams get down to finding the root cause of the problem and trying to repair or restore it.
“You see the symptoms, but you do not necessarily see the root cause of it … you immediately fire off a team whose task is to actually communicate with the customers … making sure that everyone is aware of exactly what is happening.”
Werner says that teams over at Amazon Web Services work round the sun to ensure uptime. And while the senior management team continuously tracks development, he expects to be woken up immediately in case there is a major outage.
The February incident was due to human error, where an engineer typed the wrong number that caused a chain reaction that ultimately led to a major failure.
Which affected sites like Quora, Trello, the project management tool, and even Amazon’s artificial intelligence assistant, Alexa, which struggled due to this infamous little catastrophe.
That said, Vogels places the blame not on the engineer that was directly responsible for the outage, but on Amazon itself for not having fail safes that could have prevented the incorrect input or protected its systems.
And this, ultimately, is a key point for the CTO of the cloud giant — learning from errors.
The whole interview and the article is an excellent read, and it also goes over precisely what caused the February outage. Point your browsers to the link above to give it a read.