If you run an online business, you probably felt some kind of effect from the AWS outage in late February 2017. Amazon Web Services (AWS) powers many of the large websites across the web, so the trickle-down effect caused most of the Internet to blackout for several hours. Even if the critical failure didn’t affect you, it’s a good example of why every business should never have a single point of failure.
What Happened to AWS?
Working in IT or DevOps can be a stressful job. Just one wrong command can critically affect a multitude of systems. For a small network, it can be recoverable quickly and the business does not lose much money. For a service such as Amazon, it costs not only the cloud vendor millions but its clients as well.
IT departments regularly need to make changes in production. Usually, these changes are scheduled days in advance. You can assume with such a large infrastructure and its importance, Amazon has a thorough and strict change control policy. They announced that the scheduled maintenance was to troubleshoot and fix a performance issue with their billing system. The IT staff ran a command to take some servers offline, which is standard for server maintenance.
Instead of taking only a few servers offline, the command took down several more that were supposed to stay in service to support customers. Too many of their storage servers (called S3 services) were taken offline, and customers using the US east coast servers were rendered out of service.
Amazon attempted to reboot the downed servers, but even this took longer than expected. Because these servers were the backbone for their status panel, they could not even alert customers using the standard dashboard tools
Several large, well-known sites use AWS, so customers around the world felt the impact. AWS has millions of customers, but sites with millions of visitors a day felt the impact. Quora, TMZ, Trello, and Mashable are just a few sites that were down during the outage. Even “Is It Down Right Now” – a site dedicated to identifying if a site is down for the world or just you – relied on AWS, and it was down for several hours.
How to Avoid a Single Point of Failure
Usually, cloud services are used to help sites avoid single-point failures. In this scenario, AWS is the cloud infrastructure that became the single point of failure. Normally, cloud infrastructure involves data centers that replicate data across several nodes. If one data center has a catastrophic failure, the service is replicated to other data centers that take over while the other one is repaired. Users might have performance issues while it happens, but the sites aren’t entirely down.
A “single point of failure” issue is when only one critical outage causes the entire network to fail. The single point can be a person, a process, a piece of hardware or software. It’s described as anything that could unilaterally bring down the entire business whether it’s internal or external systems
The AWS outage taught IT that even a large, replicated cloud infrastructure can be the single point of failure. When you build secure IT systems, you never want to allow this to happen. The cloud is normally used as a backup system for large internal networks, but it can also be the main infrastructure.
For businesses that rely on AWS, it’s difficult to create a better plan to avoid this type of outage. One option would be to have a small, replicated third-party host that could be flipped on when AWS fails, but this would mean DNS changes and DNS takes up to three days to replicate across the web. You could have a secondary host with DNS already in place and forward your main service to the secondary one, but this is an expensive option for small businesses.
One common pitfall is allowing one user to control a critical system. This person might be a developer, a manager, or financial staff. Ask yourself if this one person were to leave the company, would someone else be able to pick up where the former employee left off. Always have employees cross-train each other. Some businesses have employees trade jobs for a week to help train them.
Avoiding a single point of failure usually takes a team of security experts. They can be consultants or part of your IT staff. Once you identify these issues, you should work with your team to make changes to your network. The AWS outage is a prime example of the disastrous consequences of having just one system fail for your customers and your staff.