Cloud Service Reliability – Part 3: ‘Antifragile’ – When DevOps met SRE

In part 2 of this blog series I introduced the concept of Antifragile systems (or services). Systems that are neither fragile or robust. They are systems that thrive in chaos. That are architected, developed, deployed and run from the ground up in a manner to achieve the SLOs of availability and responsiveness expected from today’s modern applications and services. They are systems that are DevOps-enabled and SRE-ready.

I have actually written extensively on Antifragile systems in my book ‘The DevOps Adoption Playbook’. So, instead of rehashing what I have already published to make it look like new content, I am just going to be lazy and reproduce 2 pages directly from chapter 5 (DevOps Plays for Driving Innovation) of my book that focus on describing the what and how of Antifragile systems. You can read much more about Antifragility in general and how it applies to IT systems, services and applications in my book – available on Amazon <end shameless book promotion \>.

(Antifragile) systems assume that failures will happen. Servers will go down. Discs will fail. Networks will deliver traffic with high latency. Network switches will fail. Third-party network connectivity providers will go offline. Memory allocated will not be sufficient. Data sets will exceed the capacity of the queues handling them. Entire sources of data streams will go down, or deliver the streams too slowly. Third-party services will not conform to the service level agreements (SLAs) of the providers and suppliers.

Applications being delivered will have defects. Too many users will want to access a hot, new application feature. Middleware configurations will be incorrectly set up, or not fine-tuned for the app in question. Hackers will try to compromise services. Bots will overload the app with useless traffic. An app completely separate from yours will cause a service you need to go down. An app in another system will crash, creating a domino effect in the systems, which will impact your app. Humans will intentionally cause disruption. Humans will insert malicious backdoors in services. Humans will insert benign ‘Easter eggs’ in their code. Humans will make errors. 

You need to build systems that thrive in this chaos. These systems must be built to handle situations where you assume beforehand that something will go down and the system will need to stay up anyway, by finding an alternate way to get the services it needs to stay up. Here are some key characteristics of such an Antifragile system:

  • Fail fast. In line with the principle of fail fast, Antifragile systems need to be built to handle any failure—fast. One popular approach is to build systems that never fix a server instance that has a fault or is not functioning or performing as desired. You just kill that server instance and replace it with a new instance, and you do so without allowing the rest of the system to be impacted.

The goal is to have systems that never go down, even for maintenance or upgrades. Facebook never shows a message stating it will be “down for maintenance this Sunday night from 2 a.m. to 4 a.m.”; the next time you log in, you just get the upgraded version of the site.

  • Fail often. How do you prepare to handle failure, and to handle it fast? You fail often. Antifragile systems need to be able to address failures continuously. Unfortunately, while all IT Ops organizations have plans and protocols to handle incidents, they are rarely tested. A sports team has to practice continuously, through pre-season and the season, to perfect the plays they want to run. This may be a standard set of plays they run all the time, or a game-changing play they want to run as a surprise tactic to win a critical game. The New Orleans Saints football team won Super Bowl XLIV in 2010 by running a surprise “on-side” kick against the Indianapolis Colts. However, they succeeded in the play not because they caught the opposing team off-guard with an ambush (which they did), but because they practiced it several times and only added it to their playbook after it worked perfectly in practice.
  • MTBF to MTTR. The success of Antifragile systems needs to be measured differently than for robust systems, and so it requires different metrics. Robust systems have traditionally used a metric called ‘mean time between failures’ (MTBF) to measure their stability. MTBF measures the time period between failures or incidents. An Antifragile IT system should not focus on MTBF. The goal is to fail fast and fail often, making that metric counter-productive. Antifragile systems assume that failures will happen, and that there is no way to avoid them. As a result, an Antifragile system focuses its architecture and operational models on ‘mean time to resolve’ (MTTR). How quickly can a failure be repaired and a service that has gone down be brought up? How can it minimize MTTR, and also minimize the impact of a service being down, on other services and the overall system? How can it go to an operational state where, although servers may go “red,” services are always “green”?
  • Cattle not pets. Fragility in systems actually comes from a desire to make them too robust. System administrators who maintain individual servers to keep them always-up take steps to provide all the care and feeding the servers need to handle any issue or stress situation they may face, and manually handle the situation when it does occur. The servers are inherently unique, and so they treat them like pets. This would be fine in a world of physical servers with static instances running on them. However, in today’s dynamic world, this is not scalable. Automation is needed both to manage the servers at scale, and to monitor them and mitigate challenges in real time. They need to be treated like cattle. Cattle do not have names. For all intents and purposes, they are inherently identical to each other. They are tagged using a scalable naming convention. They are culled if they get sick. They are fed in bulk. They are managed and maintained in bulk. And they have a finite, pre-determined lifespan, which ends with them being steak or hamburger or sausage[i]. In a similar manner, server instances need to be named using a scalable naming convention, not individual names. They need to be identical to each other. They need to be monitored and managed in bulk. They need to be killed and replaced with new instances when they have issues. And, they need to have a predetermined, finite lifecycle that manages how they are provisioned and de-provisioned. Sorry, no more cows named “Daisy” live on the cattle ranch. (If they do, they are the rancher’s pet.) Similarly, no more servers named midnight.rational.com should live in your datacenter.

Servers may go “red,” services are always “green”

When I wrote this chapter of the book in mid-2016, I was just starting to hear about SRE. The Google SRE book had just come out a couple of months earlier. While I had not had the opportunity to explore it in depth by then, it is evident to see the strands of SRE in the above section on Antifragile systems. Operations of such systems in production, coupled with a robust monitoring regime and incident response processes, as described in parts 1 and 2 of this series, are at the core of what we today refer to as SRE.

To conclude this series on SRE [ii], let me reiterate, adopting SRE is not a trivial task. Delivering Antifragile systems is not the silver bullet either. Having robust monitoring just tells you when you are not delivering to an SLO. In addition, …wait there’s more… the organizational structures for SRE certainly need to be in place. These SRE teams need to be in place with proper skills, processes and authority to respond without approvals across several layers of delivery and operations. SRE teams need to be plugged-in with the traditional development and delivery, and operations teams in the enterprise. All these teams need to have visibility into each other’s work. Development needs to know how the SRE team responded to an incident, so as to incorporate more resiliency into the code to prevent the incident from happening again. Delivery teams need to manage change management processes to allow for changes to be delivered at a high velocity to production, without disrupting availability. Infrastructure and platform teams need to ensure the services they deliver have high availability too, and have their own SRE teams and practices in place. SRE teams need to be aware of and prepared for changes being developed and deployed to application code, and the infrastructure or platform services being consumed to run the applications.

Lastly, all the teams need to be prepared for incident response. As I mentioned in the ‘fail-fast’ and ‘fail-often’ sections in my book excerpt above, they need to be continuously practicing for incident responses for all types of potential incidents. This is where the Netflix Simian Army I referred to in part 2 of the series comes in. The Simian Army keeps Netflix’s development and delivery and SRE teams ready for any kind of incident response. It makes their systems and services and their teams practices Antifragile. To conclude let me quote Netflix’s engineers Ariel Tseitlin and Yury Izrailevsky themselves on their description of the Simian Army (also reproduced in my book ‘DevOps Adoption Playbook).

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is in inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables—all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice. 

Inspired by the success of the Chaos Monkey, we’ve started creating new simians that induce various kinds of failures, or detect abnormal conditions, and test our ability to survive them; a virtual Simian Army to keep our cloud safe, secure, and highly available.

Are you prepared to handle the Simian Army in your enterprise? Even if one does not implement the Simian Army in one’s organization, just the exercise of exploring how one would prepare for implementing such a regime would go a long way in going towards achieving preparedness for incident response, and introducing automation to incident management processes. Humans cannot respond to the Simian Army and win. It requires automation.

So, a question to you, my readers, are you adopting SRE in your organization? Are your systems and services Antifragile? Do share your answers and thoughts on this series in the comments section below.

[i] With apologies to vegans, vegetarians, Hindu’s and non-red-meat eaters.

[ii] I say ‘conclude’ without prejudice, you honor. Hence, reserving the right to write another post in the series at a later date.