Ever since Google published the Site Reliability Engineering (SRE) book in 2016, the SRE movement has changed how organizations look at reliability, and incident response and management. Not unlike DevOps, working on adopting SRE is resulting in an organizational cultural shift. A shift which is changing how organizations are organized, on how information flows within an organization that would allow for the delivery of more reliable and resilient, dynamic systems. That being said, SRE, that is SRE as defined by Google, is not applicable for most organizations. Organizations need to take the thought process and culture behind Google’s SRE and adapt it just enough to make it suitable and viable for their organization’s business needs. As I see it today, large enterprises are mostly failing at doing this. They are either attempting to adopt SRE in its purest form, not realizing they are not Google, or totally changing (corrupting) it to suit how they do things, how they have always done things, to their broken culture, hence making what they call SRE, SRE in name only.
“But, You are Not Google”Me talking to many a CIO/VP of Ops
“But, You are not Google”. This is a common refrain I have said to many a CIO or VP of Ops in companies that I have worked on SRE adoption with. I try to be polite, I promise. But really, they are not Google. In the very initial pages of the Google SRE book, in the introduction itself, the authors describes why Google developed SRE. They have massive data centers on which their services run. These data centers have a high incidence of hardware failure, given their size. This required Google to have the ability to dynamically move services from one part of the data center to another in a fraction of time. Given the large user base of the deployed services, Google also needed to have extremely fast response times to outages and degradation in quality of service, with minimal impact to the user. Their operations teams had to find a way to handle all these incidents, outages and failures in an automated manner to reduce toil and stress on the team. Their incidents, outages and failures were also very repetitive. Given the homogenous nature of the hardware across their datacenters, and the nature of the services deployed, there were very few outliers. Most tasks could (should) be automated.
This led to the development of what we today know as SRE. Google had a team of software developers work in operations with the goal of developing software to handle the vast majority of tasks that were assigned to the system administration teams and incident response teams. As the software got more and more mature, more and more typical tasks had been automated. The humans could then focus on the outliers. On tasks that were not ‘typical’. Reliability Engineering meets software engineering = SRE.
If the 1st paragraph in this section is not an apt description of your datacenters and systems you are running, you do not need SRE. Don’t get me wrong, you need (service/system) Reliability Engineering. You still need to automate repetitive, typical tasks in operations. You just don’t need to, and really should not do it the Google way. You are not Google. Very few organizations are.
SRE for the Enterprise
So what does SRE in the ‘regular’ Enterprise look like? It may be easier to describe what it does not look like. Here goes:
- You are not replacing your current Ops team, your sys admins with software Engineers. You need your ops team. They know how your custom built infrastructure and systems work. They know its idiosyncrasies. They know when Chicago opens to a ticket to say they are offline again, its the network. Yes, its always the network, but the sysadmins know who to ping at Equinix to get it restored pronto. They know how the option trade desk system slows to a grind on Expiration Friday and you just ignore those tickets from traders that day. And even if you wanted to get rid of all the sys admins, can you afford to hire that many software engineers to replace them all? You can barely fill all your open slots on the dev teams. What you need to do is complement your Ops teams with software engineers who can understand what the teams do day-in, day-out and what tasks are repetitive and typical, and then they can develop tools for automated remediation. These software engineers should be embedded in the ops team, not a separate team on the outside. Think Squads.
- Renaming your DevOps teams as SRE is a no-no. First of all <steps on soapbox/> there should NOT be a team called the DevOps team! You created a new silo to do what was supposed to be a movement to eliminate silos? DevOps should be what everyone does. If you do have a separate team to build automation, that is exactly what they are – the automation building team. And that is exactly what they should do – build automation for others to use. They build tools to enable the processes of DevOps. They do NOT use those tools themselves to do any real deployments. Ever. But I leave my true feelings on this for another blog post <steps off soapbox/>. Renaming your DevOps teams as SRE only gets them to leave for better jobs as real SREs elsewhere at better salaries now that they are formally SREs according to their resumes. Don’t do it. DevOps is DevOps. SRE is SRE. They are joined at the hip, but they are different. SREs have a different set of goals from the application development teams – build automation to reduce toil, increase observability, and improve reliability of the systems in production. They should not be building deployment automation or improving test environments to get better quality signals.
- SRE is first and foremost about culture. Do you have a culture that is reliability focussed? What does reliability mean to you? Is it MTBF or MTTR? Do you have well defined SLAs (or SLOs)? How do you measure them? Do you meet them? Do you have Observability? Do you do blameless postmortems of incidents? Are they really blameless? Do you make your developers ‘carry pagers’ and do they do it without fear? Do you deploy on Fridays and sleep well over the weekend?
I am barely scratching the surface of how SRE needs to be adapted for the enterprise (that is not Google). I would like to hear from you. But instead of leaving a comment on this post, which I perfectly OK with, I would prefer you respond to an SRE survey we recently created. I had the opportunity earlier this year to work with three good friends of mine – Marc Hornbeek, Archana Joshi and Niladri Choudhury on a State of SRE survey commissioned by Catchpoint. This survey is designed specifically to get a view into how enterprises and other organization not named Google are adopting SRE. If your organization is adopting or thinking of adopting SRE, do take this survey right now. We will be analyzing the results and sharing findings on the current state and trends (this is the 3rd year Catchpoint is running a SRE survey) later this Spring.
To sweeten the deal, I am offering my time to anyone who completes the survey and wants to talk SRE. Once you complete the survey and want to talk about your SRE adoption or just have questions about SRE, email a screenshot of the completion screen to me and I will schedule a free 20-minute consult with you and your team. So, what are you waiting for – click here to go to the survey. I look forward to speaking with you soon.
Sanjeev Sharma, Principal Analyst, accelerated strategies