By Sanjeev Sharma, Principal Analyst
Ever since Google published the Site Reliability Engineering (SRE) book in 2016, the SRE movement has changed how organizations look at reliability, and incident response and management. Not unlike DevOps, working on adopting SRE is resulting in an organizational cultural shift. A shift which is changing how organizations are organized, on how information flows within an organization that would allow for the delivery of more reliable and resilient, dynamic systems. That being said, SRE, that is SRE as defined by Google, is not applicable for most organizations. Organizations need to take the thought process and culture behind Google’s SRE and adapt it just enough to make it suitable and viable for their organization’s business needs. As I see it today, large enterprises are mostly failing at doing this. They are either attempting to adopt SRE in its purest form, not realizing they are not Google, or totally changing (corrupting) it to suit how they do things, how they have always done things, to their broken culture, hence making what they call SRE, SRE in name only.
“But, You are Not Google”Me talking to many a CIO/VP of Ops
“But, You are not Google”. This is a common refrain I have said to many a CIO or VP of Ops in companies that I have worked on SRE adoption with. I try to be polite, I promise. But really, they are not Google. In the very initial pages of the Google SRE book, in the introduction itself, the authors describes why Google developed SRE. They have massive data centers on which their services run. These data centers have a high incidence of hardware failure, given their size. This required Google to have the ability to dynamically move services from one part of the data center to another in a fraction of time. Given the large user base of the deployed services, Google also needed to have extremely fast response times to outages and degradation in quality of service, with minimal impact to the user. Their operations teams had to find a way to handle all these incidents, outages and failures in an automated manner to reduce toil and stress on the team. Their incidents, outages and failures were also very repetitive. Given the homogenous nature of the hardware across their datacenters, and the nature of the services deployed, there were very few outliers. Most tasks could (should) be automated.
This led to the development of what we today know as SRE. Google had a team of software developers work in operations with the goal of developing software to handle the vast majority of tasks that were assigned to the system administration teams and incident response teams. As the software got more and more mature, more and more typical tasks had been automated. The humans could then focus on the outliers. On tasks that were not ‘typical’. Reliability Engineering meets software engineering = SRE.
If the 1st paragraph in this section is not an apt description of your datacenters and systems you are running, you do not need SRE. Don’t get me wrong, you need (service/system) Reliability Engineering. You still need to automate repetitive, typical tasks in operations. You just don’t need to, and really should not do it the Google way. You are not Google. Very few organizations are.
So what does SRE in the ‘regular’ Enterprise look like? It may be easier to describe what it does not look like. Here goes:
Join the DevOps Continuous Learning Community for FREE and get access to member-only content such as industry research, podcasts, e-books, and more!
Sanjeev Sharma, Principal Analyst, accelerated strategies