By: Niladri Choudhuri
“What happens when a software engineer is tasked with what used to be called operations” – Ben Treynor, Google.
Around 2003, much before DevOps came into existence, Google created Site Reliability Engineers. SRE is a discipline where the software engineering principles are applied to the infrastructure and operations problems to make the systems much more stable and reliable and to be able to ultra-scale as per the business needs.
The goals of Site Reliability Engineering are to create ultra-scalable and highly reliable distributed software systems.
SRE’s spend 50% of their time doing “ops” related work such as issue resolution, on-call, and manual interventions and spend 50% of their time on development tasks such as new features, scaling or automation. Monitoring, alerting and automation are a large part of SRE work.
The following are the SRE Principles:
What is SLO?
SLO or Service Level Objective is the availability criteria for the product and service. It is the expected goal for how well a service should operate. SLOs are very strongly related to the user experience. Once the SLOs are met, customer satisfaction will be high as users will be happy.
SLOs need to be set and monitored regularly as it is a key objective of SRE. There should be various SLOs for Products and Services. SLOs are always from the Customer point of view.
What is TOIL?
“TOIL is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical devoid of enduring value and that scales linearly as a service grows”. – Vivek Rau, Google.
Examples of toil are manual releases, physically connecting to infrastructure to check something, doing regular password resets, testing over and over, acknowledging the same alerts every day, creating users, manual resets, on-call response, extracting data, manual scaling of infrastructure, etc.
TOIL is bad because:
According to the Catchpoint SRE Survey Report 2019, the following are the most popular SLOs:
What is Error Budget?
“100% is the wrong reliability target for basically everything” – Ben Treynor
Error Budget means the amount of Time Budget we have where service can get affected. This is the time that is used to bring in new features or make architectural changes. If we tend to spend more than the budget, there has to be a consequence. One such consequence is to stop new features and get the system stable. So, all the post-mortem related backlogs are prioritized over the new features. SRE encourages to burn the Error Budget to Zero and use it strategically to balance velocity (speed) and availability (stability).
We need to be lean and have smaller batches as big changes can lead to higher risk and thus burning up of the error budget.
SLO – 99.9% Availability of the System
Error Budget – 43 minutes per month (0.1%) Within this time all new feature releases, patches, planned and unplanned downtime needs to be fit into these 43 minutes.
Consequence – If the Error Budget is used up, then the release of new features has to stop and user stories from the post-mortem related backlogs need to be prioritized.
What is Observability?
“Observability, as a noun, is a property of a system, it’s a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Therefore, if our IT systems don’t adequately externalize their state, then even the best monitoring can fall short” – Peter Waterhouse, CA
Observability is about having enough data that can be used to answer questions that are not already known. Observability required architecting is such a way so that the system can provide information to be able to help understand the health of the systems.
Observability is important because:
There are many other concepts that need to be looked at for delivering the best of services to the customer.