Site Reliability Engineering: What is it?

By: Niladri Choudhuri

According to the latest report by LinkedIn on the ‘Emerging Jobs 2020: Site Reliability Engineer is among the top 10 in-demand jobs for 2020’.

Business Insider mentions: ‘SRE’s annual remuneration can go as high as Rs. 30 Lakhs for those with an experience of 5 years’. This seems like the best job to bag in 2020. The question is who are they?

According to the 2019 SRE Report by Catchpoint – ‘Site Reliability Engineering is still emerging as a practice’. As per the Google book ‘Site Reliability Engineering’:

‘Site Reliability Engineering is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.’

The goal of Site Reliability Engineering is to create an ultra-scalable and highly reliable distributed software systems.

The Principles of Site Reliability Engineering are:

  • Operations is a Software Problem – Software engineering principles like designing and building is used to solve problems rather than maintaining and operating
  • Service Level Objectives – Services are managed to the SLO (Service Level Objective). SLO is from the customer’s point of view. Some action has to be taken if SLOs are breached. So SLOs needs consequences if they are violated. SLOs are made to make the user experience better
  • Toil – Any repetitive, mundane operational task is bad. It should be automated. The ‘Wisdom of Production’ is used to design better systems. SREs must have time to make tomorrow better than today
  • Automation – Automate whatever can be automated and help remove toil. Infrastructure as code, Configuration as code are to be done. We need to be careful to fix bad processes before automating them. SREs also has the ability to regulate the work
  • Reduce Cost of Failure – Late problem detection causes higher cost. SRE tries to reduce MTTR (Mean Time To Repair). Canary Testing, Smaller Pieces of Work helps in faster detection and recovery
  • Shared Ownership – SREs share skillset with the development team and has operations related skills. Hence, the Silo is broken between Dev and Ops. This requires some organization changes in structure, performance appraisal from individual to team based and also need at least T-shaped skills.

Another concept that SRE uses is ‘Error Budget”. If there is a breach of SLO, there has to be a consequence. For example, if there are 1 Million transactions per month and we have a 99.9% SLO, it means that we can have 1000 transactions in a month to fail. This is an error budget. This means that we can do new releases, patches, modification, etc., which can result in a maximum of 1000 transactions failing due to those. If there is any more, we may need to stop new releases till we make the system stable.

According to the SRE Survey 2019 of Catchpoint – the most popular SLOs are:

Availability72%
Response Time47%
Latency46%
We don’t have SLOs27%

SREs use 50% of their time for Operations work and 50% on Development work. Google also states that no one will work more than 25% of their time in “On-call”. Monitoring is important in SRE but Observability is more important. Externalizing all the outputs of a service allows us to infer the internal state of that service thus making it observable. Being observable means being proactive as monitoring is only after the event has occurred.

SRE requires automation. The following can be areas of automation driven by SRE:

  • Infrastructure as Code/Configuration as Code – Tools like Terraform, AWS CloudFormation, Puppet, Chef, Ansible, Saltstack, Docker, etc.
  • Automated Functional and Non-Functional testing in production – Tools like Selenium, Cucumber, Jasmine, Mocha, Zephyr, Mockito, JMeter, SonatypeNexus Lifecycle, SoapUI, WhiteSource, Veracode, Nagios, etc., can be used
  • Only Versioned and Signed artifacts are deployed – Tools used are Nexus, Artifactory
  • Automation helps better observability – Tools used are OpsGenie, Nagios, Dynatrace, AppDynamics, Prometheus, Splunk, LogStash
  • Helps in future growth planning easier – Tools that can be of help are Amazon Cloud Auto Scaling, Kubernetes Pod Scaling, Amazon Cloud RDS, NoSQL-type databases like MongoDB, Couchbase, Cloud APIs
  • Antifragility and Chaos Engineering – Tools like Chaos Monkey, PagerDuty, VictorOps, Squadcast. Fire Drills also need to be done.

Automation helps make things consistent, testable, production ready easily. It also is more secure and auditable. It also helps recreating errors easier. Cost of change is less and regression risk is reduced. Automation helps in automated deployment thus making it more secure and less vulnerable. It reduces the dependency errors and helps to identify vulnerabilities faster and easier. Automation helps in reducing MTTR and helps with protective monitoring. Automation helps in reducing TOIL and thus reducing Total Cost of Ownership. Various risks like availability, integrity, are mitigated.

SRE does not stand alone. It works with DevOps and Lean, IT Service Management and Agile.

Also, we are providing Foundation Level SRE Certification Course. You can learn more about it here.

Link to original source

Become Free Member