By: Niladri Choudhuri
According to the latest report by LinkedIn on the ‘Emerging Jobs 2020: Site Reliability Engineer is among the top 10 in-demand jobs for 2020’.
Business Insider mentions: ‘SRE’s annual remuneration can go as high as Rs. 30 Lakhs for those with an experience of 5 years’. This seems like the best job to bag in 2020. The question is who are they?
‘Site Reliability Engineering is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.’
The goal of Site Reliability Engineering is to create an ultra-scalable and highly reliable distributed software systems.
The Principles of Site Reliability Engineering are:
Another concept that SRE uses is ‘Error Budget”. If there is a breach of SLO, there has to be a consequence. For example, if there are 1 Million transactions per month and we have a 99.9% SLO, it means that we can have 1000 transactions in a month to fail. This is an error budget. This means that we can do new releases, patches, modification, etc., which can result in a maximum of 1000 transactions failing due to those. If there is any more, we may need to stop new releases till we make the system stable.
According to the SRE Survey 2019 of Catchpoint – the most popular SLOs are:
|We don’t have SLOs||27%|
SREs use 50% of their time for Operations work and 50% on Development work. Google also states that no one will work more than 25% of their time in “On-call”. Monitoring is important in SRE but Observability is more important. Externalizing all the outputs of a service allows us to infer the internal state of that service thus making it observable. Being observable means being proactive as monitoring is only after the event has occurred.
SRE requires automation. The following can be areas of automation driven by SRE:
Automation helps make things consistent, testable, production ready easily. It also is more secure and auditable. It also helps recreating errors easier. Cost of change is less and regression risk is reduced. Automation helps in automated deployment thus making it more secure and less vulnerable. It reduces the dependency errors and helps to identify vulnerabilities faster and easier. Automation helps in reducing MTTR and helps with protective monitoring. Automation helps in reducing TOIL and thus reducing Total Cost of Ownership. Various risks like availability, integrity, are mitigated.
SRE does not stand alone. It works with DevOps and Lean, IT Service Management and Agile.
Also, we are providing Foundation Level SRE Certification Course. You can learn more about it here.