DevOps Institute

Leverage SRE to Build a Culture of Reliability, Resiliency and Risk Management

Updated January 19, 2023

By: Biswajit Mohapatra

Site Reliability Engineering (SRE) propagates a culture of building and operating reliable, resilient, risk-managed software systems. SRE looks at operations through the lens of software engineering practices.

Traditional software development models typically address reliability at the beginning of the design phase. As a result, changes to the functionality in a later stage impacts previously considered reliability requirements. Non-functional requirements are not reviewed as often. Quality of Service (QoS) parameters are most widely overlooked during the software development life cycle leading to operational issues subsequently.

The goal is not to push the software to production, but to run and manage it efficiently and effectively once it is live. SRE bridges this gap by leveraging a well-defined set of practices, principles, and culture built on DevOps foundations with a strong emphasis on engineering capabilities.

SRE sets measurable engineering objectives mapping to Service Level Objective (SLO) and enables monitoring and tracking of QoS parameters such as:

Reliability – Ability of the system to function correctly, failure-free software operation
Availability – System response to disruption and fault tolerance, avoid downtime, Stateless application design, fail forward database design
Recoverability – System ability to recover from incidences through actionable alerts and next-gen automation
Serviceability – Speed with which system can be repaired, System health assessment, monitoring and logging mechanism, end-user experience
Elasticity – System scalability and performance with reference to data, traffic, peak load and response time
Resiliency – System ability to withstand potential failure, focus on Mean Time to Repair (MTTR) over Mean Time Between Failures (MTBF)
Risk Budgeting – Ongoing process of risk measurement, attribution, and allocation. Optimal risk allocation to maximize expected return

SLO, SLI and SLA must exist and be measured

It’s of paramount importance to standardize SLO, identify KPIs, create balanced scorecards and continuously drive measurement, monitoring and tracking. Measurable Service Level Indicators (SLI) will determine success or failure of a change in production. Error Budget will act as an explicit quantitative measurement parameter in your Service Level Agreement (SLA) that can connect feature planning to service reliability.

How you balance change velocity vs. availability, reliability, security and other operational attributes is the key question to be answered. Implementation of continuous delivery, continuous integration, continuous testing, continuous release and deployment coupled with collaboration will drive the required cultural change. The system must recover from failure by automation.

Your SRE team needs to be responsible for the system design and development, release management, capacity management, change management, incidence management, automation, availability, latency, performance, security and monitoring of their services.

SRE will deliver differentiated value proposition towards your digital reinvention journey by providing fast and uninterrupted services through resilient systems, drive operational excellence and cost optimization by adopting automation and best practices, adopt risk management frameworks to address risk tolerance of services and bridge the relationship gap between development and operations teams and enable them to communicate with cost of reliability.

Leveraging SRE to design, build, operate and enhance software systems is critical for the future of business. Every CIO is today looking at SRE to strengthen their digital business foundation. It’s time now to build a culture of risk-managed, reliable and resilient digital footprint and SRE is at the heart of all these happenings.

LINK TO ORIGINAL ARTICLE

Community at DevOps Institute

Join now

[EP112] Why an AIOps Certification is Something You Should Think About

Join Eveline Oehrlich and Suresh GP for a discussion on Why an AIOps Certification is Something You Should Think About Transcript 00:00:02,939 → 00:00:05,819 Narrator: You're listening to the Humans of DevOps podcast, a 00:00:05,819 → 00:00:09,449 podcast focused on...

[EP111] ITSM Value Streams: Transform Opportunity Into Outcome book review

Join Eveline Oehrlich and David Billouz for a discussion on ITSM Value Streams: Transform Opportunity Into Outcome book review. Transcript Narrator 0:02 You're listening to the humans of DevOps podcast, a podcast focused on advancing the humans of DevOps through...

[Ep110] Open Source, Brew and Tea!

Join Eveline Oehrlich and Max Howell, CEO of tea.xyz and creator of Homebrew, to discuss open source including "the Nebraska problem," challenges, and more. Max Howell is the CEO of tea.xyz and creator of Homebrew. Brew was one of the largest open source projects of...

DevOps Institute

Leverage SRE to Build a Culture of Reliability, Resiliency and Risk Management

SLO, SLI and SLA must exist and be measured

Community at DevOps Institute

related posts

[EP112] Why an AIOps Certification is Something You Should Think About

[EP111] ITSM Value Streams: Transform Opportunity Into Outcome book review

[Ep110] Open Source, Brew and Tea!

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Buy the exam from the PeopleCert website

Complete your application from the PeopleCert website