Site Reliability Engineering: Monitoring and Observability

By: Niladri Choudhuri

We have become accustomed to seeing developers developing software and not being overly concerned with monitoring, which is frequently assumed to be IT operations work. With DevOps initiatives, this view changes and developers also need to consider the operational aspects, starting from the design, through the CICD pipeline and extending to the post-deployment running of the service.

The Second Way of DevOps is about shortening and amplifying feedback loops: “shifting left”. This means that we want to get as much information as we can and use the ‘Wisdom of Production’ as early as possible in the software delivery lifecycle. To be able to achieve this, we need to consider the monitoring of the environment.

But the more the data, the more the chance of noise. We do need to get as much data as possible but we also need to be able to identify the most important pieces to make our best decisions. Now we are using monitoring that is not just for collecting data, metrics and event traces, but also to make us understand the health of the system. Thus, making systems observable! Let’s now understand what is monitoring and what is observability.

As per Peter Waterhouse of CA said:

“Monitoring is a verb; something we perform against our applications and systems to determine their state. From basic fitness tests and whether they are up or down, to more proactive performance health checks. We monitor applications to detect problems and anomalies. As troubleshooters, we use it to find the root cause of problems and gain insights into capacity requirements and performance trends over time.”

Riverbed said in a blog post:

“Monitoring aims to provide a broad view of anything that can be generically measured based on what you can specify in external configuration, whether it is time-series metrics or call stack execution.”

Observability is the ability to infer internal states of a system based on the system’s external outputs. Charity Majors, CEO of honeycomb.io explained observability on Twitter as: 

“Observability, short and sweet:

– can you understand whatever internal state the system has gotten itself into?

…just by inspecting and interrogating its output?

…even if (especially if) you have never seen it happen before?”

This means that observability is a measure of how well we can understand the internal states of a system from the knowledge of the external outputs. The point is that we get to understand things that have never happened, the unknown-unknowns.

As Peter Waterhouse went on to say:

“Observability is about how well internal states of a system can be inferred from knowledge of external outputs. So, in contrast to monitoring which is something we do, observability (as a noun), is more a property of a system.”

With the use of cloud-native, containers, microservice architectures, the old ways of monitoring do not enable us to scale. We need to use tools to better understand the application’s inner working and performance in the distributed systems and the CI/CD pipeline.

As mentioned by Charity Majors:

“Observability requires methodical, iterative exploration of the evidence. You can’t just use your gut and a dashboard and leap to a conclusion. The system is too complicated, too messy, too unpredictable. Your gut remembers what caused yesterday’s outages, it cannot predict the cause of tomorrow’s.”

Tooling is an important aspect of observability. Many tools are helping us to externalize the key events of an application through logs, metrics, and events. Tracing can be one such example. We can use Kubernetes to activate the metrics capture and analysis during a containerized application deployment to have observability. Many open-source tools are available for implementing observability like collected, stated, flaunted, Zipkin, Kaeger, OpenTracing, Open Telemetry and Semantic Loggic.

Observability means we need to have a watch on all application components from mobile and web front-ends to infrastructure. These used to come from various data sources. Now our systems are more complex and we need the application and code to be architected to enable telemetry and observability.

It is also important to look at the human aspect. People need to use this information in designing, developing and testing their applications. It has to be used in the right context. We need to have modern monitoring methods built into the deployment pipeline with minimum complexity. An example from the blog by Peter Waterhouse is that we could increase observability by establishing server-side application performance visibility with client-side response time analysis during load testing – a neat way of pinpointing problem root cause as we test at scale.

We need to understand the objective and then have the team implement the objective with the relevant tools and use the information to deliver the service in the best possible way.

Observability is important since:

  • Services are growing rapidly
  • Architectures used today are more dynamic
  • Dependencies between services are complex
  • We seek to improve CX – customer experience

We need to use Service Level Objectives (SLOs) and Service Level Indicators (SLIs) together with observability to optimize the performance and reliability of our products. Here’s an example from the DevOps Institute’s SRE Foundation course:

  • SLO’s are from a user perspective and help identify what is important
  • E.g. 90% of users should complete the full payment transaction in less than one elapsed minute
  • SLI’s give detail on how we are currently performing
  • E.g. 98% of users in a month complete a payment transaction in less than one minute
  • Observability gives the use of the normal state of the service
  • 38 seconds is the “normal” time it takes users to complete a payment transaction when all monitors are healthy

To find out more about the SRE Foundation course, click here.

ORIGINAL SOURCE TO LINK BACK TO

Become Free Member