The Practice of Chaos Engineering Observability

By: Biswajit Mohapatra

Software systems are evolving rapidly with increasing complexity over time. Architecture, application, infrastructure and storage complexities are growing exponentially, making systems more prone to failure. Today’s modern distributed systems are associated with many unpredictable failure scenarios that are extremely difficult to monitor across all failure points.

Monitoring is the process of checking behaviour of a system to ensure everything is functioning as expected. However, monitoring itself is not good enough when we are dealing with modern systems associated with complex integration points and interfaces across operating systems, Kubernetes layers and application stacks.  This gives rise to the evolution of observability as a discipline comprising three pillars: logging, monitoring and metrics. Observability is the property of the system that helps understanding what is going on in the system and getting related information to troubleshoot. Observability determines the internal state of a system from the knowledge of external outputs. A good observability solution should have the ability to externalize data and additional learning embedded into it. Sometimes we don’t even put effort in fixing the problem since we don’t know that the problem exists. Observability is about understanding what are systems failure modes, how data and insights can be leveraged to iterate and improve the system. Correlation between logging, monitoring and tracing coupled with index free log aggregation and data driven insights is poised to drive the success of observability solutions in the future.

Observability and Chaos Engineering

Chaos engineering is the practice of facilitating controlled experiments to uncover weaknesses in the system. Crash testing your systems in a simulated environment will aid in identifying the failure modes and taking corrective measures. The goal is to identify and address issues proactively before they reach your users. This can be achieved through hypothesizing normal steady state behaviour and continue to create failure modes that will impact the hypothesis, modelling failure of systems to improve resiliency, simulating production load, fault injection, controlled roll out using canary deployments, varying real world scenarios through simulation of hardware failure, malformed responses within ecosystems, sudden spikes in traffic to check for scalability and reducing the blast radius to contain and minimize the impact caused due to experiments.

Chaos Engineering workflow will be comprised of following steps:

(1)  Plan the experiment creating hypothesis around steady state behaviour

(2)  Initiate attack that is small enough to give information about how systems react

(3)  Measure the impact comparing with steady state metrics

(4)  If issue detected, cut off the attack and fix the issue

(5)  If issue is not detected, scale the attack until issues are observed

(6)  Learn, make improvements and automate experiments to run continuously

Chaos Engineering will introduce real time failures into systems to assess system ability to tolerate failures, recoverability, resiliency and high availability. By designing chaos engineering experiments you can learn weaknesses in the system that could potentially lead to failures. Then these weaknesses can be addressed proactively going beyond the reactive process that currently dominates most incident response models. However,  it’s important not to rush into the practice of chaos engineering without proper planning and designing of experiments. Every chaos experiment should begin with a hypothesis. The test should be designed with a small scope with a focused team working on the same. Every organization should focus on controlled chaos promoted by observability to improve system resiliency.

Chaos engineering leverages observability to discover and overcome system weaknesses. Without observability, there is no chaos engineering. Organizations need to focus on building a culture of observing systems. It’s no longer about writing code. It’s all about adopting processes to build resilient systems. Introduction of continuous chaos in your DevOps CICD pipeline helps automating experiments and failure testing enabling detection, debugging and fixing issues more proactively. The practice of chaos engineering observability will improve confidence in the system, enable faster deployments, prioritize business KPIs and drive auto healing of systems. Use of AI/ML will  aid in building observability  patterns and antipatterns by close monitoring of system and user behaviour over a period of time. The hypothesis developed over these patterns and antipatterns can help auto heal the systems.

The industry has started recognizing differentiated value propositions provided by the practice of chaos engineering observability. This is certainly helping to address many unknown facets of systems unpredictability. Chaos engineering experiments coupled with cognitive observability study of complex systems using trend analysis, regression analysis and time series analysis will help take the systems into newer heights in the near future.


Become Free Member