Chaos Engineering: Journey to Resilience in Your System

Daria Kotelenets
Nov 17, 2023
6 min read

Many of you have probably heard about Murphy's law, that if anything can go wrong, it will go wrong, and at the worst possible time. Werner Vogels, VP & CTO at Amazon, slightly rephrased this and said:

"Everything fails all the time, so plan for failure and nothing fails"

That is essentially what chaos engineering is all about—building resilient systems. Before digging into chaos engineering, let's first define what system resilience means. Yacov Haimes defined resilience as

system's capability to withstand a major disruption within acceptable degradation parameters and recover within an acceptable time, all while managing composite costs and risks.

Take a moment to assess how resilient your system is right now on a scale of 1 to 10. If your answer is 7 or less, you probably can benefit from chaos engineering.

Chaos engineering for toddlers

I have a toddler and the way he learns about this world is through experiments, with his toys, with things that are not allowed, with his body, with my nervous system etc.. So, let’s imagine that you are 3-year-old child. You have a toy that's a big tower made of building blocks. Now, chaos engineering is a bit like a fun game where you purposely try to make the tower fall over, but you do it safely. You might tap some of the blocks gently to see which ones are strong and which ones might need more support. It's like a science experiment with your tower. You're learning about what makes it strong and what can make it fall, but you're doing it purposefully to understand better how it works.

Chaos engineering is not about randomly breaking things; it is about anticipating what-if scenarios, about thoughtful, controlled experiments that help us to know our system better. You are creating disruptive events that stress your system by injecting the delays, taking down the component, to observe what happens. Your goal is to prove or disprove assumptions, and to find weaknesses before they happen in at least the expected moment.

Turn OFF the light!

Chaos Engineering helps us deal with crises. People handle crises in different ways, but the body's stress response remains largely the same. Cortisol and adrenaline levels increase, heart rate and blood pressure rise. This physiological reaction is often referred to as the 'fight or flight' response, and as a result, we sometimes make less than optimal decisions. Therefore, it is crucial to be prepared and to train your team on how to deal with crises. As someone who experienced daily 20-hour-long blackouts in Ukraine, I find the strategy of 'Turning off the Light' very symbolic in explaining the goal of chaos engineering. To prepare for blackouts, you can read about how to do it, implement specific measures, and consider purchasing power banks, for instance. However, the most effective preparedness strategy is to simulate a 24-hour blackout. This way, you can identify all the bottlenecks and become mentally prepared. Simulate your next crisis – don't wait for it! Just as a pilot cannot land a plane in bad weather without simulator training.

Process overview

Chaos Engineering is like a anti-stress vaccine. We inject harm to build immunity from outages. According to the 2021 State of Chaos Engineering report, the most common outcomes of Chaos Engineering are:

Lower mean time to resolution (MTTR)
Lower mean time to detection (MTTD)
Fewer bugs shipped to product
Fewer outages.
Teams who frequently run Chaos Engineering experiments are more likely to have >99.9% availability.

I see Chaos Engineering as a natural extension of quality assurance, much like vulnerability assessment or performance testing. Ultimately, this requires collaborative effort among the various engineers within the team, all working toward the common goal of ensuring system stability. This is why it's crucial to train our engineering team to navigate turbulent conditions, and Chaos Engineering is a tool that can help us in achieving this.

The process of executing experiments is pretty standard, especially for those who have experience with A/B tests. However, it's worth reviewing.

It's important to mention that not everything can be fixed. The decision is always a balance between assessment cost and risks. Perhaps the cost is so high that implementing a recovery plan is the best course of action at this moment.

Examples

It was already highlighted that chaos experiments are conducted in a controlled environment and with specific objectives in mind, allowing teams to learn from failures and make necessary improvements before they become critical issues in a production environment. So let’s review several examples:

The logical question that might arise is how chaos engineering differs from regular negative testing. Chaos engineering is a deep dive into the complex world of unknowns. During negative tests, you can predict what might go wrong. The primary goal of negative testing is to identify defects in specific functionalities or components, while chaos experiments are primarily conducted to test system resilience and improve overall system reliability. Negative tests are often designed using appropriate test design techniques and are integrated into your software development life cycle (SDLC). On the other hand, chaos experiments are frequently carried out in a production environment and require a robust monitoring and alerting system, as well as effective incident response processes. This leads us to the next section about environments and tools.

Environments and Tools

Production is the ideal environment for conducting chaos experiments. Other environments may lead to false-negative or false-positive results due to factors such as scale, traffic, infrastructure setup, and more. However, there are cases where tests can be executed in preproduction environments, which can provide the necessary results. Some companies cannot afford the level of risk associated with production testing when the cost of failure is too high. Therefore, the choice of environment depends on the context in which your company operates and its risk tolerance.

One of the safest ways to inject failure into the environment is by using the canary deployment pattern. With this approach, you gradually roll out changes to a small subset of users and expand availability if no issues are detected. Additionally, as part of the shift-left testing strategy, some tests can be integrated into CI/CD pipelines to ensure a baseline level of reliability. After your application has passed the initial build and automated tests in the CI/CD pipeline, you can introduce a chaos stage that simulates scenarios like network latency between microservices or components.

At the early stages of a project, it's acceptable to conduct some manual experiments. However, as the project progresses, it's generally better to transition to 'automated chaos' testing to assess reliability during the deployment process. This is why chaos engineering often relies on specialized tools like Azure Chaos Studio or AWS Fault Injection Simulator. For instance, AWS Fault Injection Simulator (FIS) is a fully managed service for running such experiments. You can create an experiment state machine as a graph, integrate it with your delivery pipeline, and automatically roll back or stop the experiment if specific conditions are met.

Challenges

Implementing changes like this is not a super simple task. So, let's review the key challenges that teams usually face.

Low culture of quality and a culture of learning from failures. People often prefer to focus on positives rather than on failures. They tend to prioritize personal work over collaboration, and it is often easier to blame rather than to show empathy. Chaos engineering is, first of all, a shift in mindset.
Lack of strong observability. Can you actually see what is happening with your system so you can understand whether it's performing well or poorly Observability is a key component of chaos engineering; without it, you cannot detect problems.
Poor knowledge of system architecture. Those planning experiments need to have a deep understanding of the system architecture. They should be aware of the system's weakest points, such as single points of failure, and understand the cost of potential failures.
Lack of understanding of what 'good' looks like and what to observe (you need to have SLA/SLOs that make sense). 'Good' doesn't mean everything is perfect; it's an empirical process that involves experimentation.
Shared environments. Collaboration with teams responsible for other areas of the application is essential because dependencies always exist and represent potential risks.
The absence of incident management, making it impossible to handle alerts in case experiments go wrong. Do you have a disaster recovery plan, backup strategy, and incident response procedures in place?
To convince management of the necessity to take calculated risks and execute experiments that might potentially impact our production system.

As you can see, Chaos engineering is not a project; it is a habit that enhances our ability to constantly innovate. It is not a silver bullet, and it doesn't replace other approaches to ensure resilience. However, it can help us learn about things that we wouldn't be able to learn in any other way. It's better to start this sooner rather than later. As a result, system reliability won't become your technical debt.

A few last tips…

Logic will take you from A to B. Imagination will take you everywhere. - Albert Einstein.

Practice long-term thinking, intentionally considering what may happen in the future. Pretend that you are launching a spacecraft. Once it is in space, there is not much that you can change.
Apply system thinking, consider a system in its entirety, and aim to understand its complexities.

Think big - act small. Set ambitious, high-level goals, but implement them starting with smaller, well-defined experiments that are easy to manage and then add complexity with every next iteration.
Step outside your comfort zone to deal with unknowns.
Be ready for the snowball effect. Shit happens, sometimes small, initially controlled issue or disruption, introduced as part of a chaos experiment, escalates into a larger and more widespread problem.
Don’t apply chaos to chaos!