Posted on 08-Jun-2019 07:47:04
Chaos Engineering is a practice that has been started in Netflix with the first Chaos Monkey created sometime in 2010 and Chaos Engineer as a role created in 2014. The goal of Chaos Engineering is to improve the confidence in the services/infrastructure during real life turbulent conditions, i.e., failure or stress events. The practice helps you run planned experiments to identify system weaknesses and fix them before they happen in real-time and lead to a monetary or reputational impact to the business.
Chaos Engineering becomes very important in companies that have large distributed systems. The complexity has ever increased with the microservices architectures where there are many services/components, especially in some companies there are thousands of services and it is very difficult to manually monitor or react to failed services or analyse failure scenarios. Reliability needs to be in-built into these services and infrastructure.
When it comes to improving the confidence in a system or infrastructure, a lot of people get confused between the 2 main terms - availability and resiliency. They are 2 different things and hence there are 2 different terms. A system or service can be made highly available by maintaining redundant components but each component by itself needs to be resilient to failure conditions. Otherwise all of the components may just fail making the entire set of replicas unavailable - as a simple example, if the code is unable to handle malformed messages, these malformed messages may be retried by all the components and all of them can just crash. Even PaaS platforms like Openshift or PCF can help in automatically restarting pods or containers for availability but cannot help with the resiliency of the app/service itself.
The real-life failure events can be loss of a downstream service, network failures, etc and stress situations can be a slow network or things like high CPU or memory on the server that hosts the services. To be resilient to these conditions, the services require proper error handling, retry logics, timeout setups, ensure they are using as much low memory and CPU as possible, etc. These conditions also mean that there needs to be sufficient alerting/monitoring in place to detect these conditions and take necessary actions - automated in most cases and manual in some cases.
There may be certain infra level failures like a physical disk failure or an ethernet device failure. In such conditions, availability through redundant components is the only solution. The Chaos Engineering experiments do not involve the failures of the physical devices directly. Instead, they are indirectly experimented by making that server unavailable.
Now, how can these Chaos experiments be performed? At a high level, there are 4 main parts of running the Chaos experiments.
Hypothesize steady state - you determine what would be your steady state. For eg, the hypothesis might be that the service continue to be available while running the experiments.
Minimize the blast radius. You should not run the experiments across the entire infrastructure at the same time.
Vary real world events, or also called attacks in the Chaos Engineering world, some of which I have mentioned above. There are different ways the real world events can be applied on the target services or servers. There are many open source tools to help apply these attacks and some mechanisms can be built in-house. The attacks can be applied across on-premise VMs, PaaS and IaaS platforms.
Verify that the steady state continues to be met even during the experimentation. If the steady state is not met, look at what caused the experiment to fail and also whether sufficient alerting/monitoring/logging was in place.
Vishnu Vardhan Chikoti is a co-author for the book "Hands-on Site Reliability Engineering". He is a technology leader with diverse experience in the areas of Application and Database design and development, Micro-services & Micro-frontends, DevOps, Site Reliability Engineering and Machine Learning.