What is Chaos Engineering?

Let’s start with the definition by the Chaos community  “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Chaos engineering is being adopted by top DevOps and IT teams at companies such as Netflix and Amazon.

4 steps to uncover systemic weaknesses

Which Chaos Engineering experiments to perform first?

  • Known Knowns – Things you are aware of and understand
  • Known Unknowns – Things you are aware of but don’t fully understand
  • Unknown Knowns – Things you understand but are not aware of
  • Unknown Unknowns – Things you are neither aware of nor fully understand

The diagram below illustrates this concept:

Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow 5 phases:

  1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
  2. Hypothesize that this steady state will continue in both the control group and the experimental group.
  3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
  5. Fixes: Keep in mind that when working in a chaos engineering approach, it’s all about learning from failure. That’s why you create them on purpose, in a controlled and managed environment.

Chaos Engineering Tools

Measure experiments carefully, ensuring they are low-risk: involve few users, limit user flows, limit the number of live devices, etc.

The following is a list of tools to get you started:

Chaos Monkey: The OG of chaos engineering. The tool is still maintained and currently integrated into Spinnaker, a continuous delivery platform developed initially by Netflix to release software changes rapidly and reliably.

Platform: Spinnaker

Release year: 2012

Creator: Netflix

Language: Go

 

Mangle: Enables one to run chaos engineering experiments against applications and infrastructure components and quickly assess resiliency and fault tolerance. Designed to introduce faults with minimal pre-configuration and supports a wide range of tooling, including K8S, Docker, vCenter, or any Remote Machine with SSH enabled.

Platforms: Docker, Kubernetes, bare-metal, cloud platforms

Release year: 2018

Creator: ChaosIQ

Language: Python

 

AWS Fault Injection Simulator: AWS Fault Injection Simulator is a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application’s performance, observability, and resiliency.

Works with: Amazon Relational Database Service (RDS), Elastic Compute Cloud (EC2), Elastic Container Service (ECS), and Elastic Kubernetes Service (EKS)

Release year: 2021

Creator: Amazon Web Services

 

ChaosBlade is built on nearly ten years of failure testing at Alibaba. It supports a wide range of platforms including Kubernetes, cloud platforms, and bare-metal, and provides dozens of attacks including packet loss, process killing, and resource consumption. It also supports application-level fault injection for Java, C++, and Node.js applications, which provides arbitrary code injection, delayed code execution, and modifying memory values.

As chaos engineering is an experimentation approach, it gives us a holistic view of the system’s behavior and how all the moving parts interact in a given set of circumstances, allowing us to derive insights into the system’s technical and soft aspects (aka, the human factor). Chaos engineering will enable organizations to find security vulnerabilities that are otherwise challenging to detect by traditional methods due to distributed systems’ complex nature. This may include losses caused by human factors, poor design, or lack of resiliency.