What is Chaos Engineering?
Let’s start with the definition by the Chaos community
“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
4 steps to uncover systemic weaknesses
Which Chaos Engineering experiments to perform first?
- Known Knowns – Things you are aware of and understand
- Known Unknowns – Things you are aware of but don’t fully understand
- Unknown Knowns – Things you understand but are not aware of
- Unknown Unknowns – Things you are neither aware of nor fully understand
The diagram below illustrates this concept:
Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow 5 phases:
- Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
- Hypothesize that this steady state will continue in both the control group and the experimental group.
- Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
- Fixes: Keep in mind that when working in a chaos engineering approach, it’s all about learning from failure. That’s why you create them on purpose, in a controlled and managed environment.
Chaos Engineering Tools
Measure experiments carefully, ensuring they are low-risk: involve few users, limit user flows, limit the number of live devices, etc.
The following is a list of tools to get you started:
Chaos Monkey: The OG of chaos engineering. The tool is still maintained and currently integrated into Spinnaker, a continuous delivery platform developed initially by Netflix to release software changes rapidly and reliably.
Platform: Spinnaker
Release year: 2012
Creator: Netflix
Language: Go
Mangle: Enables one to run chaos engineering experiments against applications and infrastructure components and quickly assess resiliency and fault tolerance. Designed to introduce faults with minimal pre-configuration and supports a wide range of tooling, including K8S, Docker, vCenter, or any Remote Machine with SSH enabled.
Platforms: Docker, Kubernetes, bare-metal, cloud platforms
Release year: 2018
Creator: ChaosIQ
Language: Python
AWS Fault Injection Simulator: AWS Fault Injection Simulator is a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application’s performance, observability, and resiliency.
Works with: Amazon Relational Database Service (RDS), Elastic Compute Cloud (EC2), Elastic Container Service (ECS), and Elastic Kubernetes Service (EKS)
Release year: 2021
Creator: Amazon Web Services
ChaosBlade is built on nearly ten years of failure testing at Alibaba. It supports a wide range of platforms including Kubernetes, cloud platforms, and bare-metal, and provides dozens of attacks including packet loss, process killing, and resource consumption. It also supports application-level fault injection for Java, C++, and Node.js applications, which provides arbitrary code injection, delayed code execution, and modifying memory values.
As chaos engineering is an experimentation approach, it gives us a holistic view of the system’s behavior and how all the moving parts interact in a given set of circumstances, allowing us to derive insights into the system’s technical and soft aspects (aka, the human factor). Chaos engineering will enable organizations to find security vulnerabilities that are otherwise challenging to detect by traditional methods due to distributed systems’ complex nature. This may include losses caused by human factors, poor design, or lack of resiliency.
Related Posts
October 14, 2024
Zetta Systems Celebrates 10 Years of Tech Excellence
A Decade of Innovation: Zetta Systems Celebrates 10 Years of Tech Excellence
May 19, 2023
Top 10 Biggest Outages in 2021
There are hundreds of cloud and network provider outages every week, more than…
May 19, 2023
Cloud Agnostic is (not) a Myth?
The term “cloud agnostic” is regarded to refer to applications and workloads…