Discipline of experimenting on production systems to discover weaknesses before they cause incidents, by injecting controlled failures.
Chaos Engineering is the discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions in production. Unlike traditional testing that validates known behaviors, chaos engineering seeks to discover emergent system properties through controlled fault injection.
The practice is based on the principle that complex systems fail in unpredictable ways. Rather than waiting for these failures to occur naturally, chaos engineering provokes them in a controlled manner to identify weaknesses before they become critical incidents. This proactive approach allows teams to improve system resilience based on empirical evidence.
Netflix popularized this discipline with Chaos Monkey in 2010, but the concept has evolved into a structured methodology that spans from simple experiments to complex game days involving multiple teams and systems.
The four principles of chaos engineering establish a scientific methodology for experiments:
| Failure type | What it simulates | What it validates | Blast radius |
|---|---|---|---|
| Terminate instances | Hardware failure, deployment issues | Auto-healing, redundancy | Instance/AZ |
| Inject latency | Degraded network, overload | Timeouts, circuit breakers | Specific connection |
| Dependency failure | External service down | Fallbacks, graceful degradation | Downstream service |
| Exhaust resources | CPU/memory/disk at limit | Autoscaling, alerting | Node/cluster |
| Data corruption | Inconsistencies, bugs | Validation, reconciliation | Specific dataset |
| Network partition | Split-brain, CAP theorem | Consensus algorithms, data consistency | Network segment |
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-experiment
namespace: production
spec:
# Hypothesis: System maintains 99.9% uptime with pods deleted
appinfo:
appns: ecommerce
applabel: "app=checkout-service"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
# Delete 1 pod every 30 seconds for 5 minutes
- name: TOTAL_CHAOS_DURATION
value: "300"
- name: CHAOS_INTERVAL
value: "30"
- name: FORCE
value: "false"
probe:
# Validate that endpoint responds correctly
- name: checkout-availability
type: httpProbe
httpProbe/inputs:
url: "https://api.example.com/health"
insecureSkipTLS: false
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5s
interval: 10sHypothesis: During normal conditions, the checkout service maintains:
- P95 latency < 500ms
- Success rate > 99.5%
- Throughput > 1000 transactions/minute
- CPU utilization < 70%
Experiment: Delete 2 of 10 service pods for 10 minutes
Success metric: All metrics remain within thresholds
Hypothesis: API Gateway handles backend service loss gracefully:
- Fallback responses in < 200ms
- Circuit breaker activates after 5 consecutive failures
- Structured error logs generated
Experiment: Simulate complete microservice failure for 5 minutes
Validation: Verify circuit breaker activation and fallback responses
Blast radius defines the potential scope of impact of an experiment. Control strategies:
Game days are coordinated exercises that simulate major incidents:
❌ Bad: "Let's see what happens if we delete pods"
✅ Good: "Hypothesis: System maintains 99.9% uptime when 2 of 10
checkout service pods are deleted for 5 minutes"
❌ Bad: Production experiments without scope limits
✅ Good: Experiments limited by time, geography, and traffic percentage
❌ Bad: Run experiments without validation metrics
✅ Good: Real-time dashboards with steady-state metrics
In modern distributed systems, emergent complexity makes failures inevitable and unpredictable. Chaos engineering transforms this reality from reactive to proactive: instead of waiting for systems to fail at the worst possible moment, we make them fail when we're prepared to learn from it.
For staff+ engineering teams, chaos engineering provides empirical evidence about architectural trade-offs. Do we really need that multi-region redundancy? Are circuit breakers configured correctly? Does auto-scaling respond fast enough? Only controlled experiments can answer these questions with real data.
The practice also accelerates the development of expertise in incident management. Teams that practice chaos engineering regularly respond faster and more effectively to real incidents, because they've already experienced similar scenarios under controlled conditions. It's the difference between training in flight simulators versus learning during a real emergency.
Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.
Approaches and testing levels for validating software works correctly, from unit tests to end-to-end tests and testing in production.
Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.
Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.