Chaos Engineering
Discipline of experimenting on production systems to discover weaknesses before they cause incidents, by injecting controlled failures.
seed#chaos-engineering#resilience#fault-injection#testing#reliability
What it is
Chaos Engineering is the practice of injecting controlled failures into systems to discover weaknesses before they cause real incidents. Popularized by Netflix with Chaos Monkey.
Principles
- Define the system's steady state
- Hypothesize that steady state continues under perturbations
- Introduce real-world variables (network failures, latency, crashes)
- Look for differences between hypothesis and reality
Experiment types
| Type | What it simulates | What it validates |
|---|---|---|
| Terminate instances | Server failure | Auto-healing, redundancy |
| Inject latency | Degraded network | Timeouts, circuit breakers |
| Dependency failure | External service down | Fallbacks, graceful degradation |
| Exhaust resources | CPU/memory/disk at limit | Autoscaling, alerts |
| Data corruption | Inconsistent data | Validation, reconciliation |
Tools
| Tool | Focus |
|---|---|
| Chaos Monkey | Terminate instances (Netflix) |
| Gremlin | Complete SaaS platform |
| Litmus | Kubernetes-native (CNCF) |
| AWS FIS | Fault Injection Simulator |
Precautions
- Start in non-production environments
- Have automatic rollback
- Communicate experiments to the team
- Minimize blast radius
Why it matters
Distributed systems fail in unpredictable ways. Chaos engineering turns those failures into planned, controlled events, revealing weaknesses before they become production incidents. It is the practice that builds real confidence in system resilience.
References
- Principles of Chaos Engineering — Manifesto.
- Chaos Monkey — Netflix, 2024. The original chaos engineering tool.
- Chaos Engineering: History, Principles, and Practice — Gremlin, 2024. History and principles of chaos engineering.