Concepts

Chaos Engineering

Discipline of experimenting on production systems to discover weaknesses before they cause incidents, by injecting controlled failures.

seed#chaos-engineering#resilience#fault-injection#testing#reliability

What it is

Chaos Engineering is the practice of injecting controlled failures into systems to discover weaknesses before they cause real incidents. Popularized by Netflix with Chaos Monkey.

Principles

  1. Define the system's steady state
  2. Hypothesize that steady state continues under perturbations
  3. Introduce real-world variables (network failures, latency, crashes)
  4. Look for differences between hypothesis and reality

Experiment types

TypeWhat it simulatesWhat it validates
Terminate instancesServer failureAuto-healing, redundancy
Inject latencyDegraded networkTimeouts, circuit breakers
Dependency failureExternal service downFallbacks, graceful degradation
Exhaust resourcesCPU/memory/disk at limitAutoscaling, alerts
Data corruptionInconsistent dataValidation, reconciliation

Tools

ToolFocus
Chaos MonkeyTerminate instances (Netflix)
GremlinComplete SaaS platform
LitmusKubernetes-native (CNCF)
AWS FISFault Injection Simulator

Precautions

  • Start in non-production environments
  • Have automatic rollback
  • Communicate experiments to the team
  • Minimize blast radius

Why it matters

Distributed systems fail in unpredictable ways. Chaos engineering turns those failures into planned, controlled events, revealing weaknesses before they become production incidents. It is the practice that builds real confidence in system resilience.

References

Concepts