Chaos Engineering

What it is

Chaos Engineering is the discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions in production. Unlike traditional testing that validates known behaviors, chaos engineering seeks to discover emergent system properties through controlled fault injection.

The practice is based on the principle that complex systems fail in unpredictable ways. Rather than waiting for these failures to occur naturally, chaos engineering provokes them in a controlled manner to identify weaknesses before they become critical incidents. This proactive approach allows teams to improve system resilience based on empirical evidence.

Netflix popularized this discipline with Chaos Monkey in 2010, but the concept has evolved into a structured methodology that spans from simple experiments to complex game days involving multiple teams and systems.

Fundamental principles

The four principles of chaos engineering establish a scientific methodology for experiments:

Define steady state: Identify metrics that represent normal system behavior (latency, throughput, error rate)
Hypothesize continuity: Formulate the hypothesis that steady state will be maintained during the experiment
Introduce real-world variables: Inject failures that reflect real events (server crashes, network partitions, latency spikes)
Disprove the hypothesis: Look for evidence that contradicts the initial hypothesis to discover weaknesses

Experiment types

Failure type	What it simulates	What it validates	Blast radius
Terminate instances	Hardware failure, deployment issues	Auto-healing, redundancy	Instance/AZ
Inject latency	Degraded network, overload	Timeouts, circuit breakers	Specific connection
Dependency failure	External service down	Fallbacks, graceful degradation	Downstream service
Exhaust resources	CPU/memory/disk at limit	Autoscaling, alerting	Node/cluster
Data corruption	Inconsistencies, bugs	Validation, reconciliation	Specific dataset
Network partition	Split-brain, CAP theorem	Consensus algorithms, data consistency	Network segment

Practical example: Litmus experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-experiment
  namespace: production
spec:
  # Hypothesis: System maintains 99.9% uptime with pods deleted
  appinfo:
    appns: ecommerce
    applabel: "app=checkout-service"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        # Delete 1 pod every 30 seconds for 5 minutes
        - name: TOTAL_CHAOS_DURATION
          value: "300"
        - name: CHAOS_INTERVAL
          value: "30"
        - name: FORCE
          value: "false"
      probe:
      # Validate that endpoint responds correctly
      - name: checkout-availability
        type: httpProbe
        httpProbe/inputs:
          url: "https://api.example.com/health"
          insecureSkipTLS: false
          method:
            get:
              criteria: ==
              responseCode: "200"
        mode: Continuous
        runProperties:
          probeTimeout: 5s
          interval: 10s

Steady-state hypothesis examples

E-commerce checkout service

Hypothesis: During normal conditions, the checkout service maintains:
- P95 latency < 500ms
- Success rate > 99.5%
- Throughput > 1000 transactions/minute
- CPU utilization < 70%

Experiment: Delete 2 of 10 service pods for 10 minutes
Success metric: All metrics remain within thresholds

API Gateway

Hypothesis: API Gateway handles backend service loss gracefully:
- Fallback responses in < 200ms
- Circuit breaker activates after 5 consecutive failures
- Structured error logs generated

Experiment: Simulate complete microservice failure for 5 minutes
Validation: Verify circuit breaker activation and fallback responses

Blast radius control

Blast radius defines the potential scope of impact of an experiment. Control strategies:

By infrastructure

Canary deployments: Experiments on 1-5% of traffic
Blue/green environments: Experiments in parallel environment
Availability zones: Limit to specific AZ
Kubernetes namespaces: Isolate by namespace/cluster

By time

Limited duration: Experiments of 5-15 minutes maximum
Specific schedules: Avoid peak hours or maintenance windows
Automatic rollback: Triggers based on health metrics

By functional scope

Feature flags: Enable/disable specific functionalities
User cohorts: Limit to beta or internal users
Geographic regions: Experiments by geographic region

Game day planning

Game days are coordinated exercises that simulate major incidents:

Preparation (2-4 weeks before)

Define scenarios: Multi-AZ failure, database corruption, DDoS attack
Form teams: Incident commander, communications lead, technical leads
Prepare runbooks: Response and rollback procedures
Set up observability: Dashboards, alerts, centralized logs

Execution (2-4 hours)

Initial briefing: Objectives, roles, communication channels
Gradual injection: Start with minor failures, escalate progressively
Real-time documentation: Decisions, response times, lessons
Immediate debrief: What worked, what failed, next steps

Post-game (1-2 weeks after)

Detailed analysis: MTTR metrics, runbook effectiveness
Action items: Improvements in monitoring, alerting, procedures
Runbook updates: Incorporate lessons learned
Next game day planning: New scenarios, greater complexity

Tools and platforms

Loading diagram...

Common anti-patterns

Experiments without clear hypothesis

❌ Bad: "Let's see what happens if we delete pods"
✅ Good: "Hypothesis: System maintains 99.9% uptime when 2 of 10 
         checkout service pods are deleted for 5 minutes"

Uncontrolled blast radius

❌ Bad: Production experiments without scope limits
✅ Good: Experiments limited by time, geography, and traffic percentage

Lack of observability

❌ Bad: Run experiments without validation metrics
✅ Good: Real-time dashboards with steady-state metrics

Why it matters

In modern distributed systems, emergent complexity makes failures inevitable and unpredictable. Chaos engineering transforms this reality from reactive to proactive: instead of waiting for systems to fail at the worst possible moment, we make them fail when we're prepared to learn from it.

For staff+ engineering teams, chaos engineering provides empirical evidence about architectural trade-offs. Do we really need that multi-region redundancy? Are circuit breakers configured correctly? Does auto-scaling respond fast enough? Only controlled experiments can answer these questions with real data.

The practice also accelerates the development of expertise in incident management. Teams that practice chaos engineering regularly respond faster and more effectively to real incidents, because they've already experienced similar scenarios under controlled conditions. It's the difference between training in flight simulators versus learning during a real emergency.

References

PRINCIPLES OF CHAOS ENGINEERING - Principles of chaos engineering — Community, 2019. Manifesto and fundamental principles of chaos engineering.
Home - Chaos Monkey — Netflix, 2024. Official documentation of the original chaos engineering tool.
LitmusChaos - Open Source Chaos Engineering Platform — CNCF, 2024. Cloud-native platform for chaos engineering experiments.
What is AWS Fault Injection Service? - AWS Fault Injection Service — AWS, 2024. Managed service for fault injection in AWS.
Chaos Engineering — Gremlin, 2024. Complete guide to chaos engineering and best practices.
Resilience Engineering at LinkedIn with Project Waterbear — LinkedIn Engineering, 2017. Enterprise-scale chaos engineering implementation.
GitHub - dastergon/awesome-chaos-engineering: A curated list of Chaos Engineering resources. · GitHub — Community, 2024. Curated list of chaos engineering resources and tools.

What it is

Fundamental principles

The four principles of chaos engineering establish a scientific methodology for experiments:

Define steady state: Identify metrics that represent normal system behavior (latency, throughput, error rate)
Hypothesize continuity: Formulate the hypothesis that steady state will be maintained during the experiment
Introduce real-world variables: Inject failures that reflect real events (server crashes, network partitions, latency spikes)
Disprove the hypothesis: Look for evidence that contradicts the initial hypothesis to discover weaknesses

Experiment types

Failure type	What it simulates	What it validates	Blast radius
Terminate instances	Hardware failure, deployment issues	Auto-healing, redundancy	Instance/AZ
Inject latency	Degraded network, overload	Timeouts, circuit breakers	Specific connection
Dependency failure	External service down	Fallbacks, graceful degradation	Downstream service
Exhaust resources	CPU/memory/disk at limit	Autoscaling, alerting	Node/cluster
Data corruption	Inconsistencies, bugs	Validation, reconciliation	Specific dataset
Network partition	Split-brain, CAP theorem	Consensus algorithms, data consistency	Network segment

Practical example: Litmus experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-experiment
  namespace: production
spec:
  # Hypothesis: System maintains 99.9% uptime with pods deleted
  appinfo:
    appns: ecommerce
    applabel: "app=checkout-service"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        # Delete 1 pod every 30 seconds for 5 minutes
        - name: TOTAL_CHAOS_DURATION
          value: "300"
        - name: CHAOS_INTERVAL
          value: "30"
        - name: FORCE
          value: "false"
      probe:
      # Validate that endpoint responds correctly
      - name: checkout-availability
        type: httpProbe
        httpProbe/inputs:
          url: "https://api.example.com/health"
          insecureSkipTLS: false
          method:
            get:
              criteria: ==
              responseCode: "200"
        mode: Continuous
        runProperties:
          probeTimeout: 5s
          interval: 10s

Steady-state hypothesis examples

E-commerce checkout service

Hypothesis: During normal conditions, the checkout service maintains:
- P95 latency < 500ms
- Success rate > 99.5%
- Throughput > 1000 transactions/minute
- CPU utilization < 70%

Experiment: Delete 2 of 10 service pods for 10 minutes
Success metric: All metrics remain within thresholds

API Gateway

Hypothesis: API Gateway handles backend service loss gracefully:
- Fallback responses in < 200ms
- Circuit breaker activates after 5 consecutive failures
- Structured error logs generated

Experiment: Simulate complete microservice failure for 5 minutes
Validation: Verify circuit breaker activation and fallback responses

Blast radius control

Blast radius defines the potential scope of impact of an experiment. Control strategies:

By infrastructure

Canary deployments: Experiments on 1-5% of traffic
Blue/green environments: Experiments in parallel environment
Availability zones: Limit to specific AZ
Kubernetes namespaces: Isolate by namespace/cluster

By time

Limited duration: Experiments of 5-15 minutes maximum
Specific schedules: Avoid peak hours or maintenance windows
Automatic rollback: Triggers based on health metrics

By functional scope

Feature flags: Enable/disable specific functionalities
User cohorts: Limit to beta or internal users
Geographic regions: Experiments by geographic region

Game day planning

Game days are coordinated exercises that simulate major incidents:

Preparation (2-4 weeks before)

Define scenarios: Multi-AZ failure, database corruption, DDoS attack
Form teams: Incident commander, communications lead, technical leads
Prepare runbooks: Response and rollback procedures
Set up observability: Dashboards, alerts, centralized logs

Execution (2-4 hours)

Initial briefing: Objectives, roles, communication channels
Gradual injection: Start with minor failures, escalate progressively
Real-time documentation: Decisions, response times, lessons
Immediate debrief: What worked, what failed, next steps

Post-game (1-2 weeks after)

Detailed analysis: MTTR metrics, runbook effectiveness
Action items: Improvements in monitoring, alerting, procedures
Runbook updates: Incorporate lessons learned
Next game day planning: New scenarios, greater complexity

Tools and platforms

Loading diagram...

Common anti-patterns

Experiments without clear hypothesis

❌ Bad: "Let's see what happens if we delete pods"
✅ Good: "Hypothesis: System maintains 99.9% uptime when 2 of 10 
         checkout service pods are deleted for 5 minutes"

Uncontrolled blast radius

❌ Bad: Production experiments without scope limits
✅ Good: Experiments limited by time, geography, and traffic percentage

Lack of observability

❌ Bad: Run experiments without validation metrics
✅ Good: Real-time dashboards with steady-state metrics

Why it matters

References

PRINCIPLES OF CHAOS ENGINEERING - Principles of chaos engineering — Community, 2019. Manifesto and fundamental principles of chaos engineering.
Home - Chaos Monkey — Netflix, 2024. Official documentation of the original chaos engineering tool.
LitmusChaos - Open Source Chaos Engineering Platform — CNCF, 2024. Cloud-native platform for chaos engineering experiments.
What is AWS Fault Injection Service? - AWS Fault Injection Service — AWS, 2024. Managed service for fault injection in AWS.
Chaos Engineering — Gremlin, 2024. Complete guide to chaos engineering and best practices.
Resilience Engineering at LinkedIn with Project Waterbear — LinkedIn Engineering, 2017. Enterprise-scale chaos engineering implementation.
GitHub - dastergon/awesome-chaos-engineering: A curated list of Chaos Engineering resources. · GitHub — Community, 2024. Curated list of chaos engineering resources and tools.

What it is

Fundamental principles

Experiment types

Practical example: Litmus experiment

Steady-state hypothesis examples

E-commerce checkout service

API Gateway

Blast radius control

By infrastructure

By time

By functional scope

Game day planning

Preparation (2-4 weeks before)

Execution (2-4 hours)

Post-game (1-2 weeks after)

Tools and platforms

Common anti-patterns

Experiments without clear hypothesis

Uncontrolled blast radius

Lack of observability

Why it matters

References

Related content

What it is

Fundamental principles

Experiment types

Practical example: Litmus experiment

Steady-state hypothesis examples

E-commerce checkout service

API Gateway

Blast radius control

By infrastructure

By time

By functional scope

Game day planning

Preparation (2-4 weeks before)

Execution (2-4 hours)

Post-game (1-2 weeks after)

Tools and platforms

Common anti-patterns

Experiments without clear hypothesis

Uncontrolled blast radius

Lack of observability

Why it matters

References

Related content