Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Chaos Engineering

Discipline of experimenting on production systems to discover weaknesses before they cause incidents, by injecting controlled failures.

evergreen#chaos-engineering#resilience#fault-injection#testing#reliability

What it is

Chaos Engineering is the discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions in production. Unlike traditional testing that validates known behaviors, chaos engineering seeks to discover emergent system properties through controlled fault injection.

The practice is based on the principle that complex systems fail in unpredictable ways. Rather than waiting for these failures to occur naturally, chaos engineering provokes them in a controlled manner to identify weaknesses before they become critical incidents. This proactive approach allows teams to improve system resilience based on empirical evidence.

Netflix popularized this discipline with Chaos Monkey in 2010, but the concept has evolved into a structured methodology that spans from simple experiments to complex game days involving multiple teams and systems.

Fundamental principles

The four principles of chaos engineering establish a scientific methodology for experiments:

  1. Define steady state: Identify metrics that represent normal system behavior (latency, throughput, error rate)
  2. Hypothesize continuity: Formulate the hypothesis that steady state will be maintained during the experiment
  3. Introduce real-world variables: Inject failures that reflect real events (server crashes, network partitions, latency spikes)
  4. Disprove the hypothesis: Look for evidence that contradicts the initial hypothesis to discover weaknesses

Experiment types

Failure typeWhat it simulatesWhat it validatesBlast radius
Terminate instancesHardware failure, deployment issuesAuto-healing, redundancyInstance/AZ
Inject latencyDegraded network, overloadTimeouts, circuit breakersSpecific connection
Dependency failureExternal service downFallbacks, graceful degradationDownstream service
Exhaust resourcesCPU/memory/disk at limitAutoscaling, alertingNode/cluster
Data corruptionInconsistencies, bugsValidation, reconciliationSpecific dataset
Network partitionSplit-brain, CAP theoremConsensus algorithms, data consistencyNetwork segment

Practical example: Litmus experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-experiment
  namespace: production
spec:
  # Hypothesis: System maintains 99.9% uptime with pods deleted
  appinfo:
    appns: ecommerce
    applabel: "app=checkout-service"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        # Delete 1 pod every 30 seconds for 5 minutes
        - name: TOTAL_CHAOS_DURATION
          value: "300"
        - name: CHAOS_INTERVAL
          value: "30"
        - name: FORCE
          value: "false"
      probe:
      # Validate that endpoint responds correctly
      - name: checkout-availability
        type: httpProbe
        httpProbe/inputs:
          url: "https://api.example.com/health"
          insecureSkipTLS: false
          method:
            get:
              criteria: ==
              responseCode: "200"
        mode: Continuous
        runProperties:
          probeTimeout: 5s
          interval: 10s

Steady-state hypothesis examples

E-commerce checkout service

Hypothesis: During normal conditions, the checkout service maintains:
- P95 latency < 500ms
- Success rate > 99.5%
- Throughput > 1000 transactions/minute
- CPU utilization < 70%

Experiment: Delete 2 of 10 service pods for 10 minutes
Success metric: All metrics remain within thresholds

API Gateway

Hypothesis: API Gateway handles backend service loss gracefully:
- Fallback responses in < 200ms
- Circuit breaker activates after 5 consecutive failures
- Structured error logs generated

Experiment: Simulate complete microservice failure for 5 minutes
Validation: Verify circuit breaker activation and fallback responses

Blast radius control

Blast radius defines the potential scope of impact of an experiment. Control strategies:

By infrastructure

  • Canary deployments: Experiments on 1-5% of traffic
  • Blue/green environments: Experiments in parallel environment
  • Availability zones: Limit to specific AZ
  • Kubernetes namespaces: Isolate by namespace/cluster

By time

  • Limited duration: Experiments of 5-15 minutes maximum
  • Specific schedules: Avoid peak hours or maintenance windows
  • Automatic rollback: Triggers based on health metrics

By functional scope

  • Feature flags: Enable/disable specific functionalities
  • User cohorts: Limit to beta or internal users
  • Geographic regions: Experiments by geographic region

Game day planning

Game days are coordinated exercises that simulate major incidents:

Preparation (2-4 weeks before)

  1. Define scenarios: Multi-AZ failure, database corruption, DDoS attack
  2. Form teams: Incident commander, communications lead, technical leads
  3. Prepare runbooks: Response and rollback procedures
  4. Set up observability: Dashboards, alerts, centralized logs

Execution (2-4 hours)

  1. Initial briefing: Objectives, roles, communication channels
  2. Gradual injection: Start with minor failures, escalate progressively
  3. Real-time documentation: Decisions, response times, lessons
  4. Immediate debrief: What worked, what failed, next steps

Post-game (1-2 weeks after)

  1. Detailed analysis: MTTR metrics, runbook effectiveness
  2. Action items: Improvements in monitoring, alerting, procedures
  3. Runbook updates: Incorporate lessons learned
  4. Next game day planning: New scenarios, greater complexity

Tools and platforms

Loading diagram...

Common anti-patterns

Experiments without clear hypothesis

❌ Bad: "Let's see what happens if we delete pods"
✅ Good: "Hypothesis: System maintains 99.9% uptime when 2 of 10 
         checkout service pods are deleted for 5 minutes"

Uncontrolled blast radius

❌ Bad: Production experiments without scope limits
✅ Good: Experiments limited by time, geography, and traffic percentage

Lack of observability

❌ Bad: Run experiments without validation metrics
✅ Good: Real-time dashboards with steady-state metrics

Why it matters

In modern distributed systems, emergent complexity makes failures inevitable and unpredictable. Chaos engineering transforms this reality from reactive to proactive: instead of waiting for systems to fail at the worst possible moment, we make them fail when we're prepared to learn from it.

For staff+ engineering teams, chaos engineering provides empirical evidence about architectural trade-offs. Do we really need that multi-region redundancy? Are circuit breakers configured correctly? Does auto-scaling respond fast enough? Only controlled experiments can answer these questions with real data.

The practice also accelerates the development of expertise in incident management. Teams that practice chaos engineering regularly respond faster and more effectively to real incidents, because they've already experienced similar scenarios under controlled conditions. It's the difference between training in flight simulators versus learning during a real emergency.

References

  • PRINCIPLES OF CHAOS ENGINEERING - Principles of chaos engineering — Community, 2019. Manifesto and fundamental principles of chaos engineering.
  • Home - Chaos Monkey — Netflix, 2024. Official documentation of the original chaos engineering tool.
  • LitmusChaos - Open Source Chaos Engineering Platform — CNCF, 2024. Cloud-native platform for chaos engineering experiments.
  • What is AWS Fault Injection Service? - AWS Fault Injection Service — AWS, 2024. Managed service for fault injection in AWS.
  • Chaos Engineering — Gremlin, 2024. Complete guide to chaos engineering and best practices.
  • Resilience Engineering at LinkedIn with Project Waterbear — LinkedIn Engineering, 2017. Enterprise-scale chaos engineering implementation.
  • GitHub - dastergon/awesome-chaos-engineering: A curated list of Chaos Engineering resources. · GitHub — Community, 2024. Curated list of chaos engineering resources and tools.

Related content

  • Site Reliability Engineering

    Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.

  • Testing Strategies

    Approaches and testing levels for validating software works correctly, from unit tests to end-to-end tests and testing in production.

  • Observability

    Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.

  • Incident Management

    Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.

Concepts