What Is Chaos Engineering?

Chaos engineering is the discipline of experimenting on a distributed system to build confidence in its capability to withstand turbulent conditions in production. It was pioneered by Netflix, who created Chaos Monkey to randomly terminate production instances and verify the system remained available.

The core insight: rather than waiting for failures to happen unexpectedly, proactively inject failures and observe how the system responds. This is fundamentally different from traditional testing — you are testing the system’s resilience, not its functionality.

Principles of Chaos Engineering

  1. Build a hypothesis around steady state behavior. Define what “normal” looks like: error rate, latency, throughput.
  2. Vary real-world events. Inject failures that actually happen: server crashes, network partitions, disk full, high latency.
  3. Run experiments in production. The closer to production, the more meaningful the results. Start in staging, graduate to production.
  4. Automate experiments to run continuously. One-time experiments are useful; continuous experiments build ongoing confidence.
  5. Minimize blast radius. Start small — one instance, one region, a small percentage of traffic.

The Chaos Experiment Lifecycle

Step 1: Define Steady State

Measurable indicators of normal system behavior:

  • Error rate < 0.1%
  • P95 response time < 200ms
  • All health checks passing
  • Order completion rate > 99%

Step 2: Hypothesize

“If we terminate one instance of the payment service, the system will continue processing payments with no increase in error rate because the load balancer routes to healthy instances.”

Step 3: Design the Experiment

  • Target: Payment service, one instance
  • Failure type: Process termination (kill -9)
  • Duration: 5 minutes
  • Blast radius: 1 of 4 instances (25% capacity reduction)
  • Abort criteria: Error rate > 1% or P95 latency > 1 second

Step 4: Execute

Run the experiment with monitoring active. Have a kill switch ready to stop the experiment immediately if abort criteria are met.

Step 5: Analyze

Compare metrics during and after the experiment with the steady state:

  • Did error rate increase? By how much?
  • Did response time increase? For how long?
  • Did the system recover automatically? How quickly?
  • Did monitoring detect the issue? Did alerts fire?

Step 6: Fix and Repeat

If the system did not behave as expected, fix the weakness and repeat the experiment to verify the fix.

Types of Chaos Experiments

ExperimentWhat It TestsExample
Instance terminationAuto-scaling, load balancingKill a random pod/VM
Network latencyTimeout handling, retriesAdd 500ms latency between services
Network partitionSplit-brain handling, consistencyBlock traffic between two services
Disk fullLogging, data handlingFill disk to 100%
CPU/Memory stressThrottling, resource limitsConsume 90% CPU on a node
DNS failureFallback mechanismsBlock DNS resolution
Dependency failureCircuit breakers, fallbacksMake a third-party API unavailable

Chaos Engineering Tools

ToolTypeBest For
Chaos MonkeyOpen source (Netflix)Random instance termination
LitmusOpen source (CNCF)Kubernetes-native chaos experiments
GremlinSaaSEnterprise chaos-as-a-service
Chaos ToolkitOpen sourceFramework for defining experiments in JSON/YAML
ToxiproxyOpen source (Shopify)Network condition simulation
AWS Fault Injection SimulatorAWS serviceAWS-specific fault injection

QA’s Role in Chaos Engineering

QA engineers bring unique value to chaos engineering:

  1. Test design skills: QA knows how to design experiments that reveal weaknesses
  2. Monitoring knowledge: QA understands which metrics indicate real problems
  3. Risk assessment: QA can evaluate which experiments are safe to run and in what order
  4. Validation: QA verifies that fixes actually resolve the discovered weaknesses

Exercise: Design a Chaos Experiment

Your e-commerce application has: API gateway, product service, cart service, payment service, notification service, PostgreSQL, Redis.

Design three chaos experiments in order of increasing risk.

Solution

Experiment 1: Redis Cache Failure (Low Risk)

Hypothesis: If Redis becomes unavailable, the application continues serving requests (with degraded performance) by falling back to database queries.

Steady state: P95 < 200ms, error rate < 0.1%, product pages load successfully

Injection: Stop Redis container for 5 minutes

Expected behavior: Response time increases to 500-800ms (database fallback), no errors, product pages still load

Abort criteria: Error rate > 1% or product pages return 500 errors

Experiment 2: Payment Service Instance Failure (Medium Risk)

Hypothesis: If one of three payment service instances dies, the load balancer routes to healthy instances with no failed payments.

Steady state: Payment success rate > 99.5%, P95 < 500ms

Injection: Kill one payment service pod (kubernetes delete pod)

Expected behavior: Kubernetes restarts the pod within 30 seconds. During that time, remaining instances handle traffic. No failed payments.

Abort criteria: Payment success rate < 98% or P95 > 2 seconds

Experiment 3: Network Partition Between Services (Higher Risk)

Hypothesis: If the product service cannot reach the inventory service, it serves cached inventory data and marks items as “check availability” instead of showing “in stock/out of stock.”

Steady state: Product pages show accurate inventory, error rate < 0.1%

Injection: Block network between product-service and inventory-service for 3 minutes

Expected behavior: Product pages load, show cached inventory or “check availability” message. No 500 errors.

Abort criteria: Error rate > 2% or product pages fail to load

Key Takeaways

  1. Chaos engineering is proactive resilience testing — find weaknesses before they find you
  2. Always define steady state first — you cannot detect failure without knowing what normal looks like
  3. Start small and escalate — begin in staging, with small blast radius, then move to production
  4. Automate and repeat — one-time experiments are good; continuous experiments are better
  5. QA brings unique value — test design skills and monitoring knowledge are exactly what chaos engineering needs