Skip to content

Chaos Engineering

Overview

This guide explores chaos engineering principles and practices for Java applications. Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. We'll cover how to implement chaos experiments specifically for Java applications, microservices, and cloud-native architectures.

Prerequisites

Learning Objectives

  • Understand chaos engineering principles and benefits
  • Design and execute controlled chaos experiments for Java applications
  • Implement chaos engineering tools in Java environments
  • Monitor and analyze system behavior during chaos experiments
  • Apply chaos engineering to microservices architectures
  • Create a chaos engineering culture in development teams

Chaos Engineering Fundamentals

What is Chaos Engineering?

Chaos engineering is the practice of deliberately introducing controlled failure into a system to test its resilience and identify weaknesses before they cause real outages.

┌─────────────────────────────────────────────────────────────────────┐
│                     Chaos Engineering Process                        │
└─────────────────────────────────────────────────────────────────────┘
       │                │                  │                 │
       ▼                ▼                  ▼                 ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  ┌─────────────┐
│ Define      │  │ Hypothesize │  │ Run Controlled  │  │ Analyze     │
│ Steady State│  │ about Chaos │  │ Experiments     │  │ Results     │
└─────────────┘  └─────────────┘  └─────────────────┘  └─────────────┘

Core Principles

  1. Start in a Known State: Define what normal system behavior looks like before introducing chaos.
  2. Hypothesize about Outcomes: Form a hypothesis about how the system will respond to a specific failure.
  3. Introduce Controlled Experiments: Simulate real-world failures in a controlled environment.
  4. Minimize Blast Radius: Start small and gradually increase the scope of experiments.
  5. Run Experiments in Production: Eventually, test in production environments to uncover real issues.

Benefits for Java Applications

  • Increased Resilience: Identify and fix weaknesses before they cause production incidents
  • Improved Recovery: Test and optimize recovery mechanisms
  • Reduced MTTR: Decrease mean time to recovery when failures occur
  • System Understanding: Develop deeper knowledge of system behavior under stress
  • Confidence in System: Build confidence in the ability to handle unexpected failures

Designing Chaos Experiments for Java Applications

Experiment Structure

A well-designed chaos experiment follows this structure:

  1. Define Steady State: Determine what "normal" behavior looks like (response times, throughput, etc.)
  2. Create a Hypothesis: Form a hypothesis about how the system will behave under specific conditions
  3. Introduce Variables: Inject failure or unusual conditions
  4. Observe Results: Monitor system behavior during the experiment
  5. Analyze Outcome: Compare results against the hypothesis and baseline
  6. Remediate Issues: Fix any weaknesses identified
  7. Iterate: Refine and repeat experiments

Common Chaos Scenarios for Java Applications

  1. Resource Exhaustion:
  2. Memory leaks and heap exhaustion
  3. Thread pool saturation
  4. Connection pool depletion
  5. CPU spikes
  6. Disk space filling up

  7. Network Failures:

  8. Service dependency failures
  9. Network latency and packet loss
  10. DNS resolution failures
  11. Load balancer failures

  12. State Mutations:

  13. Database corruption
  14. Inconsistent cache states
  15. Stale configuration
  16. Corrupted messages in queues

  17. Time and Clock Issues:

  18. Clock skew between services
  19. Leap second bugs
  20. Timezone handling issues

Chaos Engineering Tools for Java

Chaos Monkey for Spring Boot

Chaos Monkey for Spring Boot (CM4SB) is an implementation of the Chaos Monkey pattern for Spring Boot applications.

Installation

Add the dependency to your Maven pom.xml:

<dependency>
    <groupId>de.codecentric</groupId>
    <artifactId>chaos-monkey-spring-boot</artifactId>
    <version>2.6.1</version>
</dependency>

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Configuration

Enable Chaos Monkey in your application.yml:

chaos:
  monkey:
    enabled: true
    watcher:
      component: false
      controller: true
      repository: true
      rest-controller: true
      service: true
    assaults:
      level: 3
      latencyActive: true
      latencyRangeStart: 1000
      latencyRangeEnd: 3000
      exceptionsActive: false
      killApplicationActive: false
      memoryActive: false

management:
  endpoint:
    chaosmonkey:
      enabled: true
  endpoints:
    web:
      exposure:
        include: health,info,chaosmonkey

Using Chaos Monkey at Runtime

Activate specific assaults via the REST API:

# Enable latency assault
curl -X POST \
  http://localhost:8080/actuator/chaosmonkey/assaults \
  -H 'Content-Type: application/json' \
  -d '{
    "latencyActive": true,
    "latencyRangeStart": 2000,
    "latencyRangeEnd": 5000,
    "level": 3
}'

# Enable exception assault
curl -X POST \
  http://localhost:8080/actuator/chaosmonkey/assaults \
  -H 'Content-Type: application/json' \
  -d '{
    "exceptionsActive": true,
    "exception": {
        "type": "java.io.IOException",
        "message": "Chaos Monkey - RuntimeException"
    },
    "level": 1
}'

Chaos Toolkit

Chaos Toolkit is a general-purpose chaos engineering tool that can be used with Java applications.

Installation

pip install chaostoolkit
pip install chaostoolkit-spring

Creating an Experiment

Define your experiment in JSON format:

{
    "version": "1.0.0",
    "title": "Spring Boot Service Resilience",
    "description": "Test the resilience of a Spring Boot service when its dependencies fail",
    "tags": ["spring", "java", "resilience"],
    "steady-state-hypothesis": {
        "title": "Services are all available and healthy",
        "probes": [
            {
                "type": "probe",
                "name": "service-health",
                "tolerance": true,
                "provider": {
                    "type": "http",
                    "url": "http://localhost:8080/actuator/health",
                    "timeout": 3
                }
            }
        ]
    },
    "method": [
        {
            "type": "action",
            "name": "terminate-dependency",
            "provider": {
                "type": "process",
                "path": "docker",
                "arguments": ["stop", "dependency-service"]
            },
            "pauses": {
                "after": 5
            }
        }
    ],
    "rollbacks": [
        {
            "type": "action",
            "name": "restart-dependency",
            "provider": {
                "type": "process",
                "path": "docker",
                "arguments": ["start", "dependency-service"]
            }
        }
    ]
}

Running an Experiment

chaos run experiment.json

Litmus Chaos

Litmus is a chaos engineering platform for Kubernetes, suitable for orchestrating chaos experiments on Java applications deployed to Kubernetes.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.0.yaml

Pod Chaos Experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: spring-app-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=spring-app'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  monitoring: true
  jobCleanUpPolicy: 'delete'
  annotationCheck: 'false'
  engineState: 'active'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'true'

Java-Specific Chaos Techniques

JVM Chaos

  1. Memory Exhaustion:

    @Component
    public class MemoryChaos {
        private final List<byte[]> leakyList = new ArrayList<>();
    
        @Scheduled(fixedRate = 1000)
        public void consumeMemory() {
            if (ChaosToggle.isActive("memory-leak")) {
                // Each array is 1MB
                leakyList.add(new byte[1024 * 1024]);
                log.info("Current memory waste: {}MB", leakyList.size());
            }
        }
    }
    

  2. Thread Pool Saturation:

    @Component
    public class ThreadChaos {
        @Autowired
        private ThreadPoolTaskExecutor executor;
    
        public void saturateThreadPool() {
            if (ChaosToggle.isActive("thread-saturation")) {
                int threadCount = executor.getMaxPoolSize();
                for (int i = 0; i < threadCount; i++) {
                    executor.execute(() -> {
                        try {
                            // Block a thread for a long time
                            Thread.sleep(300000);
                        } catch (InterruptedException e) {
                            Thread.currentThread().interrupt();
                        }
                    });
                }
            }
        }
    }
    

  3. CPU Spike Generation:

    @Component
    public class CpuChaos {
        private volatile boolean active = false;
    
        @PostConstruct
        public void init() {
            new Thread(this::consumeCpu).start();
        }
    
        public void enableChaos() {
            active = true;
        }
    
        public void disableChaos() {
            active = false;
        }
    
        private void consumeCpu() {
            while (true) {
                if (active) {
                    // Busy loop to consume CPU
                    for (int i = 0; i < 1000000; i++) {
                        Math.sin(Math.random());
                    }
                } else {
                    try {
                        Thread.sleep(100);
                    } catch (InterruptedException e) {
                        Thread.currentThread().interrupt();
                    }
                }
            }
        }
    }
    

Database Chaos

  1. Connection Pool Exhaustion:

    @Component
    public class DbConnectionChaos {
        @Autowired
        private DataSource dataSource;
    
        private final List<Connection> heldConnections = new ArrayList<>();
    
        public void exhaustConnectionPool() throws SQLException {
            if (ChaosToggle.isActive("db-connection-leak")) {
                // Get the connection pool size
                int maxConnections = 20; // Example value, actual value depends on your configuration
    
                // Hold connections without releasing
                for (int i = 0; i < maxConnections; i++) {
                    try {
                        Connection conn = dataSource.getConnection();
                        heldConnections.add(conn);
                        log.info("Holding connection {}", i);
                    } catch (SQLException e) {
                        log.info("Connection pool exhausted after {} connections", i);
                        break;
                    }
                }
            }
        }
    
        public void releaseConnections() {
            for (Connection conn : heldConnections) {
                try {
                    conn.close();
                } catch (SQLException e) {
                    log.error("Error closing connection", e);
                }
            }
            heldConnections.clear();
        }
    }
    

  2. Slow Query Simulation:

    -- PostgreSQL slow query
    CREATE OR REPLACE FUNCTION chaos_slow_query()
    RETURNS void AS $$
    BEGIN
      IF (SELECT value FROM chaos_toggles WHERE key = 'slow-query') THEN
        PERFORM pg_sleep(3);
      END IF;
    END;
    $$ LANGUAGE plpgsql;
    
    -- Then use it in your queries
    SELECT chaos_slow_query(), * FROM users WHERE id = ?;
    

External Dependency Chaos

  1. Failing External Service Calls:

    @Component
    public class RestTemplateChaosInterceptor implements ClientHttpRequestInterceptor {
        @Override
        public ClientHttpResponse intercept(HttpRequest request, byte[] body, 
                                           ClientHttpRequestExecution execution) throws IOException {
            if (ChaosToggle.isActive("external-service-failure")) {
                String host = request.getURI().getHost();
                if (host.equals("external-api.example.com")) {
                    // Simulate a service failure
                    return new MockClientHttpResponse(
                        "Service unavailable".getBytes(),
                        HttpStatus.SERVICE_UNAVAILABLE
                    );
                }
            }
    
            if (ChaosToggle.isActive("external-service-latency")) {
                try {
                    // Introduce artificial latency
                    Thread.sleep(3000);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            }
    
            return execution.execute(request, body);
        }
    }
    

  2. Feign Client Chaos:

    @Configuration
    public class FeignChaosFallbackFactory {
        @Bean
        public FallbackFactory<UserServiceClient> userServiceFallbackFactory() {
            return throwable -> new UserServiceClient() {
                @Override
                public User getUserById(Long id) {
                    if (ChaosToggle.isActive("user-service-failure")) {
                        throw new RuntimeException("Chaos-induced failure");
                    }
                    // Normal fallback behavior
                    return new User(id, "Fallback User", "fallback@example.com");
                }
            };
        }
    }
    

Monitoring During Chaos Experiments

Key Metrics to Monitor

During chaos experiments, monitor these key Java-specific metrics:

  1. JVM Metrics:
  2. Heap memory usage (used, committed, max)
  3. Garbage collection frequency and duration
  4. Thread count and states
  5. Class loading

  6. Application Metrics:

  7. Response times (average, p95, p99)
  8. Error rates
  9. Request throughput
  10. Request success rate

  11. Resource Metrics:

  12. CPU usage
  13. Memory usage
  14. Disk I/O
  15. Network I/O

  16. Business Metrics:

  17. Transaction success rate
  18. Order completion rate
  19. User session duration
  20. Conversion rates

Setting Up Prometheus and Grafana

Add Micrometer to your Spring Boot application:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Configure metrics in application.yml:

management:
  endpoints:
    web:
      exposure:
        include: health,prometheus,metrics
  metrics:
    export:
      prometheus:
        enabled: true
    distribution:
      percentiles-histogram:
        http.server.requests: true
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99

Creating a Chaos Dashboard

Sample Grafana dashboard configuration for monitoring during chaos experiments:

{
  "annotations": {
    "list": [
      {
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      },
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "chaos_experiment_state{state=\"started\"}",
        "iconColor": "rgba(255, 96, 96, 1)",
        "name": "Chaos Experiments Started",
        "titleFormat": "Chaos Started: {{description}}",
        "tagKeys": "experiment"
      },
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "chaos_experiment_state{state=\"finished\"}",
        "iconColor": "rgba(0, 200, 0, 1)",
        "name": "Chaos Experiments Finished",
        "titleFormat": "Chaos Ended: {{description}}",
        "tagKeys": "experiment"
      }
    ]
  },
  "panels": [
    {
      "title": "JVM Memory Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(jvm_memory_used_bytes{area=\"heap\"})",
          "legendFormat": "Heap Used"
        },
        {
          "expr": "sum(jvm_memory_committed_bytes{area=\"heap\"})",
          "legendFormat": "Heap Committed"
        }
      ]
    },
    {
      "title": "HTTP Response Time (95th percentile)",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[1m])) by (le))",
          "legendFormat": "P95 Response Time"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_server_requests_seconds_count{status=~\"5..\"}[1m])) / sum(rate(http_server_requests_seconds_count[1m])) * 100",
          "legendFormat": "Error Rate %"
        }
      ]
    },
    {
      "title": "Thread States",
      "type": "graph",
      "targets": [
        {
          "expr": "jvm_threads_states_threads{state=\"runnable\"}",
          "legendFormat": "Runnable"
        },
        {
          "expr": "jvm_threads_states_threads{state=\"blocked\"}",
          "legendFormat": "Blocked"
        },
        {
          "expr": "jvm_threads_states_threads{state=\"waiting\"}",
          "legendFormat": "Waiting"
        }
      ]
    }
  ]
}

Chaos Engineering for Java Microservices

Service Mesh Chaos

Using Istio to introduce network chaos:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service-chaos
spec:
  hosts:
  - payment-service
  http:
  - fault:
      abort:
        percentage:
          value: 25
        httpStatus: 500
    route:
    - destination:
        host: payment-service

Introducing latency with Istio:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: order-service-chaos
spec:
  hosts:
  - order-service
  http:
  - fault:
      delay:
        percentage:
          value: 50
        fixedDelay: 3s
    route:
    - destination:
        host: order-service

Testing Resilience Patterns

  1. Circuit Breaker Testing:

Monitor circuit breaker state during failure injection:

@RestController
public class CircuitBreakerMetricsController {
    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    @GetMapping("/actuator/circuitbreakers")
    public Map<String, Map<String, Object>> getCircuitBreakerStatus() {
        Map<String, Map<String, Object>> result = new HashMap<>();

        circuitBreakerRegistry.getAllCircuitBreakers().forEach(cb -> {
            Map<String, Object> details = new HashMap<>();
            CircuitBreaker.State state = cb.getState();
            CircuitBreaker.Metrics metrics = cb.getMetrics();

            details.put("state", state.toString());
            details.put("failureRate", metrics.getFailureRate());
            details.put("slowCallRate", metrics.getSlowCallRate());
            details.put("numberOfFailedCalls", metrics.getNumberOfFailedCalls());
            details.put("numberOfSlowCalls", metrics.getNumberOfSlowCalls());
            details.put("numberOfSuccessfulCalls", metrics.getNumberOfSuccessfulCalls());

            result.put(cb.getName(), details);
        });

        return result;
    }
}
  1. Bulkhead Pattern Testing:

Create a chaos experiment to verify that failures in one service don't affect others:

@Service
public class BulkheadChaosService {
    @Autowired
    private ThreadPoolBulkhead orderThreadPoolBulkhead;

    public void saturateBulkhead() {
        if (ChaosToggle.isActive("bulkhead-saturation")) {
            // Submit many tasks to saturate the bulkhead
            for (int i = 0; i < 1000; i++) {
                try {
                    orderThreadPoolBulkhead.submit(() -> {
                        try {
                            // Task that takes a long time
                            Thread.sleep(60000);
                            return "Task completed";
                        } catch (InterruptedException e) {
                            Thread.currentThread().interrupt();
                            return "Task interrupted";
                        }
                    });
                } catch (Exception e) {
                    log.info("Bulkhead rejected task: {}", e.getMessage());
                }
            }
        }
    }
}
  1. Retry Pattern Testing:

Verify retry behavior with intermittent failures:

@Configuration
public class RetryableChaosConfig {
    @Bean
    public RetryRegistry retryRegistry() {
        return RetryRegistry.of(RetryConfig.custom()
            .maxAttempts(3)
            .waitDuration(Duration.ofMillis(500))
            .build());
    }

    @Bean
    public Retry chaosRetry(RetryRegistry retryRegistry) {
        return retryRegistry.retry("chaosRetry", RetryConfig.custom()
            .maxAttempts(5)
            .waitDuration(Duration.ofMillis(200))
            .build());
    }
}

@Service
public class RetryableChaosService {
    @Autowired
    private Retry chaosRetry;

    public String performWithRetry() {
        return Retry.decorateSupplier(chaosRetry, () -> {
            if (ChaosToggle.isActive("intermittent-failure") && 
                Math.random() < 0.7) { // 70% failure rate
                throw new RuntimeException("Chaos-induced failure");
            }
            return "Operation successful";
        }).get();
    }
}

Building a Chaos Engineering Culture

Getting Started with Chaos Engineering

  1. Start Small:
  2. Begin with non-production environments
  3. Focus on individual components
  4. Run experiments during off-peak hours
  5. Limit the "blast radius" of experiments

  6. Define Clear Objectives:

  7. Identify specific resilience goals
  8. Align chaos experiments with business priorities
  9. Focus on critical paths in your application

  10. Build a Gameday Culture:

  11. Schedule regular chaos engineering sessions
  12. Involve cross-functional teams
  13. Document findings and action items
  14. Celebrate learning and improvements

Chaos Engineering Maturity Model

┌─────────────────────────────────────────────────────────────────────┐
│                 Chaos Engineering Maturity Levels                    │
└─────────────────────────────────────────────────────────────────────┘
    │                 │                   │                  │
    ▼                 ▼                   ▼                  ▼
┌──────────┐    ┌──────────┐       ┌──────────┐       ┌──────────┐
│ Level 1  │    │ Level 2  │       │ Level 3  │       │ Level 4  │
│          │    │          │       │          │       │          │
│ Manual   │    │ Automated│       │ CI/CD    │       │ Production│
│ Chaos    │    │ Chaos    │       │ Integrated│      │ Chaos    │
└──────────┘    └──────────┘       └──────────┘       └──────────┘
  1. Level 1: Manual Chaos
  2. Ad-hoc experiments
  3. Manual failure injection
  4. Limited scope

  5. Level 2: Automated Chaos

  6. Reproducible experiments
  7. Scheduled chaos experiments
  8. Broader test coverage

  9. Level 3: CI/CD Integration

  10. Chaos tests in CI/CD pipeline
  11. Automated verification of results
  12. Chaos as a quality gate

  13. Level 4: Production Chaos

  14. Regular production experiments
  15. Continuous verification
  16. Comprehensive resilience testing

Establishing a Chaos Engineering Program

  1. Create a Chaos Engineering Team:
  2. Identify champions across teams
  3. Allocate dedicated time for chaos engineering
  4. Provide training and resources

  5. Define Chaos Principles:

  6. Document your approach to chaos engineering
  7. Set boundaries and safety measures
  8. Establish communication protocols

  9. Build a Chaos Engineering Backlog:

  10. Prioritize experiments based on risk
  11. Target known weak points
  12. Track progress and results

  13. Measure and Communicate Success:

  14. Track improvements in system resilience
  15. Share learnings across teams
  16. Demonstrate business value

Conclusion

Chaos engineering is a powerful approach to building more resilient Java applications by proactively identifying weaknesses through controlled experiments. By deliberately introducing failures in a controlled manner, you can verify that your system can withstand turbulent conditions and recover gracefully.

For Java applications, especially those using microservices architectures, chaos engineering helps verify that resilience patterns like circuit breakers, bulkheads, and retries work as expected under real-world failure conditions. By establishing a chaos engineering practice in your organization, you can build more reliable systems that better serve your users even when things go wrong.

References