Chaos Engineering¶

Overview¶

This guide explores chaos engineering principles and practices for Java applications. Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. We'll cover how to implement chaos experiments specifically for Java applications, microservices, and cloud-native architectures.

Prerequisites¶

Understanding of monitoring and logging
Familiarity with Java deployment strategies
Basic knowledge of containerization and Kubernetes
Experience with Spring Boot or Java EE applications

Learning Objectives¶

Understand chaos engineering principles and benefits
Design and execute controlled chaos experiments for Java applications
Implement chaos engineering tools in Java environments
Monitor and analyze system behavior during chaos experiments
Apply chaos engineering to microservices architectures
Create a chaos engineering culture in development teams

Chaos Engineering Fundamentals¶

What is Chaos Engineering?¶

Chaos engineering is the practice of deliberately introducing controlled failure into a system to test its resilience and identify weaknesses before they cause real outages.

┌─────────────────────────────────────────────────────────────────────┐
│                     Chaos Engineering Process                        │
└─────────────────────────────────────────────────────────────────────┘
       │                │                  │                 │
       ▼                ▼                  ▼                 ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  ┌─────────────┐
│ Define      │  │ Hypothesize │  │ Run Controlled  │  │ Analyze     │
│ Steady State│  │ about Chaos │  │ Experiments     │  │ Results     │
└─────────────┘  └─────────────┘  └─────────────────┘  └─────────────┘

Core Principles¶

Start in a Known State: Define what normal system behavior looks like before introducing chaos.
Hypothesize about Outcomes: Form a hypothesis about how the system will respond to a specific failure.
Introduce Controlled Experiments: Simulate real-world failures in a controlled environment.
Minimize Blast Radius: Start small and gradually increase the scope of experiments.
Run Experiments in Production: Eventually, test in production environments to uncover real issues.

Benefits for Java Applications¶

Increased Resilience: Identify and fix weaknesses before they cause production incidents
Improved Recovery: Test and optimize recovery mechanisms
Reduced MTTR: Decrease mean time to recovery when failures occur
System Understanding: Develop deeper knowledge of system behavior under stress
Confidence in System: Build confidence in the ability to handle unexpected failures

Designing Chaos Experiments for Java Applications¶

Experiment Structure¶

A well-designed chaos experiment follows this structure:

Define Steady State: Determine what "normal" behavior looks like (response times, throughput, etc.)
Create a Hypothesis: Form a hypothesis about how the system will behave under specific conditions
Introduce Variables: Inject failure or unusual conditions
Observe Results: Monitor system behavior during the experiment
Analyze Outcome: Compare results against the hypothesis and baseline
Remediate Issues: Fix any weaknesses identified
Iterate: Refine and repeat experiments

Common Chaos Scenarios for Java Applications¶

Resource Exhaustion:
Memory leaks and heap exhaustion
Thread pool saturation
Connection pool depletion
CPU spikes
Disk space filling up
Network Failures:
Service dependency failures
Network latency and packet loss
DNS resolution failures
Load balancer failures
State Mutations:
Database corruption
Inconsistent cache states
Stale configuration
Corrupted messages in queues
Time and Clock Issues:
Clock skew between services
Leap second bugs
Timezone handling issues

Chaos Engineering Tools for Java¶

Chaos Monkey for Spring Boot¶

Chaos Monkey for Spring Boot (CM4SB) is an implementation of the Chaos Monkey pattern for Spring Boot applications.

Installation¶

Add the dependency to your Maven pom.xml:

<dependency>
    <groupId>de.codecentric</groupId>
    <artifactId>chaos-monkey-spring-boot</artifactId>
    <version>2.6.1</version>
</dependency>

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Configuration¶

Enable Chaos Monkey in your application.yml:

chaos:
  monkey:
    enabled: true
    watcher:
      component: false
      controller: true
      repository: true
      rest-controller: true
      service: true
    assaults:
      level: 3
      latencyActive: true
      latencyRangeStart: 1000
      latencyRangeEnd: 3000
      exceptionsActive: false
      killApplicationActive: false
      memoryActive: false

management:
  endpoint:
    chaosmonkey:
      enabled: true
  endpoints:
    web:
      exposure:
        include: health,info,chaosmonkey

Using Chaos Monkey at Runtime¶

Activate specific assaults via the REST API:

# Enable latency assault
curl -X POST \
  http://localhost:8080/actuator/chaosmonkey/assaults \
  -H 'Content-Type: application/json' \
  -d '{
    "latencyActive": true,
    "latencyRangeStart": 2000,
    "latencyRangeEnd": 5000,
    "level": 3
}'

# Enable exception assault
curl -X POST \
  http://localhost:8080/actuator/chaosmonkey/assaults \
  -H 'Content-Type: application/json' \
  -d '{
    "exceptionsActive": true,
    "exception": {
        "type": "java.io.IOException",
        "message": "Chaos Monkey - RuntimeException"
    },
    "level": 1
}'

Chaos Toolkit¶

Chaos Toolkit is a general-purpose chaos engineering tool that can be used with Java applications.

Installation¶

pip install chaostoolkit
pip install chaostoolkit-spring

Creating an Experiment¶

Define your experiment in JSON format:

{
    "version": "1.0.0",
    "title": "Spring Boot Service Resilience",
    "description": "Test the resilience of a Spring Boot service when its dependencies fail",
    "tags": ["spring", "java", "resilience"],
    "steady-state-hypothesis": {
        "title": "Services are all available and healthy",
        "probes": [
            {
                "type": "probe",
                "name": "service-health",
                "tolerance": true,
                "provider": {
                    "type": "http",
                    "url": "http://localhost:8080/actuator/health",
                    "timeout": 3
                }
            }
        ]
    },
    "method": [
        {
            "type": "action",
            "name": "terminate-dependency",
            "provider": {
                "type": "process",
                "path": "docker",
                "arguments": ["stop", "dependency-service"]
            },
            "pauses": {
                "after": 5
            }
        }
    ],
    "rollbacks": [
        {
            "type": "action",
            "name": "restart-dependency",
            "provider": {
                "type": "process",
                "path": "docker",
                "arguments": ["start", "dependency-service"]
            }
        }
    ]
}

Running an Experiment¶

chaos run experiment.json

Litmus Chaos¶

Litmus is a chaos engineering platform for Kubernetes, suitable for orchestrating chaos experiments on Java applications deployed to Kubernetes.

Installation¶

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.0.yaml

Pod Chaos Experiment¶

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: spring-app-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=spring-app'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  monitoring: true
  jobCleanUpPolicy: 'delete'
  annotationCheck: 'false'
  engineState: 'active'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'true'

Java-Specific Chaos Techniques¶

JVM Chaos¶

Memory Exhaustion:

@Component
public class MemoryChaos {
    private final List<byte[]> leakyList = new ArrayList<>();

    @Scheduled(fixedRate = 1000)
    public void consumeMemory() {
        if (ChaosToggle.isActive("memory-leak")) {
            // Each array is 1MB
            leakyList.add(new byte[1024 * 1024]);
            log.info("Current memory waste: {}MB", leakyList.size());
        }
    }
}

Thread Pool Saturation:

@Component
public class ThreadChaos {
    @Autowired
    private ThreadPoolTaskExecutor executor;

    public void saturateThreadPool() {
        if (ChaosToggle.isActive("thread-saturation")) {
            int threadCount = executor.getMaxPoolSize();
            for (int i = 0; i < threadCount; i++) {
                executor.execute(() -> {
                    try {
                        // Block a thread for a long time
                        Thread.sleep(300000);
                    } catch (InterruptedException e) {
                        Thread.currentThread().interrupt();
                    }
                });
            }
        }
    }
}

CPU Spike Generation:

@Component
public class CpuChaos {
    private volatile boolean active = false;

    @PostConstruct
    public void init() {
        new Thread(this::consumeCpu).start();
    }

    public void enableChaos() {
        active = true;
    }

    public void disableChaos() {
        active = false;
    }

    private void consumeCpu() {
        while (true) {
            if (active) {
                // Busy loop to consume CPU
                for (int i = 0; i < 1000000; i++) {
                    Math.sin(Math.random());
                }
            } else {
                try {
                    Thread.sleep(100);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            }
        }
    }
}

Database Chaos¶

Connection Pool Exhaustion:

@Component
public class DbConnectionChaos {
    @Autowired
    private DataSource dataSource;

    private final List<Connection> heldConnections = new ArrayList<>();

    public void exhaustConnectionPool() throws SQLException {
        if (ChaosToggle.isActive("db-connection-leak")) {
            // Get the connection pool size
            int maxConnections = 20; // Example value, actual value depends on your configuration

            // Hold connections without releasing
            for (int i = 0; i < maxConnections; i++) {
                try {
                    Connection conn = dataSource.getConnection();
                    heldConnections.add(conn);
                    log.info("Holding connection {}", i);
                } catch (SQLException e) {
                    log.info("Connection pool exhausted after {} connections", i);
                    break;
                }
            }
        }
    }

    public void releaseConnections() {
        for (Connection conn : heldConnections) {
            try {
                conn.close();
            } catch (SQLException e) {
                log.error("Error closing connection", e);
            }
        }
        heldConnections.clear();
    }
}

Slow Query Simulation:

-- PostgreSQL slow query
CREATE OR REPLACE FUNCTION chaos_slow_query()
RETURNS void AS $$
BEGIN
  IF (SELECT value FROM chaos_toggles WHERE key = 'slow-query') THEN
    PERFORM pg_sleep(3);
  END IF;
END;
$$ LANGUAGE plpgsql;

-- Then use it in your queries
SELECT chaos_slow_query(), * FROM users WHERE id = ?;

External Dependency Chaos¶

Failing External Service Calls:

@Component
public class RestTemplateChaosInterceptor implements ClientHttpRequestInterceptor {
    @Override
    public ClientHttpResponse intercept(HttpRequest request, byte[] body, 
                                       ClientHttpRequestExecution execution) throws IOException {
        if (ChaosToggle.isActive("external-service-failure")) {
            String host = request.getURI().getHost();
            if (host.equals("external-api.example.com")) {
                // Simulate a service failure
                return new MockClientHttpResponse(
                    "Service unavailable".getBytes(),
                    HttpStatus.SERVICE_UNAVAILABLE
                );
            }
        }

        if (ChaosToggle.isActive("external-service-latency")) {
            try {
                // Introduce artificial latency
                Thread.sleep(3000);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        }

        return execution.execute(request, body);
    }
}

Feign Client Chaos:

@Configuration
public class FeignChaosFallbackFactory {
    @Bean
    public FallbackFactory<UserServiceClient> userServiceFallbackFactory() {
        return throwable -> new UserServiceClient() {
            @Override
            public User getUserById(Long id) {
                if (ChaosToggle.isActive("user-service-failure")) {
                    throw new RuntimeException("Chaos-induced failure");
                }
                // Normal fallback behavior
                return new User(id, "Fallback User", "fallback@example.com");
            }
        };
    }
}

Monitoring During Chaos Experiments¶

Key Metrics to Monitor¶

During chaos experiments, monitor these key Java-specific metrics:

JVM Metrics:
Heap memory usage (used, committed, max)
Garbage collection frequency and duration
Thread count and states
Class loading
Application Metrics:
Response times (average, p95, p99)
Error rates
Request throughput
Request success rate
Resource Metrics:
CPU usage
Memory usage
Disk I/O
Network I/O
Business Metrics:
Transaction success rate
Order completion rate
User session duration
Conversion rates

Setting Up Prometheus and Grafana¶

Add Micrometer to your Spring Boot application:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Configure metrics in application.yml:

management:
  endpoints:
    web:
      exposure:
        include: health,prometheus,metrics
  metrics:
    export:
      prometheus:
        enabled: true
    distribution:
      percentiles-histogram:
        http.server.requests: true
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99

Creating a Chaos Dashboard¶

Sample Grafana dashboard configuration for monitoring during chaos experiments:

{
  "annotations": {
    "list": [
      {
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      },
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "chaos_experiment_state{state=\"started\"}",
        "iconColor": "rgba(255, 96, 96, 1)",
        "name": "Chaos Experiments Started",
        "titleFormat": "Chaos Started: {{description}}",
        "tagKeys": "experiment"
      },
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "chaos_experiment_state{state=\"finished\"}",
        "iconColor": "rgba(0, 200, 0, 1)",
        "name": "Chaos Experiments Finished",
        "titleFormat": "Chaos Ended: {{description}}",
        "tagKeys": "experiment"
      }
    ]
  },
  "panels": [
    {
      "title": "JVM Memory Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(jvm_memory_used_bytes{area=\"heap\"})",
          "legendFormat": "Heap Used"
        },
        {
          "expr": "sum(jvm_memory_committed_bytes{area=\"heap\"})",
          "legendFormat": "Heap Committed"
        }
      ]
    },
    {
      "title": "HTTP Response Time (95th percentile)",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[1m])) by (le))",
          "legendFormat": "P95 Response Time"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_server_requests_seconds_count{status=~\"5..\"}[1m])) / sum(rate(http_server_requests_seconds_count[1m])) * 100",
          "legendFormat": "Error Rate %"
        }
      ]
    },
    {
      "title": "Thread States",
      "type": "graph",
      "targets": [
        {
          "expr": "jvm_threads_states_threads{state=\"runnable\"}",
          "legendFormat": "Runnable"
        },
        {
          "expr": "jvm_threads_states_threads{state=\"blocked\"}",
          "legendFormat": "Blocked"
        },
        {
          "expr": "jvm_threads_states_threads{state=\"waiting\"}",
          "legendFormat": "Waiting"
        }
      ]
    }
  ]
}

Chaos Engineering for Java Microservices¶

Service Mesh Chaos¶

Using Istio to introduce network chaos:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service-chaos
spec:
  hosts:
  - payment-service
  http:
  - fault:
      abort:
        percentage:
          value: 25
        httpStatus: 500
    route:
    - destination:
        host: payment-service

Introducing latency with Istio:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: order-service-chaos
spec:
  hosts:
  - order-service
  http:
  - fault:
      delay:
        percentage:
          value: 50
        fixedDelay: 3s
    route:
    - destination:
        host: order-service

Testing Resilience Patterns¶

Circuit Breaker Testing:

Monitor circuit breaker state during failure injection:

@RestController
public class CircuitBreakerMetricsController {
    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    @GetMapping("/actuator/circuitbreakers")
    public Map<String, Map<String, Object>> getCircuitBreakerStatus() {
        Map<String, Map<String, Object>> result = new HashMap<>();

        circuitBreakerRegistry.getAllCircuitBreakers().forEach(cb -> {
            Map<String, Object> details = new HashMap<>();
            CircuitBreaker.State state = cb.getState();
            CircuitBreaker.Metrics metrics = cb.getMetrics();

            details.put("state", state.toString());
            details.put("failureRate", metrics.getFailureRate());
            details.put("slowCallRate", metrics.getSlowCallRate());
            details.put("numberOfFailedCalls", metrics.getNumberOfFailedCalls());
            details.put("numberOfSlowCalls", metrics.getNumberOfSlowCalls());
            details.put("numberOfSuccessfulCalls", metrics.getNumberOfSuccessfulCalls());

            result.put(cb.getName(), details);
        });

        return result;
    }
}

Bulkhead Pattern Testing:

Create a chaos experiment to verify that failures in one service don't affect others:

@Service
public class BulkheadChaosService {
    @Autowired
    private ThreadPoolBulkhead orderThreadPoolBulkhead;

    public void saturateBulkhead() {
        if (ChaosToggle.isActive("bulkhead-saturation")) {
            // Submit many tasks to saturate the bulkhead
            for (int i = 0; i < 1000; i++) {
                try {
                    orderThreadPoolBulkhead.submit(() -> {
                        try {
                            // Task that takes a long time
                            Thread.sleep(60000);
                            return "Task completed";
                        } catch (InterruptedException e) {
                            Thread.currentThread().interrupt();
                            return "Task interrupted";
                        }
                    });
                } catch (Exception e) {
                    log.info("Bulkhead rejected task: {}", e.getMessage());
                }
            }
        }
    }
}

Retry Pattern Testing:

Verify retry behavior with intermittent failures:

@Configuration
public class RetryableChaosConfig {
    @Bean
    public RetryRegistry retryRegistry() {
        return RetryRegistry.of(RetryConfig.custom()
            .maxAttempts(3)
            .waitDuration(Duration.ofMillis(500))
            .build());
    }

    @Bean
    public Retry chaosRetry(RetryRegistry retryRegistry) {
        return retryRegistry.retry("chaosRetry", RetryConfig.custom()
            .maxAttempts(5)
            .waitDuration(Duration.ofMillis(200))
            .build());
    }
}

@Service
public class RetryableChaosService {
    @Autowired
    private Retry chaosRetry;

    public String performWithRetry() {
        return Retry.decorateSupplier(chaosRetry, () -> {
            if (ChaosToggle.isActive("intermittent-failure") && 
                Math.random() < 0.7) { // 70% failure rate
                throw new RuntimeException("Chaos-induced failure");
            }
            return "Operation successful";
        }).get();
    }
}

Building a Chaos Engineering Culture¶

Getting Started with Chaos Engineering¶

Start Small:
Begin with non-production environments
Focus on individual components
Run experiments during off-peak hours
Limit the "blast radius" of experiments
Define Clear Objectives:
Identify specific resilience goals
Align chaos experiments with business priorities
Focus on critical paths in your application
Build a Gameday Culture:
Schedule regular chaos engineering sessions
Involve cross-functional teams
Document findings and action items
Celebrate learning and improvements

Chaos Engineering Maturity Model¶

┌─────────────────────────────────────────────────────────────────────┐
│                 Chaos Engineering Maturity Levels                    │
└─────────────────────────────────────────────────────────────────────┘
    │                 │                   │                  │
    ▼                 ▼                   ▼                  ▼
┌──────────┐    ┌──────────┐       ┌──────────┐       ┌──────────┐
│ Level 1  │    │ Level 2  │       │ Level 3  │       │ Level 4  │
│          │    │          │       │          │       │          │
│ Manual   │    │ Automated│       │ CI/CD    │       │ Production│
│ Chaos    │    │ Chaos    │       │ Integrated│      │ Chaos    │
└──────────┘    └──────────┘       └──────────┘       └──────────┘

Level 1: Manual Chaos
Ad-hoc experiments
Manual failure injection
Limited scope
Level 2: Automated Chaos
Reproducible experiments
Scheduled chaos experiments
Broader test coverage
Level 3: CI/CD Integration
Chaos tests in CI/CD pipeline
Automated verification of results
Chaos as a quality gate
Level 4: Production Chaos
Regular production experiments
Continuous verification
Comprehensive resilience testing

Establishing a Chaos Engineering Program¶

Create a Chaos Engineering Team:
Identify champions across teams
Allocate dedicated time for chaos engineering
Provide training and resources
Define Chaos Principles:
Document your approach to chaos engineering
Set boundaries and safety measures
Establish communication protocols
Build a Chaos Engineering Backlog:
Prioritize experiments based on risk
Target known weak points
Track progress and results
Measure and Communicate Success:
Track improvements in system resilience
Share learnings across teams
Demonstrate business value

Conclusion¶

Chaos engineering is a powerful approach to building more resilient Java applications by proactively identifying weaknesses through controlled experiments. By deliberately introducing failures in a controlled manner, you can verify that your system can withstand turbulent conditions and recover gracefully.

For Java applications, especially those using microservices architectures, chaos engineering helps verify that resilience patterns like circuit breakers, bulkheads, and retries work as expected under real-world failure conditions. By establishing a chaos engineering practice in your organization, you can build more reliable systems that better serve your users even when things go wrong.

Chaos Engineering¶

Overview¶

Prerequisites¶

Learning Objectives¶

Chaos Engineering Fundamentals¶

What is Chaos Engineering?¶

Core Principles¶

Benefits for Java Applications¶

Designing Chaos Experiments for Java Applications¶

Experiment Structure¶

Common Chaos Scenarios for Java Applications¶

Chaos Engineering Tools for Java¶

Chaos Monkey for Spring Boot¶

Installation¶

Configuration¶

Using Chaos Monkey at Runtime¶

Chaos Toolkit¶

Installation¶

Creating an Experiment¶

Running an Experiment¶

Litmus Chaos¶

Installation¶

Pod Chaos Experiment¶

Java-Specific Chaos Techniques¶

JVM Chaos¶

Database Chaos¶

External Dependency Chaos¶

Monitoring During Chaos Experiments¶

Key Metrics to Monitor¶

Setting Up Prometheus and Grafana¶

Creating a Chaos Dashboard¶

Chaos Engineering for Java Microservices¶

Service Mesh Chaos¶

Testing Resilience Patterns¶

Building a Chaos Engineering Culture¶

Getting Started with Chaos Engineering¶

Chaos Engineering Maturity Model¶

Establishing a Chaos Engineering Program¶

Conclusion¶

References¶