Monitoring & Logging¶
Overview¶
This guide introduces monitoring and logging concepts, tools, and implementation strategies for Java applications. Effective monitoring and logging are essential for maintaining application health, troubleshooting issues, and ensuring optimal performance in production environments.
Prerequisites¶
- Basic understanding of Java application architecture
- Familiarity with microservices concepts (for distributed tracing sections)
- General knowledge of cloud or on-premises infrastructure
- Understanding of basic DevOps principles
Learning Objectives¶
- Understand core monitoring and logging concepts for Java applications
- Learn how to implement effective logging strategies
- Master metrics collection and visualization techniques
- Implement distributed tracing for microservice architectures
- Configure alerts and notifications for proactive issue detection
- Apply best practices for Java application observability
- Understand how to build effective dashboards for application visibility
Monitoring Fundamentals¶
What is Monitoring?¶
Monitoring is the practice of collecting, analyzing, and using data about your application's performance, health, and usage patterns. Effective monitoring enables you to:
- Detect and diagnose problems before they impact users
- Understand application behavior under various conditions
- Plan capacity and resource allocation
- Validate the success of deployments and changes
- Make data-driven decisions for improvements
The Three Pillars of Observability¶
┌───────────────────────────────────────────────────┐
│ Observability │
└───────────────────────────────────────────────────┘
│ │ │
┌─────────┘ │ └─────────┐
│ │ │
▼ ▼ ▼
┌───────────────────┐ ┌─────────────────┐ ┌───────────────────┐
│ Logs │ │ Metrics │ │ Traces │
│ │ │ │ │ │
│ Detailed records │ │ Numeric samples │ │ Request paths │
│ of events │ │ of data points │ │ across services │
└───────────────────┘ └─────────────────┘ └───────────────────┘
1. Logs¶
Text records of events that happen in your system, providing detailed context about what happened at a specific time.
2. Metrics¶
Numeric measurements collected at regular intervals, representing system behavior or performance over time.
3. Traces¶
Records of requests as they flow through distributed systems, showing the path and timing of service interactions.
Logging for Java Applications¶
Logging Frameworks¶
Java offers several mature logging frameworks:
SLF4J with Logback¶
The most common modern choice:
<!-- Maven dependencies -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.36</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.2.11</version>
</dependency>
Usage example:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class UserService {
private static final Logger logger = LoggerFactory.getLogger(UserService.class);
public User findUserById(Long id) {
logger.debug("Looking up user with ID: {}", id);
try {
User user = userRepository.findById(id)
.orElseThrow(() -> new UserNotFoundException(id));
logger.info("Found user: {}", user.getUsername());
return user;
} catch (Exception e) {
logger.error("Error finding user with ID: {}", id, e);
throw e;
}
}
}
Log4j2¶
Another popular option with high performance:
<!-- Maven dependencies -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.17.2</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.17.2</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.17.2</version>
</dependency>
Logging Configuration¶
Proper logging configuration is essential for both development and production environments.
Logback Configuration Example¶
<!-- logback.xml -->
<configuration>
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>logs/application.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>logs/application-%d{yyyy-MM-dd}.log</fileNamePattern>
<maxHistory>30</maxHistory>
<totalSizeCap>3GB</totalSizeCap>
</rollingPolicy>
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<!-- JSON appender for production environments -->
<appender name="JSON" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>logs/application.json</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>logs/application-%d{yyyy-MM-dd}.json</fileNamePattern>
<maxHistory>30</maxHistory>
</rollingPolicy>
<encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
</appender>
<!-- Log levels by package -->
<logger name="com.example.app" level="INFO" />
<logger name="com.example.app.security" level="DEBUG" />
<!-- External libraries -->
<logger name="org.springframework" level="WARN" />
<logger name="org.hibernate" level="WARN" />
<root level="INFO">
<appender-ref ref="CONSOLE" />
<appender-ref ref="FILE" />
<!-- Use in production -->
<!-- <appender-ref ref="JSON" /> -->
</root>
</configuration>
Log4j2 Configuration Example¶
<!-- log4j2.xml -->
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
</Console>
<RollingFile name="RollingFile" fileName="logs/application.log"
filePattern="logs/application-%d{MM-dd-yyyy}-%i.log">
<PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
<Policies>
<TimeBasedTriggeringPolicy />
<SizeBasedTriggeringPolicy size="10 MB"/>
</Policies>
<DefaultRolloverStrategy max="10"/>
</RollingFile>
</Appenders>
<Loggers>
<Logger name="com.example.app" level="info" additivity="false">
<AppenderRef ref="Console"/>
<AppenderRef ref="RollingFile"/>
</Logger>
<Root level="warn">
<AppenderRef ref="Console"/>
<AppenderRef ref="RollingFile"/>
</Root>
</Loggers>
</Configuration>
Structured Logging¶
For better searchability and analysis, structured logging formats like JSON are recommended:
Using Logstash encoder with Logback:
<dependency>
<groupId>net.logstash.logback</groupId>
<artifactId>logstash-logback-encoder</artifactId>
<version>7.2</version>
</dependency>
Adding context information:
MDC.put("userId", user.getId().toString());
MDC.put("requestId", requestId);
logger.info("User profile updated");
MDC.clear();
Centralized Logging¶
In production environments, logs should be collected centrally:
ELK Stack (Elasticsearch, Logstash, Kibana)¶
A popular combination for log collection, storage, and visualization:
# docker-compose.yml for ELK Stack
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports:
- "9200:9200"
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:7.17.0
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
ports:
- "5000:5000/tcp"
- "5000:5000/udp"
- "9600:9600"
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:7.17.0
ports:
- "5601:5601"
depends_on:
- elasticsearch
volumes:
elasticsearch-data:
With Logstash configuration:
# logstash.conf
input {
tcp {
port => 5000
codec => json
}
}
filter {
if [logger_name] =~ "com.example.app" {
mutate {
add_tag => [ "application" ]
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "java-application-%{+YYYY.MM.dd}"
}
}
Loki with Grafana¶
A lightweight alternative focused on logs:
# docker-compose.yml for Loki + Grafana
version: '3.8'
services:
loki:
image: grafana/loki:2.6.1
ports:
- "3100:3100"
volumes:
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
grafana:
image: grafana/grafana:9.3.2
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- loki
volumes:
loki-data:
grafana-data:
Metrics Collection for Java Applications¶
Key Metrics to Monitor¶
JVM Metrics¶
- Memory usage (heap and non-heap)
- Garbage collection frequency and duration
- Thread count and state
- Class loading
- CPU usage
Application Metrics¶
- Request rates
- Response times
- Error rates
- Active sessions
- Business-specific metrics (orders processed, user registrations, etc.)
System Metrics¶
- Host CPU, memory, disk usage
- Network I/O
- Container metrics if applicable
Database Metrics¶
- Connection pool usage
- Query execution time
- Transaction rates
- Cache hit/miss ratio
Metrics Collection Tools¶
Micrometer¶
Micrometer provides a vendor-neutral metrics collection facade:
<!-- Maven dependency -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>1.9.3</version>
</dependency>
For Spring Boot applications:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.properties
management.endpoints.web.exposure.include=prometheus,health,info,metrics
management.metrics.export.prometheus.enabled=true
management.endpoint.health.show-details=always
Custom metrics example:
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
@Service
public class OrderService {
private final Counter orderCounter;
private final Timer orderProcessingTimer;
public OrderService(MeterRegistry registry) {
this.orderCounter = registry.counter("orders.created");
this.orderProcessingTimer = registry.timer("orders.processing.time");
}
public Order createOrder(OrderRequest request) {
return orderProcessingTimer.record(() -> {
Order order = processOrder(request);
orderCounter.increment();
return order;
});
}
}
Prometheus¶
Prometheus is a popular time-series database and monitoring system:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'spring-boot-app'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['app:8080']
Docker Compose setup:
services:
app:
image: myapp:latest
ports:
- "8080:8080"
prometheus:
image: prom/prometheus:v2.40.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
volumes:
prometheus-data:
Grafana¶
Grafana provides visualization for metrics:
services:
grafana:
image: grafana/grafana:9.3.2
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- prometheus
volumes:
grafana-data:
Example Grafana dashboard configuration for a Java application:
{
"panels": [
{
"title": "JVM Memory Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(jvm_memory_used_bytes{application=\"$application\",instance=\"$instance\",area=\"heap\"})",
"legendFormat": "Heap Used"
},
{
"expr": "sum(jvm_memory_committed_bytes{application=\"$application\",instance=\"$instance\",area=\"heap\"})",
"legendFormat": "Heap Committed"
},
{
"expr": "sum(jvm_memory_max_bytes{application=\"$application\",instance=\"$instance\",area=\"heap\"})",
"legendFormat": "Heap Max"
}
]
},
{
"title": "HTTP Request Rate",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_server_requests_seconds_count{application=\"$application\",instance=\"$instance\"}[1m]))",
"legendFormat": "Requests/sec"
}
]
}
]
}
Distributed Tracing for Java Applications¶
What is Distributed Tracing?¶
Distributed tracing tracks requests as they flow through microservices, providing context for troubleshooting and performance optimization.
OpenTelemetry¶
OpenTelemetry is an observability framework that combines tracing, metrics, and logs:
<!-- Maven dependencies -->
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
<version>1.19.0</version>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-sdk</artifactId>
<version>1.19.0</version>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
<version>1.19.0</version>
</dependency>
For Spring Boot applications:
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
<version>1.19.0-alpha</version>
</dependency>
Manual instrumentation example:
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
@Service
public class PaymentService {
private final Tracer tracer;
private final BankService bankService;
public PaymentService(Tracer tracer, BankService bankService) {
this.tracer = tracer;
this.bankService = bankService;
}
public void processPayment(Payment payment) {
Span span = tracer.spanBuilder("processPayment")
.setAttribute("payment.id", payment.getId())
.setAttribute("payment.amount", payment.getAmount())
.startSpan();
try (Scope scope = span.makeCurrent()) {
validatePayment(payment);
Span bankSpan = tracer.spanBuilder("bankTransfer").startSpan();
try (Scope bankScope = bankSpan.makeCurrent()) {
bankService.transfer(payment);
} finally {
bankSpan.end();
}
} finally {
span.end();
}
}
}
Jaeger¶
Jaeger is a popular distributed tracing system:
# docker-compose.yml with Jaeger
services:
jaeger:
image: jaegertracing/all-in-one:1.39
ports:
- "16686:16686" # UI
- "14250:14250" # gRPC
- "14268:14268" # HTTP Collector
Spring Boot configuration:
# application.yml
opentelemetry:
traces:
exporter: jaeger
metrics:
exporter: prometheus
jaeger:
endpoint: http://jaeger:14250
Alert Management¶
Alerting Best Practices¶
Effective alerts should be: 1. Actionable: Alert on symptoms that require human intervention 2. Precise: Avoid alert fatigue by minimizing false positives 3. Relevant: Direct alerts to the right team 4. Clear: Include sufficient context to understand and resolve the issue
Alert Configuration in Prometheus¶
# alertmanager.yml
route:
group_by: ['alertname', 'service', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'pager'
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
- name: 'pager'
pagerduty_configs:
- service_key: '<pagerduty-key>'
Alert rules:
# alert-rules.yml
groups:
- name: java-application
rules:
- alert: HighMemoryUsage
expr: (sum(jvm_memory_used_bytes{area="heap"}) / sum(jvm_memory_max_bytes{area="heap"})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage (instance {{ $labels.instance }})"
description: "Memory usage is above 85% for 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HighErrorRate
expr: sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) / sum(rate(http_server_requests_seconds_count[5m])) * 100 > 5
for: 2m
labels:
severity: critical
annotations:
summary: "High HTTP error rate (instance {{ $labels.instance }})"
description: "HTTP 5xx error rate is above 5% for 2 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Monitoring and Logging Architecture¶
Complete Observability Stack¶
┌───────────────────────────────────────────────────────────────────────────────┐
│ Application │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────────────────┐ ┌────────────┐ │
│ │ Logs │ │ Metrics │ │ Distributed Traces │ │ Health │ │
│ │ (Logback) │ │(Micrometer)│ │ (OpenTelemetry) │ │ Checks │ │
│ └─────┬─────┘ └─────┬─────┘ └──────────┬────────────┘ └────────────┘ │
└───────────────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Log Storage │ │ Time Series │ │ Trace Storage │
│ (Elasticsearch)│ │ (Prometheus)│ │ (Jaeger) │
└───────┬───────┘ └───────┬──────┘ └────────┬─────────┘
│ │ │
└─────────────────┼───────────────────┘
▼
┌─────────────┐
│ Dashboard │
│ (Grafana) │
└─────┬───────┘
│
▼
┌──────────────┐
│ Alerting │
│(AlertManager)│
└──────────────┘
Monitoring in Kubernetes¶
For Java applications in Kubernetes, additional tools like Prometheus Operator and Loki help:
# Prometheus Operator ServiceMonitor for a Spring Boot app
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: spring-app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: spring-boot-app
endpoints:
- port: web
path: /actuator/prometheus
interval: 15s
Java Application Health Checks¶
Spring Boot Actuator¶
Spring Boot Actuator provides built-in health endpoints:
# application.yml
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
endpoint:
health:
show-details: always
probes:
enabled: true
health:
livenessState:
enabled: true
readinessState:
enabled: true
Custom health indicator:
@Component
public class DatabaseHealthIndicator implements HealthIndicator {
private final DataSource dataSource;
public DatabaseHealthIndicator(DataSource dataSource) {
this.dataSource = dataSource;
}
@Override
public Health health() {
try (Connection conn = dataSource.getConnection()) {
try (Statement stmt = conn.createStatement()) {
stmt.execute("SELECT 1");
return Health.up()
.withDetail("database", "PostgreSQL")
.withDetail("status", "Available")
.build();
}
} catch (Exception e) {
return Health.down()
.withDetail("error", e.getMessage())
.build();
}
}
}
Kubernetes Probes¶
Configure Kubernetes probes for Java applications:
apiVersion: apps/v1
kind: Deployment
metadata:
name: spring-boot-app
spec:
template:
spec:
containers:
- name: app
image: spring-boot-app:1.0.0
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
Best Practices for Monitoring Java Applications¶
1. Define Service Level Objectives (SLOs)¶
Establish clear SLOs for your application, like: - 99.9% availability - 95% of requests complete in under 300ms - Error rate below 0.1%
2. Use the RED Method¶
Monitor key metrics for every service: - Rate: Requests per second - Error rate: Failed requests per second - Duration: Distribution of request latencies
3. Implement the USE Method¶
For resources, monitor: - Utilization: Average time resource was busy - Saturation: Amount of work queued - Errors: Error events count
4. Structured Logging¶
Always use structured logging with consistent format and fields: - Include contextual information (request ID, user ID) - Use appropriate log levels - Include timestamps and source information
5. Centralize Everything¶
Ensure all logs, metrics, and traces are collected centrally for: - Correlation analysis - Historical trending - Anomaly detection
6. Right-size Retention¶
Balance retention needs against storage costs: - High-resolution metrics: 2 weeks - Aggregated metrics: 1 year+ - Critical application logs: 90 days - Detailed debug logs: 7-14 days
7. Use Application Performance Monitoring (APM)¶
Consider APM solutions for deeper insights: - Elastic APM - New Relic - Dynatrace - AppDynamics - DataDog
Next Steps¶
Once you understand monitoring and logging fundamentals, explore these related topics: