Skip to content

Kubernetes Troubleshooting

Overview

This guide covers troubleshooting techniques in Kubernetes, including debugging applications, cluster issues, and common problems with their solutions.

Prerequisites

  • Basic understanding of Kubernetes concepts
  • Knowledge of kubectl commands
  • Familiarity with logging and monitoring
  • Understanding of networking concepts

Learning Objectives

  • Understand troubleshooting methodology
  • Learn debugging techniques
  • Master log analysis
  • Implement monitoring solutions
  • Resolve common issues

Table of Contents

  1. Pod Issues
  2. Node Problems
  3. Networking Issues
  4. Storage Problems
  5. Cluster Issues

Pod Issues

Debugging Pods

# Get pod status
kubectl get pod <pod-name> -n <namespace>

# Get pod details
kubectl describe pod <pod-name> -n <namespace>

# Get pod logs
kubectl logs <pod-name> -n <namespace>

# Get previous pod logs
kubectl logs <pod-name> -n <namespace> --previous

# Get container logs in multi-container pod
kubectl logs <pod-name> -c <container-name> -n <namespace>

# Execute commands in pod
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Common Pod States

# Pod stuck in Pending
apiVersion: v1
kind: Pod
metadata:
  name: resource-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

# Pod in CrashLoopBackOff
apiVersion: v1
kind: Pod
metadata:
  name: liveness-pod
spec:
  containers:
  - name: app
    image: nginx
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 3
      periodSeconds: 3

Node Problems

Node Debugging

# Check node status
kubectl get nodes

# Get node details
kubectl describe node <node-name>

# Get node metrics
kubectl top node

# Check node logs
journalctl -u kubelet

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions[]'

Node Maintenance

# Drain node
kubectl drain <node-name> --ignore-daemonsets

# Cordon node
kubectl cordon <node-name>

# Uncordon node
kubectl uncordon <node-name>

Networking Issues

Network Debugging

# Test service DNS
kubectl run -it --rm --restart=Never busybox --image=busybox -- nslookup kubernetes.default

# Check service endpoints
kubectl get endpoints <service-name>

# Test network connectivity
kubectl run -it --rm --restart=Never busybox --image=busybox -- wget -O- http://service-name:port

# Check network policies
kubectl get networkpolicies --all-namespaces

Service Debugging

# Debug service connectivity
apiVersion: v1
kind: Pod
metadata:
  name: debug-pod
spec:
  containers:
  - name: debug
    image: nicolaka/netshoot
    command: ['sh', '-c', 'while true; do sleep 3600; done']

Storage Problems

Storage Debugging

# Check PV status
kubectl get pv

# Check PVC status
kubectl get pvc

# Check storage class
kubectl get storageclass

# Describe storage issues
kubectl describe pv <pv-name>
kubectl describe pvc <pvc-name>

Storage Cleanup

# Force delete PVC
kubectl patch pvc <pvc-name> -p '{"metadata":{"finalizers":null}}'

# Force delete PV
kubectl patch pv <pv-name> -p '{"metadata":{"finalizers":null}}'

Cluster Issues

Cluster Debugging

# Check cluster components
kubectl get componentstatuses

# Check control plane pods
kubectl get pods -n kube-system

# Check cluster events
kubectl get events --all-namespaces

# Check API server logs
kubectl logs -n kube-system kube-apiserver-<node-name>

Control Plane Recovery

# Backup etcd
ETCDCTL_API=3 etcdctl snapshot save snapshot.db

# Restore etcd
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db

Best Practices

Logging Best Practices

apiVersion: v1
kind: Pod
metadata:
  name: logging-pod
spec:
  containers:
  - name: app
    image: nginx
    volumeMounts:
    - name: logs
      mountPath: /var/log
  volumes:
  - name: logs
    emptyDir: {}

Monitoring Setup

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
spec:
  selector:
    matchLabels:
      app: web
  endpoints:
  - port: metrics

Common Issues and Solutions

Application Issues

  1. Pod won't start
  2. Check image name and tag
  3. Verify resource requests and limits
  4. Check pull secrets
  5. Examine node capacity

  6. Container crashes

  7. Check application logs
  8. Verify health checks
  9. Check resource usage
  10. Examine dependencies

Networking Issues

  1. Service not accessible
  2. Verify service selector
  3. Check endpoint creation
  4. Test pod connectivity
  5. Examine network policies

  6. Ingress problems

  7. Check ingress controller
  8. Verify TLS configuration
  9. Examine service backend
  10. Check DNS resolution

Storage Issues

  1. PVC stuck in pending
  2. Check storage class
  3. Verify PV availability
  4. Examine storage provider
  5. Check capacity

  6. Volume mount failures

  7. Check mount permissions
  8. Verify volume paths
  9. Examine node issues
  10. Check filesystem type

Troubleshooting Tools

Debug Container

apiVersion: v1
kind: Pod
metadata:
  name: debug-tools
spec:
  containers:
  - name: debug
    image: nicolaka/netshoot
    command:
    - sleep
    - "3600"
    securityContext:
      privileged: true

Network Debug

apiVersion: v1
kind: Pod
metadata:
  name: network-debug
spec:
  containers:
  - name: network-tools
    image: praqma/network-multitool
    command:
    - sleep
    - "3600"

Resources for Further Learning

Practice Exercises

  1. Debug failing pods
  2. Troubleshoot service connectivity
  3. Resolve storage issues
  4. Fix networking problems
  5. Recover from cluster issues