Diagnosing Resource Issues in a Kubernetes Cluster
Diagnosing Resource Issues in a Kubernetes Cluster
This guide provides a systematic approach to identifying and diagnosing CPU and memory resource problems within a Kubernetes cluster. It covers checking node and pod resource utilization, from a cluster-wide overview to a detailed analysis of a single node.
1. Prerequisite: Install the Metrics Server
The kubectl top command is the primary tool for checking real-time resource usage. This command relies on the Metrics Server, which aggregates resource data from each node. Before proceeding, you must ensure it is installed and running.
First, check if the Metrics Server is already deployed:
kubectl get deployment metrics-server -n kube-system
If the command does not return a running deployment, install it. The following command deploys the latest version:
kubectl apply -f [https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml](https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml)
Note: After applying, it may take a few minutes for the Metrics Server to become fully operational and start reporting metrics.k8studio
2. High-Level Cluster Overview
Start by assessing the overall health of your nodes to identify any that are under pressure.signoz
Check Node Utilization
Use kubectl top node to get a summary of CPU and memory usage for every node in the cluster. This helps you quickly spot a node that is running hotter than others.signoz
kubectl top node
Output will resemble:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
vm-k3s-wkr-1 350m 8% 2840Mi 73%
vm-k3s-wkr-2 450m 11% 3150Mi 81%
vm-k3s-wkr-3 1200m 30% 3500Mi 90%
Check Top Pods Across the Cluster
To find the most resource-intensive pods across all namespaces, use kubectl top pod combined with the --sort-by flag.kubernetes+1
To find the top CPU-consuming pods:
# Sorts by the CPU column in descending order
kubectl top pod --all-namespaces --sort-by=cpu
To find the top memory-consuming pods:
# Sorts by the Memory column in descending order
kubectl top pod --all-namespaces --sort-by=memory
These commands direct your attention to the specific applications that are consuming the most resources cluster-wide.last9
3. Deep Dive: Analyzing Pods on a Specific Node
If a particular node shows high utilization (e.g., vm-k3s-wkr-3 from the example above), the next step is to identify which pods on that specific node are responsible.
The following command lists all pods on a designated node and sorts them by memory usage in descending order. This is highly effective for pinpointing the source of node pressure.
# Define the target node name
NODE_NAME="vm-k3s-wkr-3"
# Get a list of pod names on that node
POD_NAMES=$(kubectl get pods --all-namespaces --field-selector spec.nodeName=${NODE_NAME} -o=custom-columns=NAME:.metadata.name --no-headers)
# Filter the 'kubectl top' output to show only those pods, then sort by memory (4th column)
kubectl top pods --all-namespaces --no-headers | grep -E "${POD_NAMES//\ /|}" | sort -k4 -h -r
Command Breakdown
kubectl get pods ...: This command uses a--field-selectorto retrieve only the pods scheduled onspec.nodeName=${NODE_NAME}. It outputs just their names.kuberneteskubectl top pods ...: This fetches the live CPU and memory usage for all pods in the cluster.kubernetesgrep -E "${POD_NAMES//\ /|}": This filters the fulltopoutput, showing only the lines that match the pod names running on your target node. The${POD_NAMES//\ /|}part formats the pod names into agrep -Ecompatible pattern (e.g.,pod-a|pod-b|pod-c).sort -k4 -h -r: This sorts the final list by the fourth column (MEMORY) in human-readable (-h) and reverse (-r) order, placing the heaviest pod at the top.k8studio
4. Inspecting Problematic Pods
Once you have identified a high-resource pod, use the following commands to investigate further.
Check Pod Events and Configuration
Use kubectl describe to check for important events (like OOMKilled), and to see the pod's configured resource requests and limits. Comparing actual usage from kubectl top against these limits is a critical diagnostic step.last9
kubectl describe pod <pod-name> -n <namespace>
Check Application Logs
Application-level errors are often the root cause of high resource usage. Check the pod's logs for stack traces, memory leak warnings, or other errors.last9
kubectl logs <pod-name> -n <namespace>
By following this structured process, from a high-level overview to a granular, node-specific analysis, you can efficiently diagnose and resolve most common resource issues in a Kubernetes cluster.