Diagnosing Resource Issues in a Kubernetes Cluster: Difference between revisions

Latest revision as of 12:18, 4 September 2025

Diagnosing Resource Issues in a Kubernetes Cluster

This guide provides a systematic approach to identifying and diagnosing CPU and memory resource problems within a Kubernetes cluster. It covers checking node and pod resource utilization, from a cluster-wide overview to a detailed analysis of a single node.

1. Prerequisite: Install the Metrics Server

The kubectl top command is the primary tool for checking real-time resource usage. This command relies on the Metrics Server, which aggregates resource data from each node. Before proceeding, you must ensure it is installed and running.

First, check if the Metrics Server is already deployed:

kubectl get deployment metrics-server -n kube-system

If the command does not return a running deployment, install it. The following command deploys the latest version:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Note: After applying, it may take a few minutes for the Metrics Server to become fully operational and start reporting metrics. k8studio

2. High-Level Cluster Overview

Start by assessing the overall health of your nodes to identify any that are under pressure. signoz

Check Node Utilization

Use kubectl top node to get a summary of CPU and memory usage for every node in the cluster. This helps you quickly spot a node that is running hotter than others. signoz

kubectl top node

Output will resemble:

NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
vm-k3s-wkr-1   350m         8%     2840Mi          73%
vm-k3s-wkr-2   450m         11%    3150Mi          81%
vm-k3s-wkr-3   1200m        30%    3500Mi          90%

Check Top Pods Across the Cluster

To find the most resource-intensive pods across all namespaces, use kubectl top pod combined with the --sort-by flag. kubernetes+1

To find the top CPU-consuming pods:

# Sorts by the CPU column in descending order
kubectl top pod --all-namespaces --sort-by=cpu

To find the top memory-consuming pods:

# Sorts by the Memory column in descending order
kubectl top pod --all-namespaces --sort-by=memory

These commands direct your attention to the specific applications that are consuming the most resources cluster-wide. last9

3. Deep Dive: Analyzing Pods on a Specific Node

If a particular node shows high utilization (e.g., vm-k3s-wkr-3 from the example above), the next step is to identify which pods on that specific node are responsible.

The following command lists all pods on a designated node and sorts them by memory usage in descending order. This is highly effective for pinpointing the source of node pressure.

# Define the target node name
NODE_NAME="vm-k3s-wkr-3"

# Get a list of pod names on that node
POD_NAMES=$(kubectl get pods --all-namespaces --field-selector spec.nodeName=${NODE_NAME} -o=custom-columns=NAME:.metadata.name --no-headers)

# Filter the 'kubectl top' output to show only those pods, then sort by memory (4th column)
kubectl top pods --all-namespaces --no-headers | grep -E "${POD_NAMES//\ /|}" | sort -k4 -h -r

Command Breakdown

kubectl get pods ...: This command uses a --field-selector to retrieve only the pods scheduled on spec.nodeName=${NODE_NAME}. It outputs just their names. kubernetes
kubectl top pods ...: This fetches the live CPU and memory usage for all pods in the cluster. kubernetes
grep -E "${POD_NAMES//\ /|}": This filters the full top output, showing only the lines that match the pod names running on your target node. The ${POD_NAMES//\ /|} part formats the pod names into a grep -E compatible pattern (e.g., pod-a|pod-b|pod-c).
sort -k4 -h -r: This sorts the final list by the fourth column (MEMORY) in human-readable (-h) and reverse (-r) order, placing the heaviest pod at the top. k8studio

4. Inspecting Problematic Pods

Once you have identified a high-resource pod, use the following commands to investigate further.

Check Pod Events and Configuration

Use kubectl describe to check for important events (like OOMKilled), and to see the pod's configured resource requests and limits. Comparing actual usage from kubectl top against these limits is a critical diagnostic step. last9

kubectl describe pod <pod-name> -n <namespace>

Check Application Logs

Application-level errors are often the root cause of high resource usage. Check the pod's logs for stack traces, memory leak warnings, or other errors. last9

kubectl logs <pod-name> -n <namespace>

By following this structured process, from a high-level overview to a granular, node-specific analysis, you can efficiently diagnose and resolve most common resource issues in a Kubernetes cluster.

@@ Line 12: / Line 12: @@
 If the command does not return a running deployment, install it. The following command deploys the latest version:
 <syntaxhighlight lang="bash">
-kubectl apply -f [https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml](https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml)
+kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
 </syntaxhighlight>
-'''Note:''' After applying, it may take a few minutes for the Metrics Server to become fully operational and start reporting metrics.[https://k8studio.io/knowledge-base/how-to-find-memory-metrics-for-a-kubernetes-pod/ k8studio]
+'''Note:''' After applying, it may take a few minutes for the Metrics Server to become fully operational and start reporting metrics. [https://k8studio.io/knowledge-base/how-to-find-memory-metrics-for-a-kubernetes-pod/ k8studio]
 === 2. High-Level Cluster Overview ===
-Start by assessing the overall health of your nodes to identify any that are under pressure.[https://signoz.io/blog/kubectl-top/ signoz]
+Start by assessing the overall health of your nodes to identify any that are under pressure. [https://signoz.io/blog/kubectl-top/ signoz]
 ==== Check Node Utilization ====
-Use <code>kubectl top node</code> to get a summary of CPU and memory usage for every node in the cluster. This helps you quickly spot a node that is running hotter than others.[https://signoz.io/blog/kubectl-top/ signoz]
+Use <code>kubectl top node</code> to get a summary of CPU and memory usage for every node in the cluster. This helps you quickly spot a node that is running hotter than others. [https://signoz.io/blog/kubectl-top/ signoz]
 <syntaxhighlight lang="bash">
 kubectl top node
@@ Line 35: / Line 35: @@
 ==== Check Top Pods Across the Cluster ====
-To find the most resource-intensive pods across all namespaces, use <code>kubectl top pod</code> combined with the <code>--sort-by</code> flag.[https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes+1]
+To find the most resource-intensive pods across all namespaces, use <code>kubectl top pod</code> combined with the <code>--sort-by</code> flag. [https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes+1]
 To find the top CPU-consuming pods:
@@ Line 49: / Line 49: @@
 </syntaxhighlight>
-These commands direct your attention to the specific applications that are consuming the most resources cluster-wide.[https://last9.io/blog/kubectl-top/ last9]
+These commands direct your attention to the specific applications that are consuming the most resources cluster-wide. [https://last9.io/blog/kubectl-top/ last9]
 === 3. Deep Dive: Analyzing Pods on a Specific Node ===
@@ Line 67: / Line 67: @@
 ==== Command Breakdown ====
-* <code>kubectl get pods ...</code>: This command uses a <code>--field-selector</code> to retrieve only the pods scheduled on <code>spec.nodeName=${NODE_NAME}</code>. It outputs just their names.[https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes]
+* <code>kubectl get pods ...</code>: This command uses a <code>--field-selector</code> to retrieve only the pods scheduled on <code>spec.nodeName=${NODE_NAME}</code>. It outputs just their names. [https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes]
-* <code>kubectl top pods ...</code>: This fetches the live CPU and memory usage for all pods in the cluster.[https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes]
+* <code>kubectl top pods ...</code>: This fetches the live CPU and memory usage for all pods in the cluster. [https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes]
 * <code>grep -E "${POD_NAMES//\ /|}"</code>: This filters the full <code>top</code> output, showing only the lines that match the pod names running on your target node. The <code>${POD_NAMES//\ /|}</code> part formats the pod names into a <code>grep -E</code> compatible pattern (e.g., <code>pod-a|pod-b|pod-c</code>).
-* <code>sort -k4 -h -r</code>: This sorts the final list by the fourth column (MEMORY) in human-readable (<code>-h</code>) and reverse (<code>-r</code>) order, placing the heaviest pod at the top.[https://k8studio.io/knowledge-base/how-to-find-memory-metrics-for-a-kubernetes-pod/ k8studio]
+* <code>sort -k4 -h -r</code>: This sorts the final list by the fourth column (MEMORY) in human-readable (<code>-h</code>) and reverse (<code>-r</code>) order, placing the heaviest pod at the top. [https://k8studio.io/knowledge-base/how-to-find-memory-metrics-for-a-kubernetes-pod/ k8studio]
 === 4. Inspecting Problematic Pods ===
@@ Line 76: / Line 76: @@
 ==== Check Pod Events and Configuration ====
-Use <code>kubectl describe</code> to check for important events (like <code>OOMKilled</code>), and to see the pod's configured resource requests and limits. Comparing actual usage from <code>kubectl top</code> against these limits is a critical diagnostic step.[https://last9.io/blog/kubectl-top/ last9]
+Use <code>kubectl describe</code> to check for important events (like <code>OOMKilled</code>), and to see the pod's configured resource requests and limits. Comparing actual usage from <code>kubectl top</code> against these limits is a critical diagnostic step. [https://last9.io/blog/kubectl-top/ last9]
 <syntaxhighlight lang="bash">
 kubectl describe pod <pod-name> -n <namespace>
@@ Line 82: / Line 82: @@
 ==== Check Application Logs ====
-Application-level errors are often the root cause of high resource usage. Check the pod's logs for stack traces, memory leak warnings, or other errors.[https://last9.io/blog/kubectl-top/ last9]
+Application-level errors are often the root cause of high resource usage. Check the pod's logs for stack traces, memory leak warnings, or other errors. [https://last9.io/blog/kubectl-top/ last9]
 <syntaxhighlight lang="bash">
 kubectl logs <pod-name> -n <namespace>