Diagnosing Resource Issues in a Kubernetes Cluster: Difference between revisions
No edit summary |
No edit summary |
||
| Line 12: | Line 12: | ||
If the command does not return a running deployment, install it. The following command deploys the latest version: | If the command does not return a running deployment, install it. The following command deploys the latest version: | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
kubectl apply -f | kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml | ||
</syntaxhighlight> | </syntaxhighlight> | ||
'''Note:''' After applying, it may take a few minutes for the Metrics Server to become fully operational and start reporting metrics.[https://k8studio.io/knowledge-base/how-to-find-memory-metrics-for-a-kubernetes-pod/ k8studio] | '''Note:''' After applying, it may take a few minutes for the Metrics Server to become fully operational and start reporting metrics. [https://k8studio.io/knowledge-base/how-to-find-memory-metrics-for-a-kubernetes-pod/ k8studio] | ||
=== 2. High-Level Cluster Overview === | === 2. High-Level Cluster Overview === | ||
Start by assessing the overall health of your nodes to identify any that are under pressure.[https://signoz.io/blog/kubectl-top/ signoz] | Start by assessing the overall health of your nodes to identify any that are under pressure. [https://signoz.io/blog/kubectl-top/ signoz] | ||
==== Check Node Utilization ==== | ==== Check Node Utilization ==== | ||
Use <code>kubectl top node</code> to get a summary of CPU and memory usage for every node in the cluster. This helps you quickly spot a node that is running hotter than others.[https://signoz.io/blog/kubectl-top/ signoz] | Use <code>kubectl top node</code> to get a summary of CPU and memory usage for every node in the cluster. This helps you quickly spot a node that is running hotter than others. [https://signoz.io/blog/kubectl-top/ signoz] | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
kubectl top node | kubectl top node | ||
| Line 35: | Line 35: | ||
==== Check Top Pods Across the Cluster ==== | ==== Check Top Pods Across the Cluster ==== | ||
To find the most resource-intensive pods across all namespaces, use <code>kubectl top pod</code> combined with the <code>--sort-by</code> flag.[https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes+1] | To find the most resource-intensive pods across all namespaces, use <code>kubectl top pod</code> combined with the <code>--sort-by</code> flag. [https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes+1] | ||
To find the top CPU-consuming pods: | To find the top CPU-consuming pods: | ||
| Line 49: | Line 49: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
These commands direct your attention to the specific applications that are consuming the most resources cluster-wide.[https://last9.io/blog/kubectl-top/ last9] | These commands direct your attention to the specific applications that are consuming the most resources cluster-wide. [https://last9.io/blog/kubectl-top/ last9] | ||
=== 3. Deep Dive: Analyzing Pods on a Specific Node === | === 3. Deep Dive: Analyzing Pods on a Specific Node === | ||
| Line 67: | Line 67: | ||
==== Command Breakdown ==== | ==== Command Breakdown ==== | ||
* <code>kubectl get pods ...</code>: This command uses a <code>--field-selector</code> to retrieve only the pods scheduled on <code>spec.nodeName=${NODE_NAME}</code>. It outputs just their names.[https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes] | * <code>kubectl get pods ...</code>: This command uses a <code>--field-selector</code> to retrieve only the pods scheduled on <code>spec.nodeName=${NODE_NAME}</code>. It outputs just their names. [https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes] | ||
* <code>kubectl top pods ...</code>: This fetches the live CPU and memory usage for all pods in the cluster.[https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes] | * <code>kubectl top pods ...</code>: This fetches the live CPU and memory usage for all pods in the cluster. [https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/ kubernetes] | ||
* <code>grep -E "${POD_NAMES//\ /|}"</code>: This filters the full <code>top</code> output, showing only the lines that match the pod names running on your target node. The <code>${POD_NAMES//\ /|}</code> part formats the pod names into a <code>grep -E</code> compatible pattern (e.g., <code>pod-a|pod-b|pod-c</code>). | * <code>grep -E "${POD_NAMES//\ /|}"</code>: This filters the full <code>top</code> output, showing only the lines that match the pod names running on your target node. The <code>${POD_NAMES//\ /|}</code> part formats the pod names into a <code>grep -E</code> compatible pattern (e.g., <code>pod-a|pod-b|pod-c</code>). | ||
* <code>sort -k4 -h -r</code>: This sorts the final list by the fourth column (MEMORY) in human-readable (<code>-h</code>) and reverse (<code>-r</code>) order, placing the heaviest pod at the top.[https://k8studio.io/knowledge-base/how-to-find-memory-metrics-for-a-kubernetes-pod/ k8studio] | * <code>sort -k4 -h -r</code>: This sorts the final list by the fourth column (MEMORY) in human-readable (<code>-h</code>) and reverse (<code>-r</code>) order, placing the heaviest pod at the top. [https://k8studio.io/knowledge-base/how-to-find-memory-metrics-for-a-kubernetes-pod/ k8studio] | ||
=== 4. Inspecting Problematic Pods === | === 4. Inspecting Problematic Pods === | ||
| Line 76: | Line 76: | ||
==== Check Pod Events and Configuration ==== | ==== Check Pod Events and Configuration ==== | ||
Use <code>kubectl describe</code> to check for important events (like <code>OOMKilled</code>), and to see the pod's configured resource requests and limits. Comparing actual usage from <code>kubectl top</code> against these limits is a critical diagnostic step.[https://last9.io/blog/kubectl-top/ last9] | Use <code>kubectl describe</code> to check for important events (like <code>OOMKilled</code>), and to see the pod's configured resource requests and limits. Comparing actual usage from <code>kubectl top</code> against these limits is a critical diagnostic step. [https://last9.io/blog/kubectl-top/ last9] | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
kubectl describe pod <pod-name> -n <namespace> | kubectl describe pod <pod-name> -n <namespace> | ||
| Line 82: | Line 82: | ||
==== Check Application Logs ==== | ==== Check Application Logs ==== | ||
Application-level errors are often the root cause of high resource usage. Check the pod's logs for stack traces, memory leak warnings, or other errors.[https://last9.io/blog/kubectl-top/ last9] | Application-level errors are often the root cause of high resource usage. Check the pod's logs for stack traces, memory leak warnings, or other errors. [https://last9.io/blog/kubectl-top/ last9] | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
kubectl logs <pod-name> -n <namespace> | kubectl logs <pod-name> -n <namespace> | ||
Latest revision as of 12:18, 4 September 2025
Diagnosing Resource Issues in a Kubernetes Cluster
This guide provides a systematic approach to identifying and diagnosing CPU and memory resource problems within a Kubernetes cluster. It covers checking node and pod resource utilization, from a cluster-wide overview to a detailed analysis of a single node.
1. Prerequisite: Install the Metrics Server
The kubectl top command is the primary tool for checking real-time resource usage. This command relies on the Metrics Server, which aggregates resource data from each node. Before proceeding, you must ensure it is installed and running.
First, check if the Metrics Server is already deployed:
kubectl get deployment metrics-server -n kube-system
If the command does not return a running deployment, install it. The following command deploys the latest version:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Note: After applying, it may take a few minutes for the Metrics Server to become fully operational and start reporting metrics. k8studio
2. High-Level Cluster Overview
Start by assessing the overall health of your nodes to identify any that are under pressure. signoz
Check Node Utilization
Use kubectl top node to get a summary of CPU and memory usage for every node in the cluster. This helps you quickly spot a node that is running hotter than others. signoz
kubectl top node
Output will resemble:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
vm-k3s-wkr-1 350m 8% 2840Mi 73%
vm-k3s-wkr-2 450m 11% 3150Mi 81%
vm-k3s-wkr-3 1200m 30% 3500Mi 90%
Check Top Pods Across the Cluster
To find the most resource-intensive pods across all namespaces, use kubectl top pod combined with the --sort-by flag. kubernetes+1
To find the top CPU-consuming pods:
# Sorts by the CPU column in descending order
kubectl top pod --all-namespaces --sort-by=cpu
To find the top memory-consuming pods:
# Sorts by the Memory column in descending order
kubectl top pod --all-namespaces --sort-by=memory
These commands direct your attention to the specific applications that are consuming the most resources cluster-wide. last9
3. Deep Dive: Analyzing Pods on a Specific Node
If a particular node shows high utilization (e.g., vm-k3s-wkr-3 from the example above), the next step is to identify which pods on that specific node are responsible.
The following command lists all pods on a designated node and sorts them by memory usage in descending order. This is highly effective for pinpointing the source of node pressure.
# Define the target node name
NODE_NAME="vm-k3s-wkr-3"
# Get a list of pod names on that node
POD_NAMES=$(kubectl get pods --all-namespaces --field-selector spec.nodeName=${NODE_NAME} -o=custom-columns=NAME:.metadata.name --no-headers)
# Filter the 'kubectl top' output to show only those pods, then sort by memory (4th column)
kubectl top pods --all-namespaces --no-headers | grep -E "${POD_NAMES//\ /|}" | sort -k4 -h -r
Command Breakdown
kubectl get pods ...: This command uses a--field-selectorto retrieve only the pods scheduled onspec.nodeName=${NODE_NAME}. It outputs just their names. kuberneteskubectl top pods ...: This fetches the live CPU and memory usage for all pods in the cluster. kubernetesgrep -E "${POD_NAMES//\ /|}": This filters the fulltopoutput, showing only the lines that match the pod names running on your target node. The${POD_NAMES//\ /|}part formats the pod names into agrep -Ecompatible pattern (e.g.,pod-a|pod-b|pod-c).sort -k4 -h -r: This sorts the final list by the fourth column (MEMORY) in human-readable (-h) and reverse (-r) order, placing the heaviest pod at the top. k8studio
4. Inspecting Problematic Pods
Once you have identified a high-resource pod, use the following commands to investigate further.
Check Pod Events and Configuration
Use kubectl describe to check for important events (like OOMKilled), and to see the pod's configured resource requests and limits. Comparing actual usage from kubectl top against these limits is a critical diagnostic step. last9
kubectl describe pod <pod-name> -n <namespace>
Check Application Logs
Application-level errors are often the root cause of high resource usage. Check the pod's logs for stack traces, memory leak warnings, or other errors. last9
kubectl logs <pod-name> -n <namespace>
By following this structured process, from a high-level overview to a granular, node-specific analysis, you can efficiently diagnose and resolve most common resource issues in a Kubernetes cluster.