Diagnosing Resource Issues in a Kubernetes Cluster: Difference between revisions

Revision as of 12:11, 4 September 2025

Diagnosing Resource Issues in a Kubernetes Cluster

This guide provides a systematic approach to identifying and diagnosing CPU and memory resource problems within a Kubernetes cluster. It covers checking node and pod resource utilization, from a cluster-wide overview to a detailed analysis of a single node.

1. Prerequisite: Install the Metrics Server

The kubectl top command is the primary tool for checking real-time resource usage. This command relies on the Metrics Server, which aggregates resource data from each node. Before proceeding, you must ensure it is installed and running.

First, check if the Metrics Server is already deployed:

kubectl get deployment metrics-server -n kube-system

If the command does not return a running deployment, install it. The following command deploys the latest version:

kubectl apply -f [https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml](https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml)

Note: After applying, it may take a few minutes for the Metrics Server to become fully operational and start reporting metrics.k8studio

2. High-Level Cluster Overview

Start by assessing the overall health of your nodes to identify any that are under pressure.signoz

Check Node Utilization

Use kubectl top node to get a summary of CPU and memory usage for every node in the cluster. This helps you quickly spot a node that is running hotter than others.signoz

kubectl top node

Output will resemble:

NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
vm-k3s-wkr-1   350m         8%     2840Mi          73%
vm-k3s-wkr-2   450m         11%    3150Mi          81%
vm-k3s-wkr-3   1200m        30%    3500Mi          90%

Check Top Pods Across the Cluster

To find the most resource-intensive pods across all namespaces, use kubectl top pod combined with the --sort-by flag.kubernetes+1

To find the top CPU-consuming pods:

# Sorts by the CPU column in descending order
kubectl top pod --all-namespaces --sort-by=cpu

To find the top memory-consuming pods:

# Sorts by the Memory column in descending order
kubectl top pod --all-namespaces --sort-by=memory

These commands direct your attention to the specific applications that are consuming the most resources cluster-wide.last9

3. Deep Dive: Analyzing Pods on a Specific Node

If a particular node shows high utilization (e.g., vm-k3s-wkr-3 from the example above), the next step is to identify which pods on that specific node are responsible.

The following command lists all pods on a designated node and sorts them by memory usage in descending order. This is highly effective for pinpointing the source of node pressure.

# Define the target node name
NODE_NAME="vm-k3s-wkr-3"

# Get a list of pod names on that node
POD_NAMES=$(kubectl get pods --all-namespaces --field-selector spec.nodeName=${NODE_NAME} -o=custom-columns=NAME:.metadata.name --no-headers)

# Filter the 'kubectl top' output to show only those pods, then sort by memory (4th column)
kubectl top pods --all-namespaces --no-headers | grep -E "${POD_NAMES//\ /|}" | sort -k4 -h -r

Command Breakdown

kubectl get pods ...: This command uses a --field-selector to retrieve only the pods scheduled on spec.nodeName=${NODE_NAME}. It outputs just their names.kubernetes
kubectl top pods ...: This fetches the live CPU and memory usage for all pods in the cluster.kubernetes
grep -E "${POD_NAMES//\ /|}": This filters the full top output, showing only the lines that match the pod names running on your target node. The ${POD_NAMES//\ /|} part formats the pod names into a grep -E compatible pattern (e.g., pod-a|pod-b|pod-c).
sort -k4 -h -r: This sorts the final list by the fourth column (MEMORY) in human-readable (-h) and reverse (-r) order, placing the heaviest pod at the top.k8studio

4. Inspecting Problematic Pods

Once you have identified a high-resource pod, use the following commands to investigate further.

Check Pod Events and Configuration

Use kubectl describe to check for important events (like OOMKilled), and to see the pod's configured resource requests and limits. Comparing actual usage from kubectl top against these limits is a critical diagnostic step.last9

kubectl describe pod <pod-name> -n <namespace>

Check Application Logs

Application-level errors are often the root cause of high resource usage. Check the pod's logs for stack traces, memory leak warnings, or other errors.last9

kubectl logs <pod-name> -n <namespace>

By following this structured process, from a high-level overview to a granular, node-specific analysis, you can efficiently diagnose and resolve most common resource issues in a Kubernetes cluster.

@@ Line 29: / Line 29: @@
 <syntaxhighlight lang="text">
 NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
-z-k3s-agent-1   350m         8%     2840Mi          73%
+vm-k3s-wkr-1   350m         8%     2840Mi          73%
-z-k3s-agent-2   450m         11%    3150Mi          81%
+vm-k3s-wkr-2   450m         11%    3150Mi          81%
-z-k3s-agent-3   1200m        30%    3500Mi          90%
+vm-k3s-wkr-3   1200m        30%    3500Mi          90%
 </syntaxhighlight>
@@ Line 52: / Line 52: @@
 === 3. Deep Dive: Analyzing Pods on a Specific Node ===
-If a particular node shows high utilization (e.g., <code>z-k3s-agent-3</code> from the example above), the next step is to identify which pods on that specific node are responsible.
+If a particular node shows high utilization (e.g., <code>vm-k3s-wkr-3</code> from the example above), the next step is to identify which pods on that specific node are responsible.
 The following command lists all pods on a designated node and sorts them by memory usage in descending order. This is highly effective for pinpointing the source of node pressure.
 <syntaxhighlight lang="bash">
 # Define the target node name
-NODE_NAME="z-k3s-agent-3"
+NODE_NAME="vm-k3s-wkr-3"
 # Get a list of pod names on that node