Resolving 'AWS EKS Node Not Ready' Status: A Comprehensive Troubleshooting Guide
Fix AWS EKS Node Not Ready errors. Learn to diagnose kubelet failures, IAM aws-auth issues, VPC CNI IP exhaustion, and security group misconfigurations.
- Check 'kubectl describe node' first to identify NetworkUnavailable, MemoryPressure, or PIDPressure conditions.
- AWS VPC CNI (aws-node) failures due to subnet IP exhaustion are the most common cause of NotReady states in EKS.
- Ensure the EC2 instance IAM role is correctly mapped in the kube-system 'aws-auth' ConfigMap.
- Verify Security Groups allow bi-directional TCP 443 between the EKS Control Plane and worker nodes.
- Review kubelet logs via journalctl to identify container runtime (containerd) or kubelet bootstrapping failures.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| kubectl describe node | Initial triage to check node conditions and recent Kubelet events | 1 min | None |
| Check Kubelet Logs (journalctl) | Node is reachable but Kubelet fails to register with the API server | 5 mins | Low |
| Verify aws-auth ConfigMap | Node EC2 instance is running but never joins the cluster at all | 3 mins | Low |
| Restart VPC CNI / aws-node | CNI config uninitialized or Pods stuck in ContainerCreating | 2 mins | Medium |
| Replace EKS Node Group | Unrecoverable state, corrupted AMI, or configuration drift | 15 mins | High |
Understanding the Error
When operating Amazon Elastic Kubernetes Service (EKS), one of the most stressful alerts an SRE or DevOps engineer can receive is an aws eks node not ready state. In Kubernetes, the control plane continuously monitors the health of worker nodes. If the kubelet daemon running on a node stops reporting its status, or reports a degraded state, the API server marks the node's Ready condition as False or Unknown.
When a node transitions to NotReady, the Kubernetes scheduler stops placing new Pods on it. After a default eviction timeout (usually 5 minutes), the control plane will begin evicting existing Pods to reschedule them elsewhere. This can lead to cascading failures if cluster capacity is suddenly reduced.
Step 1: Diagnose the Node Conditions
Before logging into the AWS console, start with the Kubernetes API. Run the following command to get a detailed view of the failing node:
kubectl describe node <node-name>
Scroll down to the Conditions section. You are looking for several key indicators:
- Ready: Will be
FalseorUnknown. - NetworkUnavailable: If
True, the node's network routes are not configured correctly (often a CNI issue). - MemoryPressure / DiskPressure / PIDPressure: If any of these are
True, the node has exhausted its physical resources, causing the kubelet to defensively fail or the OS to invoke the OOMKiller.
Look at the Events at the bottom of the output. Common error messages include:
PLEG is not healthy: pleg was last seen active 3m0s agonetwork plugin is not ready: cni config uninitializedNodeStatusUnknown: Kubelet stopped posting node status
Step 2: AWS VPC CNI and IP Exhaustion
In EKS, networking is handled by the Amazon VPC CNI plugin. A highly common reason for a node being NotReady is that the CNI plugin failed to initialize because the underlying AWS subnet has run out of available IP addresses.
The VPC CNI assigns a secondary IP address from the VPC to every Pod. If your subnet is exhausted, the aws-node daemonset pod on the worker node will crash or hang.
How to check:
- Check the
aws-nodepods in thekube-systemnamespace:kubectl get pods -n kube-system -l k8s-app=aws-node wide - Look for pods in a
CrashLoopBackOfforErrorstate on the affected node. - Check your AWS VPC Subnet available IPv4 addresses via the AWS Console or CLI:
aws ec2 describe-subnets --subnet-ids <your-subnet-id> --query 'Subnets[*].AvailableIpAddressCount'
The Fix: Expand your subnet CIDR, move nodes to a different subnet, or enable prefix delegation (VPC CNI feature) to drastically increase the number of available IPs per EC2 Elastic Network Interface (ENI).
Step 3: Kubelet and Container Runtime Failures
If the network is fine, the kubelet or the container runtime (containerd or dockerd on older AMIs) might be failing. To investigate this, you must connect to the EC2 instance via AWS Systems Manager (SSM) Session Manager or SSH.
Once connected, check the kubelet logs:
journalctl -u kubelet -f
Common Kubelet Errors:
Unauthorized / Forbidden:
error: failed to run Kubelet: cannot create certificate signing request: UnauthorizedThis indicates an IAM issue. The EC2 instance's IAM role must be present in theaws-authConfigMap. Check the ConfigMap:kubectl get configmap aws-auth -n kube-system -o yamlEnsure therolearnprecisely matches the IAM role attached to the EC2 instance (do not include the instance profile path).Cgroup Driver Mismatch:
misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"If you are using custom AMIs, ensure your container runtime and kubelet are both configured to usesystemdas the cgroup driver. EKS optimized AMIs default tosystemd.PLEG Issues: If you see Pod Lifecycle Event Generator (PLEG) errors, the container runtime is likely deadlocked. Restart the runtime and kubelet:
sudo systemctl restart containerdsudo systemctl restart kubelet
Step 4: Security Groups and Network ACLs
For an EKS node to register as Ready, it must be able to communicate with the EKS Control Plane. EKS creates cross-account elastic network interfaces in your VPC to facilitate this.
Ensure your security groups allow:
- Control Plane to Nodes: TCP port 443 (for webhook validations and executing commands like
kubectl exec) and TCP port 10250 (for kubelet API). - Nodes to Control Plane: TCP port 443 to the EKS cluster endpoint.
If a misconfigured Terraform or CloudFormation deployment accidentally removed the ingress rules allowing the worker node security group to reach the cluster security group, the kubelet will silently fail to register, throwing connection timeouts in the journalctl logs.
Step 5: User Data and Bootstrap Scripts
If the node never joins the cluster after creation, check the EC2 User Data logs. When an EKS node boots, it runs a script (/etc/eks/bootstrap.sh) to configure the kubelet with the cluster's CA certificate and API endpoint.
Check the cloud-init logs on the instance:
cat /var/log/cloud-init-output.log
Look for errors related to downloading the EKS CA cert, connecting to the EKS endpoint, or syntax errors in any custom user data scripts you provided. If the bootstrap script fails, the kubelet is never started.
Conclusion
Resolving an EKS node NotReady issue requires a systematic approach. Always start by reading the node conditions via kubectl. Isolate whether it is a resource exhaustion issue, a CNI/networking failure, or an IAM/authentication blockage. By systematically verifying subnet IPs, security groups, the aws-auth ConfigMap, and kubelet logs, you can quickly identify the root cause and restore cluster capacity.
Frequently Asked Questions
# EKS Node Diagnostic Script
# Run this script to gather essential triage information for a NotReady node.
NODE_NAME="ip-10-0-1-123.ec2.internal"
NAMESPACE="kube-system"
echo "=== 1. Checking Node Conditions ==="
kubectl describe node $NODE_NAME | grep -A 5 "Conditions:"
echo "\n=== 2. Checking AWS VPC CNI Status ==="
kubectl get pods -n $NAMESPACE -l k8s-app=aws-node --field-selector spec.nodeName=$NODE_NAME -o wide
echo "\n=== 3. Checking for Recent Kubelet Events ==="
kubectl get events --field-selector involvedObject.name=$NODE_NAME --sort-by='.metadata.creationTimestamp'
echo "\n=== 4. Validating aws-auth ConfigMap ==="
kubectl get configmap aws-auth -n $NAMESPACE -o yaml | grep -A 10 "mapRoles"
# Note: To fetch kubelet logs from the node via AWS SSM:
# aws ssm start-session --target <instance-id> --document-name AWS-StartInteractiveCommand --parameters command="journalctl -u kubelet -n 100 --no-pager"Error Medic Editorial
Error Medic Editorial is composed of Senior DevOps Engineers and SREs dedicated to providing actionable, real-world solutions for modern cloud infrastructure challenges.