Cloud Infrastructure Errors: AWS, Azure & GCP Troubleshooting Guide
Cloud infrastructure errors can bring down production systems in seconds and cost real money every minute they persist. Whether you're running serverless functions on AWS Lambda, hosting containers on GCP Cloud Run, or managing virtual machines in Azure, the complexity of cloud platforms means errors are rarely straightforward.
The most frustrating cloud errors are often permission-related. IAM policies, resource policies, service roles, and cross-account access create a layered permission model where a single misconfigured statement can block an entire workflow. After permissions, networking is the next most common culprit — VPC configurations, security groups, subnet routing, and DNS resolution all need to be correct for services to communicate.
This section covers 31 troubleshooting articles across 13 cloud services spanning all three major providers. You'll find guides for AWS services like Lambda, S3, EC2, RDS, ECS, EKS, API Gateway, and CloudFront, plus Azure Functions, Azure VMs, GCP Cloud Run, and GCP Cloud Functions. Each article targets specific error messages you'll encounter in production and walks through resolution step by step.
Cloud errors demand systematic debugging. The provider's console, CLI, and CloudWatch/Stackdriver/Monitor logs are your primary tools. These guides teach you where to look first and how to read the signals that point to the root cause.
Browse by Category
Common Patterns & Cross-Cutting Themes
IAM & Permission Errors
Permission denied errors are the single most common class of cloud infrastructure issues. Every cloud provider uses a policy-based access control system (AWS IAM, Azure RBAC, GCP IAM) where resources, identities, and actions are governed by JSON policy documents.
When you see "Access Denied," "Forbidden," or "insufficient permissions," start by identifying which identity is making the request (user, role, or service account) and which resource it's trying to access. In AWS, use CloudTrail to find the exact API call that was denied and the policy evaluation result. In GCP, check the Policy Troubleshooter. In Azure, review the Activity Log.
Common pitfalls include: implicit denies overriding allows, resource-based policies conflicting with identity-based policies, missing trust relationships on assumed roles, and conditions like IP restrictions or MFA requirements blocking programmatic access. Cross-account access adds another layer — both the source and destination accounts must explicitly allow the action.
Resource Limits & Throttling
Every cloud service has default quotas — and they're often lower than you expect. Lambda has concurrent execution limits, EC2 has instance count limits per region, S3 has request rate limits per prefix, and RDS has connection count limits based on instance size.
Throttling manifests as 429 or 503 errors, often with provider-specific error codes like "TooManyRequestsException" or "Rate exceeded." The fix varies: some limits can be increased via a support request, others require architectural changes like request queuing, connection pooling, or distributing load across regions.
Monitor your resource utilization proactively. Set up CloudWatch alarms, Azure Monitor alerts, or GCP monitoring to warn you when you're approaching 80% of any quota. Many production outages start with hitting an obscure limit that nobody knew existed.
Networking & Connectivity Issues
Cloud networking errors are notoriously hard to debug because the network topology is invisible by default. A Lambda function that can't reach an RDS instance, an ECS task that can't pull a container image, or a VM that can't resolve DNS — these all point to networking misconfiguration.
Start with the basics: Is the resource in the right VPC and subnet? Do the security groups and network ACLs allow the traffic on the correct port? Is there a route to the destination (internet gateway for public, NAT gateway for private subnets)? For cross-service communication within AWS, VPC endpoints can eliminate the need for internet routing entirely.
DNS resolution failures are particularly sneaky in cloud environments. Custom VPCs may need DNS hostnames and DNS resolution enabled. Private hosted zones need to be associated with the correct VPCs. And if you're using service discovery, the namespace configuration must match across services.
Cold Starts & Timeout Errors
Serverless functions (Lambda, Cloud Functions, Azure Functions) and container services (Cloud Run, ECS, App Service) all suffer from cold start latency. When a function hasn't been invoked recently, the platform needs to provision a new execution environment, which adds hundreds of milliseconds to several seconds of latency.
Timeout errors compound this problem. If your function's configured timeout is too short to accommodate a cold start plus the actual processing time, it'll fail intermittently — working fine during sustained traffic but failing after idle periods. Increase timeouts to account for cold starts, use provisioned concurrency or minimum instances to keep environments warm, and optimize initialization code (lazy-load dependencies, reuse database connections across invocations).
For container services, health check misconfigurations are a top cause of restarts and apparent timeouts. Ensure your health check path responds quickly and doesn't depend on downstream services that might be slow during startup.
Quick Troubleshooting Guide
| Symptom | Likely Cause | First Step |
|---|---|---|
| Access Denied / 403 Forbidden | IAM policy missing or denying the action | Check CloudTrail/Activity Log for the denied API call; review IAM policies |
| Function timeout | Cold start + processing exceeds timeout limit | Increase timeout setting; add provisioned concurrency; optimize init code |
| Cannot connect to database from Lambda/function | VPC/subnet/security group misconfiguration | Verify Lambda is in same VPC; check security group inbound rules on port |
| Container fails health check | App not ready before health check deadline | Increase health check grace period; optimize startup time; check health endpoint |
| S3 Access Denied on upload/download | Bucket policy or object ACL blocking access | Review bucket policy + IAM policy; check bucket ownership and encryption settings |
| Rate exceeded / throttling (429/503) | Service quota or request rate limit hit | Request quota increase; implement backoff; distribute load across regions |
| EC2 instance unreachable | Security group or NACL blocking traffic | Check inbound rules; verify instance has public IP or is behind a load balancer |
| DNS resolution failure in VPC | DNS settings disabled on VPC or missing hosted zone | Enable DNS hostnames/resolution on VPC; associate private hosted zone |