AWS Lambda Timeout, 403 Forbidden, 502 Bad Gateway & Throttling: The Complete Troubleshooting Guide
Fix AWS Lambda timeout, 403 Forbidden, 502 Bad Gateway, throttling, and access denied errors with step-by-step diagnosis commands and proven configuration fixes
- Lambda timeouts are caused by insufficient timeout settings (default 3s), blocked VPC egress, cold-start overhead, or downstream dependencies hanging — increase timeout up to 15 minutes and add connection pooling
- 403 Forbidden and Access Denied errors stem from missing IAM permissions on the function's execution role or resource-based policy; 502 Bad Gateway from API Gateway means Lambda returned a malformed response or crashed during execution
- Throttling (429 TooManyRequestsException) occurs when concurrent executions exceed the account limit (default 1,000) or a reserved concurrency cap — use provisioned concurrency, SQS buffering, or request a limit increase
- Service Unavailable (503) and rate limit errors require exponential backoff with jitter; never use fixed-interval retries against AWS APIs
- Always check CloudWatch Logs (/aws/lambda/<function-name>), X-Ray traces, and the Lambda function's Last Invocation metrics before guessing at a root cause
| Error / Symptom | Root Cause | Fix Method | Time to Apply | Risk |
|---|---|---|---|---|
| Task timed out after X seconds | Timeout too short or downstream hang | Increase timeout + add async pattern | < 5 min | Low |
| 403 Forbidden (API Gateway) | Missing resource-based policy | Add lambda:InvokeFunction permission | < 5 min | Low |
| AccessDeniedException in logs | IAM role missing action | Attach inline or managed IAM policy | 5-10 min | Low |
| 502 Bad Gateway | Malformed Lambda response body | Fix response JSON structure in handler | 10-30 min | Medium |
| TooManyRequestsException (throttle) | Concurrency limit hit | Reserve concurrency + SQS buffer | 30-60 min | Medium |
| ServiceUnavailableException | Transient AWS service fault | Exponential backoff + retry logic | 15-30 min | Low |
| ENI creation timeout (VPC) | Subnet has no free IPs / NAT missing | Expand CIDR or add NAT Gateway | 30-90 min | High |
| Cold start > timeout | Large deployment package | Lambda SnapStart or provisioned concurrency | 1-2 hrs | Medium |
Understanding AWS Lambda Error Patterns
AWS Lambda failures fall into four categories: execution errors (your code or its timeout), permission errors (IAM and resource policies), integration errors (API Gateway ↔ Lambda contract violations), and capacity errors (throttling and service limits). Each category requires a different diagnostic path.
Diagnosing Lambda Timeout (Task timed out after X.XX seconds)
The most common Lambda error. You will see this in CloudWatch Logs:
[ERROR] Runtime.ExitError
RequestId: abc123 Error: Runtime exited with error: signal: killed
END RequestId: abc123
REPORT RequestId: abc123 Duration: 3000.00 ms Billed Duration: 3000 ms Init Duration: 450.12 ms
Step 1 — Identify where time is being spent. Enable AWS X-Ray active tracing and add subsegment annotations around external calls. If X-Ray is not available, add console.time() / console.timeEnd() (Node.js) or time.perf_counter() (Python) around each I/O boundary.
Step 2 — Check VPC configuration. If the function is VPC-attached, Lambda must create an Elastic Network Interface (ENI). Subnet exhaustion and missing NAT Gateway routes are the top causes of cold-start timeouts. Verify:
aws ec2 describe-network-interfaces \
--filters Name=description,Values="AWS Lambda VPC ENI*" \
--query 'NetworkInterfaces[].{SubnetId:SubnetId,Status:Status,PrivateIp:PrivateIpAddress}'
Step 3 — Check downstream dependencies. Lambda shares nothing between invocations by default. If your function calls RDS, ElastiCache, or an external HTTP API, verify those endpoints are reachable and not saturated. For RDS, use RDS Proxy to pool connections — each Lambda invocation opening its own DB connection is a common cause of timeouts under load.
Step 4 — Increase timeout (to a sensible max). For background jobs use up to 900 seconds (15 minutes). For API-backed functions keep under 29 seconds (API Gateway hard limit is 29s):
aws lambda update-function-configuration \
--function-name my-function \
--timeout 30
Diagnosing 403 Forbidden and Access Denied
Two distinct surfaces produce these errors.
API Gateway → Lambda: 403 Forbidden
This means API Gateway does not have permission to invoke the Lambda function. The resource-based policy on the Lambda function must grant lambda:InvokeFunction to the API Gateway principal.
Error in API Gateway test console:
{
"message": "Forbidden"
}
Fix — add the invocation permission:
aws lambda add-permission \
--function-name my-function \
--statement-id apigateway-prod \
--action lambda:InvokeFunction \
--principal apigateway.amazonaws.com \
--source-arn "arn:aws:execute-api:us-east-1:123456789:abc123def/*/GET/endpoint"
IAM execution role: AccessDeniedException
You will see this in CloudWatch Logs when the Lambda tries to call another AWS service:
botocore.exceptions.ClientError: An error occurred (AccessDeniedException)
when calling the GetSecretValue operation: User:
arn:aws:sts::123456789:assumed-role/my-lambda-role/my-function
is not authorized to perform: secretsmanager:GetSecretValue
on resource: arn:aws:secretsmanager:us-east-1:123456789:secret:my-secret
Fix — attach the required policy to the execution role:
aws iam put-role-policy \
--role-name my-lambda-role \
--policy-name SecretsAccess \
--policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:us-east-1:123456789:secret:my-secret*"
}]
}'
Always use the principle of least privilege — scope the Resource to the exact ARN, not *.
Diagnosing 502 Bad Gateway
When API Gateway returns 502 Bad Gateway, it received a response from Lambda that does not conform to the expected integration response format. The Lambda function itself did not time out — it returned something malformed.
Common causes:
- Handler returned
None/undefinedinstead of a dict/object withstatusCode bodyfield is not a string (must be JSON-stringified)- Unhandled exception propagated to the runtime
API Gateway execution logs (enable in Stage settings) will show:
Endpoint response body before transformations: null
Execution failed due to configuration error: Malformed Lambda proxy response
Correct response format (Python):
def handler(event, context):
return {
"statusCode": 200,
"headers": {"Content-Type": "application/json"},
"body": json.dumps({"message": "ok"}) # body MUST be a string
}
Common mistake — returning a dict as body directly:
# WRONG - causes 502
return {"statusCode": 200, "body": {"message": "ok"}}
# RIGHT
return {"statusCode": 200, "body": json.dumps({"message": "ok"})}
Diagnosing Lambda Throttling and Rate Limits
Throttled invocations return TooManyRequestsException (HTTP 429). There are two throttle types:
- Account-level concurrency limit: Default 1,000 concurrent executions per region. Shared across all functions.
- Function-level reserved concurrency: If a function has reserved concurrency set to N, it will throttle at N concurrent executions regardless of account limit.
Check current account limits and usage:
# View account concurrency limit
aws lambda get-account-settings
# View per-function reserved concurrency
aws lambda get-function-concurrency --function-name my-function
# View throttle metrics for last hour
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Throttles \
--dimensions Name=FunctionName,Value=my-function \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Sum
Fix options in order of preference:
- SQS as a buffer: Place an SQS queue in front of Lambda. Lambda polls at a controlled rate, smoothing bursts. Set
MaximumConcurrencyon the event source mapping. - Provisioned concurrency: Keeps N execution environments warm. Eliminates cold starts and guarantees capacity.
- Limit increase request: Submit via AWS Support for account-level concurrency increases (allow 48-72 hours).
# Set provisioned concurrency on an alias
aws lambda put-provisioned-concurrency-config \
--function-name my-function \
--qualifier prod \
--provisioned-concurrent-executions 50
Diagnosing Service Unavailable (503)
ServiceUnavailableException is a transient AWS-side fault. Your code should implement exponential backoff with jitter:
import boto3, time, random
from botocore.config import Config
client = boto3.client('lambda', config=Config(
retries={
'max_attempts': 5,
'mode': 'adaptive' # adaptive mode adds client-side rate limiting
}
))
For custom retry logic:
def invoke_with_backoff(client, **kwargs):
base, cap, attempt = 0.5, 30, 0
while True:
try:
return client.invoke(**kwargs)
except client.exceptions.ServiceException as e:
attempt += 1
if attempt > 5:
raise
sleep = min(cap, base * 2 ** attempt) + random.uniform(0, 1)
time.sleep(sleep)
Centralized Diagnostic Runbook
For any Lambda error, run this triage sequence before changing configuration:
# 1. Tail recent log events
aws logs tail /aws/lambda/my-function --since 1h --format short
# 2. Get last 20 invocation errors
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "?ERROR ?error ?Exception" \
--start-time $(($(date +%s) - 3600))000 \
--query 'events[].message'
# 3. Check function configuration
aws lambda get-function-configuration --function-name my-function \
--query '{Timeout:Timeout,MemorySize:MemorySize,Role:Role,VpcConfig:VpcConfig}'
# 4. Inspect execution role policies
ROLE=$(aws lambda get-function-configuration --function-name my-function \
--query 'Role' --output text | awk -F/ '{print $NF}')
aws iam list-attached-role-policies --role-name "$ROLE"
aws iam list-role-policies --role-name "$ROLE"
# 5. Check resource-based policy
aws lambda get-policy --function-name my-function --query 'Policy' --output text | python3 -m json.tool
Frequently Asked Questions
#!/usr/bin/env bash
# Lambda Error Triage Script
# Usage: FUNCTION=my-function REGION=us-east-1 bash triage-lambda.sh
set -euo pipefail
FUNCTION=${FUNCTION:?"Set FUNCTION env var"}
REGION=${REGION:-us-east-1}
SINCE=${SINCE:-1h}
echo "=== Lambda Triage: $FUNCTION ==="
echo ""
echo "--- Function Configuration ---"
aws lambda get-function-configuration \
--function-name "$FUNCTION" --region "$REGION" \
--query '{Timeout:Timeout,Memory:MemorySize,Runtime:Runtime,Role:Role,VpcConfig:VpcConfig,LastModified:LastModified}'
echo ""
echo "--- Concurrency Settings ---"
aws lambda get-function-concurrency \
--function-name "$FUNCTION" --region "$REGION" 2>/dev/null || echo "No reserved concurrency set"
echo ""
echo "--- Account Concurrency Limit ---"
aws lambda get-account-settings --region "$REGION" \
--query 'AccountLimit.{ConcurrentExecutions:ConcurrentExecutions,UnreservedConcurrentExecutions:UnreservedConcurrentExecutions}'
echo ""
echo "--- Resource-Based Policy ---"
aws lambda get-policy \
--function-name "$FUNCTION" --region "$REGION" \
--query 'Policy' --output text 2>/dev/null | python3 -m json.tool || echo "No resource-based policy"
echo ""
echo "--- Execution Role Policies ---"
ROLE=$(aws lambda get-function-configuration \
--function-name "$FUNCTION" --region "$REGION" \
--query 'Role' --output text | awk -F/ '{print $NF}')
echo "Role: $ROLE"
aws iam list-attached-role-policies --role-name "$ROLE" \
--query 'AttachedPolicies[].PolicyName'
aws iam list-role-policies --role-name "$ROLE" \
--query 'PolicyNames'
echo ""
echo "--- Recent Errors (last $SINCE) ---"
aws logs tail "/aws/lambda/$FUNCTION" \
--since "$SINCE" \
--filter-pattern '?ERROR ?Exception ?Timeout ?throttl' \
--region "$REGION" 2>/dev/null | tail -50 || echo "No log group found or no matching events"
echo ""
echo "--- Throttle Metric (last 30 min) ---"
START=$(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || \
date -u -v-30M +%Y-%m-%dT%H:%M:%SZ)
END=$(date -u +%Y-%m-%dT%H:%M:%SZ)
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Throttles \
--dimensions Name=FunctionName,Value="$FUNCTION" \
--start-time "$START" --end-time "$END" \
--period 300 --statistics Sum \
--region "$REGION" \
--query 'sort_by(Datapoints,&Timestamp)[].{Time:Timestamp,Throttles:Sum}'
echo ""
echo "--- Duration P99 (last 30 min) ---"
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value="$FUNCTION" \
--start-time "$START" --end-time "$END" \
--period 300 --statistics p99 \
--region "$REGION" \
--query 'sort_by(Datapoints,&Timestamp)[].{Time:Timestamp,P99ms:p99}' 2>/dev/null || echo "Extended statistics require CloudWatch Metric Streams or separate API"
echo ""
echo "=== Triage complete ==="Error Medic Editorial
The Error Medic Editorial team comprises senior DevOps engineers and SRE practitioners with hands-on experience operating large-scale AWS workloads. We focus on actionable, command-level troubleshooting guides grounded in real production incidents.
Sources
- https://docs.aws.amazon.com/lambda/latest/dg/troubleshooting-execution.html
- https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
- https://docs.aws.amazon.com/apigateway/latest/developerguide/handle-errors-in-lambda-integration.html
- https://docs.aws.amazon.com/lambda/latest/dg/security-iam.html
- https://repost.aws/knowledge-center/lambda-function-retry-timeout-sdk
- https://stackoverflow.com/questions/41399876/aws-lambda-function-returning-502-from-api-gateway