Fixing 'ProvisionedThroughputExceededException': DynamoDB Slow Query & Timeout Troubleshooting
Resolve DynamoDB slow query, timeout, and throttling errors (ProvisionedThroughputExceededException) with actionable scaling, indexing, and query optimization s
- Root Cause 1: Insufficient provisioned read/write capacity units (RCU/WCU) leading to throttling.
- Root Cause 2: Inefficient queries using 'Scan' instead of 'Query', or lacking proper Global Secondary Indexes (GSIs).
- Root Cause 3: Hot partitions caused by poorly distributed partition keys.
- Quick Fix Summary: Enable On-Demand capacity or Auto Scaling, convert Scans to Queries, and implement exponential backoff.
| Method | When to Use | Time to Implement | Risk / Cost Impact |
|---|---|---|---|
| Switch to On-Demand Capacity | Immediate relief for unpredictable traffic spikes | 5 mins | High cost for sustained high traffic |
| Add/Optimize GSIs | Queries filter on non-key attributes frequently | Hours to Days | Increases storage and WCU costs |
| Implement Exponential Backoff | Handling transient network or throttling timeouts | 1-2 Hours | Low risk, requires code deployment |
| Redesign Partition Keys | Persistent hot partition issues and table locks | Weeks | High risk, requires data migration |
Understanding DynamoDB Slow Queries and Timeouts
When working with Amazon DynamoDB, encountering a slow query or a sudden timeout often points to one of a few common architectural or configuration bottlenecks. Unlike traditional RDBMS systems, DynamoDB doesn't suffer from traditional "table locks" in the same way, but it does experience partition-level throttling and throughput limits that manifest as extreme latency or connection timeouts. The most common error developers see is ProvisionedThroughputExceededException, or simply elevated HTTP 5xx errors and SDK timeouts (SdkClientException: Unable to execute HTTP request: Read timed out).
Step 1: Diagnose the Bottleneck
Before changing configurations, you must identify whether the issue is throttling, inefficient querying, or network-level timeouts.
- Check CloudWatch Metrics: Look at
ProvisionedReadCapacityUnitsvsConsumedReadCapacityUnits, and specifically monitorReadThrottleEventsandWriteThrottleEvents. If throttles are high, your capacity is too low or you have a hot partition. - Enable Contributor Insights: This DynamoDB feature helps identify the exact partition keys that are being accessed most frequently (the "hot keys").
- Review the Code: Are you using
ScanorQuery? AScanoperation reads the entire table before filtering, which is notoriously slow and expensive. AQueryuses the partition key and is highly efficient.
Step 2: Immediate Mitigation (The Quick Fix)
If your production environment is currently failing with timeouts and throttling, the fastest mitigation is adjusting capacity.
- Switch to On-Demand Capacity: If you are currently using Provisioned capacity without Auto Scaling, switch the table to On-Demand. This allows DynamoDB to instantly accommodate the traffic spike, though it comes at a higher per-request cost.
- Increase Provisioned Capacity: If you prefer to stay on Provisioned, manually bump the Read Capacity Units (RCUs) and Write Capacity Units (WCUs) well above the current consumed metrics.
Step 3: Long-Term Fixes and Optimization
To prevent slow queries and timeouts permanently, you need to address the root causes at the application and schema level.
1. Replace Scans with Queries
Never use Scan for real-time application access. If you need to retrieve items based on attributes that are not the primary key, create a Global Secondary Index (GSI). This allows you to perform a fast Query against the new index.
2. Implement Exponential Backoff When DynamoDB throttles a request, it expects the client to retry. Ensure your AWS SDK is configured with an appropriate retry policy. The default SDKs usually handle this, but if you have strict API Gateway timeouts (e.g., 29 seconds), the retries might cause the API to timeout before DynamoDB succeeds.
3. Resolve Hot Partitions DynamoDB partitions data based on the Partition Key. If a massive volume of reads/writes targets the same key simultaneously, that specific partition will hit its hard limit (3000 RCUs or 1000 WCUs per partition), even if the table overall has plenty of capacity. This mimics the symptoms of a "table lock". Fix this by appending a random suffix to the partition key (Write Sharding) or ensuring your keys have high cardinality (e.g., UserID instead of Status=Active).
Frequently Asked Questions
# 1. Check CloudWatch for Throttling Events
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ReadThrottleEvents \
--dimensions Name=TableName,Value=YourTableName \
--start-time $(date -u -d '-1 hour' '+%Y-%m-%dT%H:%M:%SZ') \
--end-time $(date -u '+%Y-%m-%dT%H:%M:%SZ') \
--period 300 \
--statistics Sum
# 2. Update Table to On-Demand Capacity (Immediate Fix)
aws dynamodb update-table \
--table-name YourTableName \
--billing-mode PAY_PER_REQUEST
# 3. Example of a bad SCAN vs good QUERY in AWS CLI
# BAD (Slow, consumes high RCU):
aws dynamodb scan \
--table-name Users \
--filter-expression "#st = :status" \
--expression-attribute-names '{"#st": "Status"}' \
--expression-attribute-values '{ ":status": {"S": "ACTIVE"} }'
# GOOD (Fast, uses GSI):
aws dynamodb query \
--table-name Users \
--index-name StatusIndex \
--key-condition-expression "#st = :status" \
--expression-attribute-names '{"#st": "Status"}' \
--expression-attribute-values '{ ":status": {"S": "ACTIVE"} }'Error Medic Editorial
Error Medic Editorial is a team of certified Cloud Architects and SREs dedicated to resolving the toughest infrastructure bottlenecks and database performance issues.