DynamoDB Slow Query, Timeout & Table Lock: Complete Troubleshooting Guide
Fix DynamoDB slow queries, ProvisionedThroughputExceededException, and timeout errors. Step-by-step diagnosis with AWS CLI commands and proven solutions.
- Hot partitions are the #1 root cause: a poorly chosen partition key concentrates traffic on a single shard, exhausting its 1,000 WCU/3,000 RCU slice and triggering ProvisionedThroughputExceededException.
- DynamoDB has no traditional table locks, but TransactionConflictException and throttling produce identical symptoms — requests queue, latency spikes, and SDKs surface timeout errors after exhausting retries.
- Scan operations are catastrophic at scale: a full-table Scan on a 100 GB table consumes all provisioned RCUs for seconds to minutes; replace Scans with Query + GSI and your p99 latency drops by 10–100x.
- Quick fix sequence: (1) switch to on-demand capacity for instant headroom, (2) enable exponential backoff with jitter in your SDK, (3) add a high-cardinality suffix to your partition key, (4) evaluate DynamoDB Accelerator (DAX) for read-heavy workloads.
| Method | When to Use | Time to Apply | Risk |
|---|---|---|---|
| Switch to On-Demand Capacity | Unpredictable traffic spikes; provisioned capacity consistently exhausted | < 5 minutes (instant via console/CLI) | Low — pay-per-request pricing may increase cost |
| Increase Provisioned RCU/WCU | Predictable load with known baseline; want cost ceiling | < 2 minutes; takes effect immediately | Low — requires forecast accuracy to avoid over/under-provisioning |
| Redesign Partition Key (write sharding) | Chronic hot-partition errors on a specific key value (e.g., user_id='admin') | Hours to days — requires data migration | Medium — schema change; rollback is complex |
| Add a GSI on query attributes | Slow queries due to full-table Scans; filtering on non-key attributes | Minutes to create; backfill takes hours for large tables | Low — non-destructive; GSI has its own capacity |
| Deploy DynamoDB Accelerator (DAX) | Read-heavy workloads (>80% reads); p99 latency must be < 1 ms | 30–60 minutes (cluster provisioning) | Medium — adds infrastructure cost and DAX-specific SDK requirement |
| Tune SDK retry/backoff settings | ProvisionedThroughputExceededException on burst traffic; SDK swallows retries silently | Minutes — code change only | Low — increased retry count raises end-to-end latency on failures |
| Enable DynamoDB Auto Scaling | Recurring throttling at predictable but variable peaks (e.g., daily batch jobs) | Minutes to configure; scaling lag is 5–10 minutes | Low — scaling lag means brief throttling during sudden spikes is still possible |
Understanding DynamoDB Slow Queries, Timeouts, and "Table Lock" Behavior
DynamoDB is engineered for single-digit millisecond latency at any scale, but production teams routinely file incidents titled "DynamoDB table lock" or "DynamoDB hung" when they actually mean: requests are timing out, retries are exhausting, and the application has stalled. Understanding why separates a 5-minute fix from a multi-hour war room.
DynamoDB distributes data across internal partitions. Each partition handles a ceiling of 3,000 Read Capacity Units (RCUs) and 1,000 Write Capacity Units (WCUs) per second. When a single logical partition key attracts more traffic than its partition can serve, DynamoDB returns ProvisionedThroughputExceededException. AWS SDKs retry with exponential backoff by default — but if the condition persists, the SDK raises a final timeout exception that your application surfaces as a "slow query" or "connection timed out" error.
Symptoms and Their Exact Error Messages
Developers encounter several distinct error signatures depending on SDK and language:
Python (boto3):
botocore.exceptions.ClientError: An error occurred (ProvisionedThroughputExceededException)
when calling the GetItem operation: The level of configured provisioned throughput for
the table was exceeded. Consider increasing your provisioning level with the
UpdateTable API.
Java (AWS SDK v2):
software.amazon.awssdk.services.dynamodb.model.ProvisionedThroughputExceededException:
The level of configured provisioned throughput for the table was exceeded.
Node.js:
ProvisionedThroughputExceededException: The level of configured provisioned throughput
for the table was exceeded.
at Request.extractError (/node_modules/aws-sdk/lib/protocol/json.js:51:27)
Transaction conflict (mimics a lock):
TransactionCanceledException: Transaction cancelled, please refer cancellation reasons
for specific reasons [TransactionConflict]
General timeout (SDK gave up retrying):
RequestTimeout: Request took too long to complete. (Node.js)
ReadTimeout: Read timeout on endpoint URL (Python)
connect ETIMEDOUT (Node.js low-level)
Step 1: Diagnose — Isolate the Root Cause in CloudWatch
Before touching any configuration, pull the CloudWatch metrics that tell the truth. There are four key metric namespaces to examine:
| Metric | Namespace | What it reveals |
|---|---|---|
ThrottledRequests |
AWS/DynamoDB |
Non-zero value confirms throttling is happening |
SuccessfulRequestLatency |
AWS/DynamoDB |
p99 > 10 ms on GetItem = partition hotspot |
ConsumedReadCapacityUnits |
AWS/DynamoDB |
Compare to provisioned; near 100% = exhausted |
SystemErrors |
AWS/DynamoDB |
Non-zero = DynamoDB internal errors (contact AWS support) |
ReturnedItemCount |
AWS/DynamoDB |
High count on Query/Scan = overfetching |
Use AWS CLI to pull throttle history for the last hour:
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--dimensions Name=TableName,Value=YOUR_TABLE_NAME \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Sum
If ThrottledRequests Sum > 0 in any period, you have confirmed throttling. Now identify whether it's reads or writes:
# Check consumed vs provisioned reads
aws dynamodb describe-table --table-name YOUR_TABLE_NAME \
--query 'Table.{RCU:ProvisionedThroughput.ReadCapacityUnits,
WCU:ProvisionedThroughput.WriteCapacityUnits,
TableStatus:TableStatus,
ItemCount:ItemCount,
SizeBytes:TableSizeBytes}'
To detect hot partitions specifically, enable DynamoDB Contributor Insights:
aws dynamodb update-contributor-insights \
--table-name YOUR_TABLE_NAME \
--contributor-insights-action ENABLE
Once enabled (takes ~15 minutes to populate), navigate to DynamoDB Console → Table → Contributor Insights to see the top-N partition key values consuming the most capacity. If one key accounts for >30% of traffic, you have a hot partition.
Step 2: Fix — Apply the Right Remedy
Fix A: Immediate Relief — Switch to On-Demand or Increase Provisioned Capacity
For immediate production relief without code changes:
# Option 1: Switch to on-demand (instant, no capacity planning required)
aws dynamodb update-table \
--table-name YOUR_TABLE_NAME \
--billing-mode PAY_PER_REQUEST
# Option 2: Double provisioned capacity (adjust numbers to your baseline)
aws dynamodb update-table \
--table-name YOUR_TABLE_NAME \
--provisioned-throughput ReadCapacityUnits=500,WriteCapacityUnits=200
Note: You can switch between billing modes once per 24 hours. On-demand eliminates throttling caused by capacity exhaustion but does not fix hot partitions — a single key still hits per-partition limits.
Fix B: Eliminate Full-Table Scans
A Scan reads every item in your table. On a 10 GB table with 5 RCU provisioned, a single Scan takes ~20 minutes and blocks other reads via throttling. Replace with Query plus a Global Secondary Index:
# WRONG — full table scan
response = table.scan(
FilterExpression=Attr('status').eq('pending')
)
# RIGHT — query via GSI on 'status' attribute
response = table.query(
IndexName='status-index',
KeyConditionExpression=Key('status').eq('pending')
)
Create the GSI if it does not exist:
aws dynamodb update-table \
--table-name YOUR_TABLE_NAME \
--attribute-definitions AttributeName=status,AttributeType=S \
--global-secondary-index-updates \
'[{"Create":{"IndexName":"status-index",
"KeySchema":[{"AttributeName":"status","KeyType":"HASH"}],
"Projection":{"ProjectionType":"ALL"},
"ProvisionedThroughput":{"ReadCapacityUnits":100,"WriteCapacityUnits":50}}}]'
Fix C: Write Sharding for Hot Partition Keys
If your partition key is inherently low-cardinality (e.g., tenant_id with one dominant tenant), distribute writes across virtual shards:
import random
def write_with_sharding(table, item, shard_count=10):
shard_suffix = random.randint(0, shard_count - 1)
item['pk'] = f"{item['original_pk']}#{shard_suffix}"
table.put_item(Item=item)
def read_all_shards(table, original_pk, shard_count=10):
"""Fan-out reads across all shards and merge results."""
results = []
for shard in range(shard_count):
sharded_key = f"{original_pk}#{shard}"
resp = table.query(
KeyConditionExpression=Key('pk').eq(sharded_key)
)
results.extend(resp['Items'])
return results
Fix D: Configure SDK Retry Behavior
The default AWS SDK retry count (3 attempts) may be too low for burst scenarios. Tune with exponential backoff and jitter:
import boto3
from botocore.config import Config
# Increase max retries and use standard (exponential+jitter) retry mode
config = Config(
retries={
'max_attempts': 10,
'mode': 'standard' # enables exponential backoff with jitter
},
connect_timeout=5,
read_timeout=10
)
dynamodb = boto3.resource('dynamodb', config=config)
Fix E: Add DAX for Read-Heavy Workloads
For workloads where >70% of operations are reads and p99 latency < 1 ms is required, DynamoDB Accelerator (DAX) is a fully managed write-through cache:
# Create a DAX cluster (requires VPC configuration)
aws dax create-cluster \
--cluster-name my-dax-cluster \
--node-type dax.r5.large \
--replication-factor 2 \
--iam-role-arn arn:aws:iam::ACCOUNT_ID:role/DAXRole \
--subnet-group my-dax-subnet-group
Switch your application to the DAX client (Python example):
import amazondax
import boto3
dax = amazondax.AmazonDaxClient.resource(
endpoints=['my-dax-cluster.abc123.dax-clusters.us-east-1.amazonaws.com:8111']
)
table = dax.Table('YOUR_TABLE_NAME')
# All GetItem and Query calls now hit DAX first
Step 3: Prevent Recurrence — Enable Auto Scaling and Alarms
# Register the table as a scalable target
aws application-autoscaling register-scalable-target \
--service-namespace dynamodb \
--resource-id table/YOUR_TABLE_NAME \
--scalable-dimension dynamodb:table:ReadCapacityUnits \
--min-capacity 5 \
--max-capacity 1000
# Set a scaling policy targeting 70% utilization
aws application-autoscaling put-scaling-policy \
--service-namespace dynamodb \
--resource-id table/YOUR_TABLE_NAME \
--scalable-dimension dynamodb:table:ReadCapacityUnits \
--policy-name ReadAutoScaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration \
'{"TargetValue":70.0,
"PredefinedMetricSpecification":{"PredefinedMetricType":"DynamoDBReadCapacityUtilization"}}'
# Alert on throttling
aws cloudwatch put-metric-alarm \
--alarm-name DynamoDB-ThrottledRequests \
--metric-name ThrottledRequests \
--namespace AWS/DynamoDB \
--dimensions Name=TableName,Value=YOUR_TABLE_NAME \
--statistic Sum \
--period 60 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:ops-alerts
Frequently Asked Questions
#!/usr/bin/env bash
# DynamoDB Slow Query / Timeout Diagnostic Script
# Usage: TABLE_NAME=my-table AWS_REGION=us-east-1 bash dynamodb_diagnose.sh
set -euo pipefail
TABLE=${TABLE_NAME:?"Set TABLE_NAME"}
REGION=${AWS_REGION:-us-east-1}
END=$(date -u +%Y-%m-%dT%H:%M:%SZ)
START=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-1H +%Y-%m-%dT%H:%M:%SZ)
echo "=== Table Status ==="
aws dynamodb describe-table --region "$REGION" --table-name "$TABLE" \
--query 'Table.{Status:TableStatus,Items:ItemCount,SizeGB:TableSizeBytes,
RCU:ProvisionedThroughput.ReadCapacityUnits,
WCU:ProvisionedThroughput.WriteCapacityUnits,
BillingMode:BillingModeSummary.BillingMode}'
echo ""
echo "=== GSIs and their capacity ==="
aws dynamodb describe-table --region "$REGION" --table-name "$TABLE" \
--query 'Table.GlobalSecondaryIndexes[*].{Name:IndexName,
Status:IndexStatus,
RCU:ProvisionedThroughput.ReadCapacityUnits,
WCU:ProvisionedThroughput.WriteCapacityUnits,
Items:ItemCount}'
echo ""
echo "=== ThrottledRequests (last 60 min, 1-min resolution) ==="
aws cloudwatch get-metric-statistics \
--region "$REGION" \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--dimensions Name=TableName,Value="$TABLE" \
--start-time "$START" \
--end-time "$END" \
--period 60 \
--statistics Sum \
--query 'sort_by(Datapoints, &Timestamp)[*].{Time:Timestamp,Throttles:Sum}'
echo ""
echo "=== SuccessfulRequestLatency p99 (last 60 min) ==="
for OP in GetItem PutItem Query Scan; do
echo "--- $OP ---"
aws cloudwatch get-metric-statistics \
--region "$REGION" \
--namespace AWS/DynamoDB \
--metric-name SuccessfulRequestLatency \
--dimensions Name=TableName,Value="$TABLE" Name=Operation,Value="$OP" \
--start-time "$START" \
--end-time "$END" \
--period 3600 \
--statistics p99 Maximum Average \
--query 'Datapoints[0].{p99:p99,Max:Maximum,Avg:Average}'
done
echo ""
echo "=== ConsumedReadCapacityUnits vs Provisioned ==="
aws cloudwatch get-metric-statistics \
--region "$REGION" \
--namespace AWS/DynamoDB \
--metric-name ConsumedReadCapacityUnits \
--dimensions Name=TableName,Value="$TABLE" \
--start-time "$START" \
--end-time "$END" \
--period 60 \
--statistics Sum \
--query 'max_by(Datapoints, &Sum).{MaxConsumedPerMin:Sum,Time:Timestamp}'
echo ""
echo "=== Contributor Insights status ==="
aws dynamodb describe-contributor-insights \
--region "$REGION" \
--table-name "$TABLE" \
--query '{Status:ContributorInsightsStatus}' 2>/dev/null || \
echo "Contributor Insights not enabled. Enable with:"
echo "aws dynamodb update-contributor-insights --table-name $TABLE --contributor-insights-action ENABLE"
echo ""
echo "=== Recommend: Check Scan operations in last hour ==="
aws cloudwatch get-metric-statistics \
--region "$REGION" \
--namespace AWS/DynamoDB \
--metric-name ReturnedItemCount \
--dimensions Name=TableName,Value="$TABLE" Name=Operation,Value=Scan \
--start-time "$START" \
--end-time "$END" \
--period 3600 \
--statistics Sum \
--query 'Datapoints[0].Sum' 2>/dev/null && \
echo "Non-zero = full table scans are occurring. Replace with Query+GSI."
echo ""
echo "=== Auto Scaling status ==="
aws application-autoscaling describe-scalable-targets \
--region "$REGION" \
--service-namespace dynamodb \
--query "ScalableTargets[?ResourceId=='table/$TABLE'].{Dimension:ScalableDimension,Min:MinCapacity,Max:MaxCapacity}" \
2>/dev/null || echo "Auto Scaling not configured for this table."
echo ""
echo "Diagnosis complete. Review ThrottledRequests and p99 latency above."Error Medic Editorial
Error Medic Editorial is a team of senior SREs and cloud engineers with collective experience managing multi-petabyte DynamoDB deployments at Fortune 500 companies and high-growth startups. Our guides are tested against real production incidents and peer-reviewed for technical accuracy before publication.
Sources
- https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html
- https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html
- https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html
- https://repost.aws/knowledge-center/dynamodb-high-latency
- https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.html
- https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/contributorinsights_HowItWorks.html
- https://stackoverflow.com/questions/24067283/dynamodb-provisionedthroughputexceededexception