Error Medic

DynamoDB Slow Query, Timeout & Table Lock: Complete Troubleshooting Guide

Fix DynamoDB slow queries, ProvisionedThroughputExceededException, and timeout errors. Step-by-step diagnosis with AWS CLI commands and proven solutions.

Last updated:
Last verified:
2,397 words
Key Takeaways
  • Hot partitions are the #1 root cause: a poorly chosen partition key concentrates traffic on a single shard, exhausting its 1,000 WCU/3,000 RCU slice and triggering ProvisionedThroughputExceededException.
  • DynamoDB has no traditional table locks, but TransactionConflictException and throttling produce identical symptoms — requests queue, latency spikes, and SDKs surface timeout errors after exhausting retries.
  • Scan operations are catastrophic at scale: a full-table Scan on a 100 GB table consumes all provisioned RCUs for seconds to minutes; replace Scans with Query + GSI and your p99 latency drops by 10–100x.
  • Quick fix sequence: (1) switch to on-demand capacity for instant headroom, (2) enable exponential backoff with jitter in your SDK, (3) add a high-cardinality suffix to your partition key, (4) evaluate DynamoDB Accelerator (DAX) for read-heavy workloads.
Fix Approaches Compared
MethodWhen to UseTime to ApplyRisk
Switch to On-Demand CapacityUnpredictable traffic spikes; provisioned capacity consistently exhausted< 5 minutes (instant via console/CLI)Low — pay-per-request pricing may increase cost
Increase Provisioned RCU/WCUPredictable load with known baseline; want cost ceiling< 2 minutes; takes effect immediatelyLow — requires forecast accuracy to avoid over/under-provisioning
Redesign Partition Key (write sharding)Chronic hot-partition errors on a specific key value (e.g., user_id='admin')Hours to days — requires data migrationMedium — schema change; rollback is complex
Add a GSI on query attributesSlow queries due to full-table Scans; filtering on non-key attributesMinutes to create; backfill takes hours for large tablesLow — non-destructive; GSI has its own capacity
Deploy DynamoDB Accelerator (DAX)Read-heavy workloads (>80% reads); p99 latency must be < 1 ms30–60 minutes (cluster provisioning)Medium — adds infrastructure cost and DAX-specific SDK requirement
Tune SDK retry/backoff settingsProvisionedThroughputExceededException on burst traffic; SDK swallows retries silentlyMinutes — code change onlyLow — increased retry count raises end-to-end latency on failures
Enable DynamoDB Auto ScalingRecurring throttling at predictable but variable peaks (e.g., daily batch jobs)Minutes to configure; scaling lag is 5–10 minutesLow — scaling lag means brief throttling during sudden spikes is still possible

Understanding DynamoDB Slow Queries, Timeouts, and "Table Lock" Behavior

DynamoDB is engineered for single-digit millisecond latency at any scale, but production teams routinely file incidents titled "DynamoDB table lock" or "DynamoDB hung" when they actually mean: requests are timing out, retries are exhausting, and the application has stalled. Understanding why separates a 5-minute fix from a multi-hour war room.

DynamoDB distributes data across internal partitions. Each partition handles a ceiling of 3,000 Read Capacity Units (RCUs) and 1,000 Write Capacity Units (WCUs) per second. When a single logical partition key attracts more traffic than its partition can serve, DynamoDB returns ProvisionedThroughputExceededException. AWS SDKs retry with exponential backoff by default — but if the condition persists, the SDK raises a final timeout exception that your application surfaces as a "slow query" or "connection timed out" error.

Symptoms and Their Exact Error Messages

Developers encounter several distinct error signatures depending on SDK and language:

Python (boto3):

botocore.exceptions.ClientError: An error occurred (ProvisionedThroughputExceededException)
  when calling the GetItem operation: The level of configured provisioned throughput for
  the table was exceeded. Consider increasing your provisioning level with the
  UpdateTable API.

Java (AWS SDK v2):

software.amazon.awssdk.services.dynamodb.model.ProvisionedThroughputExceededException:
  The level of configured provisioned throughput for the table was exceeded.

Node.js:

ProvisionedThroughputExceededException: The level of configured provisioned throughput
  for the table was exceeded.
    at Request.extractError (/node_modules/aws-sdk/lib/protocol/json.js:51:27)

Transaction conflict (mimics a lock):

TransactionCanceledException: Transaction cancelled, please refer cancellation reasons
  for specific reasons [TransactionConflict]

General timeout (SDK gave up retrying):

RequestTimeout: Request took too long to complete. (Node.js)
ReadTimeout: Read timeout on endpoint URL (Python)
connect ETIMEDOUT (Node.js low-level)

Step 1: Diagnose — Isolate the Root Cause in CloudWatch

Before touching any configuration, pull the CloudWatch metrics that tell the truth. There are four key metric namespaces to examine:

Metric Namespace What it reveals
ThrottledRequests AWS/DynamoDB Non-zero value confirms throttling is happening
SuccessfulRequestLatency AWS/DynamoDB p99 > 10 ms on GetItem = partition hotspot
ConsumedReadCapacityUnits AWS/DynamoDB Compare to provisioned; near 100% = exhausted
SystemErrors AWS/DynamoDB Non-zero = DynamoDB internal errors (contact AWS support)
ReturnedItemCount AWS/DynamoDB High count on Query/Scan = overfetching

Use AWS CLI to pull throttle history for the last hour:

aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ThrottledRequests \
  --dimensions Name=TableName,Value=YOUR_TABLE_NAME \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Sum

If ThrottledRequests Sum > 0 in any period, you have confirmed throttling. Now identify whether it's reads or writes:

# Check consumed vs provisioned reads
aws dynamodb describe-table --table-name YOUR_TABLE_NAME \
  --query 'Table.{RCU:ProvisionedThroughput.ReadCapacityUnits,
            WCU:ProvisionedThroughput.WriteCapacityUnits,
            TableStatus:TableStatus,
            ItemCount:ItemCount,
            SizeBytes:TableSizeBytes}'

To detect hot partitions specifically, enable DynamoDB Contributor Insights:

aws dynamodb update-contributor-insights \
  --table-name YOUR_TABLE_NAME \
  --contributor-insights-action ENABLE

Once enabled (takes ~15 minutes to populate), navigate to DynamoDB Console → Table → Contributor Insights to see the top-N partition key values consuming the most capacity. If one key accounts for >30% of traffic, you have a hot partition.


Step 2: Fix — Apply the Right Remedy

Fix A: Immediate Relief — Switch to On-Demand or Increase Provisioned Capacity

For immediate production relief without code changes:

# Option 1: Switch to on-demand (instant, no capacity planning required)
aws dynamodb update-table \
  --table-name YOUR_TABLE_NAME \
  --billing-mode PAY_PER_REQUEST

# Option 2: Double provisioned capacity (adjust numbers to your baseline)
aws dynamodb update-table \
  --table-name YOUR_TABLE_NAME \
  --provisioned-throughput ReadCapacityUnits=500,WriteCapacityUnits=200

Note: You can switch between billing modes once per 24 hours. On-demand eliminates throttling caused by capacity exhaustion but does not fix hot partitions — a single key still hits per-partition limits.

Fix B: Eliminate Full-Table Scans

A Scan reads every item in your table. On a 10 GB table with 5 RCU provisioned, a single Scan takes ~20 minutes and blocks other reads via throttling. Replace with Query plus a Global Secondary Index:

# WRONG — full table scan
response = table.scan(
    FilterExpression=Attr('status').eq('pending')
)

# RIGHT — query via GSI on 'status' attribute
response = table.query(
    IndexName='status-index',
    KeyConditionExpression=Key('status').eq('pending')
)

Create the GSI if it does not exist:

aws dynamodb update-table \
  --table-name YOUR_TABLE_NAME \
  --attribute-definitions AttributeName=status,AttributeType=S \
  --global-secondary-index-updates \
    '[{"Create":{"IndexName":"status-index",
       "KeySchema":[{"AttributeName":"status","KeyType":"HASH"}],
       "Projection":{"ProjectionType":"ALL"},
       "ProvisionedThroughput":{"ReadCapacityUnits":100,"WriteCapacityUnits":50}}}]'
Fix C: Write Sharding for Hot Partition Keys

If your partition key is inherently low-cardinality (e.g., tenant_id with one dominant tenant), distribute writes across virtual shards:

import random

def write_with_sharding(table, item, shard_count=10):
    shard_suffix = random.randint(0, shard_count - 1)
    item['pk'] = f"{item['original_pk']}#{shard_suffix}"
    table.put_item(Item=item)

def read_all_shards(table, original_pk, shard_count=10):
    """Fan-out reads across all shards and merge results."""
    results = []
    for shard in range(shard_count):
        sharded_key = f"{original_pk}#{shard}"
        resp = table.query(
            KeyConditionExpression=Key('pk').eq(sharded_key)
        )
        results.extend(resp['Items'])
    return results
Fix D: Configure SDK Retry Behavior

The default AWS SDK retry count (3 attempts) may be too low for burst scenarios. Tune with exponential backoff and jitter:

import boto3
from botocore.config import Config

# Increase max retries and use standard (exponential+jitter) retry mode
config = Config(
    retries={
        'max_attempts': 10,
        'mode': 'standard'  # enables exponential backoff with jitter
    },
    connect_timeout=5,
    read_timeout=10
)

dynamodb = boto3.resource('dynamodb', config=config)
Fix E: Add DAX for Read-Heavy Workloads

For workloads where >70% of operations are reads and p99 latency < 1 ms is required, DynamoDB Accelerator (DAX) is a fully managed write-through cache:

# Create a DAX cluster (requires VPC configuration)
aws dax create-cluster \
  --cluster-name my-dax-cluster \
  --node-type dax.r5.large \
  --replication-factor 2 \
  --iam-role-arn arn:aws:iam::ACCOUNT_ID:role/DAXRole \
  --subnet-group my-dax-subnet-group

Switch your application to the DAX client (Python example):

import amazondax
import boto3

dax = amazondax.AmazonDaxClient.resource(
    endpoints=['my-dax-cluster.abc123.dax-clusters.us-east-1.amazonaws.com:8111']
)
table = dax.Table('YOUR_TABLE_NAME')
# All GetItem and Query calls now hit DAX first

Step 3: Prevent Recurrence — Enable Auto Scaling and Alarms

# Register the table as a scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace dynamodb \
  --resource-id table/YOUR_TABLE_NAME \
  --scalable-dimension dynamodb:table:ReadCapacityUnits \
  --min-capacity 5 \
  --max-capacity 1000

# Set a scaling policy targeting 70% utilization
aws application-autoscaling put-scaling-policy \
  --service-namespace dynamodb \
  --resource-id table/YOUR_TABLE_NAME \
  --scalable-dimension dynamodb:table:ReadCapacityUnits \
  --policy-name ReadAutoScaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration \
    '{"TargetValue":70.0,
      "PredefinedMetricSpecification":{"PredefinedMetricType":"DynamoDBReadCapacityUtilization"}}'

# Alert on throttling
aws cloudwatch put-metric-alarm \
  --alarm-name DynamoDB-ThrottledRequests \
  --metric-name ThrottledRequests \
  --namespace AWS/DynamoDB \
  --dimensions Name=TableName,Value=YOUR_TABLE_NAME \
  --statistic Sum \
  --period 60 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:ops-alerts

Frequently Asked Questions

bash
#!/usr/bin/env bash
# DynamoDB Slow Query / Timeout Diagnostic Script
# Usage: TABLE_NAME=my-table AWS_REGION=us-east-1 bash dynamodb_diagnose.sh

set -euo pipefail
TABLE=${TABLE_NAME:?"Set TABLE_NAME"}
REGION=${AWS_REGION:-us-east-1}
END=$(date -u +%Y-%m-%dT%H:%M:%SZ)
START=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-1H +%Y-%m-%dT%H:%M:%SZ)

echo "=== Table Status ==="
aws dynamodb describe-table --region "$REGION" --table-name "$TABLE" \
  --query 'Table.{Status:TableStatus,Items:ItemCount,SizeGB:TableSizeBytes,
           RCU:ProvisionedThroughput.ReadCapacityUnits,
           WCU:ProvisionedThroughput.WriteCapacityUnits,
           BillingMode:BillingModeSummary.BillingMode}'

echo ""
echo "=== GSIs and their capacity ==="
aws dynamodb describe-table --region "$REGION" --table-name "$TABLE" \
  --query 'Table.GlobalSecondaryIndexes[*].{Name:IndexName,
           Status:IndexStatus,
           RCU:ProvisionedThroughput.ReadCapacityUnits,
           WCU:ProvisionedThroughput.WriteCapacityUnits,
           Items:ItemCount}'

echo ""
echo "=== ThrottledRequests (last 60 min, 1-min resolution) ==="
aws cloudwatch get-metric-statistics \
  --region "$REGION" \
  --namespace AWS/DynamoDB \
  --metric-name ThrottledRequests \
  --dimensions Name=TableName,Value="$TABLE" \
  --start-time "$START" \
  --end-time "$END" \
  --period 60 \
  --statistics Sum \
  --query 'sort_by(Datapoints, &Timestamp)[*].{Time:Timestamp,Throttles:Sum}'

echo ""
echo "=== SuccessfulRequestLatency p99 (last 60 min) ==="
for OP in GetItem PutItem Query Scan; do
  echo "--- $OP ---"
  aws cloudwatch get-metric-statistics \
    --region "$REGION" \
    --namespace AWS/DynamoDB \
    --metric-name SuccessfulRequestLatency \
    --dimensions Name=TableName,Value="$TABLE" Name=Operation,Value="$OP" \
    --start-time "$START" \
    --end-time "$END" \
    --period 3600 \
    --statistics p99 Maximum Average \
    --query 'Datapoints[0].{p99:p99,Max:Maximum,Avg:Average}'
done

echo ""
echo "=== ConsumedReadCapacityUnits vs Provisioned ==="
aws cloudwatch get-metric-statistics \
  --region "$REGION" \
  --namespace AWS/DynamoDB \
  --metric-name ConsumedReadCapacityUnits \
  --dimensions Name=TableName,Value="$TABLE" \
  --start-time "$START" \
  --end-time "$END" \
  --period 60 \
  --statistics Sum \
  --query 'max_by(Datapoints, &Sum).{MaxConsumedPerMin:Sum,Time:Timestamp}'

echo ""
echo "=== Contributor Insights status ==="
aws dynamodb describe-contributor-insights \
  --region "$REGION" \
  --table-name "$TABLE" \
  --query '{Status:ContributorInsightsStatus}' 2>/dev/null || \
  echo "Contributor Insights not enabled. Enable with:"
  echo "aws dynamodb update-contributor-insights --table-name $TABLE --contributor-insights-action ENABLE"

echo ""
echo "=== Recommend: Check Scan operations in last hour ==="
aws cloudwatch get-metric-statistics \
  --region "$REGION" \
  --namespace AWS/DynamoDB \
  --metric-name ReturnedItemCount \
  --dimensions Name=TableName,Value="$TABLE" Name=Operation,Value=Scan \
  --start-time "$START" \
  --end-time "$END" \
  --period 3600 \
  --statistics Sum \
  --query 'Datapoints[0].Sum' 2>/dev/null && \
  echo "Non-zero = full table scans are occurring. Replace with Query+GSI."

echo ""
echo "=== Auto Scaling status ==="
aws application-autoscaling describe-scalable-targets \
  --region "$REGION" \
  --service-namespace dynamodb \
  --query "ScalableTargets[?ResourceId=='table/$TABLE'].{Dimension:ScalableDimension,Min:MinCapacity,Max:MaxCapacity}" \
  2>/dev/null || echo "Auto Scaling not configured for this table."

echo ""
echo "Diagnosis complete. Review ThrottledRequests and p99 latency above."
E

Error Medic Editorial

Error Medic Editorial is a team of senior SREs and cloud engineers with collective experience managing multi-petabyte DynamoDB deployments at Fortune 500 companies and high-growth startups. Our guides are tested against real production incidents and peer-reviewed for technical accuracy before publication.

Sources

Related Articles in DynamoDB

Explore More Database Guides