How to Fix AWS RDS Storage Full (Error: no space left on device)
Quickly diagnose and resolve the AWS RDS Storage Full error. Learn how to clear transaction logs, resize storage, and prevent future outages.
- Root cause 1: Unmanaged transaction logs or WAL files consuming allocated storage due to replication lag.
- Root cause 2: Runaway queries creating massive temporary tables that spill to disk.
- Root cause 3: Insufficient baseline storage allocated for natural database data growth.
- Quick fix summary: Identify the bloat using CloudWatch, clear temporary files or inactive replication slots, and immediately increase allocated storage via the RDS Console.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Enable Storage Autoscaling | Proactive prevention or when downtime is unacceptable | Fast (minutes to configure) | Low |
| Manual Storage Modification | Immediate need for more space and autoscaling is off | Medium (can take hours to optimize) | Low |
| Drop Stale Replication Slots | Storage is full due to WAL bloat and replication lag | Fast (seconds to free space) | Medium (requires identifying the culprit) |
| Kill Rogue Queries | Temporary space issues caused by stuck processes | Fast | Medium (terminates active transactions) |
Understanding the Error: Storage Full in AWS RDS
When operating databases in the cloud, one of the most critical and sudden failures you can encounter is running out of disk space. For AWS RDS, whether you are running PostgreSQL, MySQL, or another engine, hitting the storage limit typically results in the instance entering a storage-full state.
When an RDS instance reaches the storage-full state, it immediately stops accepting write operations to protect the integrity of the data and the database engine. Your applications will start throwing aggressive errors. For example, in a PostgreSQL environment, you might see application logs flooded with:
psycopg2.errors.DiskFull: could not write to file "pg_wal/xlog_temp_12345": No space left on device
or
ERROR: could not extend file "base/16384/16399": No space left on device
HINT: Check free disk space.
These errors indicate that the underlying Amazon EBS volume attached to your RDS instance has 0 bytes of available space. This is a critical SEV-1 incident because it directly translates to application downtime for any service that requires database writes. Even read-only operations might fail if they require temporary disk space for sorting or hashing large datasets.
Primary Causes of RDS Storage Exhaustion
While natural data growth is a factor, sudden storage exhaustion is usually caused by operational anomalies. Understanding these is the first step toward resolution.
- Runaway Temporary Tables: Complex queries with massive
JOIN,GROUP BY, orORDER BYclauses that cannot fit intowork_mem(in Postgres) will spill over to disk, creating massive temporary files. - Transaction Log (WAL) Bloat: In PostgreSQL, Write-Ahead Logs (WAL) are crucial for crash recovery and replication. If you have a read replica that has fallen behind (replication lag), or a logical replication slot that is no longer being consumed, the primary instance will retain all WAL files indefinitely until the disk fills up.
- Unvacuumed Dead Tuples: High churn tables (lots of
UPDATEandDELETEoperations) create dead tuples. If the autovacuum daemon cannot keep up, these dead tuples consume significant disk space. - Error Log Explosion: Misconfigured applications generating millions of errors per minute can cause the database engine's error logs to consume gigabytes of storage rapidly.
Step 1: Diagnose the Root Cause
Before blindly adding storage, you must identify what consumed it. If a runaway process is creating 100GB of temporary files every minute, adding 50GB of storage will only buy you a few seconds.
Check CloudWatch Metrics
Navigate to the AWS CloudWatch console and examine the following metrics for your RDS instance:
FreeStorageSpace: Look at the trajectory. Was it a gradual decline over months, or a sudden cliff drop over minutes?WriteIOPSandReadIOPS: A sudden spike in write IOPS right before the storage filled up often points to a massive data load or temporary table spillage.ReplicaLag: If you have read replicas, check if they are lagging.
Investigate Database Internals
If your database still accepts connections (sometimes read-only connections are possible, or you can connect if you just added a tiny bit of storage), run diagnostic queries.
For PostgreSQL, check for runaway queries using pg_stat_activity:
SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE state != 'idle' AND query ILIKE '%JOIN%'
ORDER BY query_start DESC;
Check the size of your logical replication slots to see if they are retaining WALs:
SELECT slot_name, plugin, slot_type, active, restart_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal_size
FROM pg_replication_slots;
Step 2: Immediate Remediation (The Fix)
When the database is hard down due to storage-full, your immediate priority is restoring service.
Method A: Increase Allocated Storage (The Safest Route)
The most common and reliable fix is to modify the RDS instance to increase its allocated storage. AWS allows you to modify the storage size of an RDS instance dynamically.
- Go to the AWS RDS Console.
- Select your database instance.
- Click Modify.
- Scroll down to the Storage section.
- Increase the Allocated storage value. Best practice is to increase it by at least 20-25% to provide enough breathing room.
- Check the Apply immediately box at the bottom of the page. If you do not check this, the storage increase will wait for the next maintenance window!
- Click Modify DB Instance.
Important Caveat: After you trigger a storage modification, the instance will enter the storage-optimization state. This process can take several hours, and in extreme cases, days. During this time, your database will be online and fully functional, but you cannot make any further storage modifications. Therefore, ensure your initial increase is substantial enough to handle whatever caused the spike in the first place.
Method B: Dropping Unused Replication Slots (PostgreSQL Specific)
If your diagnostics revealed that an inactive logical replication slot is retaining terabytes of WAL files, dropping the slot will immediately free up space.
-- Replace 'stale_slot_name' with the actual slot name found in your diagnostics
SELECT pg_drop_replication_slot('stale_slot_name');
Once the slot is dropped, PostgreSQL will aggressively delete the unneeded WAL files, often restoring the FreeStorageSpace metric within minutes.
Method C: Killing Rogue Queries
If a specific SELECT query is generating massive temporary files, terminating that connection will cause the database engine to clean up the temporary files, freeing up space.
-- Terminate a specific PostgreSQL backend process
SELECT pg_terminate_backend(<pid_of_rogue_query>);
Step 3: Long-Term Prevention and Best Practices
Fixing the immediate outage is only half the battle. You must implement guardrails to prevent this from happening again.
1. Enable Storage Autoscaling
AWS RDS Storage Autoscaling automatically scales the storage capacity of your database instance in response to growing database workloads, with zero downtime.
When you enable this feature, you set a Maximum storage threshold. RDS will automatically increase your storage volume if:
- Free available space is less than 10% of the allocated storage.
- The low-storage condition lasts for at least 5 minutes.
- At least 6 hours have passed since the last storage modification.
2. Implement Aggressive CloudWatch Alarms
Do not rely on the storage-full state to tell you there is a problem. Create CloudWatch alarms to notify your team via PagerDuty, Slack, or Email long before the disk is full.
Create two tiers of alarms:
- Warning Alarm: Triggers when
FreeStorageSpacedrops below 20%. This generates a ticket for the DBA team to investigate during business hours. - Critical Alarm: Triggers when
FreeStorageSpacedrops below 10%. This pages the on-call engineer.
3. Tune Autovacuum (PostgreSQL)
Ensure your autovacuum settings are aggressive enough to keep up with your application's update/delete velocity. If you have large tables that are frequently updated, consider lowering the autovacuum_vacuum_scale_factor so that vacuum runs more frequently, preventing dead tuple bloat.
4. Monitor Replication Lag
If you use read replicas, set up alerts on the ReplicaLag metric. A broken replication pipeline is a ticking time bomb for your primary database's storage.
Conclusion
Encountering the aws rds storage full error is a stressful event that causes immediate application downtime. By understanding the underlying mechanics of how cloud database engines handle temporary files, transaction logs, and data growth, you can quickly diagnose the root cause. Leveraging AWS native tools like Storage Autoscaling and comprehensive CloudWatch monitoring ensures that your database infrastructure remains resilient, highly available, and invisible to your end-users.
Frequently Asked Questions
# Check RDS instance status and current storage allocation
aws rds describe-db-instances \
--db-instance-identifier my-production-db \
--query 'DBInstances[*].[DBInstanceStatus,AllocatedStorage,FreeStorageSpace]'
# Modify RDS instance to instantly increase storage and enable autoscaling
# Note: --apply-immediately is crucial to avoid waiting for the maintenance window
aws rds modify-db-instance \
--db-instance-identifier my-production-db \
--allocated-storage 250 \
--max-allocated-storage 1000 \
--apply-immediatelyError Medic Editorial
Expert Cloud Architects and SREs dedicated to solving the toughest infrastructure bottlenecks. We specialize in AWS, PostgreSQL, and high-availability systems.