Database Error Troubleshooting: Connection, Performance & Recovery
Database errors sit at the critical path of nearly every application. When your database is down or degraded, everything downstream fails — APIs return 500 errors, queues back up, and users see broken pages. Quick, accurate diagnosis is essential because database issues compound rapidly under load.
The most common database problems fall into three buckets: connection failures, performance degradation, and data integrity issues. Connection errors (too many connections, connection refused, authentication failures) are usually the fastest to diagnose but can be caused by anything from misconfigured connection strings to exhausted connection pools to network segmentation. Performance issues (slow queries, lock contention, replication lag) require more investigation and often involve query analysis, index tuning, or resource scaling.
This section covers 37 troubleshooting articles across 12 database technologies: relational databases like PostgreSQL, MySQL, MariaDB, SQL Server, and Oracle DB; document stores like MongoDB; key-value stores like Redis and DynamoDB; search engines like Elasticsearch; time-series databases like InfluxDB; and wide-column stores like Cassandra. Each guide targets specific error messages and symptoms you'll encounter in production.
Whether you're facing a deadlock at peak traffic, a replication lag that's growing faster than you can diagnose, or a connection pool that's silently exhausting itself, these guides walk you through systematic diagnosis and resolution.
Browse by Category
Common Patterns & Cross-Cutting Themes
Connection Exhaustion & Pool Management
"Too many connections" is one of the most common database errors in production, and it's rarely solved by just increasing max_connections. Every database connection consumes memory on the server — PostgreSQL uses roughly 10 MB per connection, MySQL around 1–3 MB. Raising the limit without increasing server resources just delays the crash.
The real fix is proper connection pooling. Use a connection pooler like PgBouncer for PostgreSQL, ProxySQL for MySQL, or built-in pool settings in your ORM (pool size, idle timeout, max lifetime). Set your pool size to match your actual concurrent query needs — a good starting point is 2× your CPU cores. Close connections properly in your application code; leaked connections from unclosed transactions or abandoned sessions are the most common pool drain.
For serverless and auto-scaling architectures, connection pooling is especially critical. Each Lambda invocation or container replica opens its own connection, and with hundreds of concurrent instances, you can easily exceed database limits. Use RDS Proxy (AWS), a sidecar pooler, or a connection-aware middleware layer.
Lock Contention & Deadlocks
Deadlocks and lock waits occur when multiple transactions compete for the same rows or tables. The database detects circular dependencies and kills one transaction with a deadlock error. Lock waits, on the other hand, cause queries to hang until the blocking transaction commits or the wait timeout expires.
To diagnose, check your database's lock monitoring: pg_locks and pg_stat_activity in PostgreSQL, SHOW ENGINE INNODB STATUS in MySQL, or db.currentOp() in MongoDB. Identify the blocking query and the waiting query. Common causes include: updating rows in different orders across transactions, long-running transactions holding locks unnecessarily, and missing indexes forcing table scans that acquire more locks than needed.
Prevention strategies: keep transactions as short as possible, access rows in a consistent order, use appropriate isolation levels (READ COMMITTED is usually sufficient), and add indexes to reduce lock scope. For unavoidable deadlocks, implement retry logic in your application — retry the entire transaction, not just the failed statement.
Query Performance & Slow Queries
Slow queries are the most common cause of database performance degradation. A query that takes 10 ms with 1,000 rows might take 10 seconds with 1,000,000 rows if it's doing a sequential scan instead of using an index.
Start with EXPLAIN ANALYZE (PostgreSQL), EXPLAIN (MySQL), or the equivalent in your database. Look for sequential scans on large tables, nested loop joins with high row estimates, and sorts on unindexed columns. The most impactful fix is usually adding the right index — a composite index covering the WHERE clause and ORDER BY can turn a 10-second query into a 10-millisecond query.
Enable slow query logging with a low threshold (100–500 ms) and review it regularly. Watch for N+1 query patterns from ORMs, missing pagination on large result sets, and queries that return more columns than needed. Sometimes the fix is application-level: caching, denormalization, or restructuring the data access pattern.
Replication Lag & High Availability
Replication lag means your read replicas are behind the primary, serving stale data. Small, consistent lag (under 1 second) is usually acceptable, but growing lag indicates the replica can't keep up with write volume and will eventually fail.
Common causes: the replica has less CPU/IO capacity than the primary, long-running queries on the replica block replication, large transactions (bulk imports) create replication bursts, or network latency between primary and replica is too high. Monitor replication lag continuously — in PostgreSQL check pg_stat_replication, in MySQL use SHOW SLAVE STATUS, in MongoDB check rs.status().
For critical reads that cannot tolerate stale data, always read from the primary. Design your application to be aware of replication topology: route writes to the primary and reads to replicas, with a fallback to the primary when strong consistency is required. If replication consistently can't keep up, consider scaling the primary vertically or sharding horizontally.
Quick Troubleshooting Guide
| Symptom | Likely Cause | First Step |
|---|---|---|
| Connection refused | Database not running or wrong host/port | Verify service status; check host, port, and firewall rules |
| Too many connections | Connection pool exhaustion or connection leaks | Implement connection pooling; check for leaked connections; review pool size |
| Authentication failed | Wrong credentials or pg_hba.conf / user grants | Verify username/password; check host-based auth config and user privileges |
| Deadlock detected | Transactions locking rows in conflicting order | Add retry logic; ensure consistent row access order; shorten transactions |
| Lock wait timeout exceeded | Long-running transaction blocking others | Identify blocking query; kill or wait for it; add missing indexes |
| Slow queries / high latency | Missing indexes or inefficient query plans | Run EXPLAIN ANALYZE; add indexes on WHERE/JOIN columns; optimize query |
| Replication lag increasing | Replica under-provisioned or large write bursts | Scale replica resources; break large transactions into smaller batches |
| Disk space full | Uncontrolled data growth or WAL/binlog accumulation | Free space; set up log rotation; archive or purge old data; increase storage |
| Out of memory (OOM killed) | work_mem or buffer pool too large for available RAM | Tune memory settings; reduce max_connections; upgrade instance size |
| Corrupted data or crash recovery | Unclean shutdown, disk failure, or bug | Restore from backup; run integrity checks (pg_checksums, mysqlcheck) |