Error Medic

Database Error Troubleshooting: Connection, Performance & Recovery

Database errors sit at the critical path of nearly every application. When your database is down or degraded, everything downstream fails — APIs return 500 errors, queues back up, and users see broken pages. Quick, accurate diagnosis is essential because database issues compound rapidly under load.

The most common database problems fall into three buckets: connection failures, performance degradation, and data integrity issues. Connection errors (too many connections, connection refused, authentication failures) are usually the fastest to diagnose but can be caused by anything from misconfigured connection strings to exhausted connection pools to network segmentation. Performance issues (slow queries, lock contention, replication lag) require more investigation and often involve query analysis, index tuning, or resource scaling.

This section covers 37 troubleshooting articles across 12 database technologies: relational databases like PostgreSQL, MySQL, MariaDB, SQL Server, and Oracle DB; document stores like MongoDB; key-value stores like Redis and DynamoDB; search engines like Elasticsearch; time-series databases like InfluxDB; and wide-column stores like Cassandra. Each guide targets specific error messages and symptoms you'll encounter in production.

Whether you're facing a deadlock at peak traffic, a replication lag that's growing faster than you can diagnose, or a connection pool that's silently exhausting itself, these guides walk you through systematic diagnosis and resolution.

Browse by Category

Common Patterns & Cross-Cutting Themes

Connection Exhaustion & Pool Management

"Too many connections" is one of the most common database errors in production, and it's rarely solved by just increasing max_connections. Every database connection consumes memory on the server — PostgreSQL uses roughly 10 MB per connection, MySQL around 1–3 MB. Raising the limit without increasing server resources just delays the crash.

The real fix is proper connection pooling. Use a connection pooler like PgBouncer for PostgreSQL, ProxySQL for MySQL, or built-in pool settings in your ORM (pool size, idle timeout, max lifetime). Set your pool size to match your actual concurrent query needs — a good starting point is 2× your CPU cores. Close connections properly in your application code; leaked connections from unclosed transactions or abandoned sessions are the most common pool drain.

For serverless and auto-scaling architectures, connection pooling is especially critical. Each Lambda invocation or container replica opens its own connection, and with hundreds of concurrent instances, you can easily exceed database limits. Use RDS Proxy (AWS), a sidecar pooler, or a connection-aware middleware layer.

Lock Contention & Deadlocks

Deadlocks and lock waits occur when multiple transactions compete for the same rows or tables. The database detects circular dependencies and kills one transaction with a deadlock error. Lock waits, on the other hand, cause queries to hang until the blocking transaction commits or the wait timeout expires.

To diagnose, check your database's lock monitoring: pg_locks and pg_stat_activity in PostgreSQL, SHOW ENGINE INNODB STATUS in MySQL, or db.currentOp() in MongoDB. Identify the blocking query and the waiting query. Common causes include: updating rows in different orders across transactions, long-running transactions holding locks unnecessarily, and missing indexes forcing table scans that acquire more locks than needed.

Prevention strategies: keep transactions as short as possible, access rows in a consistent order, use appropriate isolation levels (READ COMMITTED is usually sufficient), and add indexes to reduce lock scope. For unavoidable deadlocks, implement retry logic in your application — retry the entire transaction, not just the failed statement.

Query Performance & Slow Queries

Slow queries are the most common cause of database performance degradation. A query that takes 10 ms with 1,000 rows might take 10 seconds with 1,000,000 rows if it's doing a sequential scan instead of using an index.

Start with EXPLAIN ANALYZE (PostgreSQL), EXPLAIN (MySQL), or the equivalent in your database. Look for sequential scans on large tables, nested loop joins with high row estimates, and sorts on unindexed columns. The most impactful fix is usually adding the right index — a composite index covering the WHERE clause and ORDER BY can turn a 10-second query into a 10-millisecond query.

Enable slow query logging with a low threshold (100–500 ms) and review it regularly. Watch for N+1 query patterns from ORMs, missing pagination on large result sets, and queries that return more columns than needed. Sometimes the fix is application-level: caching, denormalization, or restructuring the data access pattern.

Replication Lag & High Availability

Replication lag means your read replicas are behind the primary, serving stale data. Small, consistent lag (under 1 second) is usually acceptable, but growing lag indicates the replica can't keep up with write volume and will eventually fail.

Common causes: the replica has less CPU/IO capacity than the primary, long-running queries on the replica block replication, large transactions (bulk imports) create replication bursts, or network latency between primary and replica is too high. Monitor replication lag continuously — in PostgreSQL check pg_stat_replication, in MySQL use SHOW SLAVE STATUS, in MongoDB check rs.status().

For critical reads that cannot tolerate stale data, always read from the primary. Design your application to be aware of replication topology: route writes to the primary and reads to replicas, with a fallback to the primary when strong consistency is required. If replication consistently can't keep up, consider scaling the primary vertically or sharding horizontally.

Quick Troubleshooting Guide

SymptomLikely CauseFirst Step
Connection refusedDatabase not running or wrong host/portVerify service status; check host, port, and firewall rules
Too many connectionsConnection pool exhaustion or connection leaksImplement connection pooling; check for leaked connections; review pool size
Authentication failedWrong credentials or pg_hba.conf / user grantsVerify username/password; check host-based auth config and user privileges
Deadlock detectedTransactions locking rows in conflicting orderAdd retry logic; ensure consistent row access order; shorten transactions
Lock wait timeout exceededLong-running transaction blocking othersIdentify blocking query; kill or wait for it; add missing indexes
Slow queries / high latencyMissing indexes or inefficient query plansRun EXPLAIN ANALYZE; add indexes on WHERE/JOIN columns; optimize query
Replication lag increasingReplica under-provisioned or large write burstsScale replica resources; break large transactions into smaller batches
Disk space fullUncontrolled data growth or WAL/binlog accumulationFree space; set up log rotation; archive or purge old data; increase storage
Out of memory (OOM killed)work_mem or buffer pool too large for available RAMTune memory settings; reduce max_connections; upgrade instance size
Corrupted data or crash recoveryUnclean shutdown, disk failure, or bugRestore from backup; run integrity checks (pg_checksums, mysqlcheck)

Category Deep Dives

Frequently Asked Questions