Error Medic

Fixing Cassandra 'Connection Refused', OOM, and Slow Queries: A Complete Guide

Comprehensive troubleshooting guide to resolve Cassandra connection refused, out of memory (OOM), slow queries, timeouts, and SSTable corruption errors.

Last updated:
Last verified:
1,450 words
Key Takeaways
  • Network configuration (`rpc_address`, `broadcast_address`) is the most common cause of 'Connection Refused' in new clusters.
  • Out of Memory (OOM) usually stems from unoptimized JVM heap settings or large partition reads overwhelming the heap.
  • Slow queries and timeouts often indicate tombstone buildup, unbalanced clusters, or insufficient node capacity.
  • Data corruption requires immediate intervention using `nodetool scrub` or restoring SSTables from recent backups.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Update cassandra.yamlConnection refused / Bind errors5 minsLow
JVM Heap TuningOOM / High GC pauses15 minsMedium
nodetool scrubSSTable corruptionHoursHigh
Compaction Strategy ChangeSlow queries / Tombstone issuesDaysMedium

Understanding the Error Landscape

Apache Cassandra is a highly available, distributed NoSQL database designed to handle massive amounts of data across multiple commodity servers with no single point of failure. However, like any complex distributed system, it can experience connectivity, memory, and performance issues. The Connection Refused error typically presents itself when a client application, driver, or the cqlsh utility attempts to connect to a Cassandra node and is actively rejected.

When this happens alongside other symptoms like Cassandra out of memory, Cassandra slow query, Cassandra timeout, or even Cassandra corruption, the root cause often traces back to configuration drift, resource exhaustion, network segmentation, or storage layer anomalies. In this comprehensive guide, we will break down each of these critical failure modes and provide actionable, step-by-step remediation strategies.

Step 1: Diagnosing 'Connection Refused'

The Connection Refused error means the OS actively rejected the TCP connection attempt. This typically occurs because:

  1. The Cassandra process (JVM) is down or crashed.
  2. Cassandra is binding to a different network interface than the one you are connecting to.
  3. A local firewall (like iptables or firewalld) is blocking the port.

First, verify if the Cassandra process is actually running. A node that has crashed due to an Out of Memory (OOM) error will result in connection refused for all new clients.

Check the service status and open ports:

systemctl status cassandra
netstat -plnt | grep 9042

If the service is active but the port 9042 (native transport) is missing from the netstat output, check /var/log/cassandra/system.log for initialization errors. If the service is listening on 127.0.0.1 but you are connecting from an external application, the issue lies in your cassandra.yaml configuration.

Step 2: Fixing Network Binding Issues

Open /etc/cassandra/cassandra.yaml (or your platform's equivalent configuration path) and locate the network settings. By default, for security reasons, Cassandra binds to localhost.

To allow external client connections, modify the rpc_address. To allow other nodes to communicate with this node, modify listen_address and broadcast_address.

# Address or interface to bind the native transport server to.
rpc_address: 0.0.0.0

# IP address to broadcast to other Cassandra nodes
broadcast_rpc_address: 192.168.1.100

# Address or interface to bind to and tell other Cassandra nodes to connect to.
listen_address: 192.168.1.100

Note: Setting rpc_address to 0.0.0.0 listens on all interfaces, but you must explicitly set broadcast_rpc_address so clients know how to route responses. After making these changes, restart the service:

sudo systemctl restart cassandra

Step 3: Addressing Out of Memory (OOM) Errors

If the connection is refused because the node repeatedly crashes with a java.lang.OutOfMemoryError: Java heap space, you must investigate JVM heap allocation and garbage collection (GC) behavior. Cassandra relies heavily on the JVM heap for memtables, caches, and bloom filters, and relies on the OS page cache for reading SSTables.

Check for OOM events in the logs:

grep -i "OutOfMemoryError" /var/log/cassandra/system.log

To resolve OOM crashes, adjust your jvm-server.options (for modern Cassandra versions) or cassandra-env.sh (for older versions).

# Edit /etc/cassandra/jvm-server.options
-Xms16G
-Xmx16G

Heap Sizing Rule of Thumb: Set the maximum heap size (-Xmx) to half of the system RAM, up to a maximum of 31GB (to maintain compressed Object Pointers). If your server has 64GB of RAM, 31GB is ideal. If you are still seeing OOMs with a 31GB heap, the issue is likely application-side: you may be querying excessively large partitions, triggering unbound materialization of rows in memory.

Step 4: Mitigating Slow Queries and Timeouts

Timeouts (ReadTimeoutException or WriteTimeoutException) and slow queries indicate that the coordinator node did not receive responses from enough replicas within the configured time (read_request_timeout_in_ms, default 5000ms).

Common causes include:

  • Heavy Garbage Collection Pauses: Long "Stop-The-World" GC pauses freeze the node, causing requests to queue and time out.
  • Tombstone Overload: If you frequently DELETE data or insert TTLs, Cassandra creates tombstones. Queries scanning thousands of tombstones to find a few valid rows will time out.
  • Unbalanced Clusters: One node holds significantly more data or receives more traffic.

Diagnose slow queries using cqlsh tracing:

cqlsh> TRACING ON;
cqlsh> SELECT * FROM users_keyspace.user_events WHERE user_id = '12345';

The trace output will reveal exactly where time is being spent. If scanning tombstones is the culprit, consider lowering your gc_grace_seconds for the table and manually forcing a major compaction (though major compactions should be used cautiously with SizeTieredCompactionStrategy).

nodetool tablestats users_keyspace.user_events

Look for Maximum tombstones per slice. If it's in the thousands, you have a data model issue.

Step 5: Handling Cassandra Data Corruption

In rare cases, abrupt power loss, faulty disks, or severe kernel panics can lead to SSTable corruption. The Cassandra process may crash on startup, or specific queries will fail, logging a CorruptSSTableException.

When corruption is detected, do not ignore it. You must run nodetool scrub on the affected keyspace and table. The scrub process rewrites the SSTables, discarding the corrupted chunks.

# Stop the node first if it's continuously crashing on read
sudo systemctl stop cassandra

# Run standalone scrub if offline, or nodetool scrub if online
nodetool scrub <keyspace_name> <table_name>

Warning: Scrubbing discards corrupted data. After the scrub completes, you must run a repair (nodetool repair) to fetch the lost data from healthy replicas in the cluster. If the corruption is widespread and multiple replicas are affected, you will need to restore the SSTables from your latest snapshot backups.

Summary of Best Practices

To prevent these issues from recurring:

  1. Monitor GC logs: Ship GC logs to a central observability platform to spot increasing pause times before they cause timeouts.
  2. Avoid large partitions: Keep partition sizes under 100MB to prevent memory pressure during reads.
  3. Use TimeWindowCompactionStrategy (TWCS): For time-series data, TWCS efficiently drops whole SSTables when TTLs expire, avoiding massive tombstone build-up.
  4. Regular Repairs: Run incremental or full repairs regularly to ensure data consistency across replicas, minimizing the impact of potential corruption.

Frequently Asked Questions

bash
# 1. Check if Cassandra is listening on the client port (9042)
netstat -tlnp | grep 9042

# 2. Check Cassandra logs for OOM, connection, or corruption issues
tail -n 100 /var/log/cassandra/system.log | grep -i 'error\|exception\|memory'

# 3. Check cluster status and node health
nodetool status

# 4. Investigate table statistics for high tombstone counts
nodetool tablestats my_keyspace.my_table

# 5. Scrub corrupted SSTables (replace keyspace and table names)
nodetool scrub my_keyspace my_table

# 6. Repair the table after scrubbing to restore lost data from replicas
nodetool repair -full my_keyspace my_table
E

Error Medic Editorial

Error Medic Editorial is a specialized team of senior Site Reliability Engineers and Database Administrators dedicated to simplifying complex infrastructure troubleshooting, performance tuning, and incident response.

Sources

Related Articles in Cassandra

Explore More Database Guides