Why do I get 'Connection Refused' when using cqlsh locally?

By default, cqlsh tries to connect to 127.0.0.1. If your cassandra.yaml has rpc_address set to a specific IP (like the node's LAN IP) and not localhost, cqlsh will fail unless you specify the host: `cqlsh `.

How do I fix 'java.lang.OutOfMemoryError: Java heap space' in Cassandra?

Increase the JVM heap size in jvm-server.options (or cassandra-env.sh for older versions). Ensure MAX_HEAP_SIZE is appropriate for your server RAM (usually half of total RAM, capped at 31GB for compressed OOPs), and verify your data model isn't forcing massive partition reads.

What causes Cassandra ReadTimeoutException?

Read timeouts occur when the coordinator node doesn't receive responses from enough replicas within the configured `read_request_timeout_in_ms`. This is often caused by heavy GC pauses, overloaded disks, network latency, or queries scanning too many tombstones.

Can I recover from CorruptSSTableException without backups?

Yes, you can use `nodetool scrub` to rewrite the SSTables, which discards the corrupted data chunks to allow the file to be read again. However, this means you will lose the corrupted data. Running `nodetool repair` afterward is highly recommended to fetch the missing data from other replicas.

Why are my Cassandra queries suddenly extremely slow?

Sudden performance degradation is usually tied to tombstone accumulation (if you recently deleted large amounts of data), a node experiencing long GC pauses, or hardware issues like degraded disks. Use `TRACING ON` in cqlsh to isolate the slow execution phase.

Fixing Cassandra 'Connection Refused', OOM, and Slow Queries: A Complete Guide

Fix Approaches Compared
Method	When to Use	Time	Risk
Update cassandra.yaml	Connection refused / Bind errors	5 mins	Low
JVM Heap Tuning	OOM / High GC pauses	15 mins	Medium
nodetool scrub	SSTable corruption	Hours	High
Compaction Strategy Change	Slow queries / Tombstone issues	Days	Medium

Understanding the Error Landscape

Apache Cassandra is a highly available, distributed NoSQL database designed to handle massive amounts of data across multiple commodity servers with no single point of failure. However, like any complex distributed system, it can experience connectivity, memory, and performance issues. The Connection Refused error typically presents itself when a client application, driver, or the cqlsh utility attempts to connect to a Cassandra node and is actively rejected.

When this happens alongside other symptoms like Cassandra out of memory, Cassandra slow query, Cassandra timeout, or even Cassandra corruption, the root cause often traces back to configuration drift, resource exhaustion, network segmentation, or storage layer anomalies. In this comprehensive guide, we will break down each of these critical failure modes and provide actionable, step-by-step remediation strategies.

Step 1: Diagnosing 'Connection Refused'

The Connection Refused error means the OS actively rejected the TCP connection attempt. This typically occurs because:

The Cassandra process (JVM) is down or crashed.
Cassandra is binding to a different network interface than the one you are connecting to.
A local firewall (like iptables or firewalld) is blocking the port.

First, verify if the Cassandra process is actually running. A node that has crashed due to an Out of Memory (OOM) error will result in connection refused for all new clients.

Check the service status and open ports:

systemctl status cassandra
netstat -plnt | grep 9042

If the service is active but the port 9042 (native transport) is missing from the netstat output, check /var/log/cassandra/system.log for initialization errors. If the service is listening on 127.0.0.1 but you are connecting from an external application, the issue lies in your cassandra.yaml configuration.

Step 2: Fixing Network Binding Issues

Open /etc/cassandra/cassandra.yaml (or your platform's equivalent configuration path) and locate the network settings. By default, for security reasons, Cassandra binds to localhost.

To allow external client connections, modify the rpc_address. To allow other nodes to communicate with this node, modify listen_address and broadcast_address.

# Address or interface to bind the native transport server to.
rpc_address: 0.0.0.0

# IP address to broadcast to other Cassandra nodes
broadcast_rpc_address: 192.168.1.100

# Address or interface to bind to and tell other Cassandra nodes to connect to.
listen_address: 192.168.1.100

Note: Setting rpc_address to 0.0.0.0 listens on all interfaces, but you must explicitly set broadcast_rpc_address so clients know how to route responses. After making these changes, restart the service:

sudo systemctl restart cassandra

Step 3: Addressing Out of Memory (OOM) Errors

If the connection is refused because the node repeatedly crashes with a java.lang.OutOfMemoryError: Java heap space, you must investigate JVM heap allocation and garbage collection (GC) behavior. Cassandra relies heavily on the JVM heap for memtables, caches, and bloom filters, and relies on the OS page cache for reading SSTables.

Check for OOM events in the logs:

grep -i "OutOfMemoryError" /var/log/cassandra/system.log

To resolve OOM crashes, adjust your jvm-server.options (for modern Cassandra versions) or cassandra-env.sh (for older versions).

# Edit /etc/cassandra/jvm-server.options
-Xms16G
-Xmx16G

Heap Sizing Rule of Thumb: Set the maximum heap size (-Xmx) to half of the system RAM, up to a maximum of 31GB (to maintain compressed Object Pointers). If your server has 64GB of RAM, 31GB is ideal. If you are still seeing OOMs with a 31GB heap, the issue is likely application-side: you may be querying excessively large partitions, triggering unbound materialization of rows in memory.

Step 4: Mitigating Slow Queries and Timeouts

Timeouts (ReadTimeoutException or WriteTimeoutException) and slow queries indicate that the coordinator node did not receive responses from enough replicas within the configured time (read_request_timeout_in_ms, default 5000ms).

Common causes include:

Heavy Garbage Collection Pauses: Long "Stop-The-World" GC pauses freeze the node, causing requests to queue and time out.
Tombstone Overload: If you frequently DELETE data or insert TTLs, Cassandra creates tombstones. Queries scanning thousands of tombstones to find a few valid rows will time out.
Unbalanced Clusters: One node holds significantly more data or receives more traffic.

Diagnose slow queries using cqlsh tracing:

cqlsh> TRACING ON;
cqlsh> SELECT * FROM users_keyspace.user_events WHERE user_id = '12345';

The trace output will reveal exactly where time is being spent. If scanning tombstones is the culprit, consider lowering your gc_grace_seconds for the table and manually forcing a major compaction (though major compactions should be used cautiously with SizeTieredCompactionStrategy).

nodetool tablestats users_keyspace.user_events

Look for Maximum tombstones per slice. If it's in the thousands, you have a data model issue.

Step 5: Handling Cassandra Data Corruption

In rare cases, abrupt power loss, faulty disks, or severe kernel panics can lead to SSTable corruption. The Cassandra process may crash on startup, or specific queries will fail, logging a CorruptSSTableException.

When corruption is detected, do not ignore it. You must run nodetool scrub on the affected keyspace and table. The scrub process rewrites the SSTables, discarding the corrupted chunks.

# Stop the node first if it's continuously crashing on read
sudo systemctl stop cassandra

# Run standalone scrub if offline, or nodetool scrub if online
nodetool scrub <keyspace_name> <table_name>

Warning: Scrubbing discards corrupted data. After the scrub completes, you must run a repair (nodetool repair) to fetch the lost data from healthy replicas in the cluster. If the corruption is widespread and multiple replicas are affected, you will need to restore the SSTables from your latest snapshot backups.

Summary of Best Practices

To prevent these issues from recurring:

Monitor GC logs: Ship GC logs to a central observability platform to spot increasing pause times before they cause timeouts.
Avoid large partitions: Keep partition sizes under 100MB to prevent memory pressure during reads.
Use TimeWindowCompactionStrategy (TWCS): For time-series data, TWCS efficiently drops whole SSTables when TTLs expire, avoiding massive tombstone build-up.
Regular Repairs: Run incremental or full repairs regularly to ensure data consistency across replicas, minimizing the impact of potential corruption.