What is an acceptable amount of MongoDB replication lag?

Acceptable lag depends on your application's tolerance for stale reads and your RPO (Recovery Point Objective). Generally, lag under 1-2 seconds is considered healthy in a local network. Lag exceeding 10-30 seconds often triggers warnings, and minutes of lag is critical.

Why does rs.printSecondaryReplicationInfo() show a negative lag?

Negative lag times usually indicate a clock synchronization issue between the servers. Ensure that Network Time Protocol (NTP) or chronyd is running and correctly synchronized on all nodes in the replica set.

How can I monitor MongoDB replication lag continuously?

Use monitoring tools like MongoDB Cloud Manager, Datadog, or Prometheus with the MongoDB Exporter. These tools scrape the `replSetGetStatus` metrics and can trigger alerts based on the `mongodb_rs_members_optime_delay` metric.

Does creating an index cause replication lag?

In older versions (prior to 4.2), foreground index builds blocked all other operations on the database, causing severe replication lag on secondaries. Modern MongoDB versions use a more optimized, non-blocking index build process, but heavy I/O from indexing can still temporarily slow down oplog application.

How do I increase the MongoDB oplog size to prevent nodes from falling off?

You can resize the oplog dynamically without restarting the mongod process using the `replSetResizeOplog` command. For example, to resize to 50GB: `db.adminCommand({replSetResizeOplog: 1, size: 51200})`.

MongoDB Replication Lag: How to Check, Monitor, and Fix Secondary Delay

Learn how to accurately check MongoDB replication lag using rs.status() and rs.printSecondaryReplicationInfo(), monitor delay, and resolve common performance bo

Last updated: February 24, 2026

Last verified: February 24, 2026

1,161 words

Key Takeaways

Replication lag occurs when secondary nodes cannot apply oplog entries as fast as the primary generates them.
Use rs.printSecondaryReplicationInfo() for a quick overview of lag across all secondary members.
Common root causes include network latency, under-provisioned secondary hardware (Disk I/O), and long-running operations locking the database.
Monitor the oplog window to ensure secondaries don't fall so far behind that they require a complete initial sync.

Diagnostic Methods Compared
Method	When to Use	Time	Risk
rs.printSecondaryReplicationInfo()	Quick visual check of lag time in mongosh	Seconds	None
rs.status()	Detailed inspection of optimes and node states	Minutes	None
db.getReplicationInfo()	Checking oplog size and capacity window	Seconds	None
Prometheus/Grafana Monitoring	Continuous tracking and alerting on replication metrics	Ongoing	None

Understanding MongoDB Replication Lag

In a MongoDB replica set, the primary node receives all write operations and records them in its operation log (oplog). Secondary nodes replicate this oplog and apply the operations asynchronously to maintain identical data sets. Replication lag is the delay between an operation occurring on the primary and that same operation being applied on a secondary.

While a few milliseconds of lag is normal in asynchronous replication, significant lag (seconds, minutes, or even hours) introduces severe risks:

Stale Reads: Applications reading from secondaries (readPreference: secondary or secondaryPreferred) will receive outdated data.
Failover Data Loss: If the primary crashes while secondaries are lagging, the replica set may elect a new primary that is missing recent writes, leading to rollback and potential data loss.
Oplog Exhaustion: If a secondary falls so far behind that the primary overwrites the oplog entries the secondary needs, the secondary goes into a RECOVERING state and requires a costly, resource-intensive full initial sync.

Step 1: Diagnosing the Lag

When alerts fire or users complain about stale data, your first step is to quantify the lag from the MongoDB shell (mongosh).

1. The Quick Check The most straightforward command is rs.printSecondaryReplicationInfo() (formerly rs.printSlaveReplicationInfo() in older versions).

rs.printSecondaryReplicationInfo()

Output Example:

source: db2.example.com:27017
    syncedTo: Thu Oct 26 2023 14:32:10 GMT-0400 (EDT)
    25 secs (0 hrs) behind the primary

This tells you exactly how many seconds behind the primary each secondary is based on the wall-clock time of the last applied operation.

2. The Detailed Check For more granular data, use rs.status(). Look at the optimes document for each member.

rs.status()

You want to compare the primary's optime with the secondary's optime and lastAppliedWallTime. If the stateStr of a node is RECOVERING, it may have fallen off the oplog entirely.

3. Check the Oplog Window Run db.getReplicationInfo() on the primary to see how much time your oplog covers. If your oplog window is 2 hours, and a secondary lags by 2.5 hours, it will never catch up.

Step 2: Identifying the Root Cause

If you have established that lag exists, you must identify why the secondary cannot keep up.

Under-provisioned Hardware (Disk I/O): Secondaries must perform the same number of write operations as the primary. If your secondaries are running on slower disks (e.g., lower IOPS EBS volumes in AWS) or have less RAM for the WiredTiger cache, they will naturally fall behind during write-heavy bursts. Check your system metrics (iostat, CloudWatch) for high disk latency or wait times on the secondary.

Network Latency and Bandwidth: Replication relies on continuous network streaming. If secondaries are in a different geographic region (cross-region replication) and the network link is saturated or experiencing packet loss, the oplog cannot be transferred fast enough. Use tools like ping and iperf between the primary and secondary to verify network health.

Long-Running Operations & Locking: Heavy administrative operations on a secondary, such as building large indexes in the foreground (prior to MongoDB 4.2) or massive analytical queries that exhaust the cache, can block replication threads from applying oplog entries. Check db.currentOp() on the lagging secondary to see if replication threads are waiting on locks.

Massive Write Spikes: A sudden, massive data import or update (e.g., updating millions of documents in a single script) can overwhelm the replication stream. The primary might write these sequentially very fast, but secondaries often batch and apply them differently, sometimes causing temporary lag that resolves once the burst ends.

Step 3: Resolving the Issue

Scale Up Secondaries: Ensure all nodes in the replica set have identical hardware specifications, particularly storage IOPS and RAM.
Increase Oplog Size: If your oplog window is too short for your bursty workloads, increase the oplog size dynamically using the replSetResizeOplog command.
Optimize Write Patterns: Break massive batch updates into smaller, throttled chunks to allow secondaries time to process the replication stream without falling behind.
Resync if Necessary: If a node is completely lost and its state is STARTUP2 or RECOVERING indefinitely, wipe its data directory and allow it to perform an initial sync, or seed it manually using a recent snapshot.

Frequently Asked Questions

javascript

// 1. Check replication lag across all secondaries (mongosh)
rs.printSecondaryReplicationInfo();

// 2. Check the size and capacity of the oplog on the primary
db.getReplicationInfo();

// 3. Custom script to alert if any secondary is more than 60 seconds behind
var status = rs.status();
var primaryOptime = status.optimes.lastCommittedOpTime.ts.T;

status.members.forEach(function(member) {
    if (member.stateStr === "SECONDARY") {
        var lagSeconds = primaryOptime - member.optime.ts.T;
        print("Node: " + member.name + " | Lag: " + lagSeconds + " seconds");
        if (lagSeconds > 60) {
            print("WARNING: Node " + member.name + " is experiencing severe lag!");
        }
    }
});

// 4. Dynamically increase oplog size to 50GB (run on primary)
db.adminCommand({replSetResizeOplog: 1, size: 51200});

Error Medic Editorial

Error Medic Editorial is a team of veteran Site Reliability Engineers and Database Administrators dedicated to creating actionable, code-first troubleshooting guides for production infrastructure.

Sources

Explore More Database Guides

Cassandra 'Connection Refused' on Port 9042: Complete Troubleshooting Guide

Fix Cassandra connection refused errors on port 9042. Diagnose OOM kills, misconfigured listen_address, firewall blocks, slow queries, and data corruption with

DynamoDB Slow Query, Timeout & Table Lock: Complete Troubleshooting Guide

Fix DynamoDB slow queries, ProvisionedThroughputExceededException, and timeout errors. Step-by-step diagnosis with AWS CLI commands and proven solutions.

ERROR: deadlock detected - Resolving PostgreSQL Deadlocks & Connection Exhaustion

Fix PostgreSQL deadlocks (ERROR: 40P01) and connection pool exhaustion. Learn to trace lock contention, enforce consistent lock ordering, and optimize transacti