Production database systems often require high availability architectures. This is to ensure the availability of the data through redundancy across multiple instances. However, it is important to understand how to deploy and configure the database cluster in order to achieve high availability.
MongoDB provides database replication in two different architectures - Replica Set, and Sharded Cluster.
MongoDB uses asynchronous replication to distribute the data to secondary nodes, using the oplog (operation logs), the transaction log for write operations in the database. After the data is written into the disk in the Primary node, it writes the oplog to a special capped collection that keeps rolling records for a write operation (insert, update, delete). In this blog, we will review some options and parameters for MongoDB Replication.
Transaction Log Parameters
MongoDB replication requires at least three nodes for the automatic election process in case the primary node goes down. The replication of MongoDB relies on the oplog, for the data synchronization from Primary to Secondary nodes. The default oplog size for the WiredTiger storage engine is 5% of the free disk space.
We can tune the oplog size by specifying the parameter oplogSizeMB based on the predicted workload of the system. Setting the minimum retention period of the oplog to limit the old entries of the transaction log to be truncated, we can set the parameter storage.oplogMinRetentionHours in mongod configuration files.
Read and Write Configuration
MongoDB comes with Write and Read configuration settings, which you can configure depending on the needs from the application side. We can tune parameters for the written acknowledgment from primary to secondary. We can also set the Read option to distribute read traffic across all nodes of a Replica Set.
The write concern in MongoDB gives the option on how the data is acknowledged for write operations, both in Replica Set or Sharded Cluster architectures.
The typical configuration of write concern is shown below:
{ w: <value>, j: <boolean>, wtimeout: <number> }
For example, if we want to have the data written to the majority of nodes before the timeout, we can configure as below:
{ writeConcern: { w: majority, wtimeout: 2000 } }
It will synchronize the data before the timeout happens, in this case, the timeout is 2 seconds.
Another value of the writeConcern parameter besides the majority, are:
w: 1, this value acknowledges the write operation in a standalone MongoDB node or in the Primary node of Replica Set / Sharded Cluster.
w: 0, there is no acknowledgment of the write operation in this value.
Write concern above the value of 1 requires acknowledgement from the Primary and as many as the secondaries required to meet the write concern. For example, if we set the write concern of 4 in a MongoDB ReplicaSet with 7 nodes (1 Primary and 6 Secondary), it requires acknowledgement from at least 4 nodes (1 Primary and 3 Secondary) before it is returned as a successful write operation to the application.
Another option is the j option, which is the acknowledgment that the data that has been written in the disk journal. In a Replica Set configuration, there are 3 options that explain how to write the data in the journal as below:
j: true the configuration requires writing the data into the journal disk
j: false the configuration requires writing of the data in memory.
For example, the configuration with the j option as below:
{ writeConcern: { w: majority, j: true, wtimeout: 2000 } }
The last option is if the j parameter is not specified in the parameter, it depends on the parameter of writeConcernMajorityJournalDefault. If the parameter is true, then it will require writing on the disk journal, while if it is false then only requires writing on the memory.
MongoDB sends the read request to the Primary node by default. There are some options on how we can configure read requests in a Replica Set architecture. MongoDB provides a parameter called readPreference which have many options as below:
Primary: The default value of the read preference, all the read traffic will be sent to the primary.
primaryPreferred: The read traffic will be sent to the primary node most of the time, unless the primary node is not available it will reroute to the secondary node.
Secondary: The read traffic will be sent to the secondary node.
secondaryPreferred: The read traffic will be sent to the secondary node, unless the secondary node is not available it will reroute to the primary node.
Nearest: The read request will be sent to the nearest reachable node based on the latency threshold.
Secondary Nodes
The secondary node participates in a primary election process, so eventually, the secondary nodes will be promoted if the primary goes down. There are various types of secondary nodes in a MongoDB Replica Set architecture.
Secondary nodes
The “normal” secondary nodes, which are participating in the election process of the primary nodes and can accept the traffic from the application.
Arbiter nodes
The arbiter node in a Replica Set acts as a voter who participates in the election of the Primary. The node can not be elected as a Primary because it does not store any data. The arbiter node is used if you only run one Primary and one Secondary node, because when there are just two nodes, MongoDB will not elect the Secondary unless it has the majority of votes ( > 50%). The configuration of the MongoDB arbiter is similar to the other nodes in terms of installation, specify the Replica Set name, the difference is that we need to add rs.addArb command when joining the arbiter node to the Primary node.
Hidden nodes
We can set the secondary nodes as a hidden member node in MongoDB Replica Set. The hidden nodes will not be visible in the MongoDB driver, so there will not be traffic sent to the hidden nodes. The priority of the hidden nodes themselves is 0, so it will not participate in the election process of a Primary. The parameter that needs to be configured for the hidden nodes is: "hidden": true
We can use the hidden node as a special purpose node for data ingestion, analytics or backup node.
Delayed nodes
The delayed node must be a hidden node. The data in the hidden node reflects the earlier data from the Primary nodes based on the delayed time we configured. For example, if we configure the delayed time to be 30 minutes, the data will reflect the state of the Primary node as it was 30 minutes before. We can configure the delayed node for 30 minutes as shown below:
cfg = rs.conf()
cfg.members[0].priority = 0
cfg.members[0].hidden = true
cfg.members[0].slaveDelay = 1800
rs.reconfig(cfg)
The purpose of the delayed node is for fast recovery of the data if we delete or update some incorrect data accidentally from the Primary, we can immediately recover the data from the delayed nodes. It will reduce the recovery time compared to restoring the data from backup.
Node with Priority 0
The Secondary nodes with Priority 0 are the “normal” nodes but the difference is that they will not participate in the election process. The configuration is really straightforward, we just need to set the priority as shown below:
cfg = rs.conf()
cfg.members[0].priority = 0
rs.reconfig(cfg)
The use case of the Node with Priority 0 is when we want to have DR (Disaster Recovery) site on the other datacenter, we can replicate the data and set the priority to become 0.