NoSQL world raised and blossomed in the past few years. It gave us more options in choosing databases than the past few decades did. Now, we can think about the nature of our data and actually pick the type of database that best suits our needs. That is one of the coolest things about NoSQL ecosystem, that it contains many types of databases and each type is best applied on different types of problems.
One of the dominant databases on the NoSQL market at the moment is certainly MongoDB. It is document NoSQL database, and thus closer to the traditional relational databases than some other types of NoSQL databases. This can probably explain its success. You can read about this and some other MongoDB basics in the previous post. Now, we are going to dive a little bit deeper into some MongoDB features regarding replication and scaling.
Replica Sets
What is replication? Well, one of the reasons NoSQL databases evolved in the first place was the need for a different way of storing data in distributed systems. One server was not good enough anymore. Now we have multiple servers with multiple NoSQL databases running on them. The process of synchronizing data across multiple servers in the whole system is called replication.
The goal of replication is that all servers have the same data, so users that access different servers have the same information. This process ensures data durability and makes sure that no data is lost if one server fails, or the network has problems for some time. Also, this mechanism increases general availability of the data and eliminates downtime for maintenance. Seems pretty cool, right?
In MongoDB, replication is implemented using replica sets. These are instances of MongoDB processes (mongod) that host the same data set. Each of these processes is considered one node, and that node can be data bearing or arbiter node. So, the replica set is basically created from two or more of those nodes. Based on operations that can be done on them, nodes can be either primary or secondary.
There can be only one primary node in a replica set, and that is the node on which all write operations are done. When a write operation is done on the primary, it records information about this operation in operation log – oplog. Write operations never go to the secondary nodes. They replicate primary’s oplog, and then apply recorded operations to their data set. That way, secondary data set mirrors the primary one. MongoDB client can be configured in a way that reads data from the secondary nodes.
In general, these nodes are maintaining heartbeat between them, as a mechanism for detecting failing node. This way, one of the node fails, other nodes are aware of that. But what if a primary node fails? In that case, remaining secondary nodes do an election among them to choose new primary. Elections also occur after initiating a replica set. Once the new primary node is elected, a cluster is operational again and can receive write operations. However, if old primary recovers, it will rejoin the replica set, but it will no longer be primary. It will resync data with the other nodes, but it will rejoin as a secondary node.
That’s a lot of code, but how does that look like in practice? Like in the previous article, MongoDB shell client is used for demonstrative purposes. Here, I started one instance of mongod with some extra parameters:
As you can see I started it using an extra parameter – – replSet
. Using this parameter, I defined a name for the replica set. Now, I’ve done similar call on port 27018:
And on port 27019:
Next thing we want to do is put all these nodes together in one replica set. To do that, we need to attach to one of the running mongod processes using MongoDB shell client. So I called mongo to connect to the mongod process that is running on port 27017. After that, I called method rs.initiate()
.
This function initiates replica set and configures it with passed JSON document if one is passed into the function. As you can see, I used this JSON document for configuration:
{
_id : "rs0",
members: [
{
_id : 0,
host : "localhost:27017"
} ]
}
In members, I added just the node that is running on port 27017. To add other nodes, you can simply extend this JSON to look like this:
{
_id : "rs0",
members: [
{ _id : 0, host : "localhost:27017" },
{ _id : 0, host : "localhost:27018" },
{ _id : 0, host : "localhost:27019" }]
}
The other way this can be achieved is by calling rs.add()
method, like this:
Another useful function is db.isMaster()
. This function will show some interesting data about the database we are connected to and about replica set that it belongs to:
Here, we can see the name of the replica set, which node is primary, to which node we are connected to and so on. If you need more information about replica set itself, use method rs.status()
which will give you a more complex report.
Sharding
So, replica sets gave us the ability to hold data in multiple databases thus giving us a certain level of fault tolerance and data duration. But, this approach has certain limitations. As already mentioned, all write operations go to the primary node, making it the bottleneck of the system. This means that if system grows, primary node will be overused, and eventually, it will be limited with purely hardware limitations like RAM, the number of CPUs and disk space.
There are two methods to address this issue, two ways of scaling:
- Vertical scaling – This method involves increasing performance of the server on which primary node is running by adding more RAM or increasing disk space. But this is very expensive, and not the best solution for distributed systems.
- Horizontal scaling – This method includes dividing data and load among multiple servers.
Horizontal scaling in MongoDB is done by sharding. The main goal of sharding is to provide us with something like this:
The idea is to have multiple replica sets, with multiple primaries, which will divide data and load among themselves. One of these replica sets is called shard. But multiple shards are not enough to achieve the proper functionality of this kind of system. Firstly, there is a need for a router, which will route queries and operation to the proper shard.
MongoDB is providing, for these purposes, new daemon process called mongos. Also, this system needs to know which part of the data is in which shard, and that is done by additional replica set called configuration server. The combination of multiple shards, mongos processes, and configuration serves is called sharding cluster and it looks like something like this:
Sharding itself is done on collection level. This means that defined collection is distributed among shards, not the whole database and that is done by calling the method sh.shardCollection(“DATABASE_NAME.COLLECTION_NAME”, SHARD_KEY)
. Once this command is called, the defined collection is distributed among different shards. Each shard will contain the range of data of defined collection, and mongos process will send queries to proper shards.
Conclusion
Replica Sets and Sharding are important MongoDB features. By using replica sets, MongoDB is able to create recoverable and highly durable clusters, and by using sharding, it is able to meet the demands of data growth and horizontal scaling. This way, MongoDB covered some of the major requests on the market today. In the next post, I will talk a little bit more about latest MongoDB product – MongoDB Atlas.