Improving MongoDB Read Performance - Indexing, Replication and Sharding

July 23, 2020

,

Read performance is crucial for databases. If it takes too long to read a record from a database, this can stall the request for data from the client application, which could result in unexpected behavior and adversely impact user experience. For these reasons, the read operation on your database should last no more than a fraction of a second.

There are a number of ways to improve database read performance, though not all of these methods will work for every type of application. Rather, it is best to select one or two techniques based on the application type to prevent the optimization process itself from becoming a bottleneck.

The three most important methods include:

  • Indexing
  • Read replicas
  • Sharding

In this article, we’ll discuss how to apply these three methods, in addition to limiting data transfer, to improve read performance in MongoDB and the built-in tools MongoDB offers for this.

Indexing to Improve MongoDB Read Performance

Indexing in MongoDB is one of the most common methods for improving read performance—and in fact, not only for MongoDB, but for any database, including relational ones.

When you index a table or collection, the database creates another data structure. This second data structure works like a lookup table for the fields on which you create the index. You can create a MongoDB index on just one document field or use multiple fields to create a complex or compound index.

The values of the fields selected for indexing will be used in the index. The database will then mark the location of the documents against those values. Therefore, when you search or query a document using those values, the database will query the lookup table first. The database will then extract the exact location of the document from this lookup table and fetch it directly from the location. Thus, MongoDB will not have to query the entire collection to get a single document. This, of course, saves a great deal of time.

But blindly indexing the data won’t cut it. You should ensure you’re indexing the data exactly the way you plan to query it. For example, suppose you have two fields, “name” and “email,” in a collection called “users,” and most of your queries use both fields to filter the documents. In such cases, indexing both the “name” and “email” fields is not enough. You must also create a compound index with the fields.

In addition, you need to make sure that the compound index is created in the same order in which the queries filter the records. For example, if the queries are filtering first on “name” followed by “email,” the compound index needs to be created in the same order. If you reverse the order of the fields in the compound index, the MongoDB query optimizer will not select that index at all.

And if there are other queries that use the “email” field alone to filter documents, you will have to create another index only on the “email” field. This is because the query optimizer will not use the compound index you created earlier.

It’s also important to design your queries and indexes in the earliest stages of the project. If you already have huge amounts of data in your collections, creating indexes on that data will take a long time, which could end up locking your collections and reducing performance, ultimately harming performance of the application as a whole.

To make sure the query optimizer is selecting the proper index, or the index that you prefer, you can use the hint() method in the query. This method allows you to tell the query optimizer which particular index to select for the query and not to decide on its own. This will allow you to improve MongoDB read performance to a certain extent. And remember, to optimize read performance this way in MongoDB, you should create multiple indexes whenever possible.

Key Considerations When Using Indexing

Even though having indexes takes up extra storage space and reduces write performance (as it needs to create/update indexes for every write operation), having the right index for your query could lead to good query response times.

However, it’s important to check that you have the right index for all your queries. And if you change your query or the order of fields in your query, you’ll need to update the indexes as well. While managing all these indexes may seem easy at first, as your application grows and you add more queries, managing them can become challenging.

real-time-indexing-mongodb

Read Replicas to Offload Reads from the Primary Node

Another read-performance optimization technique that MongoDB offers out of the box is MongoDB replication. As the name suggests, these are replica nodes that contain the same data as the primary node. A primary node is the node that executes the write operations, and hence, offers the most up-to-date data.

Read replicas, on the other hand, follow the operations that are performed on the primary node and execute those commands to make the same changes to the data they contain. Meaning it’s a given that there will be delays in the data getting updated on the read replicas.

Whenever data is updated on a primary node, it logs the operations performed to a file called the oplog (operations log). The read replica nodes “follow” the oplog to understand the operations performed on the data. Then, the replicas perform these operations on the data they hold, thereby replicating these same operations.

There is always a delay between the time data is written to the primary node and when it gets replicated on the replica nodes. Aside from that, however, you can command the MongoDB driver to execute all read operations on replica sets. Thus, no matter how busy the primary node is, your reads will be performed quickly. You do, however, need to ensure that your application is equipped to handle stale data.

MongoDB offers various read preferences when you’re working with replica sets. For example, you can configure the driver to always read from the primary node. But when the primary node is unavailable, the MongoDB read preference can be configured to read from a replica set node.

And if you want the least possible network latency for your application, you can configure the driver to read from the “nearest” node. This nearest node could be either a MongoDB replica set node or the primary node. This will minimize any latency in your cluster.

Key Considerations When Using Replication

The advantage of using read replica sets is that offloading all read operations to a replica set instead of the primary node can increase speed.

The major disadvantage of this, however, is that you might not always get the latest data. Also, because you are just scaling horizontally here, by way of adding more hardware to your infrastructure, there is no optimization taking place. This means if you have a complex query that is performing poorly on your primary node, it would not see a major boost in performance even after adding a replica set. Therefore, it is recommended to use replica sets along with other optimization techniques.

Sharding a Collection to Distribute Data

As your application grows, the data in your MongoDB database increases as well. At a certain point, a single server will not be able to handle the load. This is when you would typically scale your servers. However, with a MongoDB sharded collection, sharding is recommended when the collection is still empty.

Sharding is MongoDB’s way of supporting horizontal scaling. When you shard a MongoDB collection, the data is split across multiple server instances. This way, the same node is not queried in succession. The data is split on a particular field in the collection you’ve selected. Thus, you need to make sure that the field you’ve selected is present in all the documents in that collection. Otherwise, MongoDB sharding will not be properly executed and you might not get the expected results.

This also means that when you select a shard key—the field on which the data will be sharded—that field needs to have an index. This index helps the query router (the mongos application) route the query to the appropriate shard server. If you don’t have an index on the shard key, you should at least have a compound index that starts with the shard key.

Key Considerations When Using Sharding

As noted previously, the shard key and the index should be decided about early on, since once you’ve created a shard key and sharded the collection, it cannot be undone. And in order to undo sharding, you’d have to create a new collection and delete the old sharded collection.

Moreover, if you decide to shard a collection after the collection has accumulated a large amount of data, you’ll have to create an index on the shard key first, and then shard the collection. This process can take days to complete if not properly planned. Similar to read replicas, you are scaling the infrastructure horizontally here, and the index is present only on the one shard key. Therefore, if you have queries or query patterns that use more than one key, having a sharded collection might not help much. These are the major disadvantages of sharding a MongoDB collection.

Limiting Outgoing MongoDB Data to Reduce Data Transfer Time

When your application and the database are on different machines, which is usually the case in a distributed application, the data transfer over the network introduces a delay. This time increases as the amount of data transferred increases. It is therefore wise to limit the data transfer by querying only the data that is needed.

For example, if your application is querying data to be displayed as a list or table, you may prefer to query only the first 10 records and paginate the rest. This can greatly reduce the amount of data that needs to be transferred, thereby improving the read performance. You can use the limit() method on your queries for this.

In most cases, you don’t need the complete document in your application; you’ll only be using a subset of the document fields in your application. In such cases, you can query only those fields and not the entire document. This again reduces the amount of data transferred over the network, leading to faster read time.

The method for this is project(). You can project only those fields that are relevant to your application. The MongoDB documentation provides information on how to use these functions.

Alternatives for Improving MongoDB Read Performance

While these optimization techniques provided by MongoDB can certainly be helpful, when there is an unbounded stream of data coming into your MongoDB database and continuous reads, these methods alone won’t cut it. A more performant and advanced solution that combines several techniques under the hood may be required.

For example, Rockset subscribes to any and all data changes in your MongoDB database and creates real-time data indexes, so that you can query for new data without worrying about performance. Rockset creates read replicas internally and shards the data so that every query is optimized and users don’t have to worry about this. Such solutions also provide more advanced methods of querying data, such as joins, SQL-based APIs, and more.



Other MongoDB resources: