In the course of implementing the Rockset connector to MongoDB, we did a fair amount of research on the MongoDB user experience, both online and through user interviews. We learned a lot about how organizations operated MongoDB in production and found that many of our discussions invariably touched upon what it took to achieve performance at scale. While it is very easy for developers to get started with MongoDB, getting good performance while scaling data volumes and usage involves getting to grips with sharding, indexing, schema design, isolating reads from writes, and a number of other possible optimizations.
Based on this, we put together a list of MongoDB performance tuning resources that we found useful and that presented ideas we heard echoed in our conversations, in the hope that you would find some of these helpful as well.
This is an excellent intro to sharding, which is what gives MongoDB its valuable horizontal scale-out property. Not only does its author, Ankush, introduce basic sharding concepts, and the complex challenges around sharding, the article also has several useful sharding best practices for more advanced MongoDB users.
While MongoDB is well-loved for its flexible schema, the decisions made around sharding can impact database performance and the ability to introduce new query patterns downstream. This was a recurring theme we heard when speaking with MongoDB users. Unsurprisingly, the key to a positive MongoDB experience often lay in proper selection of the shard key (pun intended). Thinking through what makes for a suitable shard key helps stave off future issues with “jumbo” chunks, hot shards and imbalanced clusters. This is a good read should you be encountering such issues or proactively trying to avoid them.
Schema Design and Indexing
The author, Onyancha, reinforces several performance-related observations that came up in our conversations. A common thread in many MongoDB and broader NoSQL discussions is the tight coupling between schema design and query patterns. How the data is modeled has significant bearing on query performance. As a result, Onyancha states, “How to model the data will therefore depend on the application’s access pattern.” He goes on to provide pointers for how to think though when to use techniques like document embedding and denormalization.
Another top performance optimization involves the appropriate use of indexing. Hitting indexes, instead of scanning collections, allows for much faster querying and sorting. The blog explains how to use single field indexes and compound indexes in the MongoDB context. But aside from the mechanics of configuring indexes, defining a proper indexing strategy very much requires a solid grasp of “application queries, ratio of reads to writes, and how much free memory your system has,” with the added challenge that these may change over time.
Bulk Writes and Reads
These two blogs take a look at how to optimize for bulk writes and reads in MongoDB. The first notes an interesting, adverse side effect of checkpointing on bulk load performance. In short, if your bulk ingest rate seems to be decreasing, it may be because MongoDB is spending significant time flushing dirty content from cache to disk with each checkpoint, so you may want to adjust your cache and eviction settings to compensate.
The second blog examines increasing batch sizes for reads and writes, in excess of default settings, to speed up bulk operations. The performance gain comes from minimizing the number of round trips between client and database through the use of larger batch sizes. These blogs provide good insight into the performance optimizations users often perform, either by specifying various database settings or by modifying application logic.
The final recommendation comes from the MongoDB blog itself. As real-time uses cases—prevalent in e-commerce, gaming and IoT scenarios—come increasingly into focus, there is “tremendous pressure for applications to immediately react to changes as the occur,” as the authors very nicely put it. The blog introduces MongoDB change streams, a way of implementing change data capture (CDC), where changed data is efficiently tracked and copied to target systems. While CDC is a more established concept with SQL databases, MongoDB makes it easier to set up with change streams, which became available with MongoDB 3.6.
What’s the relationship between change streams and MongoDB performance? Change streams offers an efficient method for isolating reads from writes by offloading read-heavy applications to another system that is kept in sync with MongoDB. This change streams blog and accompanying example proved helpful to us at Rockset as we researched possible approaches to connecting from MongoDB. We also explored tailing MongoDB oplogs and using Debezium to copy data from MongoDB, going through Kafka, but ultimately chose to implement the MongoDB-Rockset connector using change streams because of the simplicity and guarantees provided. Some of the change streams capabilities we liked are listed in the Characteristics section of the blog.
Building on top of change streams, we are able to make data queryable in Rockset within seconds of updates in MongoDB. If you are building something similar, do read up on change streams. Or you could also choose to leverage the work we’ve already done and use Rockset as a real-time index for MongoDB data. More information on how we made use of change streams can be found here.
MongoDB and Rockset
We, at Rockset, really enjoyed getting to know more about MongoDB and how it works for developers. With this knowledge, we built a MongoDB-Rockset integration that seeks to improve the user experience around some of the challenges listed above. We also hope you would find some of these resources and learnings from our user research useful in your work.
If you would like to try out Rockset alongside MongoDB for real-time indexing, you can sign up for an account here.