- Loading Data
- Adding a Data Source
This page covers how to use a MongoDB collection as a data source in Rockset. This includes:
- Create a MongoDB integration to securely connect collections in your MongoDB Atlas account or self-managed MongoDB cluster with Rockset.
- Create a collection which continuously syncs your data from a MongoDB collection into Rockset in real-time.
Create a MongoDB Integration
A MongoDB integration can be created based on where your MongoDB cluster is located:
Create a Collection
Once you create a collection backed by MongoDB, Rockset scans the MongoDB collections to continuously ingest and then subsequently uses the MongoDB Change Stream to update collections as new records are added to the MongoDB collection.
If your MongoDB collection is a capped collection, MongoDB change streams don't receive deletes for old documents and hence Rockset collection can go out of sync. For this we recommend setting retention on Rockset collection at time of creation.
You can create a collection from a MongoDB source in the Collections tab of the Rockset Console.
How it works
When a MongoDB backed collection is created, indexing in Rockset occurs in two stages:
- A one-time full scan of the MongoDB collection in which all records are indexed and stored in the Rockset collection.
- Following that, continuous monitoring and sync of changes from the MongoDB collection (inserts, deletes and updates) to the Rockset collection in real-time using MongoDB Change Streams.
Once a MongoDB backed collection is set up, it will be a replica of the MongoDB collection, up-to-date to within a few seconds.
MongoDB Best Practices
When the MongoDB database is under heavy load, it affects the speed at which we can read updates. Below are some best practices for connecting MongoDB as a source with Rockset:
- Start bulk ingest when your MongoDB database is under light load
- This allows Rockset to do the one-time full scan of MongoDB without any read throttling
- Increase the read-throughput on the MongoDB cluster for bulk ingest
- Use common techniques to increase read performance for the initial scan. See some recommended techniques in a blog from our solution engineering team
- Prefer using read replica to connect as a source with Rockset. Refer to this
for details on how you can setup a connection string with a
- Rockset uses
majority read concern.
"majority"guarantees that the data read has been acknowledged by a majority of the replica set members (i.e. the documents read are durable and guaranteed not to roll back).
- Increase the op-log size
- See MongoDB recommendation for workloads that might require a larger oplog size
- If the source MongoDB collection has a high write and update rate of operations, it is
recommended to increase the
- MongoDB recommends that the oplog size for a cluster should be enough to facilitate a 24 hour Replication Oplog Window. For example, if you are generating 1 GB oplog/hour on average, then the recommendation is that your oplog is 24 GB.
- Setup alerts on MongoDB project to trigger if the op-log churn (GB / Hour) exceeds a specified threshold.
- Monitor streaming ingest metrics in Rockset
- If your org’s virtual instance is nearing
peak streaming ingest rate consider increasing its size
to avoid an increase in data latency and slow queries
- Once the streaming ingest rate is reduced you can decrease the virtual instance size back for cost control
- Using the metrics endpoint you can set alerts with your preferred monitoring tool
- If the ingest keeps getting rate limited for a prolonged period of time, depending on your oplog size and churn rate, Rockset might not be able to catch up with all the updates coming from MongoDB, and the collection will enter an unrecoverable error state that will require re-creating it.
- If your org’s virtual instance is nearing peak streaming ingest rate consider increasing its size to avoid an increase in data latency and slow queries