Real-Time External Indexing For Aggregations and Joins on MongoDB Collections
February 28, 2020
Tech Preview
TL;DR Join the Tech Deep Dive to learn how Rockset works with MongoDB!
This is a tech preview of the MongoDB integration with Rockset to support millisecond-latency SQL queries such as joins and aggregations in real-time. Rockset builds fully mutable external indexes on any fields, including deeply nested fields in JSON documents, from your MongoDB collections. It uses your MongoDB Change Streams to stay in sync with inserts, updates and deletes, so that new data is queryable in ~2 seconds. By default, Change Streams only return the delta of fields during the update operation so this means there is minimal impact to your production database performance.
MongoDB is a document database, which means it stores data in JSON-like documents. This is one of the most natural ways to think about data, and is much more powerful than the traditional row/column model for developers who need agility. Typically, as your use of MongoDB as your primary transactional database grows, there are more data services being built around it inside your organization, and some of these services would greatly benefit from having the same data available for aggregations and joins via fast declarative SQL queries in real-time.
Rockset is a real-time database in the cloud that is used for building event-driven applications, stateful microservices and real-time data services. You can think of it as a selective read replica which allows you to continuously index any fields, including deeply nested fields from your MongoDB JSON documents in an external Converged Index™, which is a combination of inverted, row and columnar index. It is a mutable index which is important because unlike typical event streams, your database change streams not only have inserts but also high rate of updates and deletes. Rockset's data model matches MongoDB's JSON document data model and has strong support for arrays, objects and mixed types. Rockset exposes a RESTful API based SQL interface for fast, powerful filtering, aggregations, and joins, in real-time. It auto-scales compute and memory in the cloud, based on the size of your data. It is not a transactional data store.
Who should use it
The MongoDB integration with Rockset allows you to load data from MongoDB into the Rockset Converged Index.
- You are building real-time data services around MongoDB that could benefit from aggregations, joins, predicates on non-indexed fields
- You have custom ETL scripts to replicate between MongoDB and other systems for access but you know that ETL pipelines are fragile and introduce too much data latency
How it works
Steps:
-
In your MongoDB Atlas account:
- Create a new read-only user in MongoDB
- Copy the connection string for the MongoDB cluster you need (sharded clusters are fully supported)
- Note: if your Mongo instance is not running in Atlas you will need to write a small python script that forwards your Change Stream to Rockset
-
In your Rockset account:
- Create a Mongo integration by entering the info from step 1 & 2
- Create a Rockset collection by specifying the Mongo collection to be indexed in Rockset
- Optionally apply ingest-time transformations such as type coercion, field masking or search tokenization
-
Rockset will first do a fast bulk load of your existing data and then continuously tail your Change Stream to stay in sync with inserts, updates and deletes
- Start exploring your collections in SQL table format in real-time
- Run fast, powerful SQL queries, including JOINS with other databases or event streams
- Use RESTful APIs or Python, Java, Node.js, Go client libraries or JDBC connector for querying
Converged Indexing
Rockset is a real-time database in the cloud, built by the team behind RocksDB. It automatically syncs the selected fields and builds a fully mutable Converged Index that combines the power of columnar, row and inverted indexes.
- Converged Indexing requires more space on disk, but as a result complex queries are faster. In simple terms, we trade off storage for CPU. However, more importantly, we trade off hardware for human time. Humans no longer need to configure indexes or write custom client-side logic and humans no longer need to wait on slow queries.
- As any experienced database user knows, as you add more indexes, writes become heavier. A single document update now needs to update many indexes, causing many random database writes. In traditional storage based on B-trees, random writes to database translate to random writes on storage. At Rockset, we use LSM trees instead of B-trees. LSM trees are optimized for writes because they turn random writes to database into sequential writes on storage. We use RocksDB's LSM tree implementation and we have internally benchmarked hundreds of MB per second writes in a distributed setting
So we have all these indexes, but how do we pick the best one for our query? We built a custom SQL query optimizer that analyzes every query and decides on the execution plan.
Tech Deep Dive
Sign up here to participate in the MongoDB - Rockset tech deep dive. You will learn more about how it works, shape the product by sharing your feedback directly with the engineering team, swap best practices with fellow users, learn and have fun along the way.
Happy Querying!
Other MongoDB resources: