Five Ways to Run Analytics on MongoDB – Their Pros and Cons
February 2, 2022
MongoDB is a top database choice for application development. Developers choose this database because of its flexible data model and its inherent scalability as a NoSQL database. These features enable development teams to iterate and pivot quickly and efficiently.
MongoDB wasn’t originally developed with an eye on high performance for analytics. Yet, analytics is now a vital part of modern data applications. Developers have formed ingenious solutions for real-time analytical queries on data stored in MongoDB, using in-house solutions or third-party products.
Let’s explore five ways to run MongoDB analytics, along with the pros and cons of each method.
1 – Query MongoDB Directly
The first and most direct approach is to run your analytical queries directly against MongoDB. This option requires no extra tooling, so you can develop both operational and analytical applications directly on MongoDB.
There are many reasons this isn’t most developers’ favored approach, though.
First, depending on the size and nature of your queries, you may have to spin up replicas to avoid the required computations interfering with your application’s workload. This can be a costly and technically challenging approach, requiring effort to configure and maintain.There is also a possibility the data queried from replicas isn’t the latest due to replication lags.
Second, you’ll likely spend additional time adding and tuning your MongoDB indexes to make your analytics queries more efficient. And even if you put in the effort to define indexes on your collection, they will only be effective for known query patterns.
Third, there are no relational joins available in MongoDB. This means that enriching your queries with data from multiple collections can be both time consuming and unwieldy. Options for joining data in MongoDB include denormalization or use of the
$lookup operator, but both are less flexible and powerful than a relational join.
2 – Use a Data Virtualization Tool
The next approach is to use a data virtualization tool. There are quite a few of these on the market, with each trying to enable business intelligence (BI) on MongoDB. Microsoft bundles PolyBase with SQL Server, and it can use MongoDB as an external data source. Other vendors, such as Dremio and Knowi, offer data virtualization products that connect to MongoDB. Virtualizing the data with this kind of tool enables analytics without physically replicating the data.
This approach’s obvious benefit is that you don’t have to move the data, so you can generally be up and running quickly.
Data virtualization options are primarily geared toward making BI on MongoDB easier and are less suited for delivering the low latency and high concurrency many data applications require. These solutions will often push down queries to MongoDB, so you will face the same limitations of using MongoDB for analytics without strong isolation between analytical and operational workloads.
3 – Use a Data Warehouse
Next, you can replicate your data to a data warehouse. There are some big players here like Redshift from AWS, Snowflake, and Google BigQuery.
The benefit of these tools is that they’re built specifically for data analytics. They support joins and their column orientation allows you to quickly and effectively carry out aggregations. Data warehouses scale well and are well-suited to BI and advanced analytics use cases.
The downsides of data warehouses are data and query latency. The original data rarely replicates from the primary data source in real time, as data warehouses are not designed for real-time updates. The lag is typically in the tens of minutes to hours, depending on your setup. Data warehouses have a heavy reliance on scans, which increases query latency. These limitations make data warehouses less suitable options for serving real-time analytics.
Lastly, for effective management, you need to create and maintain data pipelines to reshape the data for these warehouses. These pipelines require additional work from your team, and the added complexity can make your processes more brittle.
4 – Use a SQL Database
If your data requirements aren’t quite large enough to justify a data warehouse solution, maybe you can replicate it to a relational SQL database in-house. This excellent article, Offload Real-Time Reporting and Analytics from MongoDB Using PostgreSQL, can get you started.
You won’t have much trouble finding staff who are comfortable constructing SQL queries, which is a clear upside to this approach. SQL databases, like MySQL and Postgres, are capable of fast updates and queries. These databases can serve real-time data applications, unlike the data warehouses we considered previously.
Note, though, that this method does still require data engineering to reshape the MongoDB data for a relational database to ingest and consume. This extra layer of complexity adds more points of failure to your process.
Additionally, this approach doesn’t scale well. Most SQL implementations aren’t designed to be distributed, unlike their NoSQL counterparts. Vertically scaling can be expensive and, after a certain point, prohibitive to your time, your costs, and your technology.
5 – Use a NoSQL Data Store Optimized for Analytics
Lastly, you can replicate your data to another NoSQL data store optimized for analytics. Notable here is Elasticsearch, built on top of Apache Lucene.
The main benefit of this kind of approach is that there’s no need to transform data into a relational structure. Additionally, Elasticsearch leverages its indexing to provide the fast analytics that modern data applications require.
The drawback of the MongoDB-to-Elasticsearch approach is that Elasticsearch has its own query language, so you won’t be able to benefit from using SQL for analytics or perform joins effectively. And while you may not need to perform heavy transformation on the MongoDB data, you are still responsible for providing a way to sync data from MongoDB to Elasticsearch.
An Alternative That Combines the Benefits of NoSQL and SQL
There’s one more option to run analytics on MongoDB: Rockset. Rockset provides real-time analytics on MongoDB using full-featured SQL, including joins. While some of the options we mentioned previously would be well-suited for BI use cases with less stringent data and query latency requirements, Rockset enables you to run low-latency SQL queries on data generated seconds before.
Rockset has a built-in MongoDB connector that uses MongoDB CDC (change data capture), delivered via MongoDB change streams, to allow Rockset to receive changes to MongoDB collections as they happen. Updating using change streams ensures the latest data is available for analytics in Rockset.
We’ve examined a range of solutions to undertake analytics against your data in MongoDB. These approaches range from performing analytics directly in MongoDB with the help of indexing and replication, to moving MongoDB data to a data store better equipped for analytics.
These MongoDB analytics methods all have their advantages and disadvantages, and should be weighed in light of the use case to be served. For an in-depth look at how to implement each of these alternatives, and how to evaluate which is right for you, check out Real-Time Analytics on MongoDB: The Ultimate Guide.
Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.