How To Join Data in MongoDB

February 10, 2022

,
Register for
Index Conference

Hear talks on search and AI from engineers at Netflix, DoorDash, Uber and more.

MongoDB is one of the most popular databases for modern applications. It enables a more flexible approach to data modeling than traditional SQL databases. Developers can build applications more quickly because of this flexibility and also have multiple deployment options, from the cloud MongoDB Atlas offering through to the open-source Community Edition.

MongoDB stores each record as a document with fields. These fields can have a range of flexible types and can even have other documents as values. Each document is part of a collection — think of a table if you’re coming from a relational paradigm. When you’re trying to create a document in a group that doesn’t exist yet, MongoDB creates it on the fly. There’s no need to create a collection and prepare a schema before you add data to it.

MongoDB provides the MongoDB Query Language for performing operations in the database. When retrieving data from a collection of documents, we can search by field, apply filters and sort results in all the ways we’d expect. Plus, most languages have native object-relational mapping, such as Mongoose in JavaScript and Mongoid in Ruby.

Adding relevant information from other collections to the returned data isn’t always fast or intuitive. Imagine we have two collections: a collection of users and a collection of products. We want to retrieve a list of all the users and show a list of the products they have each bought. We’d want to do this in a single query to simplify the code and reduce data transactions between the client and the database.

We’d do this with a left outer join of the Users and Products tables in a SQL database. However, MongoDB isn’t a SQL database. Still, this doesn’t mean that it’s impossible to perform data joins — they just look slightly different than SQL databases. In this article, we’ll review strategies we can use to join data in MongoDB.

Joining Data in MongoDB

Let’s begin by discussing how we can join data in MongoDB. There are two ways to perform joins: using the $lookup operator and denormalization. Later in this article, we’ll also look at some alternatives to performing data joins.

Using the $lookup Operator

Beginning with MongoDB version 3.2, the database query language includes the $lookup operator. MongoDB lookups occur as a stage in an aggregation pipeline. This operator allows us to join two collections that are in the same database. It effectively adds another stage to the data retrieval process, creating a new array field whose elements are the matching documents from the joined collection. Let’s see what it looks like:

Beginning with MongoDB version 3.2, the database query language includes the $lookup operator. MongoDB lookups occur as a stage in an aggregation pipeline. This operator allows us to join two collections that are in the same database. It effectively adds another stage to the data retrieval process, creating a new array field whose elements are the matching documents from the joined collection. Let’s see what it looks like:

db.users.aggregate([{$lookup: 
    {
     from: "products", 
     localField: "product_id", 
     foreignField: "_id", 
     as: "products"
    }
}])

You can see that we’ve used the $lookup operator in an aggregate call to the user’s collection. The operator takes an options object that has typical values for anyone who has worked with SQL databases. So, from is the name of the collection that must be in the same database, and localField is the field we compare to the foreignField in the target database. Once we’ve got all matching products, we add them to an array named by the property.

This approach is equivalent to an SQL query that might look like this, using a subquery:

SELECT *, products
FROM users
WHERE products in (
  SELECT *
  FROM products
  WHERE id = users.product_id
);

Or like this, using a left join:

SELECT *
FROM users
LEFT JOIN products
ON user.product_id = products._id

While this operation can often meet our needs, the $lookup operator introduces some disadvantages. Firstly, it matters at what stage of our query we use $lookup. It can be challenging to construct more complex sorts, filters or combinations on our data in the later stages of a multi-stage aggregation pipeline. Secondly, $lookup is a relatively slow operation, increasing our query time. While we’re only sending a single query internally, MongoDB performs multiple queries to fulfill our request.

Using Denormalization in MongoDB

As an alternative to using the $lookup operator, we can denormalize our data. This approach is advantageous if we often carry out multiple joins for the same query. Denormalization is common in SQL databases. For example, we can create an adjacent table to store our joined data in a SQL database.

Denormalization is similar in MongoDB, with one notable difference. Rather than storing this data as a flat table, we can have nested documents representing the results of all our joins. This approach takes advantage of the flexibility of MongoDB’s rich documents. And, we’re free to store the data in whatever way makes sense for our application.

For example, imagine we have separate MongoDB collections for products, orders, and customers. Documents in these collections might look like this:

Product

{
    "_id": 3,
    "name": "45' Yacht",
    "price": "250000",
    "description": "A luxurious oceangoing yacht."
}

Customer

{
    "_id": 47,
    "name": "John Q. Millionaire",
    "address": "1947 Mt. Olympus Dr.",
    "city": "Los Angeles",
    "state": "CA",
    "zip": "90046"
}

Order

{
    "_id": 49854,
    "product_id": 3,
    "customer_id": 47,
    "quantity": 3,
    "notes": "Three 45' Yachts for John Q. Millionaire. One for the east coast, one for the    west coast, one for the Mediterranean".
}

If we denormalize these documents so we can retrieve all the data with a single query, our order document looks like this:

{
    "_id": 49854,
    "product": {
        "name": "45' Yacht",
        "price": "250000",
        "description": "A luxurious oceangoing yacht."
    },
    "customer": {
        "name": "John Q. Millionaire",
        "address": "1947 Mt. Olympus Dr.",
        "city": "Los Angeles",
        "state": "CA",
        "zip": "90046"
    },
    "quantity": 3,
    "notes": "Three 45' Yachts for John Q. Millionaire. One for the east coast, one for the west coast, one for the Mediterranean".
}

This method works in practice because, during data writing, we store all the data we need in the top-level document. In this case, we’ve merged product and customer data into the order document. When we query the information now, we get it straight away. We don’t need any secondary or tertiary queries to retrieve our data. This approach increases the speed and efficiency of the data read operations. The trade-off is that it requires additional upfront processing and increases the time taken for each write operation.

Copies of the product and every user who buys that product present an additional challenge. For a small application, this level of data duplication isn’t likely to be a problem. For a business-to-business e-commerce app, which has thousands of orders for each customer, this data duplication can quickly become costly in time and storage.

Those nested documents aren’t relationally linked, either. If there’s a change to a product, we need to search for and update every product instance. This effectively means we must check each document in the collection since we won’t know ahead of time whether or not the change will affect it.

Alternatives to Joins in MongoDB

Ultimately, SQL databases handle joins better than MongoDB. If we find ourselves often reaching for $lookup or a denormalized dataset, we might wonder if we’re using the right tool for the job. Is there a different way to leverage MongoDB for our application? Is there a way of achieving joins that might serve our needs better?

Rather than abandoning MongoDB altogether, we could look for an alternative solution. One possibility is to use a secondary indexing solution that syncs with MongoDB and is optimized for analytics. For example, we can use Rockset, a real-time analytics database, to ingest directly from MongoDB change streams, which enables us to query our data with familiar SQL search, aggregation and join queries.

Conclusion

We have a range of options for creating an enriched dataset by joining relevant elements from multiple collections. The first method is the $lookup operator. This reliable tool allows us to do the equivalent of left joins on our MongoDB data. Or, we can prepare a denormalized collection that allows fast retrieval of the queries we require. As an alternative to these options, we can employ Rockset’s SQL analytics capabilities on data in MongoDB, regardless of how it’s structured.

If you haven’t tried Rockset’s real-time analytics capabilities yet, why not have a go? Jump over to the documentation and learn more about how you can use Rockset with MongoDB.


Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.