How Converged Indexing Works
Follow our stories and unique insights.
About this video
Hello, everybody. I'm Igor from a founding engineer at Rockset and today I want to talk about converged indexing, which is a cool feature we have at Rockset that makes our database very easy to use. The basic idea behind converged indexing is to index all fields in all the documents you store at Rockset and to also combine columnar storage and the search index in the same system.
Before we go into more details, let me share some background on columnar storage and the search index and how they usually work. So in columnar storage, the goal is to store each column separately, which gives us great compression because you store data that looks similar closer together. And what this also allows us to do is when executing the query, we scan a lot of data and we do very simple operations on it, which yields very efficient vectorized processing on that data, which means our queries are much faster.
As an example, on the left-hand side you can see two documents, document zero, document one, where each of them have three columns, name, interests, and less active, and on the right-hand side, you can see what would the columnar storage of those two documents look like. And so you can see the name has the name column ... Values for the name column are stored close together where we have a list of document IDs and the value of that column for the document ID. Same thing for interests and same thing for last active.
Search indexing on the other hand, we store the map between a value and the list of document IDs that contain that value, and what this allows us to do on the query side is to quickly retrieve a list of document IDs that match a particular predicate. Back to our old example where we have two documents on the right-hand side, you can see what would the search index look like on those two documents. So we still have them separated by column, but now instead of a document ID mapping to value, we have value and mapping to document ID, so we store Druba, which is the value, and we say name is equal to Druba in document ID one. Same thing for interests, and then also for last active, we store mapping from value to document ID.
And then finally, converged indexing is the combination of those two systems, columnar and search index. We built converged indexing on top of a key value store abstraction. We use RocksDB but any key value store will do, the only important thing is that it keeps the data sorted and then each document that we store in converged indexing maps to many key value pairs in the key value store.
So let's see how it works. Let's see, we have two documents here, a little bit simplified this time. We have documents zero with name Igor and document one where name is equal to Druba and on the right-hand side, you can see which keys we would have in our key value store for those two documents. So first of all, you can see the first two key value pairs are from the row store and here the tricks we play are just how we construct the key off the key value pair.
Here you can see the first component of the key in the row store is document ID and the second component of the key in the row store is column name. And what this allows us to do is we store all values for a particular document close together, as you would in a row store. On the gray side, row store gives us very fast point lookup latencies. In the column store, we flip the key components. Instead of having the first key component be document ID, in column store the first key component is column name and the second key component is document ID. And as you can see, this means we store all values for a particular column close together, which gives us fast scan times and also gives us better compression.
And then finally for search index, we actually put the value inside of the key and we store document ID as a suffix. So the first component is obviously the column name, the second component is common value, and the last key component is document ID. So for example, if you're looking for all documents, where name is equal to Druba, we would find all keys in our key value store with prefix S.name.Druba, which is a fast operation that we can do in RocksDB.
So finally, converged indexing allows us to do both fast analytical queries and fast search queries in the same system, and we have also built an optimizer that picks between which of these two indexes we want to use. We will have another session like this that talks a bit more about our optimizer and how we built it. Thank you for your attention.