Compare and Contrast Search Indexing With Real-Time Converged Indexing
May 27, 2021
Let's compare and contrast search indexing with real-time converged indexing and explain what converged indexing is, how it's similar, how it's different, how the architecture is set up, and then review some of the details of how it is different in terms of operations.
When you talk about serverless systems and cloud-native systems, there's a huge advantage that we have in the cloud and we really want to spend some time talking about initial setup, in terms of day two operations.
Search indexing has been around for a while. As we look at where search indexing started, its roots in text search, and then over time, all the different use cases that it's being used for, we looked at some design goals in terms of designing Rockset and designing converged indexing a little differently.
One of our primary goals at Rockset is to help our customers get better scaling in the cloud. The second one is more flexibility, especially now in the last few years with how data has changed, how the shape of the data coming from many different places tends to be completely different, and how it's being used for very different types of applications. How do we give you more schema-query flexibility? And the last one is around low ops.
As far as speed and scale is concerned, we're looking at new data being queryable in about two seconds, with P95 of two seconds, even if you have millions of writes per second coming in. At the same time, we also want to make sure that queries return in milliseconds, even on terabytes of data.
Of course, this is possible today with Elasticsearch. Elastic is used at very high scale. The challenge is that managing data at that scale becomes very, very difficult. So better scaling means to enable this type of scaling in the cloud while making it very easy.
For flexibility. We heard feedback loud and clear that you want to be able to do a lot more complex queries. You want to be able to do, for example, standard SQL queries, including JOINs, on whatever your data is, wherever it's coming from. It could be nested JSON coming from MongoDB. It could be Avro coming from Kafka. It could be Parquet coming from S3, or structured data coming from other places. How can you run many types of complex queries on this without having to denormalize your data? That's one of the design goals.
When you build a cloud-native system, you can enable serverless cloud scaling and the vectors we're optimizing for are both hardware efficiency and human efficiency in the cloud.
Memory is very expensive in the cloud. Managing clusters and scaling up and down is painful when you have a lot of bursty workloads. How can we handle all of that more simply in the cloud?
Let's take a deep dive into what really is the difference between the two indexing technologies.
Elasticsearch has an inverted index and it also has doc value storage built using Apache Lucene. Lucene has been around for a while. It's open source and many are intimately familiar with it. It was originally built for text search and log analytics and this is something at which it really shines. It also means that you have to denormalize your data as you put your data in and you get very fast search and aggregation queries.
You can think of converged indexing as a next generation of indexing. Converged indexing combines the search index (the inverted index) with a row-based index and a column store. All of this is built on top of a key-value abstraction, not Lucene. This is built on top of RocksDB.
Because of the flexibility and scale that it gives you, it lends itself really well to real-time analytics and real-time applications. You don't need to denormalize your data. You are able to execute really fast search, aggregation, time-based queries (because you now have built a time index), geo-queries (because you have a geo-index), and your JOINs are also possible and really fast.
Converged Index Under the Hood
We talked about having your columnar, inverted and row index in the same system. Think of it as your ingested document being shredded and mapped to many keys and values, and being stored in terms of many keys and values.
RocksDB is an embedded key-value store. In fact, our team that built it. If you're not familiar with RocksDB, I'll give you a one second overview. So our team built RocksDB back at Facebook and open sourced it. Today you will find RocksDBs used in Apache Kafka, it's used in Flink, it's used in CockroachDB. All the modern cloud scale distributed systems use RocksDB.
Rockset uses RocksDB under the hood, and it's a very different representation than what is done in Elasticsearch. One of the big differences here is that because you have these three different types of indexes, we can now have a SQL optimizer that decides in real time which is the best index to use, and then returns your queries really fast by picking the right index and optimizing your query in real-time.
Because this is a key-value store, the other advantage you have is that each and every field is mutable. What does this mutability give you as you scale? You don't have to ever worry about re-indexing if you're using (for example) database change streams, you don't have to worry about what happens when you have a lot of updates, deletes, inserts, etc in your database change data capture. You don't have to worry about how that's handled in your index. Every individual field being mutable is very powerful as you start scaling your system, as you have massive scale indexes.
Whatnot switched from Elasticsearch to Rockset for real-time personalization because of the challenges managing updates, inserts and deletes in Elasticsearch. For every update, they had to manually test every component of their data pipeline to ensure there were no bottlenecks or data errors.
Learn about additional differences between Elasticsearch and Rockset in this technical comparison whitepaper.