Case Study: Sequoia Capital — Why We Moved from Elasticsearch to Rockset
April 5, 2021
Sequoia Capital is a venture capital firm that invests in a broad range of consumer and enterprise start-ups. To keep up with all the data around potential investment opportunities, they created a suite of internal data applications several years ago to better support their investment teams. More recently, they transitioned their internal apps from Elasticsearch to Rockset. We spoke with Sequoia’s head of engineering, Jake Quist, and VP of data science, Hem Wadhar, about their reasons for doing so.
Tell us about the internal tools you build and manage at Sequoia
Sequoia uses a combination of internal and external data to inform our decision-making process. We have investment professionals and data scientists, and we want our users to be able to get the data they need for their work.
Over time, we’ve built a number of internal apps to surface data to our users. From a handful of users early on, we now have half our firm using our apps in some form. Half of our apps require transactional consistency, so they use Postgres or DynamoDB. The other half—about 15 tools—use Rockset for search and analytics. We had originally built them on Elasticsearch but migrated to Rockset a year ago. We also use Retool for the front-end for our apps.
Why did you move search and analytics from Elasticsearch to Rockset?
There are two main reasons we preferred Rockset to Elasticsearch for the analytical apps we were building: the ability to use SQL and shorter indexing times.
Rockset lets us write SQL against our data. SQL is a better fit for what we are doing in bringing together multiple data sets to create a map of the start-up universe in which we operate. The ability to do relational algebra in Rockset is really helpful.
SQL allows more people to interact with the data. Our engineers and data scientists are much more productive writing queries in SQL. Everything was that much harder when using Elasticsearch DSL. Prior to moving to Rockset, we avoided Elasticsearch DSL syntax if we could, sometimes performing tasks in Spark instead. We are constantly iterating on our queries, and we’re able to determine correctness more quickly because of our familiarity with SQL. When things do break, it’s easier to check what broke if we’re using SQL.
We use data from many different sources in our analysis. We regularly receive data files from our vendors that we need to ingest from S3. Elasticsearch and Rockset both index the data to accelerate query performance, but the indexing time is much shorter with Rockset. This allows us to query the most recent version of the data as quickly as possible, without compromising on performance.
What alternatives did you consider?
Given the challenges with Elasticsearch, there’s a good chance we would have moved off Elasticsearch anyway, even if Rockset weren’t an option. In the past, we’ve considered using Postgres instead, but we would have had to be more selective about the data we put into Postgres, potentially limiting the data sets we bring into our apps. Snowflake and Amazon Athena were other SQL options, and we do use Snowflake at Sequoia, but Rockset is way faster for powering apps.
We’ve also experimented with other NoSQL databases, but SQL is just so much easier to use. All the NoSQL alternatives required learning something different from SQL. Ultimately, there’s a lot of value in being able to query using SQL but not having to specify the schema, and Rockset gives us that ability.
What did you achieve by making the switch from Elasticsearch to Rockset?
Our team doesn’t use Elasticsearch anymore. We’ve moved our internal apps over to Rockset for search and analytics.
We got the ability to do joins. Elasticsearch doesn’t support joins, so we were constantly denormalizing our data to get around this. It can take a week to set up a Spark job to denormalize each data set, and because of the data we deal with, we would experience significant space amplification due to denormalization. Data that would occupy 1 TB in Elasticsearch now takes up 10 GB in Rockset, approximately a 100x difference from not having to denormalize in order to join data.
We shortened the time it takes to index our data. With Elasticsearch, it would take 4-5 hours to index our largest data set. We’re doing that in 15-30 minutes with Rockset. We’re making data usable more quickly now, and we no longer need to expend effort monitoring longer-running ingestion on Elasticsearch.
We can move and iterate faster with Rockset. Our data model is constantly in flux, and we don’t anticipate it will ever get to a steady state, so it’s important to be able to iterate quickly on our queries and apps. The schema exploration capability in Rockset is really helpful in understanding the structure of the data we receive. Building and debugging queries using SQL in Rockset is trivial for us. We would sometimes take 15-30 minutes to construct the equivalent queries in Elasticsearch, and it would still not be 100% certain that we’d correctly specified the query we intended. Moving to Rockset allows us to be more efficient due to our familiarity with SQL. Rockset’s Query Lambdas (named, parameterized SQL queries stored in Rockset that can be executed from a dedicated REST endpoint) serve as a helpful abstraction layer on which we build our internal apps.
We no longer need to manage and maintain a cluster. We previously used an Elasticsearch managed cloud service, but it still needed a lot of fine tuning from our engineers and might go down for a couple of hours every month. Rockset is a maintenance delight. We don’t have to think about it and can simply focus on building our apps on top of it.
Overall, we’ve improved the underlying data infrastructure for our apps with this transition from Elasticsearch to Rockset. The number of apps we build and the data we employ in our analysis will continue to grow, and we’re looking forward to more Rockset features and integrations to help us on the way.