Elasticsearch or Rockset for Real-Time Analytics: How Much Query Flexibility Do You Have?
February 25, 2021
It’s difficult to create data analytics systems that can easily query across your various data sources while maintaining fast performance and real-time capabilities.
In an attempt to mitigate these challenges, many companies are turning to more modern database solutions. Two of these real-time analytics solutions are Elasticsearch and Rockset.
Elasticsearch, originally developed for text search, has recently tried to push into the data analytics space. However, Elasticsearch has several limitations that make it less suitable when it comes to running more complex analytical queries.
Rockset, on the other hand, provides full-featured SQL and an API endpoint interface that allows developers to quickly join across data sources like DynamoDB and Kafka. Rockset also automatically indexes your data without manual intervention in a Converged Index—in a search index, a columnar index, and a row index—making it adept at running a variety of complex analytics.
In this article we’ll compare the ease and flexibility of querying data using Rocket and Elasticsearch.
Why Query Flexibility Is Important for Real-Time Analytics
Companies are turning to real-time analytics to help drive operationally critical decisions. For example, a company might use real-time analytics on data such as daily active users and page load times to help detect outages of their apps on a regional level. Waiting until their batch reports load to see if their apps are down could mean millions of dollars of lost opportunity.
This is one of the many reasons developers rely on Elasticsearch or Rockset—for the ability to query data fast. This is because highly performant, accurate, and real-time analytics have become increasingly necessary for companies to better manage factories, calculate live pricing, and provide better service to website users.
This can be a challenge, though. A lot of data systems that provide real-time analytics require non-trivial ETL (extract, transform, load) to get the data into the “right” shape, or may not provide the analytical functionality required by the application. For example, you might have to develop a real-time data pipeline using a tool like Kafka just to get the data in a format that allows you to aggregate or join data in a performant manner.
Let’s look at how Elasticsearch and Rockset stack up with these considerations in mind.
Analyze Semi-Structured Data As Is
The data feeding modern applications is rarely in neat little tables. Instead, this data is often semi-structured in JSON or arrays.
Often this lack of structure forces developers to spend a lot of their time engineering ETL and data pipelines so that analysts can access the complex datasets. This takes a lot of time and is often a slow process that doesn’t work well for anybody.
Rockset doesn’t require you to ETL your data and it provides several helpful features that allow engineers to optimize their time rather than spending it developing data pipelines.
Rockset’s Smart Schemas feature automatically detects and creates a schema based on the exact data present. Some tools attempt to do this by just detecting the values of the first few records, but Rockset creates a schema based on every record, field, and type in the data set. And Rockset will not reject data that does not fit an existing schema. Instead it creates a new field or data type if it encounters new data.
Developers can also forgo configuring the data maps they would likely have to implement if they were using Elasticsearch. Rockset’s flexibility makes it possible for developers to spend less time developing ETL and mapping data, and more time actually developing their products.
Figure 1: Example of a Smart Schema where the
zip field contains values of different types
SQL Joins and Aggregations
Another benefit Rockset offers over Elasticsearch is easily running SQL and aggregation queries. Rockset supports full-featured SQL, enabling filtering, sorting, aggregating, and joining data in SQL. As the de facto language for data management, running SQL allows many users to easily access Rockset or port their queries from other databases to Rockset without any additional training.
Joins, in particular, are rarely well supported by alternative real-time analytics solutions. Because Rockset implemented SQL as its native query language, join functionality was included from day one and not as an afterthought. Joins are often used in real-time analytics applications to combine streaming data (usually representing events) with static data (like customer information).
With Elasticsearch, joins are not a first class citizen and many teams end up denormalizing their data to model relationships. This requires setting up a data pipeline to denormalize the data upfront, as well as ongoing maintenance to deal with operational issues and changes in the data over time. In addition, denormalization will result in a significant amplification of the amount of data that needs to be stored in Elasticsearch.
Figure 2: Denormalization is often required in Elasticsearch because it does not support joins
An alternative to denormalizing data before ingest is to do complex application-side joins. You can see an example of how user friendly Rockset can be in this Rockset vs. Elasticsearch example involving joins.
As an added bonus, Rockset’s SQL support allows it to easily integrate with Superset, Tableau, Redash, and other data visualization tools in the SQL ecosystem. This means you can quickly go from your query to your real-time dashboard.
Data APIs and Developer Tooling
Rockset easily queries across data sources using SQL to create Query Lambdas that you can connect to API endpoints. This developer tooling allows your team to spin up API endpoints with almost zero infrastructure development.
Query Lambdas allow developers to version control their SQL queries, better manage the SQL development lifecycle, and get metrics on individual queries. Not every developer needs to understand the intricacies of the data infrastructure, so Rockset’s ability to collaborate and reuse SQL queries with Query Lambdas provides a lot of flexibility in how development teams can build their analytics.
But Rockset’s biggest advantage is in its unique approach to indexing.
Search vs. Converged Indexing
When we consider query flexibility, simply being able to express the queries you want on the data you have is not useful without good results. Queries need to be able to scan, filter, and aggregate millions—if not billions—of rows quickly across multiple tables.
Furthermore, storing this data in tables is rarely sufficient. Your data systems will also need to take advantage of indexing in order to improve performance. When it comes to indexing, there are several methods you can use.
Most standard databases, like Postgres, MySQL or SQL Server, store data in row formats. This means that each individual row and all of its columns are stored together. When you query in these databases, your response is an entire row of data. This makes a lot of sense for operational databases, but can lack speed when it comes to analytical queries.
Columnar indexing became more feasible as data systems began to store their data in columns rather than rows—also known as column-oriented storage. This provides performance benefits in terms of compression.
Additionally, a query only pulls exactly the columns that it needs, making analytical queries considerably faster.
Both Rockset and Elasticsearch take advantage of search indexing, a technique that makes search-like queries fast. Each (column, value) pair is stored in a posting list of documents for which “column” references “value.”
This technique allows you to query with a filter or predicate, and quickly find the data that matches said filter. Rockset does this by keeping the posting lists sorted. These lists store the intersect of the lists or merge them, then return the results that either satisfy conjunction or disjunction of the filter.
Rockset doesn’t use just one of these methods of storing data. Instead, Rockset creates three indexes of your data to create a Converged Index™, which has the following characteristics:
- Accelerates many types of queries: Storing data in multiple indexes enables good out-of-the-box performance on different types of queries, whether they are search queries, aggregations, or point lookups.
- Compute efficient: Although indexing the data takes up more space, Rockset reduces the amount of compute expended. This is because queries can simply return results from the indexes rather than scanning large volumes of records. This trade-off benefits users, as compute generally costs more than storage.
- Lighter writes: The more indexes you create, the heavier writes become. This means that updating a single row or document would require you to update all your indexes as well. This is a slow process that only gets worse as you increase the number of indexes you rely on, especially since most databases use B-trees as the underlying structure. Rockset uses LSM trees instead of B-trees, which are optimized for writes because they turn random writes to database into sequential writes on storage, improving performance and creating lighter writes.
In contrast to Elasticsearch, which is focused on search indexes, Rockset’s converged indexing leads to faster queries and better performance over a wide range of queries, allowing developers greater flexibility when building real-time analytics.
Figure 3: A summary of how search indexing differs from converged indexing
Query Flexibility Increases Developer Productivity
In the world of big data and real-time analytics, your team needs a database system that can manage and index data fast. Developers are looking for ways to improve their productivity as they develop new products. With the many data sources in today’s modern architecture, this can be difficult.
With Rockset, regardless of what format your data is in, your team can query it using SQL to easily parse complex data types. From there, you can join and aggregate data without using complex code. This new flexibility allows developers to prototype and build new features quickly, without investing in heavy data preparation up front, saving on developer time and effort and increasing developer productivity overall.
Learn more about the architectural differences in the Elasticsearch vs Rockset white paper and migration journey to Rockset in 5 Steps to Migrate from Rockset to Elasticsearch blog.
Other blogs in this Elasticsearch or Rockset for Real-Time Analytics series:
- Part 1: Managing Clusters vs Going Serverless
- Part 2: How Much Query Flexibility Do You Have?
- Part 3: Real-Time Ingestion and Indexing