Rockset Is Up to 9.4x Faster than Apache Druid on the Star Schema Benchmark

February 18, 2021

,
Sign up for free.

Get started with $300 in free credits. No credit card required.

Please read our November 2021 update on comparing real-time analytics solutions here: Comparing Rockset, Apache Druid and ClickHouse for Real-Time Analytics




Real-time analytics is all about deriving insights and taking actions as soon as data is produced. When broken down into its core requirements, real-time analytics means two things: access to fresh data and fast responses to queries. These are essentially two measures of latency, which we term data latency and query latency, respectively.

Data latency is the time from when data is produced to when it can be queried, and is a function of how efficiently a database can sustain writes. As it usually gets less focus in benchmarks, we released RockBench, a data latency benchmark, last September. Using RockBench, we ascertained Rockset’s suitability for many real-time analytics applications due to its ability to keep data latency to under 1 second, while ingesting 1 billion events per day, on a standard 4XLarge Virtual Instance.

Query Latency and the Star Schema Benchmark

Query latency is the second key measure of real-time analytics performance and is the focus of the rest of this post. To evaluate query latency, we turned to the Star Schema Benchmark (SSB), an industry-standard benchmark to measure database performance on analytical applications. The SSB was designed for a batch analytics scenario, rather than real-time analytics, but will still yield useful insight into Rockset’s performance on analytical queries.

The SSB has also been used for performance measurements of other modern data technologies. In June 2020, Imply released a study of Apache Druid and Google BigQuery performance on the SSB. For the Rockset benchmark, we used the same hardware resources that were used in the Druid benchmark to provide greater context for our SSB evaluation.

Up to 9.4x Faster than Druid

From the benchmarking results, we observed one SSB query execute 9.4x faster on Rockset than on Druid, with many queries running 2x to 4x faster. The entire SSB suite ran 1.5x faster on Rockset compared to Druid. This demonstrates better performance with resource parity, since pricing was not available for a true price-performance comparison.

rockset-vs-apache-druid

In making these comparisons, we recognize we are not experts in configuring Druid, so we relied on a benchmark report from those who have the most knowledge about their system and can tune it best. In addition, benchmarks represent a snapshot in time, and systems will get faster with each new release. We are using the most recent benchmark published by Imply for comparison, but we expect Druid performance will continue to improve, as will Rockset’s.

Running the Star Schema Benchmark on Rockset

Benchmark Overview

The SSB comprises a suite of 13 analytical SQL queries that provide a good combination of functional and selectivity coverage.

We conducted the benchmark using SSB data at scale factor 100, which corresponds to 100GB and 600M rows of data. We denormalized the generated data prior to loading to provide a more direct comparison to the Druid benchmark, which avoided query-time joins, since Druid only recently added some limited join support.

rockset-ssb-diagram Figure 1: Performance harness used to generate and load SSB data, run queries and measure query runtimes

Loading into Rockset was straightforward and required zero configuration, apart from specifying some keys for column-based clustering. Once the SSB data was loaded into Rockset, we ran a load-generator query script, based on the Rockset Python client, that issued queries and measured runtimes.

Benchmark Results

We recorded the following runtimes across the 13 SSB queries.

rockset-ssb-results Figure 2: Benchmark results when running SSB on Rockset (600M rows, 100GB data set)

All queries in the SSB suite executed in under 1 second on Rockset, with a median runtime of 254 ms. This result demonstrates Rockset’s ability to run complex analytics with sub-second performance, a common requirement for real-time analytics applications.

When comparing to these results with Druid’s, we observe that 9 out of the 13 queries ran faster on Rockset. Rockset was 9.4x faster on the query with the largest speedup, with many queries in the 2x to 4x range, whereas Druid’s largest advantage was a 3.2x speedup. The suite of 13 queries completed in 4,146 ms on Rockset compared to 6,043 ms on Druid, corresponding to a 1.5x speedup overall. The following figures show Rockset’s query runtimes compared to those reported in Imply’s Druid and BigQuery paper.

rockset-druid-ssb Figure 3: Comparing Rockset and Druid SSB results

rockset-ssb-graph Figure 4: Graph showing Rockset, Druid and BigQuery runtimes on SSB queries

How Rockset Accelerates Real-Time Analytics

Several Rockset features work in concert to accelerate these SSB queries and real-time analytics in general.

  • Converged Index™
  • Column-based clustering
  • Vectorization

Converged Index

Rockset stores all ingested data in a Converged Index, which is a combination of:

  • Inverted index
  • Column-based index
  • Row-based index

Each query can take advantage of the index that is best suited for it and leads to the fastest execution. For instance, highly selective queries typically benefit from using the inverted index, while queries that require aggregations over large numbers of records will benefit from using the column-based index. By indexing data in three different ways, multiple types of queries can be executed efficiently without any manual intervention.

Column-based clustering

Users can configure column-based clustering so as to colocate data according to a clustering key they specify. This maximizes the opportunity for sequential access and reduces the amount of data that needs to be scanned for each query.

Vectorization

Rockset uses columnar data chunks to exchange data between query execution operators. This allows vectorized processing, where operations are performed on many values, instead of one value, at a time, resulting in more efficient query execution.

What This Means for Developers of Real-Time Analytics

With this SSB performance evaluation, we determined that Rockset is capable of delivering the sub-second query latency needed for real-time analytics, with better performance than alternatives like Druid. Coupled with the earlier RockBench evaluation that established Rockset’s ability to analyze data being written in real time, we see that Rockset can be a good fit for real-time analytics applications that require fast queries on the latest data. These include many use cases like logistics tracking, security analytics, e-commerce personalization, gaming leaderboards and customer-facing SaaS analytics.

While this evaluation was performed on a denormalized data set, Rockset's design also allows it to execute joins efficiently, so applications are not limited to operating on denormalized data. Future work would include running Rockset performance evaluations involving joins on normalized data.

Additionally, SSB data is well structured and therefore less representative of the real-life semi-structured data sets we commonly come across. It should be noted that Rockset can support the same analytical SQL queries on complex, nested data as well.

Given Rockset’s ability to provide both the write and read performance required for real-time analytics, we invite you to include Rockset in your consideration if you are developing real-time analytics features or products. Read the Rockset Performance Evaluation on the Star Schema Benchmark white paper to get the details on how we ran the SSB evaluation. Or, sign up for a free Rockset account to try running your own queries on Rockset!

Subscribe to our blog
mouse pointer

Sign Up for Free

Get started with $300 in free credits. No credit card required.