Rockset Is Up to 9.4x Faster than Apache Druid on the Star Schema Benchmark
February 18, 2021
Real-time analytics is all about deriving insights and taking actions as soon as data is produced. When broken down into its core requirements, real-time analytics means two things: access to fresh data and fast responses to queries. These are essentially two measures of latency, which we term data latency and query latency, respectively.
Data latency is the time from when data is produced to when it can be queried, and is a function of how efficiently a database can sustain writes. As it usually gets less focus in benchmarks, we released RockBench, a data latency benchmark, last September. Using RockBench, we ascertained Rockset’s suitability for many real-time analytics applications due to its ability to keep data latency to under 1 second, while ingesting 1 billion events per day, on a standard 4XLarge Virtual Instance.
Query Latency and the Star Schema Benchmark
Query latency is the second key measure of real-time analytics performance and is the focus of the rest of this post. To evaluate query latency, we turned to the Star Schema Benchmark (SSB), an industry-standard benchmark to measure database performance on analytical applications. The SSB was designed for a batch analytics scenario, rather than real-time analytics, but will still yield useful insight into Rockset’s performance on analytical queries.
The SSB has also been used for performance measurements of other modern data technologies. In June 2020, Imply released a study of Apache Druid and Google BigQuery performance on the SSB. For the Rockset benchmark, we used the same hardware resources that were used in the Druid benchmark to provide greater context for our SSB evaluation.
Up to 9.4x Faster than Druid
From the benchmarking results, we observed one SSB query execute 9.4x faster on Rockset than on Druid, with many queries running 2x to 4x faster. The entire SSB suite ran 1.5x faster on Rockset compared to Druid. This demonstrates better performance with resource parity, since pricing was not available for a true price-performance comparison.
In making these comparisons, we recognize we are not experts in configuring Druid, so we relied on a benchmark report from those who have the most knowledge about their system and can tune it best. In addition, benchmarks represent a snapshot in time, and systems will get faster with each new release. We are using the most recent benchmark published by Imply for comparison, but we expect Druid performance will continue to improve, as will Rockset’s.
Running the Star Schema Benchmark on Rockset
The SSB comprises a suite of 13 analytical SQL queries that provide a good combination of functional and selectivity coverage.
We conducted the benchmark using SSB data at scale factor 100, which corresponds to 100GB and 600M rows of data. We denormalized the generated data prior to loading to provide a more direct comparison to the Druid benchmark, which avoided query-time joins, since Druid only recently added some limited join support.
Figure 1: Performance harness used to generate and load SSB data, run queries and measure query runtimes
Loading into Rockset was straightforward and required zero configuration, apart from specifying some keys for column-based clustering. Once the SSB data was loaded into Rockset, we ran a load-generator query script, based on the Rockset Python client, that issued queries and measured runtimes.
We recorded the following runtimes across the 13 SSB queries.
Figure 2: Benchmark results when running SSB on Rockset (600M rows, 100GB data set)
All queries in the SSB suite executed in under 1 second on Rockset, with a median runtime of 254 ms. This result demonstrates Rockset’s ability to run complex analytics with sub-second performance, a common requirement for real-time analytics applications.
When comparing to these results with Druid’s, we observe that 9 out of the 13 queries ran faster on Rockset. Rockset was 9.4x faster on the query with the largest speedup, with many queries in the 2x to 4x range, whereas Druid’s largest advantage was a 3.2x speedup. The suite of 13 queries completed in 4,146 ms on Rockset compared to 6,043 ms on Druid, corresponding to a 1.5x speedup overall. The following figures show Rockset’s query runtimes compared to those reported in Imply’s Druid and BigQuery paper.
Figure 3: Comparing Rockset and Druid SSB results
Figure 4: Graph showing Rockset, Druid and BigQuery runtimes on SSB queries
How Rockset Accelerates Real-Time Analytics
Several Rockset features work in concert to accelerate these SSB queries and real-time analytics in general.
- Converged Index™
- Column-based clustering
Rockset stores all ingested data in a Converged Index, which is a combination of:
- Inverted index
- Column-based index
- Row-based index
Each query can take advantage of the index that is best suited for it and leads to the fastest execution. For instance, highly selective queries typically benefit from using the inverted index, while queries that require aggregations over large numbers of records will benefit from using the column-based index. By indexing data in three different ways, multiple types of queries can be executed efficiently without any manual intervention.
Users can configure column-based clustering so as to colocate data according to a clustering key they specify. This maximizes the opportunity for sequential access and reduces the amount of data that needs to be scanned for each query.
Rockset uses columnar data chunks to exchange data between query execution operators. This allows vectorized processing, where operations are performed on many values, instead of one value, at a time, resulting in more efficient query execution.
What This Means for Developers of Real-Time Analytics
With this SSB performance evaluation, we determined that Rockset is capable of delivering the sub-second query latency needed for real-time analytics, with better performance than alternatives like Druid. Coupled with the earlier RockBench evaluation that established Rockset’s ability to analyze data being written in real time, we see that Rockset can be a good fit for real-time analytics applications that require fast queries on the latest data. These include many use cases like logistics tracking, security analytics, e-commerce personalization, gaming leaderboards and customer-facing SaaS analytics.
While this evaluation was performed on a denormalized data set, Rockset's design also allows it to execute joins efficiently, so applications are not limited to operating on denormalized data. Future work would include running Rockset performance evaluations involving joins on normalized data.
Additionally, SSB data is well structured and therefore less representative of the real-life semi-structured data sets we commonly come across. It should be noted that Rockset can support the same analytical SQL queries on complex, nested data as well.
Given Rockset’s ability to provide both the write and read performance required for real-time analytics, we invite you to include Rockset in your consideration if you are developing real-time analytics features or products. Read the Rockset Performance Evaluation on the Star Schema Benchmark white paper to get the details on how we ran the SSB evaluation. Or, sign up for a free Rockset account to try running your own queries on Rockset!
Sign Up for Free
Get started with $300 in free credits. No credit card required.