Comparing Rockset, Apache Druid and ClickHouse for Real-Time Analytics
November 23, 2021
We built Rockset with the mission to make real-time analytics easy and affordable in the cloud. We put our users first and obsess about helping our users achieve speed, scale and simplicity in their modern real-time data stack (some of which I discuss in depth below). But we, as a team, still take performance benchmarks seriously. Because they help us communicate that performance is one of the core product values at Rockset.
We are in complete agreement with Snowflake and Databricks on one thing: that anyone who publishes benchmarks should do them in a fair, transparent, and replicable manner. In general, the way vendors conduct themselves during benchmarking is a good signal of how they operate and what their values are. Earlier this week, Imply (one of the companies behind Apache Druid), published what appears to be a tongue-in-cheek blog claiming to be more efficient than Rockset. Well, as a discerning customer, here are the questionable aspects of Imply's benchmark for you to consider:
- Imply has used a hardware configuration that has 20% higher CPU in comparison to Rockset. Good benchmarks aim for hardware parity to show an apples to apples comparison.
- Rockset’s cloud consumption model allows independently scaling compute & storage. Imply has made inaccurate price-performance claims that misrepresent competitor pricing.
Also, note that as often happens with vendors working on performance, the previous benchmarks used in the comparison were run almost a year ago and much has changed since then, so watch this space for updates.
Real-Time Data in the Real World
Car companies measure, optimize and publish how fast they can go from 0-60 mph, but you as the customer test-drive and evaluate a car based on that and a plethora of other dimensions. Similarly, as you choose your real-time solution, here are the technical considerations and the different dimensions to compare Rockset, Apache Druid and ClickHouse on.
Starting from first principles, here are the five characteristics of real-time data that most analytical systems have fundamental problems coping with:
- Massive, often bursty data streams. With clickstream or sensor data, the volume can be incredibly high — many terabytes of data per day — as well as incredibly unpredictable, scaling up and down rapidly.
- Change data capture streams. It is now possible to continuously capture changes as they happen in your operational database like MongoDB or Amazon DynamoDB. The problem? Most analytics databases, including Apache Druid and ClickHouse, are immutable, meaning that data can’t easily be updated or rewritten. That makes it very difficult for it to stay synced in real time with the OLTP database
- Out-of-order event streams. With real-time streams, data can arrive out of order in time or be re-sent, resulting in duplicates.
- Deeply-nested JSON and dynamic schemas. Real-time data streams typically arrive raw and semi-structured, say in the form of a JSON document, with many levels of nesting. Moreover, new fields and columns of data are constantly appearing.
- Destination: data apps and microservices. Real-time data streams typically power analytical or data applications. This is an important shift, because developers are now end users, and they tend to iterate and experiment fast, while demanding more flexibility than what was expected of first-generation analytical databases like Apache Druid.
Comparing Rockset, Apache Druid and ClickHouse
Given the technical characteristics of real-time data in the real world, here are the useful dimensions to compare Rockset, Apache Druid and ClickHouse. All competitor comparisons are derived from their documentation as of November 2021.
In the words of our customer: “Rockset is pure magic. We chose Rockset over Druid, because it requires no planning whatsoever in terms of indexes or scaling. In one hour, we were up and running, serving complex OLAP queries for our live leaderboards and dashboards at very high queries per second. As we grow in traffic, we can just ‘turn a knob’ and Rockset scales with us.“
We’re focused on accelerating our customers’ time to market: “Rockset shrank our 6-month long roadmap into one afternoon” said one customer. No wonder Imply has embarked on project Shapeshift in an attempt to get closer to Rockset’s cloud efficiency - however lifting and shifting datacenter-era tech into the cloud is not an easy endeavor and we wish them good luck. For someone who claims to care about real-world use cases more than performance, Apache Druid is surprisingly lacking in functionality that actually matters in the real world of real-time data: ease of deployment, ease of use, mutability, ease of scaling. Rockset will continue to innovate to make real-time analytics in the cloud more efficient for users with a focus on actual customer use cases. Price-performance does matter. Rockset will continue to publish regular benchmarking results and rest assured we will do our utmost not to misrepresent ourselves or our competitors in this process - and most importantly we will not mislead our customers. In the meantime we invite you to test drive Rockset for yourself and experience real-time analytics at cloud scale.
Sign Up for Free
Get started with $300 in free credits. No credit card required.