Streaming Data and Real-Time Analytics With Kafka + Rockset
April 26, 2022
As Kafka Summit is in full swing in London this week and the topic of event streaming is all over my Linkedin feed, I saw a post asking "Is streaming dead?" referring to CNN+ being shut down.
In the last few days, Netflix took a once-in-a-lifetime beating in the stock market, and CNN redefined fail fast (pioneered by Silicon Valley) when it announced the breaking news that it will shut down CNN+ just weeks after a very splashy debut. Not all is doom and gloom though. HBO reported millions of new subscribers in Q1 and Disney+ is doing OK.
We at Rockset think about a different kind of streaming and that is definitely not dead. That streaming is rocking and with Kafka Summit this week, I thought it a good time to emphasize the importance of streaming data in today’s modern real-time data stack.
The rise of Kafka was closely aligned in the last few years with the explosive growth of IoT devices. The desire to capture and analyze that data fueled the growth of Kafka and opened up new frontiers for organizations to deliver services to their customers. Confluent made it easy for everyone to use streaming data in their data stack by launching Confluent Cloud.
Even Databases Are Streams Now
Enterprise data, which mostly resides in RDBMS databases (like Oracle, MSSQL, etc.), still follows the archaic batch processing that often introduces delays of hours if not days between when the data is generated and when it is analyzed. That backward looking approach is not in line with the speed and agility with which enterprises want to move today. Database change data capture (CDC) has been finally adopted by major databases and it has helped transform the data sitting in those databases into a data stream. And, suddenly you can use the infrastructure that was designed to ingest IoT data in real time to ingest all the enterprise data as well.
But Enterprises Still Do Batch Analytics?
Now, the ability to ingest data in real time is there so does it solve the problem of getting insights from that data in real time? Not really. Because we still follow the old way of analyzing data. The way enterprises are analyzing data is as follows:
Enterprises are forced to take the above approach because their enterprise data warehouse needs curated data before it is ready to be analyzed. The data warehouse is designed to work with fixed schema and requires flattening of nested data before it can be stored. Enterprises spend millions of dollars in trying to run the batch process more frequently to ensure that applications are able to use the latest data. Even with all these hassles, data is typically stale by a few hours at least. On top of that, the system doesn’t perform well for ad-hoc queries as the data is flattened and denormalized in a way to accelerate a particular set of queries.
Real-Time Analytics Are Now Affordable
We at Rockset are on a mission to make real-time analytics affordable for everyone by cutting down on the expensive and time consuming ETL/ELT process, and actually delivering on the promise of fast queries on fresh data.
So how do we do it?
- Schemaless ingest: Rockset can ingest data without the need for flattening, denormalization or even a schema, saving lots of data engineering complexity. Rockset is a mutable database. It allows any existing record, including individual fields of an existing deeply nested document, to be updated without having to reindex the entire document. This is especially useful and very efficient when staying in sync with operational databases, which are likely to have a high rate of inserts, updates and deletes.
- Converged Index™: Rockset is built using converged indexing, which is a combination of inverted index, column-based index and row-based index. As a result, it is optimized for multiple access patterns, including key-value, time-series, document, search and aggregation queries. The goal of converged indexing is to optimize query performance without knowing in advance what the shape of the data is or what type of queries are expected.
- True SaaS data platform: Rockset is a fully managed serverless database, with no capacity planning, provisioning and scaling to worry about. This is in contrast to other systems that claim to be built for real-time analytics, but still employ a datacenter-era architecture rooted in servers and clusters, requiring time, effort and expertise to configure and operate.
While streaming in the context of Netflix and CNN+ may not be flourishing, streaming in the data world is just getting started. And it is not only about IoT where the growth will happen. Technologies like Confluent will become the backbone of enterprise architecture and every data source can be and will be converted into a data streaming source, allowing real-time consumption of data for analytics. All customers need is a data platform that supports real-time analytics. Rockset, together with Kafka/Confluent, is determined to deliver on the promise of real-time analytics for everyone.