Best Practices for Analyzing Kafka Event Streams

The rapid adoption of Kafka as a streaming platform has created a growing need for organizations to unlock the value in their Kafka data. In this tech talk, Dhruba Borthakur discusses various design patterns for building analytics on Kafka event streams. He will review the top considerations for analytics on Kafka - including data latency, query performance, ease of use, and scalability - and share best practices when selecting an analytics stack for Kafka data.

Speakers

Dhruba is the CTO and co-founder of Rockset. He was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System.

Show Notes

Kevin Leong:

We have with us today, Dhruba Borthakur, CTO of Rockset. Dhruba is a veteran of the data management space. He was an engineer on the database team at Facebook, where he was the founding engineer, the RocksDB data store. Also early in his career he was at, Yahoo where he was one of the founding engineers, Hadoop Distributed File System. So without further ado, Dhruba.

Dhruba Borthakur:

Thank you, Kevin. Great to be here. The topic for today's discussion is about, how can you process a lot of real-time events and get analytics insights from this data store? I have a set of slides, but I would really like if you folks have any questions and send the questions to Kevin, because, at the end of the day, I'd like this to be more of a discussion rather than just a set of slides that I go through. So that agenda for today is, I'm going to tell a little bit about what is event data. I'm going to show you a few examples of real life, event data, and how it looks like. Then I'm going to touch a little bit on how Apache Kafka can help a lot in getting analytics out of this event dataset. Then I'm going to do a little bit deeper dive into what are the things that we need to consider to be able to get real-time analytics from Kafka.

Dhruba Borthakur:

What are the pros and cons of using different systems and how can we best determine what systems to use for our different kinds of workloads that you might have? This is the meat of the presentation. After that, I'm going to give a demo. The demo is also going to show you how we can process events, very quickly in real-time and how they can get complex analytics out of those evils. And then I'm also going to open it up for Q&A so that I can help answer any other questions you might have. But again, like I said, I can respond to questions even in the middle of my talk, if you send the questions using the webinars too.

Dhruba Borthakur:

So what are the different types of event data that we typically want to process and get analytics out of? So there are many different vertical use cases of the one that I'm very familiar with is online advertising, because I had worked at Facebook and Yahoo earlier, and I used to work closely with analytics on that powers online advertising. So there what happens in that use case is that people are using your online app or your online browser, and then every click that the person makes is actually generating an event. And those events can be analyzed to advertisements, or to show relevant advertisements to your users. So it's a relevance engine, it is a decision making engine that is using these events to show relevant advertisements. Similarly, take, for example, online gaming interactions. If you have an online game and lots of people are playing games, you could show them upselling items, or you could show them, get better engagement from your gaming users if you can feed some of these analytics evens back into the gaming tool.

Dhruba Borthakur:

The list is long, but I know how want to go through each of these, let us look at a few examples of real life event data, just to understand what are the semantics that are similar for all kinds of events and what is different. So the things that are similar. So take for example, here we are looking at a reservation confirmation event. The interesting thing is that there's a type to this event, which says that it's reservation confirmed. There's a reservation ID, which uniquely identifies this record. And then there are a set of fields which are like names and codes, and stuff like that, which is very specific to this kind of event. So the interesting thing here is that there is a type of this event. There is an ID which identifies this event, and then there are certain specific fields that are unique to this kind of event. Now, if you look at a similar event data from an e-commerce backend, you can see that it kind of follows the same pattern, there's a type and then there are ways to identify what kind of event data this one is.

Dhruba Borthakur:

And then it also has very event specific types in this, in this record. Another example, from a real life application is an IoT edge device that is producing events. Here again, there's a device ID, IoT edge device usually might have a geolocation or a timestamp, and it might have more specific fields that are specific to the best type of events. So the unique thing about all of these types of events are that there's a way to identify each event. There's usually a timestamp might be associated with each event, and then the fields are very different from one type of event to another. So the focus of any kinds of analytics tools that you use, is that how can these tools be used to analyze different types of events? Because in my backend, I might have used from three different sources that I want to process, and they have all different fields in each of your event structures.

Dhruba Borthakur:

So that is the focus of this talk. And we are going to talk about, how can I do this in real-time? So, take, for example, I used to be one of the original developers of the Hadoop file system. And there, this is how systems used to... I've seen systems from close by where you do a lot of things in batch mode, these are analytics in batch and analytics mode. So take for example, an app or a service is generating events, it gets fed into a data lake and then a data... And you run your a set of ETL processes, which could massage your data, which could enrich your data or transform your data, and then put it in a very clean format in the data warehouse. And now your data is available for analytics. Now, this is usually a batch process that's happened once a day or once an hour.

Dhruba Borthakur:

In the very early days of Hadoop this used to happen once a day, but over time, people have moved more hourly and maybe 15 minute partitions where you can get analytics out of your data after say 15 minutes of data latency. So for real-time event analytics this is not sufficient, because we need faster access to be able to access these events and get analytics out of it. So we continued to do the batch load that we have done before, but many use cases also benefit a lot if you can take these events and put it in a real-time event store. And then the user, or sorry, your applications can query this event store directly without having to wait for ETL processes, to be able to get analytics out of your dataset.

Dhruba Borthakur:

So, how do these apps and services generate events and how does it get to your event store? So this is where Kafka comes into the picture. And, what is Kafka? I think all of you probably already know, it's a distributed streaming platform. You have a set of topics. It's first in, first out in the sense you put in data to one end of the topic and the data is, reliably and guaranteed delivered to the other side, whoever is listening to that topic, is very useful for building data pipelines. And it's also useful to transform one form of data into a slightly different form, or do a little bit of enrichments to your dataset. I'm guessing a lot of you might be already data practitioners using Kafka, so I don't need to preach to the choir. But it's the foundation of platform for building a lot of real-time, event analytics, very low latency and type high throughput is the key.

Dhruba Borthakur:

So the focus for using, using Kafka is that I can get low latency from the time when data is produced into Kafka, to the time when it's delivered by Kafka to your application. And also Kafka has the ability to scale to a huge number of megabytes per second, which means that you can scale it up to 100s of megabytes a second if your application demands it. Now let's talk about and real example of how I can use Kafka to do an event analytics, end-to-end with very low data latency. So this is a picture from a real life VR application.

Dhruba Borthakur:

So the VR on the top left side of your screen, you see this VR app, it is an app which is generating, where people are interacting with the app through some custom devices. And they are actually generating a lot of events, every time the player or the user uses the VR app, let's say, he's looking at an interactive movie or he's playing a game. Each of these actions generate an event and that's the JSON event that gets deposited into a draw Kafka topic. So event stream is very high because the number of users who are playing this game could be many 100s and 1000s are videos of users playing the game. Now this drop topic, is being processed through Kafka itself and written to another topic in Kafka and it is combined with geolocation, because we want to tag every event with the geolocation from where the event is coming.

Dhruba Borthakur:

So this is basically data enrichment happening as part of the Kafka data processing pipeline itself. And then on the right side of your screen, you'd see how data is being tailed from the Kafka topic into a warehouse, for more reporting events and for more analytics that need to be done every hour, every minute, every day. The interesting thing is happening more on the lower left side of your screen, where data is now tailed from this Kafka topics and being put into an operational analytical real-time engine, which is used for random ad hoc queries. And who makes these queries? These are users who are making decisions on how the app is used, how the app needs to be used, all these are queries from the application itself. Like the VR application itself is querying this database, in real-time, because it needs to make the user's experience better. Take, for example, a user let's say is clicking on some particular interactive movie, and is taking a certain path in the movie. Now, the VR application wants to show more of that stuff.

Dhruba Borthakur:

So he's getting analytics out of this real-time analytics database and showing a better experience to the viewer in real-time, and this is happening asand when the player is playing this video game, for example, This is a very real-time application or an application of a very real-time event analytics where analytics is not just for reporting, but actually for powering your application itself. So again, to summarize, this is how this application is using Kafka, it's real-time events coming in on the left, it's depositing data in your Kafka. You could be using KSQL, for example, to join this data with the geolocation data that is kind of static and coming in from the side, and it could enrich your data in Kafka, and then it could park it into a real-time analytics database or a real-time analytics backend, which is being queried by your say, VR application to get analytics out in real-time.

Dhruba Borthakur:

So one of the interesting things here is that, this is very different from reporting backend where you need to extract insights within a few seconds or a few milliseconds of when your data is produced. And also, it is different because, the queries are not pre-canned, the queries could change from minute to minute based on what your application is using. So if you have this kind of a data pipeline, what are the things that we need to consider to figure out what kind of connections we need to have with Kafka? And what kind of real-time analytics backend would you would be useful for your workload? So let us talk about how can we pick and choose different backend software to deliver to your performance needs.

Kevin Leong:

And if I may, we have some questions coming in, so keep those questions coming. I think one of them was, will we be sharing the deck and video after? And we are recording it and we will be sharing the recording with our attendees here. And then if you go to the next slide, Dhruba we have a question that you can answer along the way which is, how do you validate the schema of your events coming as messy JSON? So what type of validation is happening, and maybe you'll consider that as we go through this section.

Dhruba Borthakur:

Yeah, that's a great question. So thanks for interrupting, Kevin, because I think this is the right place for me to answer the question. How can you validate the schema data that's coming into JSON? So this is one of the considerations. So let me show you, two or three things, first in this screen. So how do you determine what kind of analytics backend we want to use to respond to your queries? And sort of the three things that I mentioned here is one is data latency, how fast can we get insights out of your events? And then, how complex can your queries be, is it a key value store where the queries are really simple or is it more complex queries where you can do Jane JOINs regulations, or sorting and stuff like that?

Dhruba Borthakur:

And then the third is the question that I think we received is that how can I deal with columns where are fields in your JSON, where the schema is not fixed, or where there are very sparse columns in the sense that some JSON have a few columns and another JSON could have completely different set of columns. So I would explain how we can pick and choose. But these are some of the considerations that we should think about to figure out what analytics database to use. Of course, there are other considerations as well. Like for example, we need to figure out what is the query latency that our application is demanding. Do we need really millisecond 11 to see queries or are we okay with queries, which take maybe two minutes to serve? We also might want to look at query volume, take, for example, the VR application example that I gave, there are 1000s and 100s of 1000s of people could be playing an online game. So the query volume could be high.

Dhruba Borthakur:

But then you might have other use case where let's say you are doing a data science kind of workload on some of these, there a query volume might be once a second or maybe few queries a minute. Similarly, another characteristic that we might want to consider is about operations of your data, like what is your operational overhead for keeping the system running, and available 24/7 all the time for doing real-time analytics and does it need a lot of people dealing with it? Is it an auto-driven? Is it all a self-serve? We are going to touch on each of these six topics in a little bit more detail now. Let's talk about data latency first.

Dhruba Borthakur:

So, for most real-time event analytics platform, our application, we would like, the user requirement is that when an event is produced, we need to be able to query it, or it needs to be visible inquiries, at the most with a few seconds of latency, which means that if you produce an event now, you should be able to see that event in your querying system within a few seconds. And where that once the event is produced to Kafka, how quickly does it get produced? How quickly does it get reflected in your analytics backend? Which means that there typically needs to be live sync between Kafka and this analytics backend, so that whenever an event is produced to Kafka, it gets reflected in your analytics backend quickly.

Dhruba Borthakur:

The focus here is also to reduce a little bit of ETL processes because ETL processes, usually take minutes or maybe hours sometimes to be able to clean your data, or give structure to your data before it gets put in a analytic database. So for real-time event analytics, it's great if you have a system where you can avoid a lot of ETLs, some ETLs might be mandatory, but you could avoid a lot of ETLs, and you continuously put data into your analytics backend through Kafka as quickly as possible. So this is one primary concern or one primary characteristic that we will be aware of when you are trying to figure out what backend to use. The second one, is about the ability to support complex queries.

Dhruba Borthakur:

So, what kind of expressive power does your application need? Is your application okay querying a set of dataset that is pre-canned for events that don't change, or does the application need to be able to express complex application logic, so that it can pick up, events from your dataset intelligently using, based on what the application is needing? Take, for example, if you have a key value store, it's likely that the key value store is produced through at the end of an ETL pipeline. So which means that your data doesn't really change for, till the next time that your pipeline runs.

Dhruba Borthakur:

But if you're looking at a data store which has joined segregations, GROUP BYs, and having sorting and relevance build into it, then you can empower your applications to be very flexible and agile, and get a lot of insights out of your dataset very quickly, and very fast, and on your most recent data. So really to support complex queries would be interesting for most events that are being deposited into your Kafka stream. Another consideration that a lot of our backends for applications have is that data is coming through Kafka, but then there is an alternate data source, which is your data that is coming from a third party, or it could be dimension tables and fact tables coming from a database. So there's the need to be able to combine these data sources of events that are flying in through Kafka as well as a standing dataset or a transaction database, and you want to combine these two to be able to power your real-time applications.

Dhruba Borthakur:

So this is another consideration for you to think about, when figuring out what backend to use for real-time event analytics and Kafka. The third thing that we're talking about is, mixed types. So this is where the question came in saying that, how can we handle kind of a free format JSON objects coming in which are of different types, are having very sparse fields. So, there are different ways to solve this. I've seen people set up a long ETL process to clean up and standardize on some fields. Let's say, in this example, in your screen, you see there are five or six records. Everything has a zip code, but some of the zip codes are strings in JSON and some of those zip codes are in teachers. Now, how do you handle this? Now you could, the batch way of handling this would be to set up an ETL process where you clean them up, you convert the strings into integers, and then you write it in a warehouse, but for real-time usage analytics, you could connect Kafka to a real-time data store where this can happen automatically.

Dhruba Borthakur:

The real-time data stores that come to my mind are things like systems, like say a grid system, or like a Rockset system which are essentially built for analytics at large scale. And some of these systems have the ability to handle this kind of fluid schema very well without you having to do a lot of manual work. I would also look at type binding like, do a binder type of each of these fees at the timeline data is written into Kafka, or do I handle the type binding at the time when the query is made, so that you can take all sorts of data in these fields, but then when your queries come that's the time when you specify what kind of records are you looking for?

Dhruba Borthakur:

Are you looking for records where age is, or where does the zip code is a specific number? And do you want it to be, taking it as a new teacher or do you want it to be a string? Or you can convert strings the teachers at the time of query. So I'll give you a demo to show how this is typically done, but this is one important thing to keep in mind when you're trying to pick the analytics backend of our choice. I hope I answered the question, but if there's a followup question, please do answer, in the meantime I'm going to continue with my slides. The fourth interesting thing to consider or to look, to pick your analytics dataset is your query latency. So how fast are your queries happening? Are you built? Does your application need millisecond latency queries or are your application okay if you get queries back in five seconds, or 10 seconds? And this is an answer that only you can answer because you know or each of us know, what is the requirement for our application.

Dhruba Borthakur:

But there are certain techniques to achieve different kinds of latency. So what are the ways to achieve a low latency on these large datasets? So the traditional approach is essentially a very MapReduce type of backend where you have a large dataset of events, and you want to power a set of applications, which have some pre-canned queries. And so you can use a MapReduce type of backend where you can paralyze a scan. So MapReduce essentially is, has the ability to paralyze processing in your large datasets, and then each of your individual mappers, they scan the amount of data that they have and give you results, and you can combine them and produce the reports. On the other hand, for a real-time event analytics, it's great if you have the ability to paralyze an index rather than paralyze and scan.

Dhruba Borthakur:

So what is index in here? You have a lot of these event data, and you can build in this is on your event data, so all your queries are fast. Here I mentioned about converge indexing, which is a technique to be able to build all kinds of indexes automatically, for example, your system could take event data from Kafka and then build an index, which is a column index, just like typical warehouses might build, or you could build an inverted index just like a search engine, or it could also build a document index just like document store. So that depending on your query, your system can automatically figure out which index we use and give you quick results. So this is very important for ad hoc analytics, or in a system where you don't really know which queries will be made by your application. And the application might make different types of queries based on the lifetime of the application.

Dhruba Borthakur:

So this is a newer way of handling real-time event analytics, because it gives you very low data latency, as well as very low latency for your applications

Kevin Leong:

Oh, Dhruba, before you move on, we had another question come in. I know you've covered several of the six considerations already, but the question was, can you point out the differences and similarities to Amazon Athena on some of these different considerations that you're talking about?

Dhruba Borthakur:

Oh, great question. Yeah. So Athena that is a great system for querying a large set of data, and the data user resides in S3. And it works very well, when you look at the traditional approach in this picture, where there's a lot of event data, you can run Athena. It is essentially open source Presto, which does a lot of MapReduce style queries, where each of your Presto nodes, or each of your Presto workers would be connecting to some subset or some blocks of your data, scanning those blocks to give you results. So to compare an analytics database with Athena, I think there's, both of them can do kind of solve the same sort of queries, because your analytics database is also can serve complex queries. The difference I feel is that more about the latency of your queries and the adeptness of the analytics in the sense that, for analytics database, I think you would typically expect a far lower latency for your queries.

Dhruba Borthakur:

And for your analytics database, I'm hoping that you can pick an analytics database where you don't need to do a lot of cleaning and enrichment at the time when data is written, whereas in Athena, you would probably do some kind of filtering and cleaning so that the Athena queries, latencies are reasonably okay for your application. I can cover more, is there follow-up questions on this? So query volume is another thing, another way to kind of distinguish between maybe a Redshift or Athena versus analytics database. Now it's possible that... And if you're looking at say maybe a few queries a second or a few concurrent queries a second, and at a Redshift kind of a warehouse might be, pretty okay for your application, where if you're trying to build an aerial application which is probably making a lot more concurrent queries on your dataset, and equatable mutable analytics database might be more efficient.

Dhruba Borthakur:

That also tells me another difference between Athena and an analytics database, is that I think that it's mostly read only, because your data is queried from S3, whereas for an analytics database, it's usually a new table analytics database, which means that you can override events, you can actually get analytics out of your most recent datasets rather than trying to partition your data, and then do it in a very warehouse style of analytics. Operational simplicity is something definitely that we all keep in mind when you're setting up this backend application. We need to be able to scale, that's kind of a no brainer nowadays, but we also need to be able to handle bursty traffic. So how can we provision more hardware when there's more work to do, and how can we reduce the cost of the hardware of which our system is running, when there's less work to be done? Also things like backfilling data from Kafka when, maybe some agent is not operating, was offline and then suddenly came back online after a day. How quickly is it the backfield later from some of these Kafka into an analytic database?

Dhruba Borthakur:

So some of these things are kind of feeding into your operational simplicity and you'd have to analyze again, your workload to figure out what is best for you. I have an example, or I will show you a demo, of what exactly, or how this can be done for a real life application. I have data and I'm going to use the Tweeter stream, because the Tweeter stream is real life JSON data, and I'm hoping that I can explain how mixed types can be handled, in this JSON data. This is not artificial data, this is real world data. And that JSON is very complicated in nature. And then there is another data feed, or dataset that's coming from Nasdaq, which is basically stock tickers that are coming from Nasdaq. I'm going to use Rockset to do the analytics database just because it's easy for me to demo this because Rockset is a pre-installed service. I don't really need to consider anything as part of this demo.

Dhruba Borthakur:

I can point until you can get real datasets into Rockset. But this demo could also be done on many other backend databases, including say PostgreSQL or Druid, or Elastic, or some other datasets you might have. For those you might need to set sort of stuff yourself, but for this, I'm going to show you the operational simplicity also of this analytics database that I'm going to use Rockset. And I'm going to show you a set of queries. These queries are essentially a decision-making query. The query is going to find out, it's going to analyze this event stream that's coming from Kafka. And it is going to show you the most Tweeted stock ticker symbol in the last one day.

Dhruba Borthakur:

So it's a recommendation system, or it could be like a decision making system where you need to process a lot of event stream that's coming into your system and getting analytics out of this dataset. So let us go to console. Do you have any questions, Kevin?

Kevin Leong:

Yeah. I was just going to say, while Dhruba is switching over, if there are any questions on your mind that came up while Dhruba was introducing the topic, please send them. We've had a couple of good questions so far, so keep them coming.

Dhruba Borthakur:

Great. Thanks, Kevin. So I'm using the Rockset Console right now to show you the demo of data event streams in Kafka. But like I said, you could use any other analytics database if that is your favorite choice. So in Rockset there is this concept of collections, which is essentially to tie in a dataset that's coming from a Kafka stream into a queryable format. So Rockset supports many other formats or sources, but I'm going to pick Apache Kafka format. I'm going to pick, an integration which is basically security that is associated with that Kafka stream. And I'm going to create a collection. I'm going to name this collection to be the Tweeter of fire hose because Tweeter is producing data at full speed. And then I'm going to type in the name of the Kafka topic, into which Twitter is producing this data.

Dhruba Borthakur:

So the moment I type in the name of the Kafka topic, so the only thing I have done till now is to essentially specify the name of the Kafka topic. And on the right side, I get a preview of all the data that is coming in, streaming into this Kafka topic. And this is a way for me to validate that, am I looking at the right data source? Is it making sense to me that all these feeds are coming in here? And this definitely looks real, because this is what the Tweeter stream is all about. You can see, I will show you again how complicated this one is. I might want to do some transformations upfront to be able to maybe filter out some records or drop some fields from this incoming stream, because I don't want to query all the fields they're not very important for me.

Dhruba Borthakur:

Let's say, I to say, "Hey, I don't want to contributors field," and I can say, "Hey, drop this field." I can do also some other kinds of transformations that we'll talk about later. Also because it's an event streaming database, almost all event streaming databases will have a way to set retention. So you can set our attentions for 30 days, 60 days or a few days, and you can pick any field in your record to be the timestamp for off the record. And then I will scroll down and I'll say, "Create this collection." So the moment I say, "Create this collection," and now I go to the collections page and I look at this collection that got created. I actually created it yesterday, which is why you would see that it has a lost 90 gigs of data and constantly rescaling data from the Kafka stream.

Dhruba Borthakur:

You'd see these number of documents changing frequently. And you'd also see that last updated a few seconds ago, which means that it is basically an index, it's indexing all the data that is coming in from your Kafka stream. So what are the things that I have done till now? I haven't specified a schema. I haven't set up any servers or machines. Only thing I've done is specified where is my data, and I specify the name of a Kafka topic. So now if I want to do analytics out of this Kafka topic, I'm going to pick the collection that I created, and automatically the system shows me the metadata of the events that are coming into this index, take, for example, every record is identified by an ID.

Dhruba Borthakur:

So automatically it is telling you that 100% of records have this value. But if you look for, take, for example, contributors, you will see that only 98% of records have contributors field and the other records don't have this feed. So this is what I mean by a sparse database, where the database has a schema, but the schema is not fixed in time. So it's not as huge schema on a system, it's a very much a schema full system. This is what we call it as, ability to support mixed types. So this title of the system might vary based on the data that you put into your system. So now each of these fields essentially are queryable. Take, for example, I will try to pick out the one maybe here, so I created that. So 98% of records have created a field, so you can also query... So each of these fields also has a type and here very clear it says the created app type is a string.

Dhruba Borthakur:

I'm also going to show you some fields which have different types, as part of this demo. So let us make a standard SQL query, which just samples this dataset and gets me some values. So I did a select star and I tried to get only 10 records from this dataset. And this is basically again, to look at the dataset in a tabular format. I can also look at it as JSON format. You'll see there's a very complex JSON, out here. It has deeply nested fields and, long arrays inside it. So this is typical JSON or real life JSON data structures that we're looking at. You would also see that, the query took only 32 milliseconds, which is basically query latencies. And we could keep noticing these numbers, as I know, and we make more and more queries.

Dhruba Borthakur:

So now I want to pick out only three fields from this query, only the user name text and symbols, because I'm looking at those records where people have tweeted a stock ticker symbol as part of their tweets. So these are the three fields that are interesting to me. And I already see this field showing up here and 16 milliseconds, and the symbols are all now, which means that a lot of people are tweeting, but they're not quitting any stock ticker symbols. So now I need to look at, fields that are not now, which is where I put, is not now into query. And now if I look at those queries, I will be able to see some relevant things where it's an area, but there are text aside it, and it's a complicated JSON with doubled nesting, and areas inside of nested objects.

Dhruba Borthakur:

So we are getting close, but we're not completely done yet. So now we will refine our query and then put in an event time for less than one day, which means that I'm not looking only at events which have happened in the Kafka stream for within the last one day. And so now, if I read on that query, I should be able to see a set of fields, and a stock ticker symbol, which happened in the last one day only. And then, now I need to GROUP BY and ORDER BY, to count the stock ticker symbol, because I'm looking for the top tweeted stock ticker symbol.

Dhruba Borthakur:

So here I do, GROUP BYs and ORDER BY, and descending, and I'll print the top 10, because that's what I'm interested in, as part of his decision making analytics, and some of the symbols look real and valid. But these listing could change from second to second or minute to minute, because this collection is constantly indexing new data from Kafka, as in when we're speaking. And then, now I might not know what is TSLA, refer to. So I combine this with a static dataset. So the next query essentially does a JOIN with a static dataset called tickers that's from Nasdaq. And then if I run the JOIN query, and then I can see that TSLA is actually a test line. So this is the information that's a static dataset, that maps a stock ticker symbol to a string of the name of the company.

Dhruba Borthakur:

And so that is kind of, concludes my demo here, which is the goal was show, how I can get analytics from a very complex dataset without having to set up anything or without having to deal with, types or without having to deal with a very sparse dataset that's coming in. But still able to make analytical decisions very quickly, at low latency. So these queries are like a few million seconds. The last one probably take 184 milliseconds. So this shows that there are analytics database that you can use, combined with the power of Kafka to be able to get good analytics or real-time analytics into your event streams. So those are the four or five things that I wanted to show in the demo. What are the best practices or how to figure out what analytics data store to use.

Dhruba Borthakur:

And for your workload, you would probably look at these four or five different things and then determine what kind of backend you might want to use to process this event streams coming from Kafka. That concludes my demo and my talk. I have some more slides, but I'd rather take questions now and see if I can help answer any of your questions, you might have.

Kevin Leong:

Okay. Awesome, Dhruba. Thanks for that, run through of some of the things to think about. And doing analytics on Kafka that's sweet. Yeah. Feel free to go ahead if you have any questions that are on your mind, put them into the questions window, in your webinar tool. I do have one for you, Dhruba.

Dhruba Borthakur:

Yes.

Kevin Leong:

Ready? Okay. So I think this is a question about what you were just showing with the demo here. Does it support Avro and Protobuf formats?

Dhruba Borthakur:

Awesome. Great question. So, most of our current... Okay. So the demo that we are showing right now is about Kafka, where the schema is in Avro and JSON. So that's what typically most use cases I have seen happen is that people are depositing data into Kafka using Avro and JSON. And the analytics database that I picked for the demo, which is Rockset, support this out of the box. If the data is in Protobuf format, that's also possible to do using Rockset. And I think it's also a possibility to do using other databases out there. But I'm not 100% sure whether, how much work would be needed to be able for you, to be able to deposit data into Kafka using Avro and Protobuf. Does this user already have data in Avro and Protobuf? I think the challenge there is that, can you use some of your Kafka, libraries to be able to deposit data into Kafka in Protobuf format? That is something I think you need to consider even before you go to figuring out which data analytics database you might want to use.

Kevin Leong:

Okay. So, again, keep the questions coming. On that note, Dhruba I was talking about, different analytics databases, and I think we have a few questions here about, potentially other analytics databases. So one of them is what are the main differences between Rockset and the ELK stack?

Dhruba Borthakur:

Oh, That's a great question. Yeah. I have seen that a lot of real-time analytics are happening on, either the Druid or the ELK stack. There are certain differences between Rockset and ELK. And the main difference is about converging that. So what, Rockset does is it builds all kinds of indices, which means that it builds an inverted index just like Elastic, but it also builds an inverted index, just like a Redshift or Snowflake. And it also builds a document centric index, just like MongoDB, for example. So if you currently have your event stream and you are putting it into the ELK stack, and then you're also putting it Mongo for, or you're also putting it in your warehouse to do analytics. Now in Rockset you can essentially do all of these kinds of analytics in one system, that's one differentiator.

Dhruba Borthakur:

The second differentiator is that Rockset is build only to run on the cloud, which means that there's a lot of the elasticity built in. So if there's more work to be done, more nodes automatically spin up and process our index, whereas on the ELK stack, I think it is being run in the cloud, but there are certain design practices that are not absolutely natively for the cloud, for example, I've seen benchmarks by Rockset can index around, terabytes per hour or more. I haven't seen, it is not a comparison with ELK stack per se, because it also depends on the price performance. But it's interesting to compare how a solution built natively for the cloud might be able to give you far better price performance compared to statically built systems for, or systems that were built for a static set of hardware on which they're optimized to run.

Dhruba Borthakur:

So those are the two main differences, I think. If you're looking at, say warehouse type queries where you need to do a lot of complex JOINs and aggregations, Rockset definitely will differentiate itself, compared to the ELK stack, because it knows exactly which indices to use. And I think, the Rockset, the query engine is also built with low latency in mind. I know the ELK stack also has a, is trying to build or getting into the SQL of support now. But that's good support on ELK stack is very new. So I'm yet to see too many benchmarks on the ELK stacks SQL query engine.

Kevin Leong:

Okay. We have to follow-up comments also that Elasticsearch does have common indexes, by way of Lucene DocValues. And so, yeah.

Dhruba Borthakur:

Yes. Yeah, no, absolutely. So Lucene is a great piece of software. I really like Lucene, it's very... The code is easy to understand and read, and, I think Elasticsearch never registered a lot. The interesting thing is that, if you look at the locking primitives and Lucene, they're essentially... So, Lucene is also a compaction engine because you need to compact files, the segments that Lucene creates. And one differentiator there is, again that for Rockset, we use RocksDB for low latency indexing, and in my mind, the difference between Lucene and RocksDB, the basic things that these two stacks use, is that Lucene, kind of the locking model in Lucene is kind of big when data used to be stored in disc devices, where the whereas RocksDB is mostly built for data in SSDs and memory.

Dhruba Borthakur:

So the design principles are different. There is a lot of low latency primitives that RocksDB uses currently to index large datasets. And I think Lucene is probably there. I've seen the Lucene benchmarks, but, yeah, I think it would be great to kind of benchmark RocksDB indexing versus Lucene's indexing sometime and do something, and show some clear differentiators there.

Kevin Leong:

Okay. Great. There are lots of questions coming in. So, we've talked about the ELK stack and somewhat, and we had another question asking about similarities, differences, comparing against Druid, it supposed to be hosted to it. Right? So, hosted Rockset, which is hosted and then Druid assaulted as well. What would you say are the differences [crosstalk 00:46:41]?

Dhruba Borthakur:

Sure. So, yeah, I think the basic difference as far as usability is concerned, I think is that Rockset gives an SQL interface to your data by default. And Druid I know is also building, I have been following the Druid design discussions. So the architecture of Rockset is built for scaling up and down on the cloud. So let me tell you two different things between Druid and Rockset. So Druid, actually, what it does is, you have the ability to store cold data in a separate place and hard data in a separate place, which is great. What Rockset does is essentially this happens automatically as part of the scale-up and scale-down, so Rockset, all the data is stored in S3 for durability, but then when you make queries, relevant pieces of data that are queried more are brought into the SSDs and RAM, and that's how you get this kind of a hierarchical storage in the cloud.

Dhruba Borthakur:

And it's not a manual process where you need to separate out or have two different clusters where hot and cold data are separate. The SQL query engine for Rockset, again, is very, elastically designed in a sense when there are like two or three level aggregators, so that when queries come in, you can paralyze your queries to a large number and you can spin up more machines in the backend when your queries are in existence. And then as soon as the queries are done, the system automatically scales down the nodes so that your system is actually economically feasible for you to be able to run large, get a lot of parallelism on queries are there, but you don't have to pay for any of these when there are no queries to be made. So, yeah, I think the big [inaudible 00:48:31] for the cloud designer has a lot of differences from the backend of that Druid has.

Dhruba Borthakur:

And, yeah, again, I think, there are some use cases where Druid might be popular and there'll be some use cases where a Rockset would be really useful. So it essentially depends on your use case and what kind of operational complexity you might have, or what different kinds of queries that you might have that, that you need to, what is your query volume? What is your QPS? Now, I know that Roxanne can support 1000s of queries per second on the same dataset. So those are the kinds of questions that come up, suppose you have 10 terabytes of data and you want to bulk load that data and start making queries, how long does it take for you to bulk load this data into Druid versus Rockset? I think, I don't have pre-canned answer for these. It all depends on your workload and your requirements.

Kevin Leong:

Okay. A few more questions while we're on the topic.

Dhruba Borthakur:

Sure.

Kevin Leong:

Right, we're talking about SQL there, one question was, is Rockset's query language fully compatible with NCSQL?

Dhruba Borthakur:

Yes. So This is the great differentiator compared to most other systems that are out there. A lot of systems, so let me take one minute of digression here. So I think, if you've seen the last 10 years key value stores became really popular, right? And the reason key value stores became popular is because of their latency and scalability, is not because they can do complex queries. So a lot of these systems, essentially started, with a key-value backend and then they somehow migrated to a document backend where, it's little better than key values, but not really a complete standard SQL kind of query language. I've seen this with, I mean, Druid has a great query language, but it's not SQL, similarly, Elastic is a very powerful way language, but it's not SQL.

Dhruba Borthakur:

And MongoDB also has a great language that you can use, but it's not SQL. So you can't really do JOINs. MongoDB now has lookup support and they're getting into some of the JOIN features, but Rockset SQL is built completely with standard NCSQL in mind. So from day one, you'll be able to run SQL queries, on Rockset. Rockset doesn't have any other APIs other than SQL, which means that SQL is the only way to access data in Rockset, whereas unlike other systems where some of these systems have evolved from having a simpler API to being able to support SQL in general. So yes, the answer to that question is Rockset support, pure NCSQL, including JOINs, aggregations, GROUP BYs, windowing, and everything else that SQL has.

Kevin Leong:

All right. Great. I think we have time for one more question.

Dhruba Borthakur:

Sure.

Kevin Leong:

And that's, one coming in asking about Rockset as a data warehouse. So how does it compare with Snowflake? Should I be using Rockset as a data warehouse? And I think, what the person's comparing against is, or has in mind is Snowflake.

Dhruba Borthakur:

Yes. Right. So yeah, I think, there are certain similarities between Rockset and the data warehouse, in the sense that both of them can do, SQL on large datasets. So that's the only thing that is common. There are a lot of things that are different, which is what you might want to use to see whether Rockset versus a warehouse. The main difference is that Rockset is a mutable database, which means that you can override data, you can add data to this database as in knowing your queries are happening. Unlike a warehouse where, most of these, the backends of a warehouse is not optimized to do, to override records in real-time. So they are basically do copy and writes is, if you look at the Redshift architecture or the Snowflake architecture, you'd see that, if you update an object, it's like copy and write, so every small update actually results in a huge file being redirected into your backend.

Dhruba Borthakur:

So they're not optimized for overwrites. So your workload needs overwrites, or if your workload needs to query your most recent data, data that was produced in the last few seconds. Let's say, I want to query all my data that got produced in the last five seconds. Then you would definitely use Rockset versus in a warehouse. And the third thing is about concurrency. How many queries do you want to make on your application? Is it five queries? In which case a Redshift Cluster might be a fine thing to do. If you're talking about higher number of queries, then you need an analytics database, where Rockset or Druid could actually be quite suitable for that kind of use case. So yeah, I hope I answered your question. And if there are other ways they can send us questions, I can answer them offline too.

Kevin Leong:

Right. I think you have your email address on one of those slides. So if you have any follow-up questions, and if you can show that slide, you can feel free to reach out. I think it's on the last slide, Dhruba.

Dhruba Borthakur:

Okay. I'm going to it.

Kevin Leong:

It's just, dhruba@rockset.com.

Dhruba Borthakur:

Yeah. Please do send me an email. It'll be great to be able to hear your questions and see if I can help answer any of those.

Kevin Leong:

And actually a couple of monitoring that we probably will follow-up with offline.

Dhruba Borthakur:

Sure, yeah.

Kevin Leong:

If that's okay.

Dhruba Borthakur:

That'll be great.

Kevin Leong:

And if so, thank you all for your kind attention on this last hour of our tech talk and we will follow-up with an email on the recording if you wish to view it again. So again, thank you all for joining us today and thanks, Dhruba.

Dhruba Borthakur:

Thank you. Thanks a lot. Bye.

Recommended Webinars

Serverless Real-time Indexing: A Low Ops Alternative to Elasticsearch

Scaling MongoDB Best Practices for Sharding, Indexing and Performance Isolation

How Standard Cognition Builds AI-powered Autonomous Checkout on Computer Vision Data

Best Practices for Analyzing Kafka Event Streams