Real-time Analytics on Data Lakes: Indexing Amazon S3 for up to 125x Faster Queries

While Athena is widely used for querying data in S3, it cannot provide the performance needed for real-time analytics like customer 360s, personalization, IoT applications and more. Dhruba draws on his experience as a founding engineer of RocksDB and explains how to use real-time indexing on your data lake for real-time analytics that powers high-performance applications.

  • Indexing vs scanning - up to 125x faster than Athena. Rockset automatically builds a search index, column index, and row index on data ingested from S3 to accelerate the types of queries that are common to applications. Athena performs scans when queried, and its pricing is based on size of data scanned, so it is better suited for occasional ad hoc queries rather than high performance real-time analytics.
  • High concurrency - 1000x concurrency vs Athena. Rockset has a distributed cloud-native architecture that allows ingest, storage and query tiers to scale independently in response to workload. The ability to scale query compute as needed allows Rockset to support large numbers of concurrent users without performance degradation. In contrast, Athena can only execute 5 concurrent queries and queues any additional queries.
  • Real-time analytics - 1 second end-to-end latency vs. hours with Athena. Rockset allows queries on JSON, Avro and Parquet formats without any schema or table definition. It supports schemaless ingestion of data and automatically generates schemas based on the exact fields and types present in the ingested data, so users can run SQL on their raw data. Athena requires the creation of schemas and tables before users can query the data, resulting in delays whenever the schema changes.


Speakers

Dhruba is the CTO and co-founder of Rockset. He was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System.

Show Notes

Kevin:

Welcome everyone to today's tech talk.

Kevin:

My name is Kevin and I'll be the moderator for today. Our topic today is Real-Time Analytics on Data Lakes: Indexing Amazon S3 for up to 125x Faster Queries. As a reminder, we are recording this webinar and all participants will be muted for the duration. If you have a question, please send them to us through the Q&A portion of the GoToWebinar tool.

Kevin:

And also, as I mentioned earlier, we do have a Slack community channel. You will see the link to that on this intro slide. If we do not get to all the questions during our tech talk session, feel free to jump onto Slack community host talk, and Dhruba will be there to answer any further questions on this or really anything related to this experience.

Kevin:

So, we are pleased to have Dhruba with us today. Dhruba Borthakur, the CTO and co-founder of Rockset. He was an engineer on the database team at Facebook where he was also the founding engineer of the RocksDB data store, and he was earlier at Yahoo where he was one of the founding engineers on the HDFS project, and he was also a contributor to Apache HBase. So, with that, I'll invite Dhruba to tell us more about real-time analytics on data lakes. Dhruba?

Dhruba Borthakur:

Sure, yeah. Thanks Kevin, thanks a lot. Hi folks. Thanks a lot for attending today's tech talk. We are going to be talking about data lakes and specifically about how we can index data lakes, or index part of your data lake, to serve in real-time analytics of sale. So, that's the topic of our presentation. What is the agenda? Agenda is that first we'll talk a little bit about how the state of the data lakes are currently, how people use it right now, what are the trends that we see.

Dhruba Borthakur:

And then I will introduce you to how you can serve real-time applications on parts of your data lake. Then I'll talk a little bit about some of the reasons or some of the technical differentiators why these things have become a possibility now versus earlier. We'll talk about indexing, concurrency, mutability and things like those. Then again we'll take a step back and talk about how all of these things help your developers to ship their applications faster, powered by data that you might have on your data lake.

Dhruba Borthakur:

I also have a demo, so I'd like to spend a few minutes to show you a demo of how this is done. I have a set of slides obviously, but again my goal is not to go through all the slides. My goal is to help answer any questions you have. So, please do use the chat button and send me questions as and when they occur to you. I will pause at certain points of the talk and then try to address some of the questions that you might have. It's the questions that make the talk interesting, so I'm hoping that I have a lot of questions or comments or feedback from you on the chat button when I'm presenting.

Dhruba Borthakur:

Cool, okay. Let's take a quick peek at what is the current state of affairs as far as data lakes are concerned, and specifically about cloud data lakes in particular.

Dhruba Borthakur:

The data lake, you probably already have one in your enterprise or you're thinking of setting up one. But it's a place where you deposit a lot of data. The data usually could be in raw format, sometimes you have processed data as well, but you have data of different types, JSON, CSPs, maybe structured data, XML, videos, whatever comes to your enterprise. Usually, the data lakes are very large in size, so it's very much a size thing there, is that you need to be able to put a lot of data in there. And over time, probably you'd try to leverage some of the data to run some sorts of analytics for different purposes, to make your enterprise better.

Dhruba Borthakur:

So, what are the different areas that you typically use the data to make life better for your enterprise? Those areas can be categorized into a few different segments. We'll look at some research from 451 Research where the majority, or a large section of data lake users, are very strategic, which means that you might want to use the data lake data to make your business better, either by trend analysis or by figuring out what are the strategic things to invest in. So, these are very much long-term or medium-term goals where data processing latencies and queries might not be that important. But then the next phase, the next segment that we see which is growing quite a lot is tactical.

Dhruba Borthakur:

So, tactical and operational are some things that are becoming more of a trend in the recent past, where you have a lot of data, but you also want to use the data not just for making business decisions that affect your business six months from now but how to make decisions right now at this point. So, this research is very interesting because you can see the trends and see that there's a lot of usage of data lakes for tactical decision making. And some of it is also now operational decision making, which is what the majority of the talk is going to focus on.

Dhruba Borthakur:

And again, some of you might have data lakes on on-prem, some of you might have it on the cloud. But if you look at the current trends again as far as data lakes is concerned, there's a lot of research from a well-reputed company where they talk about already the cloud data lakes have far exceeded the on-prem data lakes that are out there just because of certain reasons why people prefer to use the cloud data lake. First of all, things are cheap and easy to use. And the primary of a cloud data lake is obviously S3. This is the basis for some of our discussion today, is that you have a cloud data lake or you want to create a cloud data lake in Amazon S3, you might have pictured because it's highly reliable, it has a lot of nines, maybe eight nines or something is the reliability. You pay as you use, you don't have to do any capacity planning. So, it's the choice for setting up your data lake, is the Amazon S3.

Dhruba Borthakur:

But S3 is just the storage, right? This is the place where you deposit all your data. The real question is how do you make value out of this data? How do you make sense out of this data? So, you need a query engine to query all this data. And so you might be using Athena. A lot of people use Athena to query your data lake. You might be using some other mechanisms as well, but this is one of the standard ones, which is why I picked this one. Athena has very standard SQL, which means that maybe some of your developers don't need to learn a new language. It's an open source software, Presto, which is what is powering Athena. And it's completely serverless, so it's a great tool to use for ad hoc queries on your large data lakes.

Dhruba Borthakur:

But if you're building applications on a data lake, then let us try to figure out what are the pros and cons of using Athena, or do you need more things from Athena to be able to run your real-time applications on data lakes?

Dhruba Borthakur:

You might be using the data lake or Athena for making occasional queries, but for building applications you need a continuous stream of queries which are highly concurrent in nature. It's not five data scientists making queries in your data lake. It's about application that's making automatic queries on your data lake. It's some other digital system or software that's making queries on your data lake. So, you need highly-concurrent queries which are continuous in nature, there is no downtime on this system.

Dhruba Borthakur:

Some examples about these kind of applications are, for example, you are doing an A/B experiment on behavioral data. You are a retailer, you have an online website, and you get click logs from your users who are active. And you want to run an experiment, "Which photo of this car should I show to this user?" And run hundreds of experiments and figure out which is the best photo to show for a particular item.

Dhruba Borthakur:

Similarly, personalization. How can you personalize something if you are a game player and you want to show personalized things to buy inside this game? Then, you need a lot of personalization that is instantly happening to be able to show those kind of personalizations.

Dhruba Borthakur:

You need a query system which is essentially very low-latency and highly-concurrent queries. These are the two things that your real-time application needs if you want to power it from the data lake data. One solution that I am going to propose here obviously is Rockset, but it's not just proposing, I am going to validate the reasons why Rockset is suitable for this. And I'm also going to compare with it Athena and bring out the pros and cons or the differences and the similarities that are in the two systems.

Dhruba Borthakur:

Just a small digression just to give an introduction about what are the technical capabilities of Rockset and why it's useful.

Dhruba Borthakur:

It's a real-time indexing database, which means that you can have your data in your data lake but you can create a real-time index on a subset of your data industry. It has automatic connectors to S3, which means that if you have deposit data into your S3 data lake, automatically it gets indexed in Rockset. You don't have to do anything, or you don't have to manage any ETL pipelines. You don't have to set up any processes or applications, it's automatic for you. You don't have to create a schema. And then it's purpose-built to serve apps, which means that it has a high-concurrency support.

Dhruba Borthakur:

I will deep-dive into each of these and validate why these claims are useful for building your apps. But before that, let me take a small digression just to show you one or two use cases where people are using Rockset to build applications on data that is on S3.

Dhruba Borthakur:

So, these use cases are also in our website, so you can very easily find them there. We have written extensively about these use cases, so I'm going to take only two of those for today because these two ...

Dhruba Borthakur:

Let's talk about the first one. Matter. It is a financial company for sustainable investing, which means that they're in the financial business. They do investing in different assets and different funds. So, what they do is that they have a stream of data coming in from various sources. These are news feeds, these are fund reports, these are peer, like many kinds of ... I think they have some 20,000 feeds coming in every day, and those feeds, they want to do sentiment analysis and trend analysis. And then based on that analysis, they want to figure out or comment what investments to make for their clients. Now, think for example there's a scandal in a company and they want to find it immediately. They want to find it the moment it happens so that they can make changes in their investment decision. They can't wait for one day to make changes in investment if some new news report is coming up in this company.

Dhruba Borthakur:

So, this is the focus how they use Rockset, because the data latency is very important for that. What is data latency? Data latency is the time when data arrives in the system till the time when you start making queries or make decisions on the system. So, although the data is being deposited into their data lake, for this user it is very important for them to be able to query this data as soon as it lands on the data lake, and then make decisions based on this data. They used to use Athena, and I think now with Rockset they see such better performance that they can deliver a lot more applications but a lot more developer-focused applications because they don't have to worry about maintaining pipelines or maintaining cleanups or creating schemas that you typically do with that. You'll find more details about this on our website.

Dhruba Borthakur:

The other one which is also kind of similar but also different is that we talk about eGoGames. It's a European eSports platform for mobile devices. What they do is that they take mobile games and then they convert it into an eSport, either by making people compete with one another or just improve their skills and give gifts and prizes and even make people compete with one another professionally for money. So, they have a business model where the focus is online games.

Dhruba Borthakur:

Now, to understand what their users are doing, they need a complex decision-making system which can process a lot of data and get complex analytics out of this, and they want to do it immediately. Think about it, they are using DynamoDB as their transactional database, which is where the store user moves and players, player information. But then there are a lot of things that need analytics that its users need. One of them is a gaming leader board, so they can build things like those using Rockset SQL.

Dhruba Borthakur:

They also have a requirement that competitions be fair, so they need to do a lot of analysis to make sure that each of the competitions that are running are fair, using complex SQL on data that is streaming in from the players' moves. They also have acquisition and attribution data from third-party systems like Adjust, CleverTap and other systems. All of them get dumped into S3, and then they use Rockset to feed it into most of the rule engines to be able to detect fraud detection and many other real-time analytics applications that they have, that they have built on top of this data set.

Dhruba Borthakur:

So, very clear examples of how all of the data is in data lake. You don't have to wait many minutes or many hours to make decisions on it. You can use Rockset and immediately start powering your applications using Rockset along with your data lake.

Dhruba Borthakur:

Now, if you take a step back from these two use cases, let us try to peel out the main considerations that these applications have and the primary reasons why they started to use Rockset, and what is the difference of Rockset versus using Athena or any other system when querying this data.

Dhruba Borthakur:

The first one obviously is query latency. If you look at Athena and Rockset, now you see two columns on your screen? The left one is about Athena, the right one is about Rockset.

Dhruba Borthakur:

There are a lot of similarities between Athena and Rockset. Both of them are SQL systems, which means that you can make SQL queries including [joints 00:16:03], group buys, order buys, but then Athena is very much an ad hoc querying system where you might be able to make ad hoc queries, but then the question is about concurrency and selective large datasets. So, if you have a query that needs to find 10 records out of a dataset of a billion records, there are certain limitations of Athena because ... I can explain again in the next slide. But Rockset, it uses an index to be able to get you those results very quickly.

Dhruba Borthakur:

The technical differentiator between Athena and Rockset is that Athena essentially mostly tries to use data in a columnar format. So, in your S3 data lake it's likely that you have data in Parquet or some other compressed column or storage formats, whereas in Rockset it's an index that we built on top of your data lake which the index has a row store, it has a column store, it has an inverted index and so the queries automatically use this indexing schema to give you low-latency queries on large datasets.

Dhruba Borthakur:

So, it doesn't matter what the selectivity of the query is, the latency of the query is what Rockset is built for. The first primary feature, or the first primary characteristic that Rockset is focused on is improving query latency. The architectural difference essentially comes from the last column in your, or last row in your screen. So, Athena is essentially a parallelize and scanning system, which basically means you parallelize everything and you scan each piece. Let me go to the next slide, it gives you a better picture of this.

Dhruba Borthakur:

So, what happens is that if you have even data coming in, you can have parallelize and scan approach. MapReduce made this popular where you spin up a lot of CPUs. Each CPU is scanning their own block of data or own set of files and giving you queried results. This is ideal for reports where you have to go through a lot of data and give your results in a periodic form or a periodic time. Whereas for real-time analytics, it's not ideal because the queries are very different in nature and many of them, many queries are arriving at the same time. So, even data for real-time analytics, you need a system more like an indexing-based system, which we call it as converged indexing system.

Dhruba Borthakur:

Converged indexing is again used essentially to give low-latency queries on your data lakes. You don't have to think about data lakes as being the slow beast that you might other ... Some people might be thinking that, "Well, data lake means you will dump a lot of data, but then querying is a little bit painful." But for Rockset, we make queries easy and fast on these data lakes, because we use something called converged indexing.

Dhruba Borthakur:

So, why converged indexing give you faster latency? That, actually, we have to peel another layer of this technology and talk about what is converged indexing.

Dhruba Borthakur:

This example that you see, the two records coming into your database are document zero and document one, which has one field only just for simplicity. In real life, these documents could be very complex because it's pure JSON.

Dhruba Borthakur:

What converged indexing does is that it takes its JSON document or [single 00:19:24] structure document and shreds it to its individual keys and values, and stores it in a key value store. That's what you see on the right side of your screen, a key and a value.

Dhruba Borthakur:

The key value store gives you the feature of sorting all the keys in some sorting order. So, finding a key in this key value store is relatively easy because it uses the index of the key value store.

Dhruba Borthakur:

Converged indexing creates three different types of keys for every record. It creates a row key of the row store, the first two records. Those are the keys that start with R. And they are very similar as if when you are storing your data in say a Postgres or a MongoDB cluster where it's a row store where every record is stored in the row format. So, you can find all the records inside a document very quickly when you use the row key.

Dhruba Borthakur:

Now, converged indexing also stores it in the column store, which means that all the columns of every record are stored together. So, if you're looking at the average age of a person, you can just look at all the C columns, C.age let's say, and give quick results. This is very similar to how data is stored in a warehouse like Redshift or Snowflake or other warehouses you might be using.

Dhruba Borthakur:

The last one is the inverted index where those are keys starting with S. And now if there's a query where, "Find me all records where name equal Dhruba." It is one key lookup in the index, in the index of the key value store, and finds the document ID is "1" , it gives you the result. So, it's very much similar to an inverted index like Lucene or Elasticsearch. So, converged indexing builds all these indexes for you, and you don't have to configure or build or study or take your data lake data into three different systems. You don't have to copy some of it in Elasticsearch. You don't have to copy some of it into Redshift and make queries. You can use Rockset to make these different types of queries on the same system.

Dhruba Borthakur:

To take an example of two different SQL queries coming in, the left one and the right one. The Rockset system automatically figures out that when I'm looking for keyword, it could be HBTS, which is the name of a conference, the number of records or the selectivity is very high which means that it could probably result in say a hundred records out of your 10 terabyte dataset. So, it'll automatically use the inverted index and give you the results, whereas the right side query where you are basically creating a report kind of thing where you're saying, "How many times each conference is mentioned in my terabyte-sized database?" That needs some scanning and counting, which is why it automatically uses the Columnar store, very similar to say Redshift and it can give you results for each of those keywords that you're interested in.

Dhruba Borthakur:

So, this happens automatically for you. You as a user with converged indexing don't really need to configure this one way or the other.

Dhruba Borthakur:

So, query latency is definitely one thing for real-time applications. The other one is concurrency. Again, I'm going to compare a little bit about Athena and Rockset. I think we mostly talked about it. So, Athena. It's a great system again, SQL works well. The difference comes when you are talking a highly-concurrency system versus a low-concurrency system.

Dhruba Borthakur:

For Athena, what happens is because first of all you might have five, 10 or maybe 20 users using the system, but when you go to thousands of queries per second it's a system that won't work for you in general because the cost of the system also becomes very high. The reason again is because it is parallelized and scanning, so every time it parallelized and you scan there's a lot of CPU needed to scan the same dataset over and over again when it is a thousand QPS. Versus an indexing system like Rockset where the QPS can be very high, because the amount of data that it needs to touch to respond to a query is far lesser than a parallelized and scan approach.

Dhruba Borthakur:

Now, we looked at query latency and concurrency. Now, how can we put this thing together and how can we build this architecture so you can actually query data in your data lake for real-time applications?

Dhruba Borthakur:

Again, focus is to give low-latency queries on data lake data. In this picture, we use something called the ALT Architecture, the Aggregator Leaf Tailer Architecture.

Dhruba Borthakur:

So, this architecture has three main components. One are tailers, who are taking data from data lake, or you could also have data from data streams like Kafka or maybe other transactional databases. So, the tailer component is separate. The tailer component transforms data from all these formats into an indexed format, and the indexed formats is reciting on these notes called the leaf notes. That's which stores all these indexes. And then the aggregators on the right side are used for queries, which means that when a query arrives the aggregator talk to one another and figure out what is the best way to run the query on this index data.

Dhruba Borthakur:

Now, the interesting thing is that you'd see that here what we have done is that we have separated out the compute and the storage. The compute is your tailers and aggregators, and storage are the leaf notes which is what stores the index.

Dhruba Borthakur:

This is very similar to some cloud warehouses that you might be using which separates compute and storage. But we have actually taken a step forward because we want to focus on real-time applications and not BI and dashboards. And for real-time applications, we need to separate in just compute ... from query compute. These two things have to be separate for a real-time database.

Dhruba Borthakur:

So, you will see that when a burst of new writes happen, the tailers are the CPUs that scale up and down, and it has no impact on query-latency and performance on the aggregators because they are separated out.

Dhruba Borthakur:

Similarly, there's a burst of query where you might need to spin up more aggregators, but it has no impact on your data latency where data is streaming in from your data lakes into the index. This is very critical to serve low-latency applications on these data lakes. Now, if you use a warehouse, this is very difficult for you to do because you cannot do writes and reads at the same time. So, typically for a warehouse you might need to back some of the data and then upload it periodically. But for Rockset and ALT Architecture, it lets you do both things at the same time which supports a high-write rate as well as a high-query rate on the write side.

Dhruba Borthakur:

This is an architecture that Rockset did not invent. It is an architecture that is quite prevalent in many web companies. We derive a lot of inspiration from systems that are running at Facebook and LinkedIn and other companies where they use a very similar architecture to power a lot of their real-time applications like ad placement or fraud detection or any of those systems.

Dhruba Borthakur:

The salient point of that architecture is that the data is mutable in your system. It's a database, which means that when writes are happening you can make queries at the same time.

Dhruba Borthakur:

So, let us draw the difference again between Athena and Rockset as far as this characteristic is concerned.

Dhruba Borthakur:

In Athena, let's say you have a schema change. Let's say your data is changing and you have new fields appearing in your data. What happens is that in Athena ... Now, Athena, you might have configured it to use let's say the Hive metastore or maybe [New 00:27:21] metastore or some other metastore. So, you'd have to make some changes in those things to be able to tell Athena that, "Hey look, there is another field that you might want to process." Whereas in Rockset is automatically processes everything that's coming in, including schema changes. So, it's a very flexible schema, it indexes the schema as well as the data of your incoming dataset and creates the index without you needing to do anything.

Dhruba Borthakur:

Also for Athena, if you're running Athena on S3 it's obvious ... because S3 is not mutable, you can create new objects and write objects through S3 which essentially means you can insert data into your dataset. But you can't really update specific fields of your data, whereas in Rockset it's a completely mutable database so you can actually update every field in your database very quickly. It's a completely mutable ... it's set up so that you don't have to re-index the entire document when you need to rewrite.

Dhruba Borthakur:

For schema changes, Rockset supports strong dynamic typing. And this example, what happens is that you are seeing five different records coming in. The age could be integers, could be strings. Everything is accepted by Rockset, it's not like we do a type check and then we reject records that don't match the schema. But on the other hand, the queries are very strongly-tied. If you make a query and the types don't match it would give an error. This is really helpful for developers because they need to know ... As a programmer it's always good to find a bug at compile time rather than at run time. So, having a strong schema at the time of querying is super-useful for developers, and then Rocket automatically creates the schema for you. It's not a schema-led system. It always is a schema but the schema you don't have to specify, the system creates it for you and shows it to you.

Dhruba Borthakur:

You would probably get good concurrency and good query-latency if you use Rockset on your data lake. But then some people might have this question saying, "Hey, what if my data lake is very big? Do I need to index all the fields in my data lake using Rockset?" The answer is no, you don't need to take everything. Rockset has things called field mappings so you can selectively pick the fields to ignore. So, if you have an additional document you can say, "I want to index only these fields in my dataset." I can ignore the other ones. I can also set retention, which means that Rockset will automatically drop the index for data that is a few days or a few months old.

Dhruba Borthakur:

So, again, the summary for this one is that for Rockset we want to support high ingestion as well as fast queries, but we also have some benchmark which shows you what is the scale at which you can do fast injection into Rockset.

Dhruba Borthakur:

So, typically what happens is that the data into your data lake is coming from let's say a Kafka stream because it's streaming in at a high volume. So, this benchmark that we recently open-sourced, called RockBench. It's an open-sourced benchmark, you can look for it or make additions to it. What it does is that when data is streaming into your data lake from a Kafka stream, we try to measure how much is the latency between when the data is available in Kafka to the time when it is queriable in the Rockset index.

Dhruba Borthakur:

So, you would see, if you go to the right-most side of these bar diagrams, you would see that 20,000 writes per second. We see a data latency of somewhere between say two seconds or so. And this is when you use a Rockset instance of 4X large on the left side. If you bump the Rockset instance to 8X large, which is the next virtual instance that Rockset has, you would see these query-latencies almost go down by 50%. So, it's a system that is very highly-scalable. If you upgrade the system to different virtual instances, you would see this latency even go lower than what you have. These are applications where data latency is super-important for these use cases.

Dhruba Borthakur:

We must have talked about all the technical side of Rockset and why it's useful to power low-latency applications. But there is another completely different side which is about developer productivity.

Dhruba Borthakur:

Usually, you might find that a lot of application developers find it hard to consume data from a data lake, because first of all there's a schema associated with it. Some people might not be familiar with SQL, somebody might not know how does the scalability of my data lake ... Our application level, procedure level, very difficult time accessing data from data lakes. They try to talk with data engineers and say, "Hey, give me this dataset and I'm going to work on this other dataset that you curated for me." To solve this problem, Rockset has something called Query Lambdas. This is one of the most used features in Rockset, and it lets you create a developer API for applications that can leverage data in your data lake directly, so that your application level of course can build applications on your data lake.

Dhruba Borthakur:

So, what is a Query Lambdas. It's a SQL query that you can configure, test and store inside of Rockset. And that Query Lambdas exposes a REST endpoint. The REST endpoint, you can hit it to HTTP call or any other REST endpoint clients. So, your applications don't have to worry about having SQL embedded in them. They can do a REST API call, they can access the results of the SQL theory. So, when you invoke the HTTP request, the system automatically runs the Query Lambdas for you. It's like a serverless thing. It runs the Query Lambdas for you. It gives you results and you have complete control of the SQL. You can use the Query Lambdas as the building block of a develop platform API where you are doing let's say maybe experiments. You want to set up five different Query Lambdas with five different experiments, and then over time try to refine each of these queries without having to change your application.

Dhruba Borthakur:

Also, your application doesn't have to link in any new libraries, you can use just normal HTTP to retrieve data from your data lake using Query Lambdas.

Dhruba Borthakur:

This is a useful feature for a lot of our users and they like it a lot.

Dhruba Borthakur:

I have a demo that shows some of these things, but are there any questions anybody has in the meantime when I switch to a demo?

Kevin:

Yeah. If you have any questions, again, feel free to put them in the tool, the GoToWebinar tool. There should be a section called questions, you put them in there and then we'll handle those during the Q&A session. You can go ahead, Dhruba. Yeah.

Dhruba Borthakur:

Yeah, sure, thank you. Kevin, can you see my demo screen now?

Kevin:

Yeah, I can see the Rockset console, yes.

Dhruba Borthakur:

Okay. That's good, yeah. This is the Rockset console. Go to consolerocket.com. This is where the UI to manage the Rockset index on their data lakes. Rockset has integration with many systems that you see here, and obviously S3 is one of them, which is your data lake. It also has a write API, which means that you can write data directly to Rockset. This is useful when you want to tie your data with certain new fields or as part of your refinement process, and it can directly add new fields to Rockset using the write API.

Dhruba Borthakur:

In today's demo, what we are going to show you is that ... How can you build a recommendation system, or how can you build a decision-making system using the data that is streaming in from let's say a Kafka. Let's say that your data is coming in from Kafka and is being deposited into your data lake in S3. You want to tap in that Kafka source and then make indexes very quickly so that you can serve your applications directly from the data that's streaming into your data lake.

Dhruba Borthakur:

This example, okay, the recommendation system that we're building is looking at all the Twitter feeds that people are tweeting right now, it's a live data stream, and then building a recommendation system to find out which Twitter symbols, top Twitter symbol people are tweeting the most in the last five minutes or the last 10 minutes or the last one minutes for example, right?

Dhruba Borthakur:

The way to do this is that we will tap into the Kafka source as and when it is depositing data into the data lake. I have already created some integration which basically gives me permission to access a Kafka stream. This is the Kafka stream on which Twitter is writing data, the most recent tweets. And I say start. So, I'm going to create a collection. Let's say I name it as "Twitter Kafka." Then I select the Kafka topic where the data is getting deposited to. And the moment I select the Kafka topic, what happens is that the system automatically samples a few documents in the Kafka topic and shows me some of these tweets that are coming in. This is just for validation for me, "Is this the data that I'm really looking at?" Once I validate it, I can field map in some documentations and I can say, "Create collection here."

Dhruba Borthakur:

The moment I create a collection, the collection gets created and I can go see it in my list of collections. This is the collection that I created just now, I mean I created a few days back, but it is accumulated to 180 gigabytes of data. It's not a very big collection, it's mostly for demo. It has 23 million documents and it's updating as we go because new tweets are happening. It also shows you what is the size of the documents or what are the fields that are available in this document. So, now this schema is automatically created for you. You can see that this is a clear time, the entities or hashtags. And there's a whole lot of complex JSON in this document.

Dhruba Borthakur:

Now, let us try to build ... Until now, I haven't specified a schema. Right? I just created a collection from a Kafka stream. The only thing I have to specify is the name of the Kafka topic.

Dhruba Borthakur:

So, now let me go to the query editor and then sample a few documents from the stream and see what it is, right?

Dhruba Borthakur:

I run a select query. It shows me some of these documents. If I look at JSON, you will see that this is a very complex JSON. It has nested fields, it has nested objects, and it has nested areas or areas inside objects inside here. Seeing this, this is the real life JSON document that Twitter is producing.

Dhruba Borthakur:

I know that I need to look at the symbol, at the entity called symbol. This is what has the stock Twitter symbol that people are tweeting about. So, I'm going to say, "Find the only data column out of all these columns that I see out here." And if I run this query, it's going to tell me the name, the text and the symbol. And a lot of people don't have symbols because they're not tweeting about a stock Twitter symbol in their tweets. So, I need to filter out all the ones which have no values in them. This is again standard SQL, I say entity symbol to run is not "null" which automatically filters out ... And you see also that it also showed you the time taken. this is 11 milliseconds on that 300 gig collection. So, now I have tweets that are not now here. These tweets are the ones here talking about some symbols. I can recognize some of these here.

Dhruba Borthakur:

Now, let us look at only the tweets that are happening the last one day. This is where I have this field "current time internal" of "one day" instead of looking at all old tweets. Now, I can very easily see only about those topics or symbols that people tweeted about in the last one day. Again, this is all standard SQL. So, Rockset has this field called underscore event time, which also makes it easy to use as a time series database because the entire dataset is also sort of based on event time. So, it's very fast to do range queries on event time.

Dhruba Borthakur:

Now, I don't really know what is AAPL and TSLA and all these symbols, so I have loaded a third-party dataset which maps these symbols to the name of a company. So, this is what I do as ... Yeah.

Dhruba Borthakur:

I also want to print accounts, how many times people are talking about AAPL and TSLA. So, I do group buys, the Twitter symbol and I count ... and I want to get the top one so this is why count descending. So, if I do this, I see that people are talking about TSLA and AAPL this many times in the last one day, for this Twitter feed.

Dhruba Borthakur:

Then I have to join it now. I'm going to show you a join where we do joins on [Tickers 00:41:09]. Tickers is a third-party dataset that we loaded from an outsourced dataset which maps TSLA to the name of the company, AAPL to the name of the company, and I do a join as part of this query. When I run it, now I can see that the queries actually expand to Tesla and Apple and I can figure out, "Oh, these are the top tweets that people are making now and talking about these tweets in the last one day." So, just imagine how easy it is to take a Kafka stream and then get a recommendation system built, or a decision-making system built, that can give you the top tweets people are talking about in the last one day. This is really powerful if you can index and make these queries at very fast rates and get analytics out of these to power a lot of your online applications.

Dhruba Borthakur:

Oh yeah, so let's talk about just a summary of our talk.

Dhruba Borthakur:

Again, the focus for building applications on S3 is that we need to pick a server system that can do low-latency queries and also highly-concurrent queries. And low-latency queries, irrespective of the selectivity of the data. Right? We want low latencies. We definitely want to improve developer velocity because it's the developers who are actually wanting to use data from their data lake.

Dhruba Borthakur:

And then I have a customer quote here, which is one of my favorite quotes of somebody using Rockset. He says that you can probably cobble up a lot of systems together to get something similar. You could set up Elasticsearch, you could set up Redshift, you could probably set up yet another system on the side and use a combination of these to make things as easy to use as Rockset. But if you're using Rockset, you set these things out of the box and it can ship your application really fast. And the focus is how can we enable companies to get to the stuff that they are trying to build very quickly? And time to market for them is a really important metric that they are trying to optimize for.

Dhruba Borthakur:

That's the talk for today. I hope the demo was interesting. I am going to hand over the mic to Kevin. Kevin, do you want to take over?

Kevin:

Yeah, I would be happy to. Thanks a lot, Dhruba. That was a really interesting conversation about how Rockset indexes data in the data lake really, and the key differentiators that you called out. This would be a great time, everyone, if you have questions, to stick them into the webinar console and then we will answer them. As you're doing that, I should point out that we do have a standing offer. Look, if you've heard what Dhruba has talked about today and you're interested to get your hands dirty, not with the demo dataset but with your data and your SQL queries, that's great. You can do that.

Kevin:

Rockset's a cloud service, all you need to do is type in your email address and you get an account right away, and you can get started with free $300 credit for you to try out Rockset anywhere you want. So, we have a link for you to go ahead and create your account if you want to do that. Also, you can just click the button on our website rockset.com. Apart from that, if we don't get to all your questions feel free to join the community channel that's on Slack, and the link is listed on this site as well. And you ask follow-up questions 10, 15 minutes after we finish over here. That's great. We do have questions and we will try to get to those. Dhruba, are you ready?

Dhruba Borthakur:

Sure, yeah. Please ask.

Kevin:

Okay. Yeah, we have some great questions coming in from the audience here. The first one, "How do you handle complex input documents? For example, things that are nested on multiple levels? This would create an explosion of keys in Rockset indexes," is the observation there. So, how does Rockset handle those things?

Dhruba Borthakur:

I see, yeah. That's a good question. Rockset takes JSON or semi-structured documents as input, so you can write non-SQL data on one side but you can get SQL on the other side. Right? So, with non-SQL data you are absolutely right, there could be a lot of nested fees and areas deeply nested. And the size of the index I think is what this user is concerned about. So, yeah, the size of the index is something that Rockset tries to minimize by doing things like say compression. We use a lot of compression technologies. We do Delta encoding when doing keys. So, there are certain techniques to reduce the size of the index, but yes, the index could be bigger in size than your columnar store, that's true.

Dhruba Borthakur:

But the other things that we do is that Rockset also separates durability from performance, which means that the data is actually stored in S3. The index is actually stored in S3 itself. But then when it is serving data, it needs to have only one replica of the data in memory when it is serving, unlike let's say you want to put it in Elastic you might need to make two copies just to keep data durable and you don't lose it.

Dhruba Borthakur:

So, automatically the fact that it's separate durability and performance means that we can keep only one copy of the data in hot storage when you're responding to queries. That's one thing.

Dhruba Borthakur:

The third thing is that we can page in, page out data from S3 when needed. So, building a large index is okay as long as the entire index doesn't need to be populated into your in-memory systems when you're responding to queries. You can let some indexes lie cold, some systems out there and then leverage only the hot storage that you have to answer, to respond to the queries that you have.

Dhruba Borthakur:

But yeah, that's a good question. The size of the data when you index it can be more than the columnar store, but in the columnar world there are a lot of ways that Rockset optimizes this for you so that the cost of your application is not the storage, but the cost of your application really is how quickly can it get loaded latency]on large datasets? And the cost probably is more on the compute side of things where you need high QPS systems to serve all this data.

Dhruba Borthakur:

Does that answer your question? I hope it answers his question.

Kevin:

Yeah, that's a great response. I think there were two parts to that question. The first part was about the nested documents and I think this question came in before the demo. But we did see it in the demo, the Twitter dataset was quite heavily-nested, right?

Dhruba Borthakur:

Right. Yeah. Twitter is very nested.

Kevin:

Totally possible and Rockset will index everything, every field in that?

Dhruba Borthakur:

Correct, yes.

Kevin:

Okay. Well, since we're talking about the demo, I think you mentioned the Query Lambdas, but are we able to see that, how you do that in the demo?

Dhruba Borthakur:

Yeah, sure. I can show you that one, definitely. Let me go back to my screen.

Kevin:

Yeah, because that was something you touched on in increasing developer velocity and so on.

Dhruba Borthakur:

Yeah, no, absolutely. Can you see my screen now?

Kevin:

I can but I can't see the bottom of your screen. [crosstalk 00:49:02].

Dhruba Borthakur:

Yeah, I know. Let me try to fix whatever this problem is. Okay, now is it better?

Kevin:

Yeah.

Dhruba Borthakur:

Okay. Yeah, you can create a Query Lambdas. Let's say you have this query. I'm just going to take one simple query out of this. I'm going to remove all these other things, these things I'm going to remove. I'm going to have this query only, which is what I ran to get you the answers. Then I can state clearly the Query Lambdas. So, what you can do is point and click again.

Dhruba Borthakur:

So, you can type in some name here, description of the Query Lambdas and say, "My Query Lambdas ... " or something like that. And you can say create a Query Lambdas. That's the only thing that you need to do to create a Query Lambdas. So, this is an idea associated with Query Lambdas. This is the SQL that I use to create the Query Lambdas. Then if you go down here, it shows you how to use the Query Lambdas. I'll go down a little bit more. This is the S3 endpoint that is there, and you can make this Curl query ... The Curl thing that you see here, you can cut and paste into a terminal window and run this. It'll give you the results of the query. And obviously because it's Curl, you can write it from more than just Grafana or whatever else out there.

Dhruba Borthakur:

This is as simple as clicking a button saying that I want to create a Query Lambdas and the SQLs are calculated for you. You can parallelize the SQL if you want, as parameters for the Query Lambdas. And you can also version it which means that, let's say you change some field in the SQL theory and you want to deploy the newest version, so then you can tell your application saying, "Hey, now I have a new version of the query, you can go use that version of the Query Lambdas."

Dhruba Borthakur:

Again, the focus is developer velocity, how can you make developers run this without having to think about SQL or embedding SQL into the application code.

Kevin:

Okay. Hey, thanks for showing us, Dhruba.

Dhruba Borthakur:

Yeah.

Kevin:

Keep those questions coming. We have definitely a few more we want to get through. Next question for you, Dhruba.

Dhruba Borthakur:

Yes.

Kevin:

Do you have to index a field for it to be pulled into a query, or do you only have to index the fields that you are using in the where clause?

Dhruba Borthakur:

Oh, I see. Okay. Great question. What happens is that with Rockset, by default, all things that are coming into your document are stored and indexed. So, if you want to use a field in a select statement, you need to make sure that that fields are indexed. If you want to use the field in the where clause, you need to make sure those fields are indexed as well. Basically, yes, the answer to the question is yes. In an existing document, any field in your existing document that you write in Rockset can appear in the select clause or it can appear in the where clause as well. Can you repeat that question again, Kevin, just in case I missed something?

Kevin:

Yeah. Do you have to index a field for it to be pulled into a query, or do you only have to index the fields that you're using in the where clause? I think your answer is that everything has to be indexed, but just to reiterate, the user doesn't have to mentally do anything, right?

Dhruba Borthakur:

Yeah, exactly. Correct. Exactly. There is no manual stuff. You just write your existing document and you can query any field in your document, either in the select clause or in the where clause.

Kevin:

Because Rockset automatically indexes everything you're using?

Dhruba Borthakur:

Yes, correct.

Kevin:

Right.

Dhruba Borthakur:

Exactly.

Kevin:

Cool. Okay, I might have time for one or two more here. But if you have more questions, keep them coming and then if we don't have time for that we'll just bring them over into the Slack channel to answer them. So, another question for you, Dhruba. "What if you want to do time-range queries on a field inside the ... " You showed the Twitter data from Kafka to Kafka message. "Not on the ingestion time but on other timestamps inside the message?"

Dhruba Borthakur:

Got it. Yeah, that's a good question. What happens is that every field is indexed in Rockset, which means that most of the queries, if you use it on a certain field, if your selectivity is low then inverted index really helps, right? Because it's going to use the inverted index to give you query results.

Dhruba Borthakur:

Take for example if the field is an integer, automatically the inverted index will create a range index on it and it can make range index queries really fast. But on the other hand, if you're talking specifically that you know upfront that something is a timestamp in your document ... Right? Let's say that the document that is coming in, you know that field called TS is my timestamp. Now, it's great if you can feed that information to Rockset, because Rockset has something called event time.

Dhruba Borthakur:

It's a special field. All the underscore fields are special to Rockset, just like Rockset as an underscore ID to identify each document. Similarly, there is a field called underscore event time in Rockset. And at the time of creating the collection, you can create a field mapping which says that my document, which has the name TS, that field in my document needs to be mapped to a field called underscore event time in Rockset. So, now, instead of Rockset, time-stamping the event time at the time of inserting the document, the Rockset will just take the timestamp from that specified field in your document and make it the insertion time into Rockset.

Dhruba Borthakur:

So, now everything will be sort of based on the event time, so you can make time-range queries really fast by doing this approach. Essentially, do this mapping, it's called field mapping, at the time of creating the collection. And then your queries on the timestamp will be very fast for you.

Kevin:

Okay. Do send any more questions you have and I will make sure that we will get to them in the community Slack channel, or we'll reach back out to you individually as well if you have any further questions. But I think we are coming up to the end of the hour. Thank you all for your attention. Again, feel free to check our Rockset, rockset.com. We do have free trial credits if you want to avail yourself to them. And thank you very much Dhruba for sharing on Rockset and the technical underpinnings on how it is built for apps on data lakes. Thank you.

Dhruba Borthakur:

Cool, yeah. Thank you. Thanks a lot, guys. Bye.

Kevin:

All right. Thanks for your attention and participating today. We'll see you again soon. Bye. 


Recommended Webinars