Reimagining Real Time Analytics in the Cloud

Rockset recently announced a new round of funding to bring further innovation to real-time analytics for application development. Hear from Rockset co-founders Venkat Venkataramani and Dhruba Borthakur on the trends they are seeing in the market, their vision for a real-time cloud data stack and how to build fast, scalable real-time apps on Rockset:

Modern data apps require real-time analytics - Across all industries, businesses are experiencing an increasing need to harness data in motion using apps for real-time personalization and recommendations, real-time supply chain logistics, gaming leaderboards, fraud detection, and more.
Real-time analytics requires both speed and scale - Most development teams find building for both speed and scale a big challenge. Solutions built for speed, like Elasticsearch and Druid, are hard to scale, while solutions built for scale cannot be retrofitted for speed.
Build scalable real-time apps in hours - Inspired by applications that have solved for speed and scale, such as Google Search and the Facebook News Feed, Rockset employs real-time indexing at their core. Learn how real-time indexing can be used alongside streaming platforms, like Kafka, and NoSQL databases, like MongoDB and DynamoDB, to build scalable real-time apps in hours instead of weeks.

Speakers

Venkat Venkataramani is the CEO and co-founder of Rockset. He was previously an Engineering Director in the Facebook infrastructure team responsible for all online data services that stored and served Facebook user data. Collectively, these systems worked across 5 geographies and and served more than 5 billion queries a second. Prior to Facebook, Venkat worked on the Oracle Database.

Dhruba Borthakur is the CTO and co-founder of Rockset. He was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System.

Show Notes

Kevin:

Hi, and welcome to today's Tech Talk. My name is Kevin and I'll be the moderator for today. Rockset said recently raised at Series B funding, so we've invited Rockset founders Venkat Venkatataramani, and Dhruba Borthakur to share some of their perspectives with us today on this Tech Talk, reimagining real time analytics in the cloud. So as a reminder, this webinar is being recorded and all participants will be muted. If you have a question, please send them our way through the Q&A portion of the GoToWebinar tool. So we have Venkat and Dhruba with us today. Venkat is the CEO and co founder of Rockset. He was previously an engineering director in the Facebook infrastructure team, responsible for all online data services that stored and served Facebook user data. Collectively, the systems worked across five geographies, and serve more than five billion queries a second. And prior to Facebook, Venkat worked on the Oracle database.

Kevin:

Dhruba is CTO and co founder of Rockset. He was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System. He was also a contributor to the open source Apache HBase project. And with that, I'll invite Venkat to take it away. Venkat.

Venkat Venkatataramani:

Thanks for the kind intro, Kevin. Can everyone hear me and see my screen?

Kevin:

Yeah.

Venkat Venkatataramani:

Yes. Awesome. Thanks a lot. It's my pleasure to do this Tech Talk with all of you, thanks for attending. Thanks for taking your time. As Kevin pointed out, we've been on this journey for a few years now, and we just closed our Series B and we're very, very excited to get Rockset in the hands of more and more people and see what they're going to build. So as part of... In this presentation, Dhruba will go into a little bit of under the hood details around how Rockset is built, and how is it different and I can't wait to hear Dhruba's presentation of that. I'll kick it off with a little bit of a higher level overview of what are these kinds of modern data applications that Rockset is really built for and share a little bit about perspective on why we need a system like Rockset in this day and age to power these kinds of applications.

Venkat Venkatataramani:

And then the third item in the agenda is when Dhruba will do a little bit of a deep dive into various parts and functionality of Rockset that allows it to provide the speed and scalability that you should take for granted with Rockset. Cool. Without further ado, let me talk about modern data applications. Rockset is really built for what we call these modern data applications. And the first question that I always hear is like, "What is a modern data application? What is this?" Let me just give some concrete examples so that everybody understands, and we're all on the same page. These are real world examples, they're not made up. They're actual companies behind these had these exact problems.

Venkat Venkatataramani:

So say you're building a real-time supply chain platform for logistics, for heavy construction, and say you need to track millions of these tickets that these trucks driving concrete, delivering concrete are using to track which part of the deliveries have they picked up? What stations have they crossed? And just to do real-time supply chain for these things. Say you started building this out on really serverless and scalable systems of record and transactional databases, like let's say, Amazon DynamoDB. But now, your platform and your application, the supply chain platform is getting more and more richer, and there is a need for being able to search and do analytics in real time across these hundreds of millions of tickets that are getting updated constantly. So you start to look around on how do I build something that can in real time, within a few seconds of lag, can search through these hundreds of millions of records and do analytics on them?

Venkat Venkatataramani:

You may be looking at something like say Elasticsearch at this point, and you look at the operational overhead, the work it requires to transfer all the data that is being maintained in DynamoDB in real time to be reflected in Elastic. There's a lot of custom engineering that needs to be built. And then Elastic itself is good at certain kinds of search queries, but there's a lot of joins and aggregations and other really complex analytics that you want on the data, and you start to really look around to see if there's a better way.

Venkat Venkatataramani:

Another example, hedge funds, you already are using a modern cloud data warehouse, you're getting hundreds of terabytes of data every night, you munch it through all of your models, and what have you to get a two to three terabyte data set that you want to put it in the hands of every investor in your company. And they want to slice and dice this data ad hoc live, while making investment decisions about several 100 times a day. You build this application on top of a warehouse, every query takes somewhere between two to five seconds, so you can't really build an interactive application on this three terabyte, four terabyte data set without a lot of custom engineering to load the data into some kind of a serving tier, and then build custom indexes and all sorts of sorts, and to be able to actually power that decision-making system that every investor is going to be using.

Venkat Venkatataramani:

So you build it on top of let's say something like a warehouse, and then it's just too slow, right? Some of these interactions take five seconds, and there's like 10 to 20 queries that you need to run before you can render a web application interface. All of a sudden, it's taking minutes to load and nobody's able to actually leverage the intelligence that you have to make better investment decisions.

Venkat Venkatataramani:

I can give you more examples, mobile games, a platform for mobile games, where it's a very multiplayer.. It's kind of social online experience. And say you're building something like this, and you want to enhance your customer experience by doing more personalized experiences based on the prior history of what games that a particular user likes to play, what kind of competitive level they like to play, and what whatnot. Also, when there is a lot of... when there's money involved, then there's a lot of people involved, there's always a need for detecting and preventing fraud. So you are using all these systems, data is coming in multiple different places, and in real time, you need to be able to manage your risk, prevent fraud, and do these kinds of things. So, again, you'll have to analyze these data in real time, because if these things are all happening offline, it is too late.

Venkat Venkatataramani:

Another, this will be my last example. This is almost like a gaming leaderboard example. Except that anybody in this particular company, what they do is they have a mobile app for tracking steps that you're taking and reward users for achievement. So whenever anybody goes on for a walk, they're playing a game. And any time they take any step, there is an update to a record in a backend database somewhere. On top of these high volume of writes coming, they wanted to build real-time leaderboard. They tried to do it on a single-node Postgres machine and things like that, a traditional way of building that app, and it quickly stopped scaling, it could not keep up with the amount of writes that are coming in, and it could not give you an interactive leaderboard experience where they wanted to show across different timeframes on who's the leader for today, this week, this month, and also based on various geo locations, and do lots of filtering before they have a very interactive, rich leaderboard experience.

Venkat Venkatataramani:

So I'm just giving you a little bit of color about all these real-world examples today, and how these modern data applications are evolving in this day and age. And if you can see, what's really happening is that these applications are having to deal with a lot more data, in terms of volume. In terms of velocity, these applications are now having to deal with data coming in at like thousands of records per second or even sometimes millions of records per second. And the application wants to put that to work immediately, right? You can see that we're kind of like two large high level buckets of applications here. The real-time analytics bucket is where you want to capture all these data coming in real time and make your business operations better. Whether it is your marketing team trying to optimize adspend on a daily basis to maximize revenue potential, whether it is your product team doing A/B testing, and trying to figure out what is the best way to achieve their goals, or preventing fraud or real-time 360 kind of examples.

Venkat Venkatataramani:

It's all internal operations team using data to make better decisions. But the key change that has happened is all these enterprises are going from batch to real time. So if you think about the history of what it has taken for us to get here, back 20 years ago, even just capturing large volumes of data, and just managing large volumes of data, and even doing batch analytics on that was very, very hard. And the big data movement really was the first wave that changed the game there. So people were able to now with the Hadoop movement and big data movement, people were able to really kind of capture and store large volumes of data, and have batch analytics with a nightly jobs, our daily jobs or weekly jobs and reports and whatnot, over massive volumes of data.

Venkat Venkatataramani:

Then the whole thing evolved to the cloud, where you have seen the same kind of architectures in terms of S3 based storage and easy to base compute allowing you to do the same thing in the cloud even more efficiently. But what is happening now is that all these enterprises and applications are going from batch to real time. So the advent of systems like Apache Kafka, companies like Segment, all these technologies are getting adopted so much that now the data acquisition in all these enterprises are happening in real time. And once the data is coming in real time, now, it's not just analytics that is like a nightly report that you want on it, you want to really put that data to work as soon as and as quickly as the data is getting acquired by your company.

Venkat Venkatataramani:

These are the kinds of sources, even streams, behavioral data, sensor data from your phones and other sensors you have, even your operational transactional databases have a chain stream, which is almost even data coming out of the transactional databases, and so on. And so all these data sources coming in real time, and people want to build these kinds of two category of applications on top of it to actually use that data to improve their business. Either they're trying to improve the efficiency of the operational teams within their business to make better decisions. We saw some examples, marketing, adspend, optimization, product analytics, preventing fraud, managing risk better, making better investment decisions using real-time 360.

Venkat Venkatataramani:

On the other side, people are also... The thing that we're seeing is also using modern, real-time databases and building real-time applications to actually build a better real-time experience for their end user. And so if you think about gaming leaderboards, personalization engines and recommendation engines, the pattern is very much the same, data is coming in real time, and I want to put it to work and I want to leverage that. In this case, I'm actually trying to add more value with real-time data to my end customer, as opposed to giving them a static dashboard based on what happened yesterday, or the week before. As and when concrete trucks are driving around, I want my customer to be able to come into my platform, and be able to see up to the second, update on where those trucks are, and who's on time and who's behind.

Venkat Venkatataramani:

And gaming leaderboard, the same way. I just took a walk around, and they want to give an end user customer experience where they see because of that walk and extra workout that they did today, that they instantly go to the top of the leaderboard. And without the real-time feedback, the gaming leaderboard doesn't really work. It doesn't really motivate their customers to use their product more. So just giving you a little bit of color around why that is a big need now for real-time analytics and real-time applications. As companies moved from batch to real time, accumulate data in real time, they want to put it to work and make their companies better and make their end user customer experience better.

Venkat Venkatataramani:

The next question is everybody's mind is, why can't I solve this using existing technologies? The key realization if you can see is that a lot of these core use cases are not new. It's just that right now, with the real-time data acquisition coming and huge volumes of data coming in, the real need now is that these applications are demanding both speed and scale simultaneously. So if you look at what's happening, or what are the existing set of solutions that are available, you are almost forced to make a choice, you're almost forced to make a big... Which is, what is the one thing that you cannot live without, and then struggle to get the other? Which is, if speed is something that you cannot live without, because it's a end customer facing application, then you pick speed, and then you struggle to scale. Or if scale is one thing that you cannot live without, because it's a internal decision-making system, then you end up picking scale, and you struggle to get speed.

Venkat Venkatataramani:

Let me dig into a little bit more. So what do I mean by that? Take a personalization engine, the gaming leaderboard example that we just saw, right? How would you build a gaming leaderboard like that? It has to be fast, when a query is run against the leaderboard, needs to come back within a sub second, otherwise, it's not a useful product. And so you end up picking something like Elasticsearch or Postgres, or MySQL, and then it is extremely hard to scale them, right? You have to really hire a team of data engineers, a team of DevOps, and look forward to writing their performance reviews every six months. It's a really difficult thing to scale, it is so hard that you can't even rely on technology to scale, you really have to rely on people to scale, it's that complicated.

Venkat Venkatataramani:

On the flip side, say, hey, I want scale, scale is my one non negotiable requirement. Then you build it on a data lake, a cloud data warehouse, and what you end up doing is, your application will be really slow. A very scan-based, columnar databases are very good at separating compute and storage, but are not known for giving you a millisecond response time for your queries, which is what you need to be able to power applications on top of your real-time data. And so in many cases, when someone moves an application from these warehouses to Rockset, you're really looking at something that is not getting 20% better, 20% faster, you're looking at something like, "My requirement is my queries need to go down from five seconds to 50 milliseconds." And that's usually the kind of requirement that applications demand. And that is the real requirement here in terms of a combination of both speed and scale.

Venkat Venkatataramani:

If you look at what is the genesis of Rockset, how is Rockset approaching this problem differently than all the other systems that has existed before it? And you think about it like, "Well, is there a better way, do application developers do have to really pick between, 'Do I get speed and incur a lot of complexity to scale? Or do I pick scale and have to either build something that is too slow and too expensive to make it go faster?'" And if you look at what applications have actually solved both of them, a couple of them come to mind, right? Google Search, and the back end for Google Search comes to mind because it is a massively large scale system, but queries come back in 100 milliseconds or less.

Venkat Venkatataramani:

Facebook newsfeed is actually very similar architecture on the back end, where it's a scatter, gather or leaf aggregator architecture that actually allows it to scale and very quickly find out and build a customized personalized newsfeed for every individual user, even when there are billions of them. The idea behind both these systems is actually real-time indexing. Point at the data sources or stream the data sources in real time to a system that indexes the data in real time in a way that accelerates your query processing. And if you think about it, there is a lot of actual aggregation ranking, finding out what is the most relevant of all the potential matches that you find either on your newsfeed stream or across all the results that match in Google, the ranking and a lot of that actually happens on real time, at query time. And it does not happen at ingest time, because at ingest time, data ingestion, all you're doing is just a real time... more or less a general purpose indexing, so that you can ask all these questions and expect a very fast response time.

Venkat Venkatataramani:

If you look at Rockset, it is directly inspired by such systems. And Rockset is the real-time indexing database for building these modern data applications like the examples we saw early at massive scale, without any operational overhead. Think of Rockset as very similar to how Google indexes the web, builds indexes on it, search indexes on it, and gives you an open text box where you can type any keyword and get fast responses. Think of Rockset as very similar, get a Rockset account, point us at wherever your data is, it could be sitting in transactional databases, it could be coming in real time from Kafka Kinesis type real-time streaming systems. Or it could even be a massive amount of data sitting in your data lake or your data warehouse. In real time, Rockset will index the data and automatically turn them into fast SQL tables in the cloud. And so very similar to Google indexing the World Wide Web, Rockset will index your enterprise data no matter where it is, and give you an open text box in which you can type anything and expect fast responses, except what you need to type into the text box is a SQL query as opposed to a keyword search.

Venkat Venkatataramani:

So that's really the simplicity and the power of Rockset and you should expect, no matter how big your dataset gets, no matter how much volume of data is coming in, no matter how complex your queries are, you should expect very, very fast response time, that is fast enough that you can power applications on top of it, not just for human interaction speed. Let me go back and just show you what are the steps in actually using Rockset and to be able to build your application. So as I said, go create an account, at Rockset there is a free trial, anybody can go to rockset.com and start one and get $300 credit if you want to kick the tires and see what it can do.

Venkat Venkatataramani:

And literally just connect your data source to Rockset, the data ingest is schemaless, so you don't have to specify anything about fields and columns and column types and what have you. Rockset will automatically turn structured and semi structured data into SQL tables that are fully typed and fully indexed. So in real time, Rockset will build a converged index that will turn these real-time data sources into SQL tables in the cloud.

Venkat Venkatataramani:

And not just that, once you have a SQL query that you want to now run on your collections in Rockset, which are these fast SQL tables, fully indexed. Now, once your SQL query... You can, in a single click in Rockset, you can turn that, a SQL query, into a REST endpoint using this feature called Query Lambdas. Now you can go hit that REST endpoint without having to worry about SQL and URMs and all the other complexities that comes in building applications on top of a traditional SQL database. Here, you will query lambdas and you're literally hitting a REST endpoint as though you built a special purpose kind of REST API endpoint for that particular, say, micro service or nano service, if you were to call it, that you will be able to plug in directly from your application.

Venkat Venkatataramani:

You can go from data coming in real time from any source, to building those applications, building those data APIs and REST endpoints on top of Rockset in a matter of hours. And you can instantly build that without having to worry about any operational complexity, it's massively scalable, and very, very fast out of the box. So let's just go back and look at some of these examples that I started the talk with. This is a Rockset customer. Actually, they were looking at Elasticsearch and other systems to be able to do this, and they were worried about scalability and the operational overhead that comes with managing Elasticsearch clusters.

Venkat Venkatataramani:

My favorite quote from this customer was that Rockset took their six-month roadmap and compressed it to a single afternoon. They were able to instantly come and point Rockset at Amazon DynamoDB source, and in a matter of hours, hundreds of millions of job tickets that they were tracking, got replicated to Rockset and new changes coming into DynamoDB would get reflected in Rockset in one to two seconds, inserts, updates, deletes what have you. And now using SQL, they were able to build very, very rich search and analytics and other kinds of reporting that they wanted to go and present to their top customers and build even a new product lines and an increase their revenue, because now they're able to create more value to their customers.

Venkat Venkatataramani:

This is another customer, a hedge fund, that actually had built this application, this interaction layer on Snowflake and the queries were taking two to five seconds, and they switched over to Rockset just for the two to three terabyte data set, right? The batch analytics still happened in Snowflake, but the output of that batch analytics every day, they get a two to three terabyte data set, sometimes it's more, sometimes it's five terabytes, and they want to index that and build a very, very fast interactive web app on that. They switched over to Rockset and the queries that used to take two to three seconds in Snowflake, run on Rockset in 18 milliseconds. So it's about 100 times faster, at least.

Venkat Venkatataramani:

The best part of it was the amount of compute that they are burning on Rockset is 1/4 the amount of compute Snowflake needed to even run those queries with two second latencies. So it's not only 100 times faster, but they were able to do that with 25% of the compute cost. One more examples like this, again, eSports eGoGames is the company, and they use Rockset for doing so many different things within their company, not just for personalization and improving the product, but also a lot of their operations that needs to happen in real time, like fighting fraud is something that they use Rockset for.

Venkat Venkatataramani:

And Rumble is one of our favorite customers, they evaluated Druid, they evaluated Snowflake, they tried various different open source and other cloud technologies. And they will not really be able to scale it in terms of both providing short data latency from when somebody takes a walk to when the leaderboard gets upgraded, and also building a rich leaderboard required sophisticated, full feature SQL, and kind of like OLAP queries that they were doing to power that live leaderboard was only possible in Rockset, even though they tried a whole bunch of other systems. So you can read about all these case studies and blogs in our website, I encourage you to look into them.

Venkat Venkatataramani:

Coming back, as I said, Rockset is the real-time indexing database for these modern applications, so that you get both speed and scale simultaneously without any operational overhead. So with that, I want now... I'll stop sharing my screen, and I would love to take a couple of questions, while Dhruba sets up his part of the presentation. And Dhruba will go more into how does Rockset actually achieve this, the combination of speed and scale and simplicity? So he'll go a little bit into how Rockset is actually built under the covers to achieve that. But would love to take one or two questions while he's setting up.

Kevin:

Yeah, sure. So again, if you have any questions, please put them into the questions section of the webinar tool. And thanks, we do have some already. Venkat, you volunteered to take some, so here's one for you and/or Dhruba, depending on who wants to answer it. Can you share some details about scalability in terms of data size, query sizes, etc.? And the reason for this question is because the person is wondering if some of the other tools you mentioned might be suitable at a certain scale versus Rockset or are there reasons he would consider Rockset even if he's not at a certain scale? Yeah.

Venkat Venkatataramani:

I can take that. In terms of scale, there is storage and then there's compute. So in terms of storage, Rockset architecture uses hot storage and not cold storage. And hot storage is I would say about 10 times more expensive, but maybe 500 to 1,000 times faster to do query processing on. So it is required for powering applications. So in terms of storage scaling, the total volume of data managed... The software is massively scalable, but it really comes down to your applications needs, right? How much data do you want to actually index and build application on versus how much data do you want to just store it in cold storage for future analytical purposes and whatnot?

Venkat Venkatataramani:

If you really separate those two, Rockset can scale to hundreds of terabytes of data and still give you sub-second response time for your query processing and for your application development. So storage scaling is built in, and Dhruba will talk a little bit about it. In terms of compute scaling, again, it goes down to your application and workloads. I think you can... Based on the number of queries, you run, the complexity of every one of those queries, and how fast you want the queries to come back for your application to be interactive and useful, Rockset allows you to scale that query compute independently of your hot storage. So you can go start with a small amount of compute in dev/test, and as soon as you go to production, you can bump it up, and then you can keep bumping it up and keep going. Rockset's virtual instance is actually a distributed system, and we can continue to grow that and scale it to meet your application's query performance needs.

Venkat Venkatataramani:

Is it scalable? Yes, I think I can... I would build applications up to maybe hundreds of terabytes of data in a single app, on top of Rockset in terms of storage, beyond which I think the hot storage cost might become prohibitively expensive for lots of applications. On the compute side, it really depends on your... You can burn a lot, you can burn a little, it really depends on your requirements on your application.

Kevin:

Okay, that's great. Thanks, Venkat. Maybe we can ask Dhruba to share a bit about how Rockset's built under the hood.

Dhruba Borthakur:

Sure, yeah. Thank you. Thanks a lot. That's a great question actually about speed and scale. And that's exactly what I was thinking of presenting for the next part of the talk. Here, we are going to discuss about what is the internal of Rockset which gives it the speed and scale and simplicity that Venkat mentioned earlier. This is something that... So Rockset uses something called aggregator-leaf-tailor architecture. And the reason behind this architecture is, again, because of the real timeness of the database. So I mean, we have talked about how real time is important for all these use cases. And the relevance of this aggregator-leaf-tailor architecture is that this is one architecture where you can have a very high write rate into the database, and also have a very high query volume in the same database. This is the difference. This is why you need the aggregator-leaf-tailor architecture because it's a real-time database. And real-time database means write a lot of data, stream data, megabytes, tens and hundreds of megabytes per second, and also at the same time, you also query the same data with high QPS and low latency.

Dhruba Borthakur:

The uniqueness of this architecture is that it is a disaggregated architecture where we separate the compute needed for ingest, the storage that we need to store the data and the compute needed from queries. So there are three kinds of disaggregation here that's going on. If you see this picture on the left side of the tailors which is what is used to ingest all the data from your various sources, they put the data in a certain format, which is indexable and storable, and then it gives it the leaf nodes. Leaf nodes are the storage tier, the storage data in hot storage. And then on the right side are the aggregators which are used for querying, and the queries could come from a live application and our distributed aggregators, which scale up and down based on your query volume.

Dhruba Borthakur:

So the three things that are aggregated, again, are the tailors, which means that you can have more tailors when there are more... data write rate increases. If a data size increases, you need to have more storage nodes, which is the leaf, and then if your query volume increases, the system will automatically scale to more aggregators, so that you can get good query performance. This architecture, we did not design this out of the blue, we have seen examples of this, a lot of similar architectures we have built, myself and Venkat, when we were at Facebook. For example, at Facebook, how the newsfeed is built, this is the indexing system, which is a very high write rate and high query volume and this kind of architecture is used for that as well.

Dhruba Borthakur:

The reason this architecture is ideal for real-time databases is because the writes are separated from the reads. So if you look at that vertical line going right down the middle of the screen, on the left side are where all the writes are happening, and on the right side of the line is where all the reads are happening. So this very much also follows kind of the standard CQRS pattern of software where writes are separated out from your reads and the don't impact one another. So even if you have a bursty write, your query should still have consistent latencies and consistent throughput for your application. So that is the architecture, you can read about it, we have written about it. And you can also see examples of this built by Facebook and LinkedIn and other web companies about this one.

Dhruba Borthakur:

This architecture powers the speed and scale that Venkat talked about earlier. So let's first talk about a little bit of the speed, how is the speed achieved by implementing our back end using the architecture we described earlier? So one unique feature of our back end is that ours is an indexing back end, it's an indexing database, we use something called converged indexing. So converged indexing, the simple way to explain it is that we take all the semi structured data that's coming in, let's say, it's a JSON document, as you can see on your screen, we shred it into individual keys and values, and we store it in a key value store, right? So that's the key value store picture that you see on the right side.

Dhruba Borthakur:

The interesting thing about converged indexing is that this actually builds three types of indexes. So for every record on the left side of your screen, we build it three indexes. One is a row index, one is a column index, and one is a search index. This is automatically built for you, you don't have to do anything for it to get built. This is again, what Rockset is doing below the covers. So the row index is very much similar to say, a MongoDB store or a Postgres store where we have all the rows stored out there. The column store is very much like a warehouse or a columnar store where you have all the values for a column are strung together. And then the search index is essentially an inverted index, where it's very much similar to Lucene or Elasticsearch, where for looking at high selective queries, that's the data set that you should go and query for.

Dhruba Borthakur:

These three indexes, what it means is that now you don't have to configure Postgres or configure Elastic or configure a warehouse separately, it is kind of built into converged index. So this is how the storage looks like. Now, the storage is only one side of the story, somebody has to leverage the storage to give you good query performance, and that's where the query optimizer comes into picture. There are two examples of the query here, that queries automatically figure out which indexes to use to leverage and give you good results. Take for example, on the left side, we have a query which says that selecting some fields from the database, and we have one or two filters. Filter side saying keyword equals hpts. Let's assume that somehow we know that this filter is actually a highly selective filter, which means that it's best if we can essentially use... if the system can automatically use the inverted index to be able to query and give you results for this.

Dhruba Borthakur:

On the other hand, on the right side of the screen, you will see a more aggregate type of queries where you have a count on a certain field. Those kind of queries need to scan a lot of data or need to scan one column or a few columns of a lot of records. So it's best if the query automatically uses the columnar store. This is the advantage of Rockset in the sense using the index is super helpful for certain types of queries, and using the columnar store is super helpful for other type of queries. Now, take for example, I mean, in database terminology, or my research, it's pretty clear, it's well known that using an index always gives you fast query results. Then it's actually amplified by the fact that if you have a high QPS system, every result, every query, instead of scanning, it could actually use the new inverted index to give you results. And this is why the total amount of CPU needed increases, is actually amplified 100 X or many times, if you use the inverted index intelligence.

Dhruba Borthakur:

I think Venkat referred to this query, described about the difference about performance for one user where they're using a columnar or warehouse versus Rockset, and that's like five seconds versus 18 millisecond query latency, something like that we have seen in actual production use cases. So that's as far as the speed of the system is concerned. The other thing that also is needed for most real-time systems is scale. This is the challenge about giving those two things together in one system, and Rockset does this efficiently.

Dhruba Borthakur:

Again, going back to the aggregator-leaf-tailor architecture, what are the things that need to scale up or scale down on this architecture? So first thing, obviously, is that the storage you have to scale and I think one of you asked this question, what is the scalability to how much can you scale? So I hope this portion answers part of your query as well. So let's first talk about the leaf nodes, which is just the rectangle right in the middle of the slide. This is actually the distributed system, again, each of these rectangles are distributed systems where lots of nodes could be put together, and this is what are called the leaf nodes. So that's the hot storage tier. The hot storage tier scales up and down, irrespective of how many queries you are doing, or irrespective of what is your write rate, right?

Dhruba Borthakur:

So it's independently scalable, and it's automatically scaled. So we use open source RocksDB cloud for storing the index in the leaf nodes, and we have written quite a lot of blogs in the open source community of how the RocksDB cloud does the hot data scaling of hot storage. There are things like zero copy clones that we use, we kind of separate the durability and the performance for the storage tier, so that we can make more replica of the storage tier needed based on your queries. So that's as far as the scalability of storage is concerned.

Dhruba Borthakur:

Now, if you look at the scalability of compute, the way we package the compute is something called virtual instances. I think Venkat mentioned this as part of answering that question. The Rockset product has this concept called virtual instances, which are essentially compute that you can use for your queries. You don't have to manage them at all, this is mostly a unit of consumption. So you can say that, "Hey, I need a 4X or 2X large virtual instance, and it will give you some amount of CPUs that you can use for your computation. Again, the reason this is important for our users is because they need consistent query latencies. This is why for a demanding application, which is user facing and which cannot have fluctuations in their performance, this is the reason why you have dedicated compute so that you can give great performance for all those applications that you are powering using Rockset.

Dhruba Borthakur:

Virtual instances can be very small if you're hardly making any queries, but it can also be very large depending on your query load and your concurrency of your queries. So this is how we scale... So for virtual instances, now the question is coming is that when you ingest data into Rockset, you also might need some compute to be able to keep your indexing speed as fast as you want, right? So this is a real-time database, which means that the right volumes are very high. Now to decide about how good Rockset is when you're writing data, streaming data into Rockset, we have a benchmark called RockBench. So again, this is an open source benchmark, we have written about it, and you can look at it and also look at what the benchmark does.

Dhruba Borthakur:

What the benchmark essentially does is that it throws a lot of data into the database in a streaming fashion, tens and hundreds of megabytes per second. I think, for this example... I will come to the graph in a second, but essentially it's streaming in a lot of data, and then it is measuring how quickly is that indexed and visible in query latency, or in any of your queries. So this particular results that we published is for a workload where it was one billion events a day streaming into the database, and then within one second, it's indexed, and it's visible to your credits. So this is what the graph on the right side shows that depending on your write rate, I think it's around 10,000 events per second, is when you get one billion events per day, which is where you see the query latencies to be around one second.

Dhruba Borthakur:

One more interesting fact is that if you upgrade the Rockset virtual instances, you can actually get faster or more throughput of your system. So the ingest is essentially scalable, again, based on how much data you are sending and how you have provisioned your account. So that's ingest scaling. But then there's also query execution scaling. Maybe you start off with low volume of queries, and then obviously, your application becomes more and more useful, and then you need to scale up your queries. How does the query scaling work? So I'm going to skip a few things here, but I'm mostly going to talk about the features of how we scale each of these queries. None of these queries... All of our queries are handled in a distributed fashion. For example, since Rockset has SQL, we actually support SQL joints. And these are distributed joints.

Dhruba Borthakur:

So every example that we have, or every SQL operation that we have, it's not limited by the memory of one node, because we have aggregator pods and the aggregator pods can be in multiple hierarchical levels so that you can get more and more memory for a single query if you want. This is what it means by having distributed sorting and distributed aggregation so that we can use more CPUs and more memory as part of a single query and give you the latency that you want.

Dhruba Borthakur:

Yet another way to improve your query latency is changing the virtual instance of the Rockset product. So if you up your virtual instance to, let's say, from large to X large, you typically expect reduction in your query latency bypass, or you can probably expect to be able to do 2X more throughput on the same system now, without having to do any code changes, or any re-sorting, you don't have to manage anything, the system is auto, kind of... How should I say? It auto scales up to all the compute that you have for your queries, and that's the reason why with a click of a button, you can actually make all your queries 2X faster, 4X faster, without having to manage any data or having to re-sort data at all. This is basically point and click interface that it has.

Dhruba Borthakur:

I would invite some of you to go try this out. This is a really cool feature, a lot of our users like it, because you can have a workload, which is dev test, or maybe in test mode first for some time. And then at the click of a button, you can actually productionize it by provisioning a larger virtual instance. And everything runs as it is, except that things are much faster, or you can put more load into the system.

Venkat Venkatataramani:

Question, Dhruba. When a user changes from, let's say, XL to 2XL, what happens to the queries? Do they fail?

Dhruba Borthakur:

Oh, good question. Yeah. So you can actually change the virtual instance online, which means that none of the queries fail at all, because this is the beauty of a cloud native back end database that you have built. You can change virtual instances, you can migrate it up, or you can migrate it down. In both cases, no query should fail at all. So if you have query lambdas, or if you have a high volume of queries, you should not see any query failures when this is happening.

Dhruba Borthakur:

So that's the part about the scaling part. The simplicity part is also something that our users love a lot. And the simplicity is not just one thing, but it's a couple of things or a couple of categories, which is why I think people can ship something like Venkat mentioned one user compressed his six month roadmap to maybe one afternoon. And this is where the simplicity becomes prominent here. So Rockset is serverless, which means that you don't have to tell us upfront how much data you are going to store, you don't have to plan any capacity and it's pay as you use. So you don't have to manage any software and you don't have to upgrade anything, the system automatically does that for you in the backend. Again, hopefully no pipelines for you, because you can deposit raw data, which is raw JSON or raw XML and you can start making SQL queries on it.

Dhruba Borthakur:

The second part of the simplicity is SQL, it's full featured SQL including joints. So you can actually run all kinds of joints on your system. And again, this is not a system which says it's SQL, but they're no joints. This is all types of joints, or most type of joints are supported in Rockset. And all these joins are actually distributed in nature, which is why you don't have to worry about how much temporary data this thing is going to create when you're running the query. Also, it has date/time function, so various flavors, all the flavors of SQL data and functions we support, geo functions, Windows functions. That's the full feature of SQL that we talk about. That is useful for some of the engineers, but then for application engineers, we have focused a lot on what are the features that Rockset can provide for application developers? And application developers love the fact that they can convert the SQL into REST APIs using query lambdas.

Dhruba Borthakur:

So once you have created a SQL, and you want to ship it to production, and you want your applications to hit it, you can make it as a query lambda, and then all your applications can just use standard REST APIs. You don't have a shipper, some customers details to your clients anymore, you can just hit the REST API and you get your SQL query results. These also support versioning, so that's why you can do CI/CD as well. And then we have good vscode plugins so when an application developer, it will give you the look and feel of native vscode database. So those are some of the reasons. Again, I would love if some of you can try and see how simple it is and if there is any feedback on how to make it even simpler. Best way, obviously, is to go to console.rockset.com and then try it yourself. We have free credits, and you can use it and see how it goes. I can take questions. Venkat, do you have anything to add?

Venkat Venkatataramani:

No, I'm looking forward to answering more questions. Thanks for attending everyone. It's a pleasure to be able to present this with Dhruba, and to all of you to on what we're doing. Go check us out. It's free to create an account and get credits and just kick the tires, and if you have any questions, we'll hang around for a while and answer that now.

Kevin:

Yeah. Exactly, Venkat and Dhruba. Thanks a lot. Keep the questions coming, I think we have quite a number. So aside from what Dhruba pointed out, you can definitely check out everything Venkat and Dhruba mentioned, by taking advantage of the $300 in free credits when you sign up for an account. We didn't have time to show you Rockset in this session, but if you'd like to talk further about how Rockset might fit with your usage, you can also contact any one of us or request a demo on rockset.com following this Tech Talk. So Venkat, so you need to jet at noon? Or do you have time to answer some questions beyond that?

Venkat Venkatataramani:

Oh, I can keep going for a while. Yes.

Kevin:

Okay, so question here, does Rockset support updates, deletes, assets? Maybe Dhruba, do you want to take that?

Dhruba Borthakur:

Sure. Yeah. So Rockset is a completely mutable database. It's not like append only, read only table that warehouses typically support. So every field of every record is mutable, which means that you can update one field in one record, and it will get updated, it doesn't reindex the entire document. So yeah, it is absolutely mutable database.

Kevin:

Okay, and how does Rockset compare versus other real-time analytic engines that can also handle similar scale and subsequent SLAs? Example, Druid, Pinot, etc.

Dhruba Borthakur:

Do you want to take that Venkat or shall I try?

Venkat Venkatataramani:

Oh, yeah, give it a shot, Dhruba, I think you know better.

Dhruba Borthakur:

Sure. Yeah. Yeah, absolutely. So again, for Rockset, it's powered by an indexing database. So it's not a columnar storage. Columnar storage has its advantages, like you can have disk size to be really small. And if you're trying to optimize storing a lot of data, then Druid is great, because it might shrink your data to a really small size to store it in offline data storage. But for Rockset, everything is based on an index, and it's automatically created, so you don't have to manage any fields saying what needs to be indexed or what doesn't need to be indexed. Also, if you're looking at other OLAP databases, you typically have JSON, which you need to flatten and then put it... I mean, at time of writing into the database, you have to flatten some fields and then write it into the database. In Rockset, that's not the case. You can just dump raw JSON without any flattening, your schemas of your input data can change over time, you don't have to do anything. This is what I mean by simplicity.

Dhruba Borthakur:

All of these things you can do in other systems as well. You just need to manually configure these things. In Rockset, it's automatic for you. So yeah, you don't have to create any schemas or anything at all, you can directly dump raw JSON and you can make SQL queries out of it. I hope that answers your question.

Kevin:

Okay, thanks, Dhruba. Again, if there are any follow ups, please contact us, ask questions online, offline. Everything is fine. Maybe I'll ask this one to Venkat. We had a couple of questions around visualization, right? How do you normally visualize the data? Can you connect to BI tools? And then if I can just add to that question, right? Are most Rockset uses actually visualizing data in visualization tools? So how are users analyzing the data in Rockset?

Venkat Venkatataramani:

That's a great question. I think there are almost static dashboards that you can use. There's a whole bunch of SQL-based dashboarding tools Rockset supports out of the box, like Grafana, Redash that are open source. Preset, the company behind Superset, we are big fans of the product. So Superset also works really well with Rockset. So that's a very popular combination. Tableau, I mean, JDBC, anything that supports JDBC can work with Rockset for all these other BI tools. The key thing is that I think the BI tools, it's about doing live queries. There's a couple of differences in visualization. BI tools are very much known for extracts and then doing interaction on the extracted data. Almost all BI tools work with Rockset as a live query, because the queries are so fast, you don't need to do any extracts to work around the slowness of your typical data system. So number one.

Venkat Venkatataramani:

Number two is ad hoc slicing and dicing. So visualization tools that allow you to click on part of a graph, and then you get a more detailed breakdown of what's happening there. Those are the kinds of dashboards that people end up building or covering on Rockset. And a lot of the static dashboarding tools fall short, doesn't really give you that interactive... It gives you a visual representation of the data, but doesn't allow you to integrate the data in a visual way, unless it's an extract. So this is usually the dilemma and most of our users end up building a custom application, say, "Why a retool?". So another popular way of visualizing and consuming the data in Rockset is to actually build a custom internal tool on, let's say, retool, and plug the query lambdas on the back end using Rockset.

Venkat Venkatataramani:

So it's really how rich you want the interaction to be. Rockset is traditionally not used as a, what I call a bean counter, right? So there is other quote, unquote, real-time systems that does not give you powerful analytics, but allow you to kind of keep some counters up to date in real time. So those are systems like Apache Druid is a very good example of that, which is extremely good, as long as you know exactly what the shape of the data is, and what exactly are the questions you're going to be asking ahead of time are. Doesn't allow you for interaction and being able to find out, not just what happened, but why, how do I respond to it, and how do I understand it?

Venkat Venkatataramani:

So the visualization layers are usually much richer when it is custom built for applications, which is why the JS SDK and the Python SDK and those kinds of applications and web apps, internal tools, decision systems are very, very popular. The second most popular thing is retool, which kind of shortens the time it takes for you to build that kind of a custom interaction layer that is optimized for your workflows. And then a third level approximation of that is all the SQL-based visualization tools via open source, like Superset, which is great, Redash is great, and Grafana is amazing, so all these tools work natively with Rockset. And on the BI side, Rockset is SQL, it's full featured standard SQL, so we have a JDBC connector. And so any BI tool like Tableau or what have you, as long as they support JDBC, it will work with Rockset.

Kevin:

Awesome, thanks for sharing, that was great. One more here, which is, do you think we can avoid using streaming applications like Spark, [City 00:58:15], etc.?

Venkat Venkatataramani:

I'll take this. Streaming applications, I mean, what are you really doing with it, right? I think there are a couple of reasons why I think some of those stream processing systems are still useful, even with Rockset. The number one thing that we would encourage people to do with stream processing engines is to probably do some level of roll ups to reduce the volume so that your hot storage costs are under control. But if you really think about what streaming applications end up doing, they do a lot more than that, they actually aggregate the data that is coming in to be able to answer a particular question that they know ahead of time. And I think that is what I think we can entirely avoid. So the typical way you get best of both worlds of both free-form query processing on your streaming data, and also control your hot storage costs and compute cost of ingesting all of that is really depending on the volume of the data, right?

Venkat Venkatataramani:

Let's say you have a terabyte of data or less coming in every day, most of our customers just stream all the raw data into Rockset, index it all, and they can ask any questions they want, and it's totally well within their budget and well within their control to be able to build a world class, real-time analytics and real-time application on that. When you cross over into more than one terabyte of data coming in and I want to stream that in, then I think that is a... What are you trying to build on... What kind of application are you trying to build? And a lot of the time our customers use for some kind of roll up, where some kind of... Instead of sending 10 records to Rockset, it gets rolled up to like one record, or some kind of reduction happens on the data coming in.

Venkat Venkatataramani:

And so in Kinesis... You can use Kinesis analytics to do it. In Kafka, you can use KSQL to do it. Any kind of stream processing that reduces the volume of data ingested into Rockset is very, very valuable. You still keep it as close to the raw form as possible, but you roll up so that instead of inserting 10 terabytes of data in real time to Rockset, you do one, but it's still much closer to the raw form. You don't need to break down on every dimension that you might ever want to query, all of those things can still stay raw, you don't need to denormalize in joins, you can do joins at query time. And so all of those advantages still come in.

Venkat Venkatataramani:

But stream processing, for rolling up and doing some kind of aggregation, right time aggregation of the raw data to be a slightly less raw form, but still reducing the volume of ingest is a very popular pattern for the streaming systems to work with Rockset's.

Kevin:

Awesome. Okay, so let's just do one more. And there were a lot of questions, but I'll try to combine multiple questions on the architecture into one and then maybe Dhruba, if you want, you can take the first crack at this. But where is ingested data stored, okay? Is it stored or managed in my AWS account? Or in Rockset? And then on the compute side, is compute only used for handling queries? Or is it shared for ingest and index? And does the user have control over storage type and size in addition to compute and virtual instances shared between customers? So lots of questions about the architecture-

Venkat Venkatataramani:

Kevin:

... and compute side.

Venkat Venkatataramani:

Cool, yeah, I can maybe answer one or two of them, and then maybe Venkat can join in and talk more about virtual instances, if you like. The first question, I think is where does data get stored, right? That was the first question.

Kevin:

Yeah.

Venkat Venkatataramani:

First the data comes in and it gets indexed, and those indexes needing to be stored. So this is what we call as hot storage. The hot storage are essentially a set of RocksDB cloud based servers, and they stored index on locally attached SSDs. This is what I mean by hot storage. But again, the local SSDs are not used for durability, because the data is also made durable in S3, which is why there's a separation between the actual place where the data is stored, which is for durability in S3, but then for serving in hot storage, this is SSD and RAM based systems and this is all RocksDB cloud based. I think the second question probably is related to this, I'd say is that our virtual instance is shared between users, right? That was the second one-

Kevin:

And there was also is compute -

Venkat Venkatataramani:

Query.

Kevin:

... different between ingestion and query index?

Venkat Venkatataramani:

Sure. Yeah. So let me answer that first. What happens is that when data comes in, we have a separate set of machines... I think we saw it in architecture, it's called tailors. The tailors are the one which actually convert this data into an indexable format or form. So this is separate from the storage tier. This is why we call this the tailors, it basically a homogenized different data formats into a specific protocol, protobuf format that Rockset has. And that CPU is separate from the storage node. So it is not tied to the storage layer, so it can be scaled up and down separately.

Venkat Venkatataramani:

Similarly, in RocksDB cloud, we have something called remote compaction, which basically allows you to take... The system sometimes needs CPU just to keep the indexes up to date and that compute, instead of running it on leaf nodes, we try to run it on a different layer called remote compaction layer. Again, we have written about it, there is a blog post. If you look for RocksDB cloud remote compaction, you will see how the CPU for keeping indexes online is separated from the next storage layer itself.

Venkat Venkatataramani:

So again, the focus is how can we separate the compute and the storage needed? That is the entire focus of this architecture. Then the virtual instances are actually used for querying. So this is something that you have to have so that your aggregators, which is basically your query components, are being scaled up and down on the side of the virtual instances. Do you want to add anything Venkat about virtual instances? Was there more follow-up questions about it?

Kevin:

And I think, Venkat, if you want to talk a little bit about VPC, I think the question also asked about that?

Venkat Venkatataramani:

Yeah. Rockset has two deployment options. One, if you just go to rockset.com and sign up, it's a fully hosted, fully managed, almost like a SaaS offering, where Rockset takes care of everything for you from operations, scaling, data connectors, and what have you. And you pay for what you use in terms of hot storage, in terms of ingest, and in terms of query compute, and virtual instance sizes. Rockset also has another deployment option, we call it a private VPC deployment option, where we separate the data plane, where customers data is ingested, indexed, stored, and the entire query path where the data is actually processed and looked up and aggregated and whatnot to serve those SQL queries. So the data plane will basically completely contain all the data path and the query path in terms of ingest and queries. And the data plane is completely separated out in the private VPC deployment option from the control plane. And the control plane is still managed by Rockset in terms of metrics, monitoring, being able to do software patches, so that you get all these latest and greatest features and bug fixes and other things from Rockset automatically.

Venkat Venkatataramani:

Also, the orchestration for the virtual instances themselves. So you can still come into the control plane and say, "Hey, make my 16 CPU virtual instance like... Take it from a large to a 2XL," and you can still come to the control plane and the control plane will orchestrate the scaling of the compute inside your data plan automatically, without any operational overhead or need from you. So in the private VPC deployment option, the customer's data never leaves their AWS VPC perimeter, which basically means it never leaves their security perimeter, their compliance perimeter and Rockset, it doesn't even need to have access to the rest of the internet. The control plane, it will be connected to the data plane using a private VPC link, a private link option in AWS so that it doesn't go through the internet either. So we have the private VPC deployment option and AWS VPCs today, it is not yet available on Azure or GCP just yet.

Kevin:

Okay, thanks a lot, Venkat and Dhruba. And thanks a lot for everyone who stayed with us. We went into overtime with all the questions, so appreciate your patience. If you want to contact any one of us, it's just our first name @rockset.com if you have more questions. And thanks Venkat, thanks Dhruba, thanks for all attendees for joining us today.

Venkat Venkatataramani:

Thanks for hosting us, Kevin.

Dhruba Borthakur:

Thanks . Bye.

Venkat Venkatataramani:

Bye.

Recommended Webinars

Serverless Real-time Indexing: A Low Ops Alternative to Elasticsearch

Scaling MongoDB Best Practices for Sharding, Indexing and Performance Isolation

How Standard Cognition Builds AI-powered Autonomous Checkout on Computer Vision Data

Best Practices for Analyzing Kafka Event Streams