Why Wait: The Rise of Real-Time Analytics Podcast
Rockset Podcast Episode 4: Pain Points of Real-Time Analytics Systems
Hear from Uber infrastructure and data specialist Zheng Shao, (ex, Yahoo, Facebook, Dropbox) who has designed multiple real-time analytics systems in the past. He touches on real time analytics challenges, batch vs real time analytics, time series databases and more.
Listen & Subscribe
Other ways to listen:
The Why Wait? podcast where we invite engineers defining the future of data and analytics to share their stories. New episodes are published every month and you can subscribe to the podcast and roundup of blogs to catch the latest episodes.
Subscribe Now:
Show Notes
About This Podcast
Giovanni Tropeano:
Welcome to Why Wait, the real-time analytics podcast by Rockset. We invite business leaders, app development thought leaders, and analytics specialists to share their stories with the world. Providing insights into what your peers are doing to improve application analytics in real-time. Before I kick it off, if you're listening to this and have a question or would like to comment, please do so on our community chat slack channel at rockset-community.slack.com. And you can also actually leave us a message on Twitter as well @rocksetcloud. With me is my cohost Dhruba Borthakur, Rockset co-founder and CTO. Thank you for being here, Dhruba, it's always a pleasure and taking the time to chat with us. Today, Dhruba and I are joined by an engineering leader with quite a track record. Having started at Yahoo, he rose through the ranks and moved on to Facebook as a Senior Engineering Manager, holding that role for over six years. He then moved on to Dropbox, and most recently to Uber, where he's owned data infrastructure, and now engineering. I've really been looking forward to this conversation. Welcome to the Why Wait podcast, Zheng Shao. It's a pleasure to have you, and let's jump right into it. I know Dhruba and Zheng know each other from a previous life. How do you gentlemen know each other?
Dhruba Borthakur:
Yeah, I have known Zheng for a long time. I think the first time I met him is when I was working at Yahoo, more than a decade back when I and I was working on the Hadoop project. Zheng was actually in a team at Yahoo who was the first users of the Hadoop project. So I've known him for a long time, starting when Big Data Revolution started. I also worked with him at Facebook, where Zheng built a lot of real time infrastructure at Facebook, including Puma and Calligraphers and some other projects at Facebook, this is how real-time analytics started when both of us were peers at Facebook. Zheng, welcome. Thanks a lot for being here. Would you like to introduce yourself?
Zheng Shao:
Thanks Dhruba.Yes. As Dhruba already mentioned, Dhruba and I worked together for a pretty long time. I started my career about 15, 20 years ago when I was doing research on data mining and databases in the University of Illinois. What I realized at that time is without a grid infrastructure, there's no way to do data mining or machine learning, AI, which is pretty much what I think the industry right now is looking at as well. Then afterwards, I started working at Yahoo and that's when I got to know Dhruba and then worked with Dhruba very closely for the six years I worked at Facebook. Then, switched to Dropbox. Five years ago, I joined Uber and had been at Uber, working on this big data systems for the last five years.
Dhruba Borthakur:
You have a lot of experience working on different systems in some of these cutting edge companies. Let's start off, about life at Facebook where both of us were working together. You had designed a lot of real-time analytics solutions there at Facebook. Could you explain a little bit about the pain points of some of the real-time analytics systems that we implemented at Facebook, and what were the challenges as far as the usage of those systems are concerned?
Zheng Shao:
I guess first of all, I should introduce a little bit about the need of the real-time analytic system at Facebook. In general, each of the companies, including the other company that I work at, when they first get started, they are okay with some analytics results that have a late, or let's say are stale, but as the business continues to grow the need of getting the real-time analytics is stronger and stronger. Business users, once you get to numbers as soon as possible, so they can make the decisions. They started from the daily reports eventually to minute it and eventually to what they want you to look at in real time so that they don't need to bother the engineering team to prepare the reports for them. For the real mandates when we first get started, those use cases are still very simple. It's basically a simple common team based on some dimensions, but then quickly the capability becomes the biggest pin point that our customers need. That means people start to look for four sequels apart. People start to look for nested columns. People start to look at how we can handle late arriving data, as that includes exactly once delivery and the semantics that also includes data processing. When sometimes older call comes in, a new record, update the value. Then how can we make sure the total sum is still correct by subtracting or by adding the data between the two records. All of these kinds of capabilities becomes the pin point. By the way, those are capabilities. A lot of them are already in the traditional batch systems, but because the real-time system is new, it takes a lot of effort to build those things from scratch and add into the real-time systems. That's the first side on the first change on the developer, of the developing capability side. The next challenge is on the maintainability and essay batch systems. We have one day to finish the job. If the system goes down, engineers have got time to restart the system. For real time assistance, unfortunately we don't have that luxury. Once the system is down, we basically violate our essay. The search engines that we have is related to the costs because the real time analytics requires a lot of the processing memory. We require a much bigger systems that consume a lot of memory and CPU, that you provide statistics. All of them added up, I would say real analytics is not easy.
Dhruba Borthakur:
Real-time analytics is definitely a challenge, especially when a lot of enterprises are trying to use it. Is it possible that some of the challenges could be addressed if you build a solution that is customized for real-time analytics? You said about a batch analytic/batch solutions that are used to have. Now, if you build a real time analytics system just by itself, what will be the engineering trade-offs or what will be the trade-offs in building such a solution compared to the batch systems that you had before?
Zheng Shao:
That's a very good question. I think that's also, most of the companies is how they started with other real-time analytics. That's basically keep the batch system away. Now the batch systems, they have been optimized for big jobs for scale, and as a result they conflict the system to handle big files, big blocks, and those systems, unfortunately, are not good at handling small files just by choice. Then we come to the real-time systems, whether the data needs to be processed very quickly. We would have to have a lot of smaller data elements in the system design. As a result, it makes sense for us to split it out and you can build a separate system. That's basically how Facebook has done it. The open source community also studying that way. All the systems that had me for real-time analytics was different from what Facebook had described and then open source later had a KAFKA. Then also a standout your Flink as well. Facebook also had the streaming processing engines, and those are all specialized for real time, data processing. However, this also creates additional complexity. Right now we have a batch system, we have a real-time system. How do we put those two systems together and make sense out of it? How do we reduce the overall operational overhead for both systems together? Those are the next challenges that we will have once we have the two systems, both in production.
Dhruba Borthakur:
Batch systems typically probably like you said, did things in big batches on big blocks or big files. So it was a different architecture there. For real-time analytics, because data is coming in and we need to make decisions in real time on the data that's coming in. What are your thoughts about, doing right time aggregations, are pre cubing essentially. So when the data is coming into the system, is it possible for the real-time analytics system to just do some aggregations, even when data is coming in or create some cubes, maybe minor, smaller sized cubes, so that at a time of query, at least you can find these things quickly and easily. What are your thoughts in this area?
Zheng Shao:
This question actually is very close to my heart since cubing was wonderful. The main topics for my research back in graduate school. At that time I thought, cube can solve a lot of problems, but when we get into a company, what happens is cubing can only solve, I would say maybe 10 or 15%. The reason, first of all, is if we know exactly the query, when can do QB but in reality is as a number of dimensions to being a business setting is usually huge to cubing for a high dimension is not possible at all. Then if we lift our thinking, one step higher, cubing is just one of the very special technique for optimization and cubing is a subset of the material that views that database engines provide. The big question will be, can we hide these Cubeans and all the observations from the user perspective? So user wouldn't have to worry about this. They still write SQL or whatever language that allows them to express the business logic, as simple as possible. Then the backend system will figure out what are the cubes need to build? What are the industries QPP and then make the query round fast and efficiently. I think that's a very good challenge. I think the whole industry is still exploring how to do this really well. So, I think we are in a very early stage right now.
Dhruba Borthakur:
I like how you're focusing on how can you make life easy for the user, because then they don't have to do this separately. It is part of the system that they're using. to make life easy for the user. Another idea that the analytic system like RockSet uses are things such as where part of when a query comes in and as part of the query planning, we can potentially figure out let's do, or maybe a broadcast join. But as part of executing that query, the system can automatically remind that maybe this is not the right type of joint to be done. Maybe I can switch the joint I'll guard him and do different kinds of joint based on the dataset. Again, to make life easy for the user. The user doesn't have to do it explicitly. Have you seen systems do this in general? What are your thoughts about kind of system, which can do this kind of execution optimization or query planning, optimizer optimizations in hand-in-hand with the execution of the analytical query?
Zheng Shao:
That's also a very good question. I'm actually looking for such a system because I think it would be useful for especially let's say extremely analytics, which is part of the real-time analytics domain. The challenge there is for traditional batch queries; they restart every hour or restart every day. The query optimizer cut to the opportunity to come in, look at the statistics, and re-plan the query every single time, the query roles. However, for the streaming analytics, the query is already running. That means the automats only got one chance to optimize the execution plan before the crowd gets started. While the query is running, there's no way to change the query plan anymore. This is how the traditional architecture of the streaming analytic is designed. I think that (inaudible) needs to be improved and we can take some ideas from just in time compiler where phone calls just like a Java language. They will take the runtime imitations run hard profiles of the program and find out which are the codepaths that got executed more frequently than the artist. And then automate that. Similarly, here, I think we will need adjusted in time combator, which runs together with the streaming job, take the profile from the job and decide when to switch from phone poll. Let's say south based such much base the giant to a hash giant. That would be really awesome, but I haven't seen such a system in the industry or in the open source community yet.
Giovanni Tropeano:
When it comes to real-time analytics, then Zheng, can we use a time series database instead? Are we able to do that? Is that an option?
Zheng Shao:
Time series database is also a very interesting topic. My understanding, looks like this; why time series database is so popular? Two reasons, first of all, there are definitely enough new cases for that in the industry, as a stock in France system monitoring in the old days to, now people call it APM, application progress monitoring. Especially because of the micro service architecture. I also choose a profile or the product analytics and mobile analytics, whereas they also have the rich data set with a time dimension. The use case is definitely abundant there. At the same time there's also enough opportunities for automation. Time series database they can utilize those techniques like data in coding and the compression to compress the data a lot smaller than before. That's a special capability of the time series data sets. That's why time series database is very popular. However, time series databases are also very limited. They do not support the full SQL. A lot of times when people want to join the data sets from the time series database with something else, it takes a lot of effort. They need to build the connectors. They need to add a layer SQL engine on top of that, to make that work. As a result, time series data base in my opinion cannot solve the real time analytics problem in general. What I would be looking for is at some point maybe the real-time analytics systems can be generalized enough and they can take the ideas from the time series database and put those ideas into the real-time analytics engine. So that, real time analytics engine can handle time series data sets as efficiently as a time series database. Again, I think this is probably some areas that maybe some start-up we are working on, but I haven't seen a very promising product yet for our use.
Giovanni Tropeano:
Then operationally, could you, well, could you mind elaborating on some of the operational aspects of real-time analytics? Like, do you need a lot of engineers to run a system like this? Or how would you even find a staff with such operational or that has the ability to handle operational complexity?
Zheng Shao:
That's also a very good question. This is a real challenge in the industry, in my opinion. From my own experience. I was lucky to be in some of the highly technical companies in the Silicon Valley where we are lucky to have such a talented team where we work on both the software development and operations side of this. When the engineers know the code, it becomes of course, much easier to operate. However, at the same time, the real time analytics systems are relatively new. They are much more immature compared with traditional databases. For databases, it's easier to hire a DBA to run, but for real-time analytics, since there are so many different systems, so many different technologies and technology themselves are changing so fast. Let's say from Scribe to Kafka, from Druid to Pinot. It's actually going to be a very big challenge to staff an in-house team who can operate this set setup of a real time analytics efficiently. So, frankly, I don't have an answer to your question. If there's a way that we can get help for the companies in the world who are not as technical, but also want to leverage the benefit of analytic systems. I think that would be great. Maybe this is one of the things that Rockset as a company is looking for. I think that may be very useful for the industry.
Giovanni Tropeano:
I think the quote of the time together was, "Real-time analytics is not easy." - Zheng. I think I'm going to tweet that after this. I love that. That's great.
Giovanni Tropeano:
I think that's all the questions that we have from our side. I wanted to keep it short and sweet and helpful. I think we did that for our listeners. If you found this insightful, please share it. Like I said, we'd love to hear your thoughts. So comment, and ask questions in our Slack channel or via Twitter. This Why Wait broadcast is brought to you by Rockset. We at Rock Set are building a real-time analytics cloud-based platform that can add value to the situations that are similar to what we discussed today. So check us out at Rockset.com. We have a two week free trial where you can receive a $300 trial credit when you are checking us out and, play with it, break it, do what you can do, what you can build on it. We'd love to have you, so please subscribe, comment. Thank you once again for joining. Thank you, Zheng. Thank you Dhruba for your time and stay tuned for our next episode. Cheers, everyone. Have a great night.
Dhruba Borthakur:
Thanks Zheng. Bye.
Zheng Shao:
Thank you.