Rockset Podcast Episode 8: Top data trends with Ben Lorica

Ben Lorica and Dhruba talk about top data trends that Data Officers, Scientists and CTOs are investigating and planning around, data OPS in ML and AI and Gradient Flow.

About This Podcast

Gio Tropeano:

Welcome to "Why Wait? The Rise of Real-time Analytics," the Rockset podcast. We invite engineering, business thought leaders, and analytics specialists to share their stories with the world, providing insights into what your peers are doing to improve data and application analytics. I'm your host, Gio Tropeano of Rockset, and I'm here with my co-host, Dhruba Borthakur, the co-founder and CTO of Rockset. Thanks again for being with us, Dhruba. Before I introduce our guest today, if you have a question or comment on today's podcast, drop us a line at community.rockset.com or tweet at us @RocksetCloud.

Gio Tropeano:

Ben Lorica is a data scientist. Starting anew in 2020, he was previously the chief data scientist at O'Reilly Media and the program chair of Strata at strataconf.com, and the AI conference, TensorFlow world. Currently, Ben runs and is the host and organizer, of the Data Exchange podcast and runs Gradient Flow research. Ben, huge welcome to the show, thanks for being with us.

Ben Lorica:

Glad to be here, anything for Dhruba.

Gio Tropeano:

Awesome. And so you guys know each other very well?

Dhruba Borthakur:

Yeah.

Ben Lorica:

Yeah, yeah. I mean, I think that... If I'm not mistaken, it might've even been Doug Cutting that introduced us, but anyway, I'm a big fan obviously, of all the work that Dhruba has done over the years, and was always keen to figure out when he was going to start a company, and I was excited to see Rockset get started and thriving.

Dhruba Borthakur:

Yeah. Thanks, Ben. Thanks, really. Thanks for joining us in the show. I think the first time we met was in a Strata Conference in London, and I think Doug Cutting was also there. So the three of us were talking and figuring out some data related issues, so that was a good time. Yeah. Today's show will focus a lot on real-time analytics, but before we get into that, I know you have had your podcast about data, machine learning, and AI, you have talked to multiple people in this area. What are the top trends that data officers and scientists and CTOs are currently investigating and planning around in 2021?

Ben Lorica:

So there are a few that come to the top of my mind. So the first one is not really a trend, but more of a reinforcement of an ongoing concern, which is, what we would call cybersecurity. But if you really analyze it, cybersecurity is information security, and information security is now data security and data privacy. I would say that's a top concern right now, especially since there are so many headlines of breaches and a potential nation state participation in cyber crime. The second trend, I don't know how to call it precisely, but Dhruba, I would... I guess, it's somewhat the ops-icification of data and machine learning. So for example, you now have these umbrella terms, DataOps and MLOps, right? So the MLOps, umbrella term, seems to have initially grown out of the need to be able to deploy to production, so the production-ization of models, but I think that common thing with any Ops related function. So you can trace this path all the way to ITOps, Devops, and now ML and DataOps, is three areas.

Ben Lorica:

There's automation, monitoring, and incident response. If you look at both the data space and the ML space, there are all sorts of tools that allow you to automate a variety of things, monitor your models or your data pipelines, and allow you to respond to incidents, because as you both know, these are complex systems, so things fail, eventually. The point of Ops after all, is to minimize your time to recovery, right? So there's the ops-ification, and then the other thing I'm noticing is there's a lot of language now (I don't know if you folks are noticing this) around injecting rigor to aspects of data engineering and machine learning; how you build pipelines. How do you unit test or do integration tests on pipelines and ML models, because these are complex and DAGs.

Ben Lorica:

So how do you inject software engineering rigor, the kinds of things that you would see in other aspects of software development, how do you map that over to data engineering and machine learning? On the data analytics infrastructure, there's also a lot of excitement around just new sets of tools, right? So you folks are familiar with Real-Time, but there's also a kind of that next generation data lake, and full disclosure, I'm an advisor to Databricks, and one of the things that I did last year for them was that we wrote this... I think what has become kind of the reference post for lake houses, which is kind of the new data management paradigm, which is the evolution of the data lake. Distributed computing is once again becoming important, just because of the size of the machine learning models in particular.

Ben Lorica:

I think in many ways, in some parts of the data processing or data engineering pipeline may be, you can obviate the need for distributed computing, but for training really large machine learning, particularly deep learning models, it seems like you really need distributed computing resources, but on the other hand, now we're talking a much smaller number of companies who get to that level of sophistication, and looking much further into the future, and I don't know, Dhruba, if you folks are thinking about this, but multicloud. So multicloud in two senses, right? So one, just making sure you run on multiple clouds, but even looking further ahead into the future, the commodification of cloud platforms. That's a bit speculative, but imagine if your relationship is with Rockset, so you log on to Rockset. Maybe you have your data in several cloud platforms to write in, then CloudSET would just basically run whatever you need to do, and you won't even need to know the details. Your relationship is completely with Rockset, for example...

Ben Lorica:

So, Ion Stoica and I actually wrote a post about this early this year, where we talked about multicloud, and what we term multicloud native platforms, which is kind of this more next generation type of cloud applications where you essentially commodify the cloud platforms. Yeah. And so those are the things that come to mind. There's, obviously, Responsible AI is another topic in the machine learning world, which also probably impacts you folks because you're doing analytics, right? So Responsible AI being the umbrella term for that collects a series of risks, right? So security and privacy, safety and reliability, fairness and bias explainability, right? So how do you make sure that when you build data and machine learning applications, that you are in fact mitigating some of these risks.

Gio Tropeano:

So, data officers have nothing going on?

Ben Lorica:

Nothing going on. So I'm curious to hear what Dhruba has to say about the whole ops-ifications. The Ops umbrella term that seems to be started out with IP, Dev, now you have ML and DataOps.

Dhruba Borthakur:

Yeah, I think ops-ification is a kind of a follow-on, because people are trying to make these part of their production environment, right?

Ben Lorica:

Yeah, harden.

Dhruba Borthakur:

Harden them and make them rock solid because the business depends on it. You also talked about responding to... as a response kind of thing, right?

Ben Lorica:

Yeah, yeah. So there's automation, monitoring, and incident response. Yeah.

Dhruba Borthakur:

Incident response. So response also means kind of a flexible backends because sometimes you need to do a lot of investigation before you can respond, for example, right? So all these things are becoming part of the data pipeline because they are becoming kind of mission critical for most enterprises. We see that a lot, and I mean, Rockset, a lot of people use it kind of for those areas. You were about to ask something, Gio?

Gio Tropeano:

Yeah. I'm recalling the conversation that we had in episode 5 with the Chief Data Officer of Farfetch, which is a UK E-commerce brand, and he was saying how data culture is now permeating the entire organization in some way, shape, or form from cybersecurity, and how that's protective of the data in some way to the operations throughout the company and how data is handled. It really is reminding me of just this saying where software was eating the world five years ago and 10 years ago, now it's data is eating the world, so it's really interesting how analytics is coming into play in this. Ben, how do you see real-time analytics affecting kind of the evolution of data owners and groups within companies?

Ben Lorica:

So, I mean to me, in many ways, real-time is a continuation of the same sorts of challenges. So for example, even if you set aside real time itself, so what's happening today is that you now have an expanding pool of workers, frontline workers, executives... not everyone has to be technical these days. You now have BI tools that maybe incorporate even the ability to build simple models, so people can do a lot on their own these days. So as you expand the number of people who rely on data, then they start depending on data, and so basically, you have some similar challenges regardless of whether or not it's real-time or not.

Ben Lorica:

The one thing I would say about real-time though is that as people depend on data to make these decisions and predictions and to run operations, then they really end up changing the way they do things, so they get dependent on data that's fresh and reliable. So I would say then to that end, I would say one of the main challenges for real-time in this disregard is... Again, it goes back to a similar challenge in the non real-time world, but probably more pressing in real-time, data quality. So if you don't have the tooling or the processes in place to spot and address data quality issues, then you might have, inadvertently, made decisions based on wrong data, right? And then the things that I alluded to earlier, kind of the operational challenges, become even more challenging with real-time.

Ben Lorica:

Yeah. So think about one thing and specifically, data pipelines, right? So now you have a real-time data pipeline, and you've got to modify it. You got to write new code, debug it, roll it out, or first of all, unit test it, integration testing, and then roll it out, and then if there's a mistake, roll it back. I mean, if it's a real-time pipeline, that becomes even more challenging, right? And then there's challenges on the end user themselves in many ways, but those are more cultural and organizational challenges, right?

Ben Lorica:

So for example, what if you are in a position as an IT team to deliver really good, high quality, real-time data, but the organization doesn't have a data strategy in place to understand what to do with this new streams of data, and how to interpret uncertainty in this new streams of data. Yeah. And also just shifting the mindset from relying on monthly reports or weekly reports or daily reports to something more fresh, but the engineering challenges, I would say, are similar, but probably more high stakes in some ways, if you know what I'm saying, because basically, if the decisions have tremendous implications, and your real-time DataOps is not delivering, then you're screwed, right?

Gio Tropeano:

So, you, Ben, have a podcast. Why don't you tell us a little bit about your podcast? Tell us about Gradient Flow, and where people can find out more about you and your projects?

Ben Lorica:

Yeah. So I started the podcast basically as I was leaving O'Reilly, because actually Dhruba was a guest on my O'Reilly Data Show. So I actually started podcasting before podcasting was a thing at O'Reilly, and then when I was leaving O'Reilly, I decided, well, I should probably just keep this going, right? So I rebooted it as an independent podcast. You can find it at thedataexchange.media, and it's basically... It's a way for me to learn along with the audience about new topics, and the focus is data, machine learning, and AI. And accompanying that, I set up kind of market research, I guess, or I guess consulting firm, because I do some data science for people who want me to do data science, but it's mostly strategy, content marketing, consulting, and even... So we even do events around this consulting company that I call Gradient Flow, which you can find that gradientflow.com.

Ben Lorica:

So there we have reports and surveys and blogs on the topics of data, machine learning, and AI, and in fact, I'd like to plug an open survey we have right now, which this audience would probably be great survey respondents for, it's a survey on data engineering, and you can find it at gradientflow.com/2021desurvey.

Gio Tropeano:

Awesome. We'll put that in the links to the podcast as well when we go live. Perfect. All right. Well, that's all we had for today. Wanted to keep it short, succinct, and to the point. Ben, I appreciate your time. Thank you so much for being with us and your insights, and Dhruba, thank you as well for your insights as well. The "Why Wait?" Podcast is brought to you by Rockset. Here at Rockset, we're building a real-time analytics cloud-based platform. Check us out at Rockset.com. You can give us a try for free for two weeks, which $300 in free trial credits are included. Thank you once again, gentlemen, for joining. Thank you to the audience for joining, and stay tuned for our next episode.

Dhruba Borthakur:

Thanks a lot.

Resources

mouse pointer

See Rockset in action

Real-time analytics at lightning speed