Rockset Podcast Episode 9: Data Analytics and Modeling with Connectly CTO

Yandong Liu, the co-founder and CTO of Connectly.ai discusses data analytics and modeling, the importance of semi-structured data when dealing with streaming events, data reliability, deploying good data modeling into production and investing in data infrastructure.

About This Podcast

Gio Tropeano:

Welcome back to Why Wait, The Rise of Real Analytics podcast, we invite engineering, business thought leaders and analytics specialists to share their stories with the world, providing insights into what your peers are doing to improve data and application analytics. I'm your host Gio Tropeano of Rockset. I'm here with my cohost Dhruba Borthakur, the co-founder and CTO of Rockset. Thank you again for being here with me, sir. Before I introduce our guest today, if you have any questions or comments on today's podcasts, drop us a line, we'd love to hear from you at community.rockset.com or tweet at us @Rocksetcloud on Twitter. With us today is Yandong Liu, the co-founder and CTO of Connectly.ai, a Silicon Valley, conversational commerce platform, enabling businesses to communicate with their customers via SMS, WhatsApp, and Messenger. Prior to Connectly, Yandong was CTO of Strava where he was responsible for the technical efforts, focused on data and machine learning.

Gio Tropeano:

And prior to Strava, he held the title of VP of engineering at NetEase, and he also led architecture development of the machine learning platform at Uber. A warm welcome to you from the Why Wait podcast. Thank you for being with us. I'm going to kick it off with the first question. You've worked with data modeling for a long time, Yandong, please describe the importance of semi-structured data in dealing with streaming events. How do you deal with records in a data set that's partial or damaged, what tools do you use to clean up data before models could be built on them? I know there's a lot of questions here about how long is the delay from when the data actually arrives to the time when it can be used for modeling?

Yandong Liu:

Arguably, today, it's really difficult to build a successful business without having good data. Depending on the nature of the business and what kind of core value you're providing to a customer's data really play an important role for all the companies I have been working at. I would say, data really forms the foundation of the business, including Connectly, Strava, Uber, et cetera. For example, Strava, it's all about logging your running and cycling data, then coupled with GPS, weather, terrain, maps data. We're going to turn that into a beautiful drawing with some nice stats for activities, which you can look at later.

Yandong Liu:

At Connectly, we integrate it with all top messaging platforms, such as WhatsApp, Messenger, SMS. Also, the rest of the world, it could be something different. And also, payment tools, so that we can then analyze the communication between business and the customers, and that will makes various business decisions to... It can be prioritizing certain conversations, automating the work, or just re-engaging the customers in an automatic fashion.

Yandong Liu:

So speaking of bad, damaged, or partial data, that always happens. For example, the Strava apps, it works with... Strava app, it worked with hundreds of different wearable devices. And then, you can imagine they all report in very different data in their own way. They're probably using different sensors, even for two people that were running next to each other because they are wearing different devices, and the quality of the data can vary significantly, right? So this also applies to other types of data: time, distance, elevation, heart rate, right? So the data is always not 100% reliable. And we're probably going to run into similar issues regarding data in the Connectly as well, as we're adding users over time.

Yandong Liu:

So there can be different treatments of the data or partial data. You can ignore the data. If you find the data is just bad due to, for example, a sensor malfunction, it is probably just damaged from the very beginning. You can choose to just throw over the data because there is really no way to recover it, or you can sensitize it because you have other multiple streams of similar data. You can probably be using some additional signal, for example, the GPS streaming. You have the GPS coordinates, but you also have OpenStreetMap, other external signal. It can help recover and correct the data, or you can look up the data, right? Weather, you can always look at weather data from some other website and go back to the point in time and recover the data. They're just a few examples of what you can do with the data.

Yandong Liu:

And you asked about the delay. I think, ideally, you want the data latency to be as low as possible. But also, it's a decision. Sometimes, it is not only a technical decision, but also puts a decision... You have to make, depending on the nature of the job, user experience, how low the latency you'd like it to be, and how much resources you could afford. For example, now on the Connectly, our latency is fairly low. It's milliseconds, a few milliseconds, because we are starting with a super scalable and highly performant architecture. We're using Golang. It's the backend programming language. And we are also using some of the event-driven architecture. So everything is super fast, but also we are small today. So I'm pretty sure we'll get some challenge later on.

Yandong Liu:

I remember at Uber, working at its machine learning platform, our P99 delay was as low as less than the 100 milliseconds. P99 meaning, 99% of the time, the delay was sub-100 millisecond. That's actually pretty nice, especially for some online predicting job. You want the model to make a decision right there for the user. So essentially, they don't feel the lack between the data arrival to the user experience.

Yandong Liu:

But in some of the cases, the delay... They have to lower the latency. For example, self-driving, or there's two parts where it's offline training. You can probably take as long as you want it because you want to absorb that large volume data and turn that into a high performance models. But online, imagine that you have sensors, the cameras that captures the data. You would like the delay from the moment they capture whether it's image or a sound, or some other laser input to a decision, right? To whether you just keep driving or I'll stop, or make a break, or just turn, you would like that latency to be extremely low. So there, you probably have to make some trade-off between the hardware aspects and the latency so that you can get to that. So that's important, something you can not sacrifice. Yeah. Eventually, I think it's the combination of both technical and business.

Yandong Liu:

So we found that Connectly, about seven months ago, with the idea of bringing messaging to the business world, if you're looking at messaging today, what do we got? We got WhatsApp. We've got a messenger. We've got probably a long tail of other tools for the rest of the world, mostly for personal communication. I know, in Asia, people use WhatsApp... sorry, WeChat. In Japan and Korea, people use something else. In a way, we got enough for personal communication tools. We use Slack and Microsoft teams for work communication, meaning B2B, business-to-business messaging. But the B2C messaging, meaning that you should be able to talk to business via messaging, just like the same way you talk to family or friends, that use case seems to be lagging behind. When are we going to exactly solve that problem?

Yandong Liu:

We can just start off by introducing an Omnibox that integrates all top messaging platforms on the market, including what I just mentioned: Messenger, WhatsApp, Instagram, SMS. Don't forget SMS is still the number one chat app, at least in the U.S. So that as a business owner and the operator, you'll be able to receive and get back to customers in a one unified place, even though the customers, they might come to you from different channels. So you don't want to be selective when it comes to growing the business. You want you to be able to talk to anyone who are interested. So, the unified inbox, that's the number one ask from our user study.

Yandong Liu:

And then, we're going to introduce a transaction, so you can pay or charge people right there in the messaging thread. You can imagine this is important, especially for certain verticals and use cases, like e-commerce. You're talking for a while and you ask about the product. You finally made the decision to purchase and you want to pay it right there, or you want to get paid right there.

Yandong Liu:

Then next, after that, we might introduce other support functions, for example, scheduling CRM, chatbot, so anything that could help with customer engagement and the business growth to automate the work or to just make you do with the same work, just 10X more efficiently.

Yandong Liu:

You asked about the challenges. I think the challenges could have come from different places. The first of these, there's always technical challenges. We need to integrate with so many other external vendors. Some vendors are more open. Some are a little hard to work with, but we'll get there. And also, other technical aspects we're dealing with messaging and the lowest that you could expect. At least, the bottom line is that you don't lose any messages, right? And also, ideally, you should have very low latency. We've talked about latency. And I think when it comes to messaging, this really affects the user experience. We'll also pay very strong attention to privacy and security, make sure this is a platform that you can trust with your personal data. So that's technical.

Yandong Liu:

And when it comes to product, I want to make sure that our users will have a unified and consistent user experience. Because under the hood, there's probably 10, 20, 30 different other vendors, and we should make it so easy and intuitive for our end users, so you don't even feel like there's other platforms underneath.

Yandong Liu:

And I think the most, the biggest of challenge is something that we're really working hard on. It's, how do we add value for both our customers and the business? Because we're two-sided markets, so we have to serve both well. And we need to figure out where they are having trouble today, or where are they doing the work, but very efficiently, or where they need a better insights, so they can better steer their business, yeah. There could be a million different things we can help with. But to choose a few ones that make them really work, and also we truly add a value to our users, I think that's going to be our winning strategy and that's what we are focusing on today with our small team.

Dhruba Borthakur:

Thanks for describing that Yandong, I think I have worked with you or I have known you for a long time now, especially the time when you were building Michelangelo at Uber. That was, as far as I understand, I read about it when you published blog posts about that piece of software. That's a great. I know that it uses a large kind of machine learning on large datasets and it produces good models and refines them over time. Could you tell me a little bit about, how do you deploy these models into production? Let's say at Uber, you are using Michelangelo to build large models. Now, building is one problem, but then putting them in production or deploying it to serving useful real users traffic is sometimes challenging and difficult. So how important was it to deploy these models in production at Uber when you were working, especially in real-time and not having delay in data as well as model deployments?

Yandong Liu:

Yeah. Like you said, Dhruba it's super important. Especially for some of the applications we're doing at Uber. So there's a couple of different businesses in Uber. For example, there's ride-sharing and Uber Eats and the self-driving et cetera. And for each of those business, there's probably a million different models running, just doing different things. When you open the Uber app, you will to see the ETA prediction, how far the driver is from you, how long it takes that person to get to you. For Uber Eats... that's just one example of a few others. For Uber Eats, there's a restaurants and dishes recommendation. Self-driving a course how to taking what the camera and other sensors collect and translate them into other signals so that machine can make a decision. So depends on the nature. I think that the freshness and the low latency can be real important.

Yandong Liu:

For example, for ride-sharing, for Uber ETA. Whenever as a customer, you open up the app, you want to see the ETA prediction ASAP. Light delay in a second or two, Uber Eats similarly. You want to see the fresh recommendation that whenever you click or tap around, you will see different recommendations. For self-driving it has to be instant. It has to be instant, otherwise, don't make any mistakes. So that's how important it is to make real-time prediction and to make a real-time decisions right there. Well, one thing about this low-latency is the model. You really have the design model so that I can act and predict instantly another thing, have a data pipeline. We all know that the model works in such a way that you need a good model plus the input and the data pipeline is also a key part of the machine learning application.

Yandong Liu:

You have to make sure that you are able to pipe in those fresh runtime production model really quickly so that you can minimize the overall delay of your prediction. So yeah, we did a lot of work in Uber to minimize or reduce latency, especially on the data part as much as we could. I'm saying this because a lot of people focus on the modeling or model architecture of the machine learning application. I don't know if people are paying similar amount of attention to the data quality and data freshness part, we did actually... a lot. For example, we build a real time feature store. The idea is that when the model, you can turn the model offline in whatever way like a batch where you can load huge amount of data to train a great model, but in order for it works online, you had to make sure that all the input features are piping into the model in a matter of milliseconds and some models are really complicated.

Yandong Liu:

I can give you a few examples. Well, some are relatively straightforward for example, for ride-sharing, you can use the current location, the weather, the kind of traffic, which you can pre-compute it. Others, have dynamic features. For example, one interesting feature I remember is, for this driver out of his or her last 100 trips, for how many features there was a call follow the trip. That could be interesting signal. It depends on what kind of prediction you're making, and that's a dynamic feature. So in order for that to work, there's actually a complex business logic. So we had to write to the feature infrastructure in such a way that it supported complex mitigation on the fly. And that really helped a lot. And I could give you another example of that.

Yandong Liu:

Some features like a sliding window, you can see how many feature we have in this geo area in last five minutes. When you say it's five minutes, it better reflect what actually happened in the last five minutes. Because you train the models actually offline, because you always make sure you can get everything right offline. So in order for the model actually work. The offline online data has to be consistent. And we actually did lots of work around that. So it's probably some of the best work we did to help boost business growth in some areas with machine learning.

Dhruba Borthakur:

Yeah. I can visualize how the whole Uber backend depends a lot on fresh data because it's a logistics application. You can miss deadlines. People are waiting for the cars to arrive or whatever prices to pay. You also worked on the data growth initiative at Uber. This is more about increasing, maybe people who use Uber, or drivers or passengers. Tell me a little bit about that initiative and how real-time analytics might have played a critical part in making some of the growth initiatives successful at Uber.

Yandong Liu:

Absolutely. So I was on the growth team for a long time, right after I joined Uber. So growth has been the top priority for Uber for a long time, including the sign-up work, onboarding, promotion, background checks, et cetera. For growth to work, we relied heavily on data analytics for product insights, servers quality, and the overall health of company growth work. All in all, data was key in our decision making process. Obviously we need to aggregate data to make sure that we were on track to hit our weekly and the monthly goals. At the same time, we were also collecting lots of insights from our real time analytics to guide our day-to-day operations and decisions and to react to various problems. For example, for sign-ups, we designed it to be a state machine. It actually has fairly complicated business logic and the whole sign-up process.

Yandong Liu:

For example, a general flow might be, you need to sign up, fill out the form, upload the documents. Documents can be driver's license, medical records, or even other document depends on the nation and the state regulations. And then if everything looks good we would kick off back on the check process on you. That background check has multiple steps itself and then working with various external providers, then if that background check passes, we would inspect your vehicle and make sure it's in good conditions. So you can tell the process isn't that straight forward. To get a real human on the road and provide rides for other real human beings is in the easy job. So, and there are many opportunities in this whole process, then it could go wrong or people get stuck.

Yandong Liu:

So by having real-time analytics, we're able to tell the stats of our growth funnel at a micro level so that we could react properly, whether to fix a problem or just unstuck that person for that application. That was actually extremely important. Also in reality, often times when people signed up to drive for Uber, they wanted that job right there, in that moment because they want to make some money. They want to make their ends meet. So our job was really to enable them to do so ASAP in a compliant way. So in a sense that every second count. And so it's really a combination of automation and human operations guided by our real-time analytics.

Dhruba Borthakur:

Yeah. Even when I was working at Facebook, we used to depend a lot on real-time analytics to make sure people are engaged and we get more users in the system. Yeah. Taking a step back from your Uber experience. You were also the VP of engineering at NetEase. So that's a place where I know you had done a lot of experiments on creating an A/B testing framework so that you can test different models or different inputs and figure out how to handle say, content blacklisting or some other use cases that could benefit from these kinds of testing to figure out what exactly to block for example, from the NetEase backend. Tell me a little bit about how important was it for you to do these kind of A/B testing frameworks, which depends on real time data or how much is real time important for building these experiments and deploying them in production?

Yandong Liu:

Absolutely. So at NetEase, we were building one of the most popular news apps in China, and we were competing with other similarly large tech companies like Baidu and ByteDance. At a certain point, news app were really popular and all those large companies were having one. And at that time we had over a 25 million daily active users. And today we might have even more NetEase might even have more and they were fairly active. On average, they would spend over one hour on our platform, just reading, consuming, all kinds of content, including videos, articles, tweets, all kinds. Because it's a news app, we try to push the most up-to-date content to the hands of users. And we want to minimize the delay between the moment the content is created, is uploaded to our platform to the moment that we recommend them to the users in a personalized way.

Yandong Liu:

So, and it's not straight forward. First, there's the scale. 25 million daily active users when over a million different items uploaded every day. And the format of the content vary a lot. It can be content like an article, it can be videos, polls and the data processing pipeline is also very complicated. Or in that process, we had data cleaning, enhancing, natural language processing, computer vision, all that. For example, people could upload just random content. You need to identify that and get rid of it. Or people upload videos with low resolutions or with watermarks, with blanks here and there. And then we need to just remove those and keep the other useful part.

Yandong Liu:

So, and not even to mention the machine learning pipeline, the data pipeline in general just to get the interesting signals of features. For example, we need to parse the text so that we need to know what the topics are for the article. Even for the video. Video is even more involving, we use deep learning pipelines to extract the content, even to just identify for example, human faces, face recognition for the celebrities and adding those as interesting features.

Yandong Liu:

On the other hand, we built a machine learning model to recommend those content and because the models can vary, and because we want the product experience to be a positive one, we were running hundreds of experiments at the same time. So we relied on A/B testing framework to make us make quick product decisions so that we know which signals are positive, to help with the engagement to help with long-term retention of our products. So we use A/B testing and analytics on daily basis so to make those critical decisions. You mentioned the content blacklisting, that's also extremely important. Just one bad actor can just ruin your platform. So we took it very seriously. And it's really a combination of product decisions and technology decisions. For example, the content, we often run into vulgar, duplicate, or illegal content or hateful abusive.

Yandong Liu:

Some are just red lines. Some are easy to decide, some are just a little bit vague. So you need to just make it protocol and then build that technology to exactly implement your product decisions in that content blacklisting process. Because of the scale, one milion items is really hard to do it by hand, manually. So we have some good automation in place, mostly from machine learning models to decide this is actually, I mean, across the threshold, compounded with heuristics and rules. Sometimes you do need rules, for example, when a machine has fairly good content. You can probably just use it just to make a decision, then remove the content directly. Sometimes it's sort of in the middle, you don't want to block good content, especially for creators who might spend lots of energy and time in creating.

Yandong Liu:

So you would send those to human operators. They will make a decision. In that process, we're not only to try to make good decisions, but also make more training examples for our machine learning purpose, okay, we can do that. So this is the production part, I mentioned some of the signals, but interestingly, most useful signals are from the community. So you can imagine, we have those report or flag button, people often use, and that's extremely helpful to help us identify bad content, properly . So again, it's real time analytics where we find all the signals and translate them, convert them into machine learning features, or just simply send them to human operators. So the sooner we can identify and get rid of bad content, the less damage we might cause as a platform to our wide audience.

Gio Tropeano:

Yeah. You speak of bad actors, lately in the news, it seems like or it's been two really big cyber attacks, ransomware attacks specifically, that have caused a lot of concern, not just in the B2B space, but in the public, with the public. Right. How important is it to have mutable datasets so that you can delete small pieces of data to be compliant with various regulations like the ability to forget so on. Are these compliance metrics difficult to adhere to when the data set is read only?

Yandong Liu:

Yeah. Super hard. I would say quickly that it's really hard to be compliant if you have a read only data system, we all know that there has been an increasing awareness of data privacy and security for the last couple of years and when your data storage is designed in that read only way, it would be pretty hard to address compliance issues and meet users specifically tasks. So for example, under GDPR and CCPA, the users are entitled to the removal of their personal data. And with a read only data storage, that's just pretty hard to get to, well, you can work around it, but it's just not that straightforward.

Yandong Liu:

That's why here at Connectly, we're adopting a privacy by design principles so that we can provide a safe and secure default data policies for our small and medium business partners, simply because most of them are not able to do so. We're considering all aspects of data security in the early stages of our development, enable our customers to access rectify, erase and import that data. So for example, we would have a very strong data processing agreement. The access isn't open, it should be need-based. Not only for our internal Connectly employees, but for their own employees.

Yandong Liu:

Not everyone has to know everything in order to serve customers. And it would provide a strong end to end encryption. So, yeah, overall we were able to do it because we understood the importance of the compliance and we designed our data engineering in such a way in the very beginning. But I've seen cases where companies, they've been there for a long time. They probably started out without knowing this before even the compliance, lots of regulations came out and the systems were also read only. So they have to do a lot of additional work in order to be compliant, or just sometimes, even now, they're not in 100%. So I think it's really important to design your system to bear this in mind in the beginning.

Gio Tropeano:

If you don't mind. I know I gave a quick overview at the beginning of this episode about Connectly, but would you mind just telling the audience about Connectly, what you're working on, what are some of the challenges that you're facing in real time analytics?

Yandong Liu:

Absolutely. So at milion, we're trying to build a communication platform for businesses, especially for small and medium ones to have a good relationship, meaningful relationship with their customers. The idea behind it is today, you're able to talk to your family, friends, and even your coworkers using messaging. We probably use it on a daily basis, but it's still pretty challenging to talk to businesses in a convenient way.

Yandong Liu:

So that's why we're trying to bring messaging into the business world. And you want, rather, today often when you search businesses online, you might've found their contact on Google maps. For example, you might see their email or phone number. Very few people use email. Phone calls, they're kind of awkward, they're real time, they're taking time, you're put on hold, you're transferred from one person to a different person. Just repeating your stories. So that's why we think that messaging is the future. It's near real time. It's asynchronous. You can do a lot more with messaging, not only communication, but you can pay and get paid. You can get scheduled. You can use chat bots to automate the conversation. So that's why we're doing this. To facilitate the communication part between the business and customers.

Yandong Liu:

What challenges you mentioned. Well, first there's technical ones. So messaging itself is hard. Getting messaging right is quite difficult, especially when we try to work with many external providers including Facebook Messenger, WhatsApp, Instagram, SMS, you name it. There will be a long tail of messaging platforms to really find a good balance, because as a business, you don't want to be selective. You probably want to just anyone who wants to communicate with you to do it in a nice way. And there's also, we just talked about data privacy and security as really important. To win customers trust, we need to provide a pretty safe and a strict data policy by default. And there's when we get to the transaction, we're not yet but we'll get to transaction and payment processing one day. That's just probably also pretty challenging. That's why we're spending a lot of time in the beginning to develop a strong foundation from both product and engineering perspective so that we are ready to scale and in a safe way. That's what we're focusing on right now.

Gio Tropeano:

Awesome. Well, if you could give other engineering leaders one piece of advice based on your expertise and your experience, what would that piece of advice be and why would you give that piece of advice specifically?

Yandong Liu:

Definitely. My advice would be, invest in your data foundation, including data engineering policy tools, like Rockset and the culture from the very beginning. Through the years of work, I've seen where people who invest a lot in data and for those who are able to get it right, particularly when scaling or the business was scaling, they benefit so much from the early work, whether it's just good decision making or just accurate business insights, et cetera. So with more accurate and fresh data, you probably can make some better business decisions. On the other hand, I've seen cases where people under invested and later on, they would be constantly firefighting, or just spend a 10 X more time to repay the tech debt while probably making suboptimal decisions due to the protocol data, because the data is not fresh, has lag, or could be just wrong data. So, do it early, do it right.

Gio Tropeano:

Sage advice for sure. That will do it for this episode of Why wait, Yandong and Dhruba, thank you so much for your time and your insights today. We really appreciate it. You can learn more about Connectly at Connectly.ai and email them at careers@connectly.com, if you have questions. They are hiring engineers and marketing at this time as well. If you found today's episode insightful, please share the episode, help us grow our following and share the thought leadership and insights on real-time analytics with your peers. The Why Wait podcast is brought to you by Rockset. At Rockset we're building a real-time analytics cloud-based platform. Check us out at rockset.com, and you can try us for free for two weeks. We also provide $300 in free trial credits for that timeframe. So once again, thank you again for joining and stay tuned for our next episode. Cheers.

Resources

mouse pointer

See Rockset in action

Real-time analytics at lightning speed