Why Wait? The Rise of Real-Time Analytics

Episode 1: With Dhruba Borkathur, Rockset CTO & Co-Founder and guest Deb Banjeree, Co-founder and CTO at Anvilogic, an Automated Detection Platform company.

About This Podcast

With Dhruba Borkathur, Rockset CTO & Co-Founder and guest Deb Banjeree, Co-founder and CTO at Anvilogic, an Automated Detection Platform company.

Gio Tropeano:

All right. So we are recording. Give me one moment here, and I'm going to get everything set up on my end, and then we'll get started. And welcome to Why Wait, the rise of real-time analytics, the Rockset podcast, where we talk with real people about the intersection of analytics and data. I'm Giovanni Tropeano of Rockset. Rockset is actually a real-time analytics database in the Cloud. And I'm here with my cohost, Rockset co-founder and CTO, Dhruba Borthakur, and we are joined by Deb Banerjee of Anvilogic, an automated detection platform solution. Deb is an engineering leader with deep technical understandings in endpoint, data center security and Cloud solution platforms. Deb and Dhruba, welcome to the Why Wait podcast. And I believe that you both know each other from a previous life, correct? You both worked together? Is that accurate?

Deb Banerjee:

Yeah. Both of us worked together.

Dhruba Borthakur:

That's right. It was way-

Deb Banerjee:

It was a long time back.

Dhruba Borthakur:

In Pittsburgh, yes.

Gio Tropeano:

Pittsburgh. Awesome.

Dhruba Borthakur:

That's right.

Gio Tropeano:

Steel City. That's great.

Dhruba Borthakur:

Steel City, yeah.

Gio Tropeano:

So Deb, tell us, let's start off. Tell me a little bit about what you're building at Anvilogic.

Deb Banerjee:

Sure. So we are building this detection automation platform to help enterprise SOC's. So there is this group in the enterprise security org, the SOC, the Security Operations Center. They are the last line of defense in the sense that if some threat gets through all your other controls, prevention and detection controls, endpoint, Cloud, network security products, and some things always they are getting through, we know that. It is the job of this group to go hunt down what's already inside the network, who's already broken into my house and hunted down.

Deb Banerjee:

And they do that by basically collecting alerts and log data, and doing some fairly heavy analytics on that in terms of finding out whether there is a breach and how the breach looks like. We help our customers, we bring this knowledge of threats and detection logic, and searches, and queries, and rules to help them automate this process of detection, engineering, development, deployment, tuning, recommendations, and so on. So we bring our intelligence in our some sense, the brains of detections into their existing data stacks, architectures they are running out there. So that's what we do.

Gio Tropeano:

Thanks for the overview. And the premise of this podcast is Why Wait. And it's all about real-time analytics. Talk to me a little bit about how you built real-time analytics into the Anvilogic product, and how your customers are benefiting from it. How are you using real-time analytics to solve your customer's problems?

Deb Banerjee:

Today if you look at the analytic stacks that are commonly used in the SOC to the outset, Splunk is a substantial part of the footprint out there in the SOC. The category is called often SIEM, Security Incident and Event Management, and Splunk has a substantial presence there. And one of the things I think, as a real time analytics solution, you're looking for basically schema-less ingestion, being able to pull in various things out there without being able to requiring any heavy transformation, normalization out there.

Deb Banerjee:

And then being able to work with large volumes of datasets because what we do in some sense, the breach detection use cases, we talk about needles in a haystack. You've got petabytes of log data, and there might be maybe tens of bytes, tens of events of interest, you talk about needles in a haystack.

Deb Banerjee:

But what we look for, and what the analyst looks for, is a pattern of needles in a haystack, really. He's looking for not one needle, but a set of needles organized by machines in IPs, host, users, applications, data centers, and so on, actually. And often it can go back a bit, it could go back from hours to days to weeks, actually. So you're effectively looking for these patterns of needles going back from days into weeks, and that becomes a high-fidelity detection. It's not easy to write a low-fidelity detection that throws up many alerts perhaps 90% of them or 99% of them are worthless. And that's the problem today in the SOC that the data is there, people don't know exactly how to write high-performance detections looking for this patterns of needles against the stack.

Deb Banerjee:

So that's one way of looking at this problem. Now, there are other more advanced users looking at more... now, Splunk has a certain cost and footprint aspect to it, that it is by no means a very cheap solution. Roughly about a million bucks per terabyte per year, that's around the cost structure for Splunk. Now people will say that, hey, if I have to store data, that should be a few hundreds of bucks per terabyte per year. Of course, usage in a computer is extra. So there are some folks looking at alternative means of doing real time analytics using more of the next generation of these Cloud storage architectures.

Deb Banerjee:

Now, the important part is now the analysts who are using these solutions, they are not database engineers. They are not even they're not even in many cases SQL familiar. How do we give them an experience, a threat hunting, a breach detection experience that leavers reuses all the skills they have today from the existing. So that's the landscape that we operate in, in terms of the real time stacks. And also what is the end-to-end use case looking, that's presented to the SOC analyst.

Dhruba Borthakur:

I think what you mentioned definitely appeals to me at least, understanding your use case. You talked about needle in a haystack, you talked about how people might not be knowing SQL, or doesn't have to be data engineer. So we'll talk about those as well but first, can I ask you more about the real timeliness of your application that you're building. Let's say you're building breach detection and threat hunting as parts of your key product.

Dhruba Borthakur:

How real time do you want the system to be? Is there a lot of value in real timeliness of the data, which I mean is, we're able to query and make sense out of the data as soon as it arrives? Or is your system okay if you can just process yesterday's data? Or data which is one hour old? I'm trying to see how I can ask you, how much impact would it be to your industry if you have a real time system versus if you have a stale analytics systems, a warehouse or some other system that you might use?

Deb Banerjee:

So in this business, we talk about dwell time. Now the moment something breach the enterprise until you detect that it's here and then you remediated it, there is this dwell time. And unfortunately dwell time turn arounds into months. If you look at some of the reports out of FireEye or CrowdStrike, they'll say, dwell time order of months actually, which is scary and unacceptable. If you're going to get it down to weeks, I think we have done in the industry a huge service I think. We start with that here.

Deb Banerjee:

Now there are classes of detections that can be, I think high-fidelity and the real timeliness is certainly very doable. What we call IOCs, Indicator of Compromises, where you know there is a bad file hash floating out there or there is a command and control server that is known to be bad actually. And you can go look for those in a file hashes, bad URLs, C2 URLs, C2 IP addresses. Certainly real time is good because it is very doable and you're getting this feeds coming in. Now, the thing is, normally this threat intel feeds are coming in from third-party sources that are vendors. Sometimes the government is sending, so there's always some delay there itself, actually.

Deb Banerjee:

Now, it's also true that our adversaries, the bad guys out there are also running fairly high quality engineering programs. They're also doing good counter-intelligence saying that, are we burned actually? Is my hash burned? Is my URL burned? Is my C2 Server burned? So they're also shifting IOCs constantly. So the stream of IOCs coming in to you is constant, and it gets stale very quickly out there, actually. So there is certainly value in doing look-ups in real time but there's also value in historical look-ups because I do want to see, did we see this IOC a month ago, three months ago, up to a year back? And I would love to go see were there indications of this group in my environment. So there is need for historical look-ups as well, not real time look-ups too.

Dhruba Borthakur:

I think that makes sense because many times you might be looking at more recent data and sometimes you have to go back in history. So real-time analytics is useful, looks for [crosstalk 00:09:53] some set of queries but then you also need the ability to be able to do historical look-ups.

Deb Banerjee:

That's right.

Dhruba Borthakur:

You also mentioned something about searching the needle in a haystack. You said something about these patterns coming in and then sometimes you have to look through and do more of a search query into this database. But then sometimes you also want to do analytical type of queries in a database. You match these patterns and then you want to maybe sum them or aggregate them and check differences. Can you tell me about analytical type of queries that you might need for your use case.

Deb Banerjee:

This is interesting because we in our world tend to group everything as a search even though the [inaudible 00:10:33], the analytics in describing. Let me give an example. So the classes of detections need tend to focus on and the highest values are what we call the longer lasting detections. Just like in any other engineering group, the adversaries are running good engineering groups and all engineers have preferences for tools and techniques, actually. Its certain ways of doing things. We all have preferences, they have preferences too. And there are groups out there like MITRE who are modeling out. What are all the common techniques and what call track procedures they are using? Those are much harder for adversaries to go change. So let's say someone uses PowerShell, actually downloads a malicious PowerShell script from the internet that is encoded and executes that on your machine.

Deb Banerjee:

Now, the thing is, the nice part about this is that, it's a fairly well defined technique I can go detect actually, but it's also the case that are legit users doing these things too and unfortunately. So these techniques are by their nature not high-fidelity. These detections are not high-fidelity. Now, almost always adversaries are using these things but sometimes even the good guys are using these things too, this are not high-fidelity. So often one of the things we do is, when we start looking at this specific behaviors, we are grouping by time actually looking I think the last five, 10. Between seconds to minutes we are looking at, again saying how many distinct users did we see on that machine actually, that behavior or distinct IP. So we're doing some count buys within a time bucket out there.

Deb Banerjee:

Sometimes we might look at specific event, goals and process, file names in that extreme that, hey am I seeing certain file names out there. So we're getting deeper into the field structures actually looking at specific fields, looking at occurrences of the fields. It may be rejects based, actually it's not exact matches. You are bucketing by vein and counting by distinct entities like tree. And these are our basic building blocks that give you your, we call them our atomic detections. The needles, is finding the needles but not all the needles are true needles, some are fake needles and some are true needles actually. And then on top of that, we have another layer of analytics that's looking for the patterns of needles actually against a given let's say, application or user or machine, did I see this needle followed by that needle followed by that and then over the last maybe N hours and days and so on.

Deb Banerjee:

So there are two classes of analytics and you are right. Absolutely. There is heavy, you're looking at the structure of the events, your company for various things by time by unique, or what we call, we use the term COI, often correlation entity of interest actually that, how we're grouping these events by users and applications, the IPS and so on.

Dhruba Borthakur:

I think I understand what you're saying. So what you were saying is that, you do some search query fast which is a very heavy filtering based on maybe a few tens or hundreds of properties. And then once you filter that you seem to be doing a lot of group buys and then order buys, or I'm talking about SQL or database queries. This is how you are doing your analytics. So you need basically both search as well as analytical type of queries.

Deb Banerjee:

And there's a time bucketing as well that often you'll do a time bucketing in there as well.

Dhruba Borthakur:

What about... now your dataset might be quite, diverse in the sense every event might have hundreds of fields. How important is it for you to search not over time like you mentioned but also over other fields that are in your database. Say you want to do filter or index on the IP address, or you want to filter an index on the zip code of the event of X produced or some other, how important is it for your data system to be able to do that so that your application is more intelligent or more powerful.

Deb Banerjee:

This is where it gets interesting. I think and I was looking at Rockset in all this full indexing or indexing on every field event, that's super powerful. Today you don't get that actually, say in our traditional analytics, security analytics solutions, I'm not getting full indexing, which means when I'm coding my search, I have to be super careful to start with the knowing that there are some fields we're indexing on and start filtering on those first actually, try to narrow down the number of records you have to operate on, literally to do the full scan on. So you have some of the lack of flexibility in underlying data sets, flowing into the detection engineering side on the analytic development side, that good experienced analyst know this thing, that some things are indexed, some things are not indexed, so therefore only focus on the index fields first actually, filter on those first and then look at the rest and do the bucketing later on actually.

Deb Banerjee:

Not everybody knows these things and therefore, sometimes people write this analytics searches that may take a long time, may bring down your system actually. Because you do want to run them for maybe hours, days and so on. And then if you don't do this careful things to them bad things would happen. If you had a model where you kick that off the table for me, actually that, you have a more powerful indexing system that's more visible, that would greatly make it simpler. I would have to be less of a rock star in terms of my detection development that have to be today.

Dhruba Borthakur:

Cool. That makes sense. You also talked about... what about things like... so you've talked about various components of your data stack, but could you tell me Anvilogic, what data stack you actually use? How do you do some of your own analytics currently? Do you use any open source stuff or do you build something yourself or are your engineers building it here for you? Could you tell me a little bit about your stack in particular?

Deb Banerjee:

So clearly we are not ourselves hosting customer data in our environment, but we are getting customer behavior and analytics coming back from the customer environments. And we do a lot of recommendations based on that. So we try to look at what worked, what didn't work, how's it going, and how our detection is functioning, how's your gather data, runtime data, performance data, the false positive, true positive data and even the tuning data. What are all the tuning activities they did on the rule and bring it back to the platform there. So we have a mix of specs today actually. We have a database, we use database stacks and we also have elastics on us as a document store as well as we have a doc store stack and we have a structured data stack.

Deb Banerjee:

We also use spark for some streaming processing as well actually when things are coming in and sometimes in a multiple destinations we have the learning part, the data science pieces out there in terms of recommendations and something go to state management and so on out there. So we've got a stack today that's a mix of... that it's a streaming in our intake side and then we have multiple destinations flowing out there, a dock stored and bought a structured store.

Deb Banerjee:

So that's where we are today, actually out there. But we see it evolving, any startup, you're trying to be agile and fast and solve problems for today but I think as the data sets rise and I think some of the recommendation systems, I think that's where I see real time shining really well, that if someone did something right now and someone is looking to do something similar, you want to bring it to the front of that person saying, hey, you're trying to do this, someone did that actually here, go take a look. And I think the real timeliness is pretty impressive out there. We are not there yet. We'd like to get there.

Gio Tropeano:

Awesome. So real quick, you're listening to Why Wait, The Rise of Real-tTme Analytics, the Rockset podcast. We're here with Deb Banerjee and Dhruba Borthakur, from Rockset. For all the builders listening to this podcast, how would you build real-time analytics into your podcast? Please share your thoughts on our Community Slack channel at rockset-community.slack.com.

Dhruba Borthakur:

Cool. Thanks Gio. I'm really interested in how developers use... is your platform very developer focused in a sense, are Anvilogic's use cases, are actually operated by developers of some other enterprise companies who are building security, or are they using real time dashboards to query your Anvilogic database or the data set? How do you surface some of your outputs to your customers? And tell me a little bit about the query latencies and query volumes that your customer might put on your application. I'm not talking about the detection part, I'm talking about the interactivity of your system with your customers. How is that delivered to your customers?

Deb Banerjee:

So typically, the users of our product are SOC analyst so they had really by definition threat hunters detection developers. Now I would say there is more domain knowledge involved here. Often many of them may have domain specific language backgrounds to come to the table with them. I would not call them pure programmers. They come at it from, they may know SPL the Splunk programming language reasonably well, or in some cases they might be Python programmers and so on. So they bring with that, so the mindset out here. And typically one of the things they look for us in when they come to the product here is for us to connect the dots actually for them. Today there are three things a good analyst has to do to get the job done, know his threat landscape, what things are getting through in his environment, he has to be aware of that because that's the old foundation without that he can't really do anything beyond that. That is changing. It's not a static thing so that's a threat landscape. And the threat landscape, the fairly low level of actionable granularity in terms of which are specific threat behavior that should be looked for. If you give me very high level descriptions, I can't operate on that, give me something low level. That's the first thing. Secondly, what data do I have? See the smart analyst would know all his stable schemers, all these fields and indexes, he would know everything. It's very hard for most people to do that. There are customers who may have a few thousand indexes and your systems out there. Most analysts might know maybe a few of them by heart and then maybe another half a dozen and then it falls off. So our system does this assessments and builds a threat landscape model for them, goes into their system and does this assessments. So when he comes in into the... lands into the Anvilogic console, the cloud interface, he's offered recommendations saying, hey, based on what you need to do and based on the data sets we see you have, and the data sets are fairly complex, it's not to know that you are collecting windows event logs or Linux audit logs as essential logs. What fields are there? Were all of the fields turned on? Did you collect the pattern processes? Did you collect process command lines and so on, Which event goals did you collect out there. And there is a wide variability and based on that, some detection rules we have may not work for you and some may work really well for you. Now smart devs can figure those things out but you want the system to be smart enough. And so right away, we connect all these dots for you, from threat in the landscape to understanding your data. And we have a tool that you're deploying to your SIEM actually that goes and does the assessments, send back the assessments and then we use that in the cloud to make recommendations about Etomic identifiers, finding the needles, and then the suspicious... and also looking for the patterns of needles, looking for the attack scenarios. And typical workflow would be for the analyst to look at that and say, "Let me create a private copy of the rule for my workspace so that it's no longer a public rule, deploy that in my environment in order to run that in my environment, look at the results, see how noisy it is, do some tuning almost all way. If you have to do some tunings and nothing is good enough that it can go to production right away actually you have to customize a rule for your environment. And as you do that, so we follow this detection score model where, as you keep modifying the role automatically, we have this agent, the app running as part of the SIEM, submits back new versions of code that you're modifying. See version one was the Anvilogic copy, version two was your changes, and version three your changes. We keep track of what are the changes and so on so that we can start recommending to other customers out there that this rule can be deployed. And these typically the sets of changes you make to a given rule in order to make it work in your environment, because there's always a burden obvious. So thats roughly a typical flow in terms of how the analyst uses that were product.

Dhruba Borthakur:

I see. So I think the takeaway at least from what I'm understanding is that, it looks like it's a very interactive system where the analysts can give some guidance saying that, hey, try these things and then the system does execute those and then gives him results. And based on that he can do some more iterated exploration or some new models that we can probably push to production. Is that what you were saying?

Deb Banerjee:

That's the starting point. So that's what you start with and what happens at the end of it, once you run my rules in your say, Splunk environment today, or your Sentinel environment, we did populate out this Anvilogic hunting index actually. Think of it as another cable you are populating of all the suspicious events we've seen in your environment, the needles out there. Now against them, now that runs in your environment again, we don't hold customer data, the customers' hosting in this environment against that today, we do a couple of ways for you to go and look for the patterns. Look for the patterns of needles it's sending. That's what the high efficacy alert detection is. And today we give you ways to define those patterns and look for those patterns. Essentially we give you more analytics, but where we are going is that we essentially give you this machine learning algorithms that can go mind constantly, this particular hunting index that we created for you actually, and then constantly keep popping up attack detections for you against the Anvilogic hunting industry. We are not there today yet, we're in the crawl walk, run phase.

Deb Banerjee:

We let you go define your particular patterns of needles you're looking for, but longer term we will pushing out this consequently new patterns to look for them. And we're limiting only to this particular hunting index, not hitting all your other large logs, because this is limited to suspicious events you've seen an environment, not everything that's not of interest. Because that's your earlier point performance as a problem here because we are taking resources from your data analytics solutions that you have. And we want to be careful and mindful to optimize that as well. So that's the second part of what we do.

Dhruba Borthakur:

Some of these processing could be quite heavyweight and might need a lot of compute and resources to be able to find this. You also touched upon the fact that this iterative process, its seems it has a lot of value for your customers if you're allowing them to move fast, which fits into our Why Wait question that you ask ourselves for real time analytics. Is that moving fast? How important is it for your customers to be able to do this iteratively and make your product better compared to your competitors?

Deb Banerjee:

Absolutely, I think. Because the timeliness matters. The threat landscape is not a static landscape. The bad guys are also doing counter intellencing, what the good guys are doing and they're constantly responding. There's a cat and mouse element out here, which does put a premium on ability to react. I think that how fast you can react, I think becomes a competitive advantage to your particular organization. And so yes, in that sense, to keep up with the threat landscape, because if you give me a detection after an adversary has moved on to other things, but it doesn't help in future, but it might help me in the past. Look back and see, okay, get it happen in the past and the future is less interesting. But clearly you want to be on top of the landscape, to the extent you can. Now the general state of detection is not excellent.

Deb Banerjee:

I would say, it demands a difficult magical skill today, it takes a lot of magic by analysts to do this. And so we are trying to bring the magic into product actually versus [inaudible 00:27:44]. So that's what the challenge is, but you're right. I think timeliness matters because the landscape is changing.

Dhruba Borthakur:

Especially with data volumes and data that you're collecting, this is a challenge, [crosstalk 00:27:56] for the compute to be efficiently used.

Deb Banerjee:

And to your point, some of the responses we see on data volume you brought up, people are bringing on multiple silos. Now they might use Chronicle for good endpoint detection logs only [inaudible 00:28:11] because of the great price, performance point out there. They might be using videos, data warehouses, or data lakes to store raw data out there because it's cheap. They might keep 30 days in Splunk, they might keep other data in Chronicle and maybe, a year worth of data somewhere in a cloud data warehouse, but you need to be able to correlate across these. There are enough correlations you might want [inaudible 00:28:35] across these out there, right there. And you want to push the detections into various silos you have out there that customers want because they are very big, heavy events silos. So being able to push detections into various silos and being able to correlate across silos, I think that's also another use case we see that emerging with advanced customers.

Dhruba Borthakur:

The silos definitely are a big problem because it looks like silos give you a lot of operational complexity. If you have a warehouse and you have data in chronical Splunk, really you need a data engineering team or some more purely people in your team who are analyzing data. Is that a problem or is that a challenge in the sense you need certain data experts who are handling with data not really with security. And does it slow you down and give you hiccups in making sense or making sense out of the real time analytics systems that you are trying to build?

Deb Banerjee:

So today I think only the very large customers who can afford to have some amount of data engineering skills because now you've got to set up pipelines in our Kafka lines, going into various places. You normalize ones or transform once multiple destination and all that. Now we need some engineering out there to keep those pipes up and running, make everything works, keep adding more normalizations so absolutely you're right. That's what we seeing today. Now there is opportunity in simplifying this I think. I think there's certainly opportunity for product teams perhaps yours to come in here and say, not a problem, you got multiple sources coming in and we can give you a single view across all the silos and you can run correlations in one place against it. Those are certainly opportunities I think, going forward. But you're right, today it's only the bigger more than fast enterprises that are doing this.

Gio Tropeano:

And the amount of data that's being collected, isn't shrinking either. So the complexity is only going to get even more complex. So, to cap this off then, so what challenges regarding real-time analytics that are top of mind for your team Deb, and how do you plan to address them? And what suggestions do you have for other folks that are in your situation that are looking to really capitalize on the opportunity that real-time analytics can provide?

Deb Banerjee:

One of the things... I think a competitive differentiator for all of us is relevance. That the world is full of insights and recommendations, most of it is not relevant. We all get recommendation is not relevant. And timeliness, I think is a really critical success factor for relevance. That if something is relevant to you right now, and it has a short shelf life, I was talking to our sales leader the other day he was saying, time kills deals. The longer you're talking to someone it gets less and less likely to close the deal up. So timeliness, I think is a key part to relevance. That if I'm looking for something, I get it right now, you bring the best intel right now and absorb it and take care of it. But if it's not there, I won't come back to it later, a day later and say, "now, what was that you were looking for it yesterday, you not find it?" So to me, that's the core here for relevance really. That, if we can support this relevance use cases, then you create business value, engaged customers, engaged users. It's a nice virtuous cycle out there versus stale data, everything is stale, when I was looking for it last week. I don't care about it anymore. And then it leads to that people start saying, oh, it doesn't work. That sort of thing. You hear this set of observations analysts are making, it doesn't work really. What they really means is that, it doesn't meet my needs that I have right now, actually. So the timeliness of queries, data access, recommendations, anything... any sites you offer, critical success factor, I think that timeliness is part of it.

Gio Tropeano:

Thank you so much. That's pretty much all the questions that we have for you, Deb. It was a great discussion today. Thank you so much for sharing your experience with us about real-time analytics at the Anvilogic. And if listeners want to learn more about Anvilogic, check out Anvilogic.com, a beautiful website, a lot of information there. And, for our listeners, we at Rockset, we're building a real-time analytics platform that can add value to use cases that are similar to the ones that Deb described to us today. Once again, thank you both for your time, for your insight. And if you're listening to this podcast or watching the video, please comment, share, we'd love to hear from the listeners and what your thoughts are. And next week tune in for another entertaining discussion with another founder of a software company that enables fleet management in real time. Thank you so much both again.

Deb Banerjee:

Thank you.

Gio Tropeano:

And have a great day.

Deb Banerjee:

[crosstalk 00:33:47] take care.

Dhruba Borthakur:

Bye.

Resources

mouse pointer

See Rockset in action

Real-time analytics at lightning speed