Building Real Time Search in Your App Beyond Text Search and Elasticsearch

The ability to support many different types of queries without having to reshape your data is particularly helpful when building real-time search and analytics. We compare how using Elasticsearch and Rockset impacts the query flexibility you have.

About This Webinar

Kevin:

Hi everyone, and welcome to today's Tech Talk. My name is Kevin, and I'll be your moderator for today. So our topic for this talk is Building Real-Time Search in Your App: Beyond Text Search and Elasticsearch. And as a reminder, this webinar is being recorded and all participants will be muted. If you do have a question, please feel free to send it to us through the questions section of the webinar tool here. And we're excited to have David Brown with us today.

Kevin:

David has been helping customers architect solutions with different, fast and big data solutions for over 20 years at Rockset. David helps customers to determine if the product is a good fit, helps place it at a conference that architecture, provides assistance during evaluation and stays involved with customers as they progress in their Rockset journey. He lives in San Diego and is on the top two list of favorite humans for a pair of providers. So welcome David. What I'll be doing today is I'll be asking David a number of questions about his experiences in the industry and from its substantial and varied experience working with users and customers.

Kevin:

So for today, we'll cover a number of different things. We'll talk about real-time search and applications. We'll talk about how search has evolved over time. And then we'll go a little bit more in depth, when we talk about some considerations users have when building real-time search and David will share his experiences in all those things. So without further ado, I'll ask David to come on and tell us a little bit about what search is in his experience. Hey David.

David Brown:

Sure. So everyone has an experience with search, right? People first became familiar with search using engines like Google and the concepts of like full-text search, natural language search or semantic language search became what people associated with the term search. The dictionary definition of search implies, not just looking for something, but generally looking for something that's hard to find. And today I think that notion of hard to find is based primarily on the size of the corpus of data that you starting with. Typically we have a lot more data to deal with now that we did in the past. And you're looking for that proverbial needle in the haystack. So searching tends to be characterized by looking for a small amount of data within a much larger body of data. The antithesis of search is reporting where you tend to read and summarize a large portion, if not all of your dataset to get a summary.

Kevin:

Interesting. So, you've worked with customers like here in Rockset and even before in your experience, what types of applications have you seen that implement search?

David Brown:

So the key attribute I see search applications having in common is that typically a human is waiting for the results. Generally you only have a second or less to return a search result. In contrast to reports, searching tends to be focusing on details, like looking up information on a very specific transaction or finding a used vehicle that best matches all of your criteria. The line between a search app and a report tends to get blurred however, especially when you introduce the ability to search against real time data. That's when you get applications like dashboards and leader boards. These tend to be little mini reports that are run against a small subset of the data, but the boundaries of that data tends to be influenced both by time and the user's current interest. So the data store that's required to support apps like that needs to be able to pull that small amount of data out of the larger data set very quickly. Ultimately there's no clear definition of what a search app is, what we are seeing customers use search in new ways every day.

Kevin:

Got it. So next in your bio, when I read your bio David, I read you've been helping customers for over 20 years now, and I know you have some thoughts about how search has evolved over time. So why don't you share those with us?

David Brown:

Yeah, I'm older than I look I think. So our problems all started when Al Gore invented the internet. Prior to that, data stored in computers was generally created by humans, typing at keyboards, and then along came HTML. And since it was initially stateless, companies needed a way to be able to reconstruct the path that users took through their various webpages. And ultimately they needed to understand and predict how people would interact with their websites. Web servers produce large quantities of data. That was the first time we can add machines. Well, not technically the first, but the first time on mass, we had machines producing large quantities of data and they are all with fine grained records being produced from the web servers every second.

David Brown:

If your database couldn't receive and process these updates in a timely manner, you would run the risk of reacting too slowly to an opportunity that was probably no longer there when you tried to react to it, right. And also each webpage produced slightly different information in the locks. So it became difficult to use traditional relational products to store this varying or semi-structured data. And these problems became known as the three V's; volume, velocity and variety.

Kevin:

Yeah. I remember those three V's. I've not been in the data industry for 20 years, but probably more than 10 years. I've heard those terms a lot in about 10 years, five years ago. So how did the software engineering community deal with those three V's?

David Brown:

So typically commercial software just evolved incrementally, right? Most products just continue to add more and more features on top of their existing architectures, but here's where history played an important role. The collapse of the.com bubble in the early 2000s left many brilliant software engineers out of work. And there are no longer tied to the corporate code basis that they were working on. And so they started to fuel the open source community with innovative data storage solutions that they were writing from scratch. They essentially threw everything out and started with the premise that in order to deal with the three V's, you had two requirements. One the ability to scale the system linearly across commodity hardware. And two, the ability to survive the loss of a single machine without losing any data or having the system become unavailable. So one of the things they threw out along with asset transactions and other things was SQL, right?

David Brown:

Although arguably SQL wasn't the problem. It was the rigid relational data structures that were the problem. And so these tech systems tended to fall under the category that we would call no SQL databases. And I don't think that many lay people really have an appreciation just for how much of a technology reboot we had with the no SQL movement. It was very unique to restart stuff from scratch. And the core technology for Elasticsearch was born in that time period. And like many other open source projects, it naturally followed the needs of the community. So we see this in the evolution of the ELK stack, they kind of added back features one by one on top of the no SQL core in order to meet the most common needs of their users, but always with the caveat that you had to service the three V's and meet the core tenants of the no SQL initiative.

David Brown:

And so they, as in both the company Elasticsearch and the open source community that supported it, probably made a very conscious choice to implement the proprietary query language because it better suited the specific needs they were trying to solve. And if you think about it, full tech search is fundamentally different than other searches. I mean, you don't type SQL into the Google search bar after all, right. So if I had to guess, I bet they knew what they were leaving behind. They knew how pervasive SQL was in the rest of the industry, but the community is went in its own self-serving direction. I think we can agree that Elastic is a great product to solve the specific needs of searching log entries and log like events. But the whole stack has been very specifically designed for that purpose.

Kevin:

Okay. That makes sense, right. Three V's, we got NoSQL and Elasticsearch out of that. Now, were there any other historical events that you think shaped the evolution of search?

David Brown:

Sure. And it's obvious to us now, and that's the cloud, right? The cloud has changed everything. To me, the concept of the cloud was inevitable. I mean, economies of scale tell us that it was just inefficient for every company to manage their own data center. We just had to wait for a company to come along that A really highly valued innovation and B was able to forgo a profit for 15 years, right. And so, that infrastructure as a service was really just a stepping stone. The more significant innovation for architectures came when the folks who first developed the Google app engine leveraged that knowledge to build Kubernetes, which has ultimately given rise to serverless computing. And again, I think that many lay people wouldn't appreciate how significant a difference there is between the traditional architectures and serverless computing. I looked it up on Wikipedia and wrote it down here.

David Brown:

I really liked the way they put it. They say that serverless is a misnomer in the sense that servers are still used by cloud providers to execute code for developers. However, developers of serverless applications are not concerned with capacity planning, configuration management, maintenance, operating, or scaling of containers, VMs, or physical servers. So sure you can run products like Elasticsearch or others inside of containers in a Kubernetes environment. But it's just not the same. You just end up managing containers instead of managing servers.

David Brown:

A truly serverless architecture allows for the dynamic scaling of system services independently, and it places far less load on system administrators. So we're all standing on the shoulders of giants. We all leverage the brilliant innovations of engineers that have come before us, but technological innovation isn't linear, right? You sometimes you get big sudden spikes like the cloud, right? Like no SQL and the products that are built after those spikes tend to have a distinct advantage over those built on legacy architectures. And also if you have a company that's built a product before a major spike, there's very little incentive or ability to tear that thing down and start from scratch and rebuild it with a new architecture, because it's difficult to change a car while you're driving down the road.

Kevin:

Right. Makes sense. And thanks for walking us through kind of the history of search as you see it. And we've talked quite a bit at a high level about, how and why a product like Elasticsearch was designed. And then you brought up Rockset as well. So potentially let's get a little bit more into the details here, right? As to how Elasticsearch and Rockset might be similar or different. So in what ways do you think Rockset has taken a similar strategy to Elasticsearch and in what ways different?

David Brown:

So I think I'd put the differences in two categories. One is the technologies that are used. And then the second is the expectations that are placed on the end user to configure and manage those technologies. So starting with the technologies, both products have an inverted index or a search index. An inverted index is start by recording each word or value that they encounter and then keep track of all the documents or pages where that word or value appears, which is kind of backwards to regular indexes. But this is super fast for finding documents that contain that particular value, but it kind of carries the assumption that only a small percentage of the documents will contain that value. If that assumption is violated or in technical terms, if you're querying against a low cardinality column and you end up getting back 30, 60, 80% of your dataset by searching on that value, then there are much faster ways to get the data off and to retrieve the data than using an inverted index.

David Brown:

So this is why Rockset also creates column range type and row indexes on the data. So we call this the converged index and it gives our query engine more options and allows it to intelligently choose the best option for pulling the data off the storage at run time. In Elasticsearch, so sort of top two then the expectations placed on the end-user. In Elasticsearch, the administrators still has to make fundamental configuration decisions. They need to understand things like types and mappings and charts and in Rockset users don't have to configure these things. You can certainly make an argument that allowing users to control these settings allows them to optimize the system for a particular use, but then you end up with a system that's optimized for a particular use and by definition, therefore de optimized for other use cases. It's our contention that the converged index can answer specific a prioritized queries almost as fast as if you were trying to run them in a highly tuned data store, but then also answered general ad hoc queries against the same data using the same system and do those quickly as well.

David Brown:

And then you need to think about the time and the cost that it takes your administrator to configure and manage your data store. If you don't have that, now you can apply that time and cost and intellectual talent towards meeting your company's real strategic projects and their core competencies, right? So why should your company be a database company? You don't necessarily need that expertise if you can get it built into the system like Rockset.

Kevin:

Okay. So you've mentioned converged indexing. So maybe you can talk a little bit about that. And in the context of, we do say that Rockset is schema-less. So how can a query engine work if you don't know to structure off the data?

David Brown:

Yeah, that's actually a neat trick. So firstly, the act of indexing data and we index all the data ahead of time, is kind of like, precalculating the answers to a question, now we don't know the full question ahead of time, but at runtime we can break the query down into a bunch of separate tasks and the converge index has affectedly prepared the answer for each of those individual tasks. So then all we need to do is to combine the answers from all those things together, which of course is done by more tasks that we call aggregation tasks.

David Brown:

And to speak to your question about schema-less, the secret is that the index information is always stored in a fixed and predictable way. So regardless of how the customer's document is formatted or how that schema in their document might morph over time, the data is going to end up in our index with a very predictable structure. And the query engine is always executing against indexes and in our case, never actually operates against the original documents. We're never scanning the document in its native form, always against the indexes. The indexes are stored in RocksDB cloud, which is where the rocks comes from, right? And that gives us important features like partitioning, fault tolerance, rebalancing, data compaction and perhaps most importantly, the ability to mutate documents or change the values in the index very, very quickly in order to deliver the freshest values to the query engine.

Kevin:

Okay, great. Thanks for explaining all that intricate detail, David. I know you mentioned something else when we started off talking about the history and the evolution of search and that was SQL, right. So, if the no SQL movement decided to throw SQL out as too hard or being too slow, why in your mind, has the industry have returned to SQL?

David Brown:

I don't think the industry ever left SQL, right. I'd wager that there probably 10 times as many people that know SQL today than there were when the no SQL movement kicked off. And remember that it wasn't SQL that was the problem. The problem was trying to write data reliably, produced by the web servers as quickly as possible and make sure that the failure of any node wouldn't result in losing any data.

David Brown:

And to be fair, we haven't tried to solve that problem with Rockset. We expect that our customers are using one of the many excellent products out there for their system of record, like DynamoDB or MongoDB, or even traditional relational databases like Oracle, MySQL, SQL Server, or they have files in a data lake, right? Rockset is the source of truth, not the system of record. A source of truth is a system where you can bring together data from numerous sources in order to get a more comprehensive view of your company and the undisputed champion for joining data sets together and answering analytical questions, whether those questions are ad hoc or otherwise is SQL, right. At a really early stage in their development, the engineers that built Elasticsearch, chose to optimize their product for a different purpose than those, ad hoc or data joining problems.

Kevin:

Okay. Right. So that kind of makes sense because Elasticsearch that decision was made, you can say at a different time, potentially at a different era, where there was a mass movement, away from SQL into this thing called no SQL. But so how do the choices Elasticsearch and Rockset made with respect to query language impact users today?

David Brown:

Well, the too long didn't read answer would be that if you don't support SQL, you are excluding A, large population of people and B, a large population of tools that you can integrate with your product. So in the end, you can and should boil everything down to cost, right? Having a proprietary query language, either takes time for your people to learn it, or you have to go and hire people that already know it from a smaller and more expensive talent pool. And then if you can't use third party tools to integrate quickly and easily into your application, you're likely going to have to build and support your own solution, which takes more time and more money, right? If you want to get into an argument about, could you solve a particular problem better or at all with one query language versus another, you're going to be missing the more important point, but it's really all about total cost and return on investment.

Kevin:

All right. Well, that makes sense. And then one other topic, which is somewhat related to how SQL that we're just talking to, people will think of SQL sometimes in conjunction with the joins, right? So let's talk about joining data, which is something that's supported in the SQL language and MySQL databases. So why is that useful for more modern search applications?

David Brown:

I think two primary reasons. Firstly, when you look at data, it tends to fall into two cohorts. There's the data that moves and flows like events, transactions, measurements, page visits, ad impressions, the events that you're tracking, right? And then there's static data that doesn't flow, like user profiles, preferences, settings, and other metadata, it changes but it doesn't really have this time-based kind of consistent flow. So the life cycle and the period of relevancy of these two types of data is different. So in the case of flowing data, you may only want to keep six months worth because anything beyond that might not be pertinent to any kind of real-time analysis. If you don't have the ability to join these two different types of data at query time, then you have to do it by denormalizing your data by pre joining them at ingestion time. And that's what Elastic customers tend to have to do to solve the problem because they don't have joints.

David Brown:

The second important reason why joins are important is the ability to support ad hoc queries. So regardless of whether those queries are composed by a human sitting in front of Tableau or business objects or something, or whether the query is assembled on the fly by your client code, without joins you can't really, it's very difficult to give the end users the flexibility to decide what data they want to bring together and how they want to look at the data, slice and dice it.

Kevin:

Okay. So that's important for a lot of use cases, I get it. But the work arounds right for joins and one is commonly to denormalize the data prior to ingestion. What are your thoughts on that? What are some of the side effects that people should know about denormalizing data ahead of time?

David Brown:

Well, the first problem is data amplification, right? So when you denormalize data, you end up duplicating a lot of that static data, right? So each row of the flowing data gets its relevant pre joined static data attached to it. So you're duplicating that static data again and again and again. And so that can cause a massive increase in the data storage requirements, which directly affects your costs and your performance, right. The one thing that humans still haven't solved is the speed of light, right? So we still have to think about how much data do I have to move off my disc or through my IO channel, right? So there's also the cost and difficulty of building and maintaining that ETL pipeline that implements that denormalization and changing it every time your data structures might change.

David Brown:

Let's assume for a minute though, the cost is no object at the moment. The more serious problem is that denormalization severely limits your ability to support ad hoc queries. If you're going to denormalize your data, you face a dilemma around which pre joins you're going to implement, which denormalizations are you going to do.

David Brown:

The more you choose to implement, the worst your data amplification problem gets because you get more bigger, bigger data. The fewer of those pre joins you implement means the fewer end user queries that you're able to support because you haven't got that data pre joined, right. And that tends to be a decision that you can only get wrong, right? You're always going to disappoint somebody, either some of your end users or your CFO, so you have to choose. As a system of truth, Rockset was designed to support both fast, a prioritized queries, via microservices that we call Query Lambdas, but also support ad hoc queries via our generic APIs or via JDBC or Tableau drivers. So having SQL and joins, allows our system support a much wider suite of applications and a much wider range of users.

Kevin:

Okay, great. You've talked that through a number of things, we've talked about where search originated, three V's, no SQL and then kind of coming back full circle to SQL and then with the new cloud technologies as well, we've looked at more granular considerations, right. When we talked about joining data SQL and so on. I just wondered if you have any closing thoughts for our audience today.

David Brown:

Sure. Again, I guess the whole concept of Rockset is that it has this converged index that we index everything out of the box. We index it multiple ways. We have an intelligent query optimizer that can make runtime decisions based on the nature of the query and the size of the data. And it avoids the alternative case of having administrators having to go and define a bunch of indexes to having multiple indexes. If you get to the point where you needed to define an index for every query that you're going to implement in order to make that query run fast, that takes up a lot of resources to build, to store in memory, to store on disc, to maintain, to manage. When you have rebalancing, that's more data than move around your network. Building an index per query is a very inefficient way. And we get around that by our intelligent query engine and the multiple indexes available from the converged index.

Kevin:

Awesome. Well, thanks a lot, David, for sharing your thoughts and we'll take some questions shortly. So please do ask anything on your mind, in the questions section of the go-to webinar tool. I should also point out that if you're intrigued by what David has shared about Rockset, how we've designed it, how a lot of our customers implementing search features and applications with it, you can take advantage of $300 in free credits when you sign up for a Rockset account. I did want to step you through and show you where you can do that. Obviously you can go to this URL here, console.rockset.com/create. But if that's too long for you, you can always... Let me close this window here. Click over here, rosckset.com, click try free, right? Which will take you to our account creation page, pretty straight forward, right. Let me just try here and show you how easy it is to set it up. Let's say demo one.

Kevin:

Pardon me. I am typing with one hand because I injured my other. So at about 50% capacity here. Kevin L, yes, agree. That should take you to a confirmation screen, right? You get a verification email once you've entered your information, just refresh that, verify and you have your accounts created in a matter of a few minutes. Okay. Once I've done that, right away, we can state our intentions, right? Let's say I want data from MongoDB coming into Rockset to index. I want to build some APIs on that and I'll say let's get started, it might give me some hints for how I can get started. But really once you get in here and once you've created your account, you can go ahead and create collections and just data from any of our supported data sources, or you can simply use our write API to write from any data or any source that you may have.

Kevin:

So it's as simple as that. Go ahead and create your account. If you heard something interesting from David's sharing today, and I should also point out if you do that this week. So if you sign up for an account today, tomorrow, we are happy to send you one of our t-shirts and two hours of consulting for your onboarding efforts. And we'll provide that. All you need to do is shoot David or me an email later on, our email addresses are pretty straightforward. Just got to get this window out of the way. It's david@rockset or kevin@rockset. If you go off and create an account today or tomorrow, drop us an email and we will be happy to set up a chat with you, to provide the consulting that we talked about and ask for your t-shirt size so we can send you a t-shirt. So that's what you can do to get a free account on Rockset. And David, I'm wondering as I was doing all that, if you had received any questions from the audience.

David Brown:

No, there was, I think a question about, doesn't Elasticsearch support SQL and in recent times they have put a SQL wrapper around their stuff, but they don't support joins. So I deal with my customer's queries all the time, and they've got in joins as well as implicit joins. We have the ability in Rockset to support nested JSON structures. And we have unnest function, which has kind of an implicit join, which is faster and more performance, but effectively the same thing. But if you don't have that ability to do joins, it's going to really restrict the way that people use SQL. And yes, you could get a subset of it. You can have a rapper that's going to implement your sorts and order buys in SQL like language. But I would argue it's not SQL. I'm sure they would argue that it was, but that's okay.

Kevin:

All right. Thanks for answering that one. Any other questions coming in, David?

David Brown:

Not that I see.

Kevin:

Okay. Well, if you have any other questions, you know where to find us, david@rockset.com or kevin@rockset.com. Also, if you'd like to talk further about how Rockset might fit in with your usage, just connect with any of us. We'd be happy to talk you through that and give you a demo. Once again, if you sign up today or tomorrow, and drop us a note, we'll be happy to send off some Rockset swag to you. And with that, I will say thank you on behalf of David and me, for joining us for our Tech Talk today. Thank you.

mouse pointer

See Rockset in action

Real-time analytics at lightning speed