6 Hard Problems Scaling Vector Search
August 28, 2023
You’ve decided to use vector search in your application, product, or business. You’ve done the research on how and why embeddings and vector search make a problem solvable or can enable new features. You’ve dipped your toes into the hot, emerging area of approximate nearest neighbor algorithms and vector databases.
Almost immediately upon productionizing vector search applications, you will start to run into very hard and potentially unanticipated difficulties. This blog attempts to arm you with some knowledge of your future, the problems you will face, and questions you may not know yet that you need to ask.
1. Vector search ≠ vector database
Vector search and all the associated clever algorithms are the central intelligence of any system trying to leverage vectors. However, all of the associated infrastructure to make it maximally useful and production ready is enormous and very, very easy to underestimate.
To put this as strongly as I can: a production-ready vector database will solve many, many more “database” problems than “vector” problems. By no means is vector search, itself, an “easy” problem (and we will cover many of the hard sub-problems below), but the mountain of traditional database problems that a vector database needs to solve certainly remain the “hard part.”
Databases solve a host of very real and very well studied problems from atomicity and transactions, consistency, performance and query optimization, durability, backups, access control, multi-tenancy, scaling and sharding and much more. Vector databases will require answers in all of these dimensions for any product, business or enterprise.
Be very wary of homerolled “vector-search infra.” It’s not that hard to download a state-of-the-art vector search library and start approximate nearest neighboring your way towards an interesting prototype. Continuing down this path, however, is a path to accidently reinventing your own database. That’s probably a choice you want to make consciously.
2. Incremental indexing of vectors
Due to the nature of the most modern ANN vector search algorithms, incrementally updating a vector index is a massive challenge. This is a well known “hard problem”. The issue here is that these indexes are carefully organized for fast lookups and any attempt to incrementally update them with new vectors will rapidly deteriorate the fast lookup properties. As such, in order to maintain fast lookups as vectors are added, these indexes need to be periodically rebuilt from scratch.
Any application hoping to stream new vectors continuously, with requirements that both the vectors show up in the index quickly and the queries remain fast, will need serious support for the “incremental indexing” problem. This is a very crucial area for you to understand about your database and a good place to ask a number of hard questions.
There are many potential approaches that a database might take to help solve this problem for you. A proper survey of these approaches would fill many blog posts of this size. It’s important to understand some of the technical details of your database’s approach because it may have unexpected tradeoffs or consequences in your application. For example, if a database chooses to do a full-reindex with some frequency, it may cause high CPU load and therefore periodically affect query latencies.
You should understand your applications need for incremental indexing, and the capabilities of the system you’re relying on to serve you.
3. Data latency for both vectors and metadata
Every application should understand its need and tolerance for data latency. Vector-based indexes have, at least by other database standards, relatively high indexing costs. There is a significant tradeoff between cost and data latency.
How long after you ‘create’ a vector do you need it to be searchable in your index? If it’s soon, vector latency is a major design point in these systems.
The same applies to the metadata of your system. As a general rule, mutating metadata is fairly common (e.g. change whether a user is online or not), and so it’s typically very important that metadata filtered queries rapidly react to updates to metadata. Taking the above example, it’s not useful if your vector search returns a query for someone who has recently gone offline!
If you need to stream vectors continuously to the system, or update the metadata of those vectors continuously, you will require a different underlying database architecture than if it’s acceptable for your use case to e.g. rebuild the full index every evening to be used the next day.
4. Metadata filtering
I will strongly state this point: I think in almost all circumstances, the product experience will be better if the underlying vector search infrastructure can be augmented by metadata filtering (or hybrid search).
Show me all the restaurants I might like (a vector search) that are located within 10 miles and are low to medium priced (metadata filter).
The second part of this query is a traditional sql-like
WHERE clause intersected with, in the first part, a vector search result. Because of the nature of these large, relatively static, relatively monolithic vector indexes, it’s very difficult to do joint vector + metadata search efficiently. This is another of the well known “hard problems” that vector databases need to address on your behalf.
There are many technical approaches that databases might take to solve this problem for you. You can “pre-filter” which means to apply the filter first, and then do a vector lookup. This approach suffers from not being able to effectively leverage the pre-built vector index. You can “post-filter” the results after you’ve done a full vector search. This works great unless your filter is very selective, in which case, you spend huge amounts of time finding vectors you later toss out because they don’t meet the specified criteria. Sometimes, as is the case in Rockset, you can do “single-stage” filtering which is to attempt to merge the metadata filtering stage with the vector lookup stage in a way that preserves the best of both worlds.
If you believe that metadata filtering will be critical to your application (and I posit above that it will almost always be), the metadata filtering tradeoffs and functionality will become something you want to examine very carefully.
5. Metadata query language
If I’m right, and metadata filtering is crucial to the application you are building, congratulations, you have yet another problem. You need a way to specify filters over this metadata. This is a query language.
Coming from a database angle, and as this is a Rockset blog, you can probably expect where I am going with this. SQL is the industry standard way to express these kinds of statements. “Metadata filters” in vector language is simply “the
WHERE clause” to a traditional database. It has the advantage of also being relatively easy to port between different systems.
Furthermore, these filters are queries, and queries can be optimized. The sophistication of the query optimizer can have a huge impact on the performance of your queries. For example, sophisticated optimizers will try to apply the most selective of the metadata filters first because this will minimize the work later stages of the filtering require, resulting in a large performance win.
If you plan on writing non-trivial applications using vector search and metadata filters, it’s important to understand and be comfortable with the query-language, both ergonomics and implementation, you are signing up to use, write, and maintain.
6. Vector lifecycle management
Alright, you’ve made it this far. You’ve got a vector database that has all the right database fundamentals you require, has the right incremental indexing strategy for your use case, has a good story around your metadata filtering needs, and will keep its index up-to-date with latencies you can tolerate. Awesome.
Your ML team (or maybe OpenAI) comes out with a new version of their embedding model. You have a gigantic database filled with old vectors that now need to be updated. Now what? Where are you going to run this large batch-ML job? How are you going to store the intermediate results? How are you going to do the switch over to the new version? How do you plan to do this in a way that doesn’t affect your production workload?
Ask the Hard Questions
Vector search is a rapidly emerging area, and we’re seeing a lot of users starting to bring applications to production. My goal for this post was to arm you with some of the crucial hard questions you might not yet know to ask. And you’ll benefit greatly from having them answered sooner rather than later.
In this post what I didn’t cover was how Rockset has and is working to solve all of these problems and why some of our solutions to these are ground-breaking and better than most other attempts at the state of the art. Covering that would require many blog posts of this size, which is, I think, precisely what we’ll do. Stay tuned for more.