Reflections on what Iâve learned about information retrieval in the last two years working at Weaviate
From BM25 to RAG: Everything I learned about vector databases, embedding models, and vector search - and everything in between.
Today Iâm celebrating my two-year work anniversary at Weaviate, a vector database company. To celebrate, I want to reflect on what Iâve learned about vector databases and search during this time. Here are some of the things Iâve learned and some common misconceptions I see:
BM25 is a strong baseline for search. Ha! You thought I would start with something about vector search, and here I am talking about keyword search. And that is exactly the first lesson: Start with something simple like BM25 before you move on to more complex things like vector search.
Vector search in vector databases is approximate and not exact. In theory, you could run a brute-force search to compute distances between a query vector and every vector in the database using exact k-nearest neighbors (KNN). But this doesnât scale well. Thatâs why vector databases use Approximate Nearest Neighbor (ANN) algorithms, like HNSW, IVF, or ScaNN, to speed up search while trading off a small amount of accuracy. Vector indexing is what makes vector databases so fast at scale.
Vector databases donât only store embeddings. They also store the original object (e.g., the text from which you generated the vector embeddings) and metadata. This allows them to support other features beyond vector search, like metadata filtering and keyword and hybrid search.
Vector databasesâ main application is not in generative AI. Itâs in search. But finding relevant context for LLMs is âsearchâ. Thatâs why vector databases and LLMs go together like cookies and cream.
You have to specify how many results you want to retrieve. When I think back, I almost have to laugh because this was such a big âahaâ moment when I realized that you need to define the maximum number of results you want to retrieve. Itâs a little oversimplified, but vector search would return all the objects, stored in the database sorted by the distance to your query vector, if there werenât a
limit
ortop_k
parameter.There are many different types of embeddings. When you think of a vector embedding, you probably visualize something like [-0.9837, 0.1044, 0.0090, âŠ, -0.2049]. Thatâs called a dense vector, and it is the most commonly used type of vector embedding. But thereâs also many other types of vectors, such as sparse ([0, 2, 0, âŠ, 1]), binary ([0, 1, 1, âŠ, 0]), and multi-vector embeddings ([[-0.9837, âŠ, -0.2049], [ 0.1044, âŠ, 0.0090], âŠ, [-0.0937, âŠ, 0.5044]]), which can be used for different purposes.
Fantastic embedding models and where to find them. The first place to go is the Massive Text Embedding Benchmark (MTEB). It covers a wide range of different tasks for embedding models, including classification, clustering, and retrieval. If youâre focused on information retrieval, you might want to check out BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.
The majority of embedding models on MTEB are English. If youâre working with multilingual or non-English languages, it might be worth checking out MMTEB (Massive Multilingual Text Embedding Benchmark).
A little history on vector embeddings: Before there were todayâs contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although todayâs contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables.
Donât confuse sparse vectors and sparse embeddings. It took me a while until I understood that sparse vectors can be generated in different ways: Either by applying statistical scoring functions like TF-IDF or BM25 to term frequencies (often retrieved via inverted indexes), or with neural sparse embedding models like SPLADE. That means a sparse embedding is a sparse vector, but not all sparse vectors are necessarily sparse embeddings.
Embed all the things. Embeddings arenât just for text. You can embed images, PDFs as images (see ColPali), graphs, etc. And that means you can do vector search over multimodal data. Itâs pretty incredible. You should try it sometime.
The economics of vector embeddings. This shouldnât be a surprise, but the vector dimensions will impact the required storage cost. So, consider whether it is worth it before you choose an embedding model with 1536 dimensions over one with 768 dimensions and risk doubling your storage requirements. Yes, more dimensions capture more semantic nuances. But you probably donât need 1536 dimensions to âchat with your docsâ. Some models actually use Matryoshka Representation Learning to allow you to shorten vector embeddings for environments with less computational resources, with minimal performance losses.
Speaking of: âChat with your docsâ tutorials are the âHello worldâ programs of Generative AI.
You need to call the embedding model A LOT. Just because you embedded your documents during the ingestion stage, doesnât mean youâre done calling the embedding model. Every time you run a search query, the query must also be embedded (if youâre not using a cache). If youâre adding objects later on, those must also be embedded (and indexed). If youâre changing the embedding model, you must also re-embed (and re-index) everything.
Similar does not necessarily mean relevant. Vector search returns objects by their similarity to a query vector. The similarity is measured by their proximity in vector space. Just because two sentences are similar in vector space (e.g., âHow to fix a faucetâ and âWhere to buy a kitchen faucetâ) does not mean they are relevant to each other.
Cosine similarity and cosine distance are not the same thing. But they are related to each other (). If you will, distance and similarity are inverses: If two vectors are exactly the same, the similarity is 1 and the distance between them is 0.
If youâre working with normalized vectors, it doesnât matter whether youâre using cosine similarity or dot product for the similarity measure. Because mathematically, they are the same. For the calculation, dot product is more efficient.
Common misconception: The R in RAG stands for âvector searchâ. It doesnât. It stands for âretrievalâ. And retrieval can be done in many different ways (see following bullets).
Vector search is just one tool in the retrieval toolbox. Thereâs also keyword-based search, filtering, and reranking. Itâs not one over the other. To build something great, you will need to combine it with different tools.
When to use keyword-based search vs.vector-based search: Does your use case require mainly matching semantics and synonyms (e.g., âpastel colorsâ vs.âlight pinkâ) or exact keywords (e.g., âA-line skirtâ, âpeplum dressâ)? If it requires both (e.g., âpastel colored A-line skirtâ), you might benefit from combining both and using hybrid search. In some implementations (e.g., Weaviate), you can just use the hybrid search function and then use the
alpha
parameter to change the weighting from pure keyword-based search, a mix of both, to pure vector search.Hybrid search can be a hybrid of different search techniques. Most often, when you hear people talk about hybrid search, they mean the combination of keyword-based search and vector-based search. But the term âhybridâ doesnât specify which techniques to combine. So, sometimes you might hear people talk about hybrid search, meaning the combination of vector-based search and search over structured data (often referred to as metadata filtering).
Misconception: Filtering makes vector search faster. Intuitively, youâd think using a filter should speed up search latency because youâre reducing the number of candidates to search through. But in practice, pre-filtering candidates can, for example, break the graph connectivity in HNSW, and post-filtering can leave you with no results at all. Vector databases have different, sophisticated techniques to handle this challenge.
Two-stage retrieval pipelines arenât only for recommendation systems. Recommendation systems often have a first retrieval stage that uses a simpler retrieval process (e.g., vector search) to reduce the number of potential candidates, which is followed by a second retrieval stage with a more compute-intensive but more accurate reranking stage. You can apply this to your RAG pipeline as well.
How vector search differs from reranking. Vector search returns a small portion of results from the entire database. Reranking takes in a list of items and returns the re-ordered list.
Finding the right chunk size to embed is not trivial. Too small, and youâll lose important context. Too big, and youâll lose semantic meaning. Many embedding models use mean pooling to average all token embeddings into a single vector representation of a chunk. So, if you have an embedding model with a large context window, you can technically embed an entire document. I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you wonât understand what the movie is about.
Vector indexing libraries are different from vector databases. Both are incredibly fast for vector search. Both work really well to showcase vector search in âchat with your docsâ-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search.
RAG has been dying since the release of the first long-context LLM. Every time an LLM with a longer context window is released, someone will claim that RAG is dead. It never isâŠ
You can throw out 97% of the information and still retrieve (somewhat) accurately. Itâs called vector quantization. For example, with binary quantization you can change something like [-0.9837, 0.1044, 0.0090, âŠ, -0.2049] into [0, 1, 1, âŠ, 0] (a 32x storage reduction from 32-bit float to 1-bit) and youâll be surprised how well retrieval will remain to work (in some use cases).
Vector search is not robust to typos. For a while, I thought that vector search was robust to typos because these large corpora of text surely must contain a lot of typos and therefore help the embedding model learn these typos as well. But if you think about it, thereâs no way that all the possible typos of a word are reflected in sufficient amounts in the training data. So, while vector search can handle some typos, you canât really say it is robust to them.
Knowing when to use which metric to evaluate search results. There are many different metrics to evaluate search results. Looking at academic benchmarks, like BEIR, youâll notice that NDCG@k is prominent. But simpler metrics like precision and recall are a great fit for many use cases.
The precision-recall trade-off is often depicted with a fishermanâs analogy of casting a net, but this e-commerce analogy made it click better for me: Imagine you have a webshop with 100 books, out of which 10 are ML-related.
Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have perfect precision (out of the k=1 results returned, how many were relevant). But thatâs bad recall (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books). And also, thatâs not so good for your business. Maybe the user didnât like that one ML-related book you returned.
On the other side of that extreme is if you return your entire selection of books. All 100 of them. Unsorted⊠Thatâs perfect recall because you returned all relevant results. Itâs just that you also returned a bunch of irrelevant results, which can be measured by how bad the precision is.
There are metrics that include the order. When I think of search results, I visualize something like a Google search. So, naturally, I thought that the rank of the search results is important. But metrics like precision and recall donât consider the order of search results. If the order of your search results is important for your use case, you need to choose rank-aware metrics like MRR@k, MAP@k, or NDCG@k.
Tokenizers matter. If youâve been in the Transformerâs bubble too long, youâve probably forgotten that other tokenizers exist next to Byte-Pair-Encoding (BPE). Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance.
Out-of-domain is not the same as out-of-vocabulary. Earlier embedding models used to fail on out-of-vocabulary terms. If your embedding model had never seen or heard of âLabubuâ, it would have just run into an error. With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless.
Query optimizations: You know how youâve learned to type âlongest river africaâ into Googleâs search bar, instead of âWhat is the name of the longest river in Africa?â. Youâve learned to optimize your search query for keyword search (yes, we know the Google search algorithm is more sophisticated. Can we just go with it for a second?). Similarly, we now need to learn how to optimize our search queries for vector search now.
What comes after vector search? First, there was keyword-based search. Then, Machine Learning models enabled vector search. Now, LLMs with reasoning enable reasoning-based retrieval.
Information retrieval is so hot right now. I feel fortunate to get to work in this exciting space. Although working on and with LLMs seems to be the cool thing now, figuring out how to provide the best information for them is equally exciting. And thatâs the field of retrieval.
Iâm repeating my last point, but looking back at the past two years, I feel grateful to work in this field. I have only scratched the surface so far, and thereâs still so much to learn. When I joined Weaviate, vector databases were the hot new thing. Then came RAG. Now, weâre talking about âcontext engineeringâ. But what hasnât changed is the importance of finding the best information to give the LLM so it can provide the best possible answer.
continue reading on www.leoniemonigatti.com
â ïž This post links to an external website. â ïž
If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.