Embeddings and Vector Databases: Enhancing Retrieval Systems

I expand a bit more here on my tweet comparing vector databases, I go into embedding models and their applications in Retrieval-Augmented Generation (RAG) systems. Here, I’ve gathered tips and insights on using embeddings effectively to improve RAG.

Embedding Models and Retrieval Techniques

New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the Massive Text Embedding Benchmark (MTEB) Leaderboard is an excellent resource. However, if the best embedding models still fall short, consider these advanced techniques to enhance text retrieval quality:

Retrieve More and Re-rank: Increase the number of text extracts retrieved and re-rank them using models like bge-reranker-large.
Use Different Length Windows: Embed documents using various window lengths (e.g., 1000 and 500 words) so you're effectively embedding your document multiple times.
LLM-assisted Extraction: Use large language models (LLMs) to extract only the relevant parts of the text in your retrieve pipeline, then (optionally) re-embed and re-rank the extracted text.
RAG-Fusion: Perform multiple query generations and use Reciprocal Rank Fusion to re-rank search results.
Lemmatization: Keep one version for generating embeddings (with stopwords removed and lemmatization applied) and another with the original text for LLM context, that will be sent to the LLM as context.
Hybrid Search: Incorporate results from classic lexical algorithms like bm25 in the re-ranking or RAG-fusion step.
HyDE Approach: Implement the HyDE approach, which uses hypothetical document embeddings generated by LLMs to match queries more effectively. Here we are doing answer to answer embedding similarity search as compared to query to answer embedding similar search in traditional RAG retrieval approach.

Choosing the Right Embedding Model

When selecting an embedding model, it’s better to use an Encoder-Decoder Models (Sequence-to-Sequence Models) (T5) or Encoder-only Models (Autoencoding Models) (BERT, Roberta...) than Decoder-only Models (Autoregressive Models) like llama or Mistral. Encoder-only models offer bidirectional embedding, capturing the full context of the sentence for more accurate embeddings.

It is possible to extract embeddings from decoder-only model, but they do not generate a good trade offer. If you use a bigger model, it takes more computation power and therefore more time to go through all your data. Encoder-only Models are also able to do bidirectional embedding (they encode all the information in the sentence, not just what meaning is necessary to generate the next word) and should be more accurate.

Recommended models:

Instructor-XL: A T5-based model with 3 billion parameters, optimized for instruction-following capabilities during embedding generation.
BGE-XXL: An 11-billion parameter model that offers robust performance.
BGE-Large-v1.5: A smaller, 1-billion parameter bi-encoder model performing well in various tasks.
BGE-Reranker-Large: A cross-encoder model ideal for reranking.

The Importance of Re-Rankers

Re-rankers, or cross-encoders, play a critical role in maximizing retrieval recall. While a simple bi-encoder retrieves relevant documents, a cross-encoder refines this list by assigning a similarity score to each document-query pair, ensuring that only the most important documents reach the LLM.

We cannot just return loads of documents to fill up the LLM context (context stuffing) because this reduces the LLM's recall performance (note that this is the LLM recall, which is different from the retrieval recall)

When storing information in the middle of a context window, an LLM's ability to recall that information becomes worse than had it not been provided in the first place (Lost in the Middle: How Language Models Use Long Contexts (2023))

The solution to this issue is to maximize retrieval recall by retrieving plenty of documents and then maximize LLM recall by minimizing the number of documents that make it to the LLM. To do that, we reorder retrieved documents and keep just the most relevant for our LLM - to do that, we use reranking. Here is a two-stage retrieval system. The vector DB step will typically include a bi-encoder or sparse embedding model.

Search engineers have used rerankers in two-stage retrieval systems for a long time. In these two-stage systems, a first-stage model (an embedding model/retriever) retrieves a set of relevant documents from a larger dataset. Then, a second-stage model (the reranker) is used to rerank those documents retrieved by the first-stage model.

We use two stages because retrieving a small set of documents from a large dataset is much faster than reranking a large set of documents. Rerankers are slow, and retrievers are fast.

Why Use Re-Rankers?

Re-rankers are more accurate than embedding models because they consider the query and document together, preserving more information. However, they are slower because they process each query-document pair individually. Despite this, their superior accuracy makes them invaluable in applications requiring precise relevance, such as medical information retrieval or detailed research queries.

The intuition behind a bi-encoder's inferior accuracy is that bi-encoders must compress all of the possible meanings of a document into a single vector - meaning we lose information. Additionally, bi-encoders have no context on the query because we don't know the query until we receive it (we create embeddings before user query time).

On the other hand, a reranker can receive the raw information directly into the large transformer computation, meaning less information loss. Because we are running the reranker at user query time, we have the added benefit of analyzing our document's meaning specific to the user query - rather than trying to produce a generic, averaged meaning.

Vector Databases: An Overview

Vector databases store data as high-dimensional vectors, essential for managing and searching through large embedding datasets efficiently. Here are some popular vector databases:

Elasticsearch: A distributed search and analytics engine that supports various types of data. One of the data types that Elasticsearch supports is vector fields, which store dense vectors of numeric values. In version 7.10, Elasticsearch added support for indexing vectors into a specialized data structure to support fast kNN retrieval through the kNN search API. In version 8.0, Elasticsearch added support for native natural language processing (NLP) with vector fields.
Faiss: A library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. It is developed primarily at Meta’s Fundamental AI Research group.
- A lot of vector databases are built on top of this, including pgvector and Pinecone
Milvus: An open-source vector database that can manage trillions of vector datasets and supports multiple vector search indexes and built-in filtering. It is a cloud-native vector database solution that can manage unstructured data. It supports automated horizontal scaling and uses acceleration methods to enable high-speed retrieving of vector data.
- Milvus supports multiple approximate nearest neighbor algorithm based indices like IVF_FLAT, Annoy, HNSW, RNSG, etc.
Qdrant: A vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points - vectors with an additional payload. Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural-network or semantic-based matching, faceted search, and other applications.
Chroma: An AI-native open-source embedding database. It is simple, feature-rich, and integrable with various tools and platforms for working with embeddings. It also provides a JavaScript client and a Python API for interacting with the database.
- Claims to be the first AI-centric vector db. looks really promising, but from what I can tell, there's no persistence available when self-hosting, meaning it's more like a service you spin up, load data into, and when you kill the process it goes away.
- What is most interesting to me about chromadb is that it has a time-series function, which might make it appropriate for streaming real-life data events into over long time periods and doing queries over time series.
OpenSearch: A community-driven, open source fork of Elasticsearch and Kibana following the license change in early 2021. It includes a vector database functionality that allows you to store and index vectors and metadata, and perform vector similarity search using k-NN indexes.
Weaviate: An open-source vector database that allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.
Vespa: A fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query. Integrated machine-learned model inference allows you to apply AI to make sense of your data in real time.
pgvector: An open-source extension for PostgreSQL that allows you to store and query vector embeddings within your database. It is built on top of the Faiss library, which is a popular library for efficient similarity search of dense vectors. pgvector is easy to use and can be installed with a single command.
- Basic postgres extension. open source, free, ubiquitous, no frills
- Apparently doesn't benchmark very well
- pgvector is great for integrating with your relational metadata but not as fast as the best-of-breed vector dbs
Vald: A highly scalable distributed fast approximate nearest neighbor dense vector search engine. Vald is designed and implemented based on the Cloud-Native architecture. It uses the fastest ANN Algorithm NGT to search neighbors. Vald has automatic vector indexing and index backup, and horizontal scaling which made for searching from billions of feature vector data.
- Vald uses a distributed index graph to support asynchronous indexing. It stores each index in multiple agents which enables index replicas and ensures high availability.
- Vald is also open-source and free to use. It can be deployed on a Kubernetes cluster and the only cost incurred is that of the infrastructure.
Apache Cassandra: An open source NoSQL distributed database trusted by thousands of companies. Vector search is coming to Apache Cassandra in its 5.0 release, which is expected to be available in late 2023 or early 2024. This feature is based on a collaboration between DataStax and Google, who are working on integrating Apache Cassandra with Google’s open source vector database engine, ScaNN.
ScaNN (Scalable Nearest Neighbors, Google Research): A library for efficient vector similarity search, which finds the k nearest vectors to a query vector, as measured by a similarity metric. Vector similarity search is useful for applications such as image search, natural language processing, recommender systems, and anomaly detection.
Pinecone: A vector database that is designed for machine learning applications. It is fast, scalable, and supports a variety of machine learning algorithms. Pinecone is built on top of Faiss, a library for efficient similarity search of dense vectors.
- Pinecone is a fully managed vector database
- It offers features like filtering, vector search libraries, and distributed infrastructure for the key benefit of reliability and speed.
- It's hosted, and the free plan only gives you one index, and paid plans are expensive.
- Mentioning it here because it seems to be the de-facto for most projects so it's good to know about, but it ain't self-hosted.
Marqo: simple to use, comes with embedding and inference management, supports multi-modal, handles chunking of text/images and much more.
Embeddinghub: An open-source solution designed to store machine learning embeddings with high durability and easy access. It allows intelligent analysis, like approximate nearest neighbor operations, and regular analysis, like partitioning and averaging. It uses HNSW algorithm for indexing the embeddings using HNSWLib, offering a high performance approximate nearest neighbor lookup.
Redis: Apparently you can use Redis, which is typically thought of as a key-value store, and is very readily available. Redis is super popular in the Rails community. Probably a fine choice. It's free, open source, fast as F (for key/value stuff anyway). Here is a quick start.

Common Features of Vector Databases

Vector databases and vector libraries are both technologies that enable vector similarity search, but they differ in functionality and usability:

Vector databases can store and update data, handle various types of data sources, perform queries during data import, and provide user-friendly and enterprise-ready features.
Vector libraries can only store data, handle vectors only, require importing all the data before building the index, and require more technical expertise and manual configuration.

Some vector databases are built on top of existing libraries, such as Faiss. This allows them to take advantage of the existing code and features of the library, which can save time and effort in development. Vector databases also share some common features:

They support vector similarity search, which finds the k nearest vectors to a query vector, as measured by a similarity metric. Vector similarity search is useful for applications such as image search, natural language processing, recommender systems, and anomaly detection.
They use vector compression techniques to reduce the storage space and improve the query performance. Vector compression methods include scalar quantization, product quantization, and anisotropic vector quantization.
They can perform exact or approximate nearest neighbor search, depending on the trade-off between accuracy and speed. Exact nearest neighbor search provides perfect recall, but may be slow for large datasets. Approximate nearest neighbor search uses specialized data structures and algorithms to speed up the search, but may sacrifice some recall.
They support different types of similarity metrics, such as L2 distance, inner product, and cosine distance. Different similarity metrics may suit different use cases and data types.
They can handle various types of data sources, such as text, images, audio, video, and more. Data sources can be transformed into vector embeddings using machine learning models, such as word embeddings, sentence embeddings, image embeddings, etc.

When choosing a vector database, it is important to consider your specific needs and requirements. It is also important to note that Vector Databases are made specifically for working with vector embeddings (store the vectors efficiently but also search and perform mathematical operations on them), but should not be used as a persistent storage for data.

Conclusion

Embeddings and vector databases are an integral part of how we retrieve and utilize information in RAG systems. By choosing the right models, implementing reranking strategies, and leveraging the power of vector databases, we can significantly enhance the quality and efficiency of our retrieval systems.

Resources: