GraphRAG > Traditional Vector RAG

GraphRAG (Graphs + Retrieval Augmented Generation) is a technique for richly understanding text datasets by combining text extraction, network analysis, and LLM prompting and summarization into a single end-to-end system.

Motivation: the Limitations of Vector Similarity Search

The idea behind the retrieval-augmented (RAG) approach is to reference external data at question time and feed it to an LLM to enhance its ability to generate accurate and relevant answers. It is pretty much ubiquitous right now to imply vector similarity search as the method used to naively identify which chunks of text might contain relevant data to answer the user’s question accurately.

This approach works fairly well when the vector search can produce relevant chunks of text. However, there are lots of scenarios when it cannot. Simple vector similarity search isn't sufficient when we need:

Multi Hop Question answering: RAG falls apart when the LLM needs information from multiple documents or even just multiple chunks to generate an answer.
- This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
- For example, consider the following question: "Describe what the CEO's of the top 3 AI companies like to eat for lunch".
- The question needs to be broken down into multiple sub-questions: "Who are the CEO's of the top 3 AI companies?", "What does $x$ like to eat for lunch?"
- Simply chunking and embeddings documents in a database and then using plain vector similarity search won't yield any relevant information to answer the question.
We are dealing with questions, which we may not know the answer to. Naively embedding the question doesn't ensure that the vector lands in the same space as the answer. For example, say we are reading a long VC deal document, and one single word "Pre-" or "Post-" money will change the entire way the deal is made. Simply embedding "What is the deal about" will not land us near the information, unless someone has explicitly written what the deal is about in the dataset (extremely unlikely if the documents are independent and don't have knowledge of each other). It lacks the context of the question, and reasoning.
- There are workarounds such as the HyDE approach creates a “Hypothetical” answer with the help of LLM and then searches the embeddings for a match. Here we are doing answer to answer embedding similarity search as compared to query to answer embedding similar search in traditional RAG retrieval approach.
RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”
- It struggles to holistically understand summarized semantic concepts over large data collections or even singular large documents.
- This is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. It is designed for situations where these answers are contained locally within regions of text whose retrieval provides sufficient grounding for the generation task.
- "What are the top 5 themes in the data?” perform terribly because baseline RAG relies on a vector search of semantically similar text content within the dataset. There is nothing in the query to direct it to the correct information.
- GraphRAG, as we discuss later, performs a lot better because we can use the semantic and thematic agglomerative approaches built over the top of Graph Machine learning to answer these questions from our holistic understanding of the dataset.
There have also been analytical derivations of how cosine-similarity can yield arbitrary and therefore meaningless 'similarities' ¹, with the researchers cautioning against blindly using cosine-similarity

These shortcomings with Vector Simialrity for RAG have garnered a lot of attention recently, including on X. This is where GraphRAG comes in.

Combining Graphs with RAG

Graphs are all around us; real world objects are often defined in terms of their connections to other things. A set of objects, and the connections between them, are naturally expressed as a graph. The process of extracting structured information in the form of entities and relationships from unstructured, the information extraction pipeline, was a key bottleneck for using Knowledge graphs with LLMs. The process of knowledge graph construction has traditionally been complex and resource-intensive ², limiting adoption. GraphRAG simplifies this process.

The beauty here is that we can process each document individually, and the information from different records gets connected when the knowledge graph is constructed or enriched. i.e. we are pre-processing data before ingestion, instead of performing operation at query-time, like the contextual summarisation techniques people usually use to get around this.

Having access to structured information allows LLM applications to perform various analytics workflows where aggregation, filtering, or sorting is required. For example:

"Who are the top 5 companies in the AI space by Valuation?"
"Which CEO has previously founded the most companies?"

Plain vector similarity search can struggle with these types of analytical questions since it searches through unstructured text data, making it hard to sort or aggregate data.

On top of this, as I mentioned in the topic modelling post, there are entire toolkits available for data represented as graphs, and being able to express data that intuitively can be represented as a graph opens up a lot of possibilities. Researchers have developed neural networks that operate on graph data (called graph neural networks, or GNNs) for over a decade. Recent developments have increased their capabilities and expressive power. Recent developments have increased their capabilities and expressive power. We are starting to see practical applications in areas such as antibacterial discovery ³, physics simulations ⁴, fake news detection ⁵, traffic prediction ⁶ and recommendation systems ⁷.

But for the purposes of this post, lets focus on retrieval. The information extraction pipeline can be performed using LLMs or custom text domain models. Then, instead of vector similarity as normally done in RAG, we can retrieve relevant information from a knowledge graph.

GraphRAG

The method that I'll follow here is heavily inspired by Microsofts GraphRAG approach ⁸, which uses an LLM to build a graph-based text index in two stages:

First to derive an entity knowledge graph from the source documents
Pre-generate community summaries for all groups of closely-related entities

This methodology focusses on graphs their inherent modularity and the ability of community detection algorithms to partition graphs into modular communities of closely-related nodes. LLM-generated summaries of these community descriptions provide complete coverage of the underlying graph index and the input documents it represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: first using each community summary to answer the query independently and in parallel, then summarizing all relevant partial answers into a final global answer.

The Information Extraction Pipeline

Source Documents $\rightarrow$ Text Chunks
- Split input texts from source documents into text chunks for processing.
- We'll use LLM prompts to extract elements of a graph index from each chunk.
- Balance chunk size to optimize the number of LLM calls and recall of entity references.
Text Chunks $\rightarrow$ Element Instances
- Identify and extract instances of graph nodes and edges from text chunks using multipart LLM prompts.
- Include entities’ names, types, descriptions, and relationships in the extraction.
- Tailor prompts to specific domains with few-shot examples.
- Perform multiple rounds of gleanings to detect additional entities and ensure extraction quality.
Element Instances $\rightarrow$ Element Summaries
- Use LLMs to create descriptions of entities, relationships, and claims, performing abstractive summarization.
- Convert instance-level summaries into single blocks of descriptive text for each graph element.
- Address potential inconsistencies in entity extraction to avoid duplicates and ensure connectedness in the graph.
Element Summaries $\rightarrow$ Graph Communities
- Model the index as a homogeneous undirected weighted graph with nodes and relationship edges.
  - The edge weights represent the normalized counts of detected relationship instances.
- Use community detection algorithms, like Leiden, to partition the graph into hierarchical communities.
- Create hierarchical community structures for efficient global summarization.
Graph Communities $\rightarrow$ Community Summaries
- Generate report-like summaries for each community in the Leiden hierarchy.
- Prioritize and add element summaries to the LLM context window until the token limit is reached. Priority, or "overall prominance" is done in decreasing order of combined source and target node degree. Priority of entities is source node, target node, linked covariates, and the edge itself.
- Summarize higher-level communities by ranking and substituting sub-community summaries if needed.
- These summaries are useful on their own because they help you understand the overall structure and meaning of the dataset. Even if you don't have a specific question, you can look through these summaries to find general themes or topics that interest you. Then, if you want more details, you can follow links to more detailed reports on those specific subtopics.
Community Summaries $\rightarrow$ Community Answers $\rightarrow$ Global Answer
- Randomly shuffle and divide community summaries into chunks.
- Generate intermediate answers for each chunk in parallel.
- Filter out unhelpful answers and sort the rest by helpfulness score.
- Iteratively add the most helpful intermediate answers into the context window to generate the final global answer.

GraphRAG is able to connect information across large volumes of information and use these connections to answer questions that are difficult or impossible to answer using keyword and vector-based search mechanisms.

Exploring the Graph

I've decided to index a subset of the dataset I scraped for Dcypher AI. It contains news updates related to the AI space. The graph we are able to generate looks like this:

We can also use a UMAP technique, similar to the technique I used in the topic modelling post, to visualize the graph in 2D space. This is a great way to get a sense of the structure of the graph, with semantically similar nodes grouped and clustered near each other.

Querying the graph with a question like "What are the top 5 companies in the AI space?" returns detailed, structured answers. This is a simple example, but the graph can be queried for more complex questions as well.

#### 1. OpenAI
 
OpenAI is a leading artificial intelligence research and deployment company based in San Francisco. It is renowned for creating the AI-powered chatbot platform ChatGPT and is acti
vely targeting the emerging market of text-to-video software. OpenAI has released advanced models like GPT-4O-2024-08-06 and is currently testing SearchGPT, positioning itself as a leader in AI innovation [Data: Reports (11, 4, 13)].
 
#### 2. Microsoft
 
Microsoft has made a significant investment in artificial intelligence through its multi-billion dollar partnership with OpenAI. This collaboration aims to enhance Microsoft's AI capabilities and maintain its competitive edge in the tech industry. The partnership underscores Microsoft's commitment to advancing AI technology [Data: Reports (17, 11)].
 
#### 3. NVIDIA
 
NVIDIA is well-known for its graphics processing units (GPUs) and contributions to artificial intelligence research. The company has developed NVIDIA Edify, a multimodal architecture for visual generative AI, which highlights its significant role in the AI and GPU markets [Data: Reports (17)].
 
#### 4. Google
 
Google continues to be a major player in the tech industry with its innovative advancements in AI technology. The company has developed a series of Gemma models, including Gemma 2, Gemma 2 2B, Gemma Scope, and ShieldGemma, showcasing its commitment to AI research and development [Data: Reports (18)].
 
#### 5. ByteDance
 
ByteDance, a prominent Chinese tech firm, has made strides in the AI space by expanding its text-to-video software, Jimeng AI, to the Apple App Store for Chinese users. This move signifies ByteDance's intent to capture a larger share of the emerging text-to-video market [Data: Reports (0)].

GraphRAG can handle more complex queries, such as "Which CEO has been the most controversial in the AI space?" Here’s an example response:

Both Sam Altman and Elon Musk have been highly controversial figures in the AI space, each playing significant roles in the ongoing legal and operational challenges facing OpenAI.
Altman's leadership and the internal power conflicts within OpenAI have drawn considerable scrutiny, while Musk's legal actions against the organization have further fueled the controversy. These events highlight the complex and often contentious nature of leadership in the rapidly evolving field of artificial intelligence.

Over baseline RAG, this improves LLM outputs considerably.

Use Cases

GraphRAG opens up a wide range of applications, especially in handling unstructured data. Its ability to draw connections between concepts and generate multilevel Leiden summaries makes it a versatile tool for various fields. Some initial thoughts on impactful ways this technology can be used:

Graph Machine Learning: GraphRAG provides the ability to analyze information using graph-based machine learning techniques.
Unstructured Data Analysis: One of its primary applications is in unstructured data analysis. Imagine throwing GraphRAG at podcast transcripts to gain deep insights from lengthy conversations, or using it to make sense of narrative driven datasets such as laws, policies, and stories etc. Nontypical datasets
Multilevel Summaries: The multilevel Leiden summaries really stand out. They allow users to get different levels of summaries of their data, offering a powerful tool for understanding and organizing information at various depths.
Fighting Disinformation: GraphRAG can be a game-changer in combating disinformation. It can structure knowledge from hundreds of interview transcripts involving experts discussing current events, making it easier to identify and counter false information.
Fraud Detection in Insurance: Insurance companies could leverage GraphRAG to detect fraudulent activities by analyzing complex data relationships and patterns that traditional methods might miss.
Financial Analysis: For financial statement trend analysis, GraphRAG can provide competitive intelligence on new technologies, helping businesses stay ahead of the curve.
Agentic Workflows: Creating a knowledge graph with network-based recall enables agents to remember the current state of important entities and their relationships, enhancing decision-making and information retrieval.
LLM-Driven Ontology: An LLM-driven ontology with a QA triple store can facilitate CRUD operations for self-improvement, making it possible for systems to continuously evolve and improve their understanding of data.
Decentralized Fact-Checking: A decentralized fact-checker powered by GraphRAG could totally change how we verify information, ensuring accuracy and reliability across various sources.
Codebase Analysis: Developers can create a graph of a codebase, allowing the LLM to understand how different components interact with each other, leading to more efficient and effective software dev.
Research and Theory Formation: Research scientists can use GraphRAG to form new theories and narrow down their search space, accelerating scientific discoveries and innovations.

GraphRAG’s ability to understand and summarize complex relationships within vast amounts of data makes it a valuable tool across multiple industries and applications.

Drawbacks and Improvements

The main drawback here is that it is computationally heavy to index the data. Multiple prompts are needed to extract entities and relationships from the text, and with multiple gleanings, this can easily get out of hand. This can be time-consuming and resource-intensive, especially for large datasets. Even when the graph is constructed, querying it is relatively more expensive, with more LLM calls being made (10x more) and an order of magnitude more tokens of context on average. These seem to be the cost for a much richer and correct answer.

There are alternatives that I want to investigate though, namely SciPhi/Triplex. This is an open source model that is fine tuned version of Phi3-3.8B for converting unstructured text into "semantic triples" (subject, predicate, object) reducing the generation cost of knowledge graphs tenfold.

SciPhi is the company that built R2R and Triplex, and they have prebuilt solutions for automatic knowledge graph construction during input file ingestion which are the next candidates for exploration on this topic.

Resources

References

Harald Steck, Chaitanya Ekanadham, Nathan Kallus: “Is Cosine-Similarity of Embeddings Really About Similarity?”, 2024, ACM Web Conference 2024 (WWW 2024 Companion) ↩
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson: “From Local to Global: A Graph RAG Approach to Query-Focused Summarization”, 2024 ↩
Jonathan M. Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M. Donghia, Craig R. MacNair, Shawn French, Lindsey A. Carfrae, Zohar Bloom-Ackermann, Victoria M. Tran, Anush Chiappino-Pepe, Ahmed H. Badran, Ian W. Andrews, Emma J. Chory, George M. Church, Eric D. Brown, Tommi S. Jaakkola, Regina Barzilay, James J. Collins: "A Deep Learning Approach to Antibiotic Discovery", Cell, Volume 180, Issue 4, 2020, Pages 688-702.e13 ↩
Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, Peter W. Battaglia: “Learning to Simulate Complex Physics with Graph Networks”, 2020 ↩
Federico Monti, Fabrizio Frasca, Davide Eynard, Damon Mannion, Michael M. Bronstein: “Fake News Detection on Social Media using Geometric Deep Learning”, 2019 ↩
Oliver Lange, Luis Perez: "Traffic prediction with advanced Graph Neural Networks", 2020 ↩
Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, Jure Leskovec: “Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time”, 2017 ↩
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson: “From Local to Global: A Graph RAG Approach to Query-Focused Summarization”, 2024 ↩