What is Retrieval Augmented Generation: Empowering LLMs with External Knowledge

This article talks about Retrieval Augmented Generation techniques used in Generative AI applications and how it can help with reducing hallucinations and ensuring the generated answers are grounded in the given context.

The recent surge in popularity of Large Language Models (LLMs) has captivated the world, with applications ranging from text generation to code completion. However, these models have their limitations, particularly when it comes to accessing external knowledge. This is where Retrieval Augmented Generation (RAG) steps in, a technique that enhances LLMs by providing them with the ability to retrieve and utilize relevant information from external sources. In this article, we will delve into the world of RAG, exploring its architecture, use cases, and some advanced retriever techniques, with a practical implementation using the langchain framework in Python.

Understanding Retrieval Augmented Generation

At its core, RAG is a method for improving the performance of LLMs by integrating external information retrieval into the generation process. It addresses a key limitation of LLMs – their reliance solely on their pre-trained knowledge base. By augmenting the generation process with relevant external information, RAG enables LLMs to provide more accurate and contextually appropriate responses.

The RAG architecture consists of two main components: the retriever and the generator. The retriever is responsible for searching through a vast knowledge base of information, typically stored as vector embeddings, and retrieving the most relevant data to answer a given query. On the other hand, the generator takes this retrieved information and synthesizes it into a natural language response.

The Retriever: Diving into Vector Search Techniques

The retriever plays a crucial role in the success of a RAG system. Its ability to efficiently and effectively search through vast amounts of data and identify relevant information is essential. There are several techniques used by retrievers to determine relevance, including keyword search, vector similarity, and hybrid methods.

Keyword Search

Keyword search is a classic approach where the retriever scans documents for exact words or phrases from the query. This method creates sparse vector representations of documents by counting word occurrences, giving higher weights to rarer words. Algorithms like TF-IDF and BM25 fall under this category and are known for their simplicity and efficiency. However, they may struggle with synonyms and semantic similarities.

Vector Similarity with Dense Vector Embeddings

Dense vector embeddings, on the other hand, offer a more semantic approach. Large language models like BERT are used to encode both the query and passages into dense vector embeddings, capturing the semantic meaning. Vector databases like Qdrant store these embeddings, enabling retrievers to match based on semantic understanding rather than just keywords. This allows for more flexible and contextually aware retrieval.

Hybrid Search: Combining Strengths

Hybrid search methods aim to combine the benefits of both keyword and vector search techniques. Some common approaches include using keyword search for initial candidate retrieval, followed by semantic re-ranking, or starting with semantic vectors for topical relevance and then filtering based on keywords. These hybrid approaches provide more comprehensive and accurate results by leveraging the strengths of different search strategies.

The Generator: Synthesizing Information into Responses

The generator component of a RAG system is typically an LLM, such as GPT, BART, or T5. It takes the query and the relevant documents or passages retrieved by the retriever as input and generates a final response. This response is synthesized from the retrieved information, ensuring that it is both factually accurate and contextually relevant.

Real-World Applications of RAG

RAG has found its way into various applications, particularly in areas where factual accuracy and knowledge depth are crucial. Here are some notable use cases:

Question Answering

RAG models power advanced question-answering systems. By retrieving relevant information from large knowledge bases, they can generate fluent and informed answers to a wide range of queries.

Language Generation

RAG enhances text generation tasks by providing contextualized information. This is particularly useful for tasks like text summarization, where multiple sources need to be synthesized into a coherent summary.

Data-to-Text Generation

By integrating with structured data sources, RAG models can generate reports, describe insights from data visualizations, and provide contextually relevant information.

Multimedia Understanding

RAG is not limited to text; it can also retrieve and understand multimodal information. For example, answering questions about images or videos by retrieving relevant textual context enhances the model's ability to interpret and generate responses for multimedia content.

Advanced Retriever Techniques: Enhancing RAG's Performance

While the basic RAG architecture provides significant improvements, there are advanced retriever techniques that can further enhance its performance. These techniques focus on improving the quality and relevance of the retrieved information, leading to more accurate and contextually rich responses. Here are some of these advanced methods:

Parent Document Retriever

The Parent Document Retriever technique addresses the balance between precision and context. By splitting large chunks of text (parent chunks) into smaller pieces (child chunks), the retriever can provide more concentrated and informative content. The parent chunks are stored in memory, while the child chunks are saved in the vector store. During retrieval, the model first searches for relevant child chunks and then returns the corresponding parent chunks, ensuring both precision and context.

Self-Query Retriever

The Self-Query Retriever excels in cases where metadata filtering is crucial. It uses an LLM to construct a query and apply filters based on metadata attributes. The LLM takes into account the available metadata and the user's query to create a new query and filter conditions. This allows for a more targeted search, reducing computational costs and improving the relevance of retrieved documents.

Contextual Compression Retriever (Reranking)

This technique combines the benefits of two methods: Bi-Encoders and Cross-Encoders. Bi-Encoders are used to retrieve the top K most relevant documents based on vector similarity. Then, the Cross-Encoder, a more reliable but computationally expensive model, recalculates the similarity between the query and these top K documents, reordering them based on their actual relevance. This two-step process improves the accuracy of the retriever while managing computational costs.

Implementing a Basic RAG System with Langchain

Now, let's put theory into practice and implement a basic RAG system using the langchain framework in Python. For this example, we will create a RAG chatbot that can answer questions about movie reviews. We will use the langchain library and some sample movie review data.

First, we need to import the necessary libraries and load the movie review data:
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Load movie review data from CSV files
documents = []
for i in range(1, 4):
	loader = CSVLoader(encoding="utf8", file_path=f"data/john_wick_{i}.csv")
	movie_docs = loader.load()
	for doc in movie_docs: 
		doc.metadata["Movie_Title"] = f"John Wick {i}"
		doc.metadata["Rating"] = int(doc.metadata["Rating"]) if
		doc.metadata["Rating"] else None
		documents.extend(movie_docs)
Next, we create a vector store, which is a database that stores documents in vector format. We'll use the Chroma vector store from langchain:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma.from_documents(documents=documents, embedding=embeddings, collection_name="doc_jonhWick", persist_directory="./jonhWick_db")
Now, let's create a simple retriever that retrieves the top 10 most similar documents to a given query:
naive_retriever = db.as_retriever(search_kwargs={"k": 10})
To complete our RAG system, we need a generator. We'll use an LLM from OpenAI for this example:
from langchain_openai import ChatOpenAI
# Create the generator LLM
chat_model = ChatOpenAI()
Finally, we assemble the RAG system using langchain's LCEL (LangChain Expression Language):
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
# Create the augmented prompt template
TEMPLATE = """
You are a helpful assistant. Use the context provided below to answer the question.
If you don't know the answer, simply say so.
Query: {question}
Context: {context}
"""
rag_prompt = ChatPromptTemplate.from_template(TEMPLATE)
# Assemble the RAG system
setup_and_retrieval = RunnableParallel({"question": RunnablePassthrough(), "context": naive_retriever})
output_parser = StrOutputParser()
naive_retrieval_chain = setup_and_retrieval | rag_prompt | chat_model | output_parser
Now, we can use our RAG system to answer a question:
response = naive_retrieval_chain.invoke("Did people generally like John Wick?")
print(response)

The output should be something like: "Yes, people generally liked John Wick."

Conclusion

Retrieval Augmented Generation is a powerful technique that enhances the capabilities of LLMs by providing them with external knowledge. By integrating information retrieval into the generation process, RAG enables more accurate, relevant, and contextually rich responses. In this article, we explored the RAG architecture, use cases, and advanced retriever techniques. We also implemented a basic RAG system using langchain, showcasing how RAG can be applied in practice. As RAG continues to evolve, we can expect even more sophisticated applications and improvements in language model performance.

Blog

Similar Articles

Schedule a demo

Schedule a demo with our experts and learn how you can pass all the repetitive tasks to Fiber Copilot AI Assistants and allow your team to focus on what matter to the business.