Introduction: Beyond Simple Search
Welcome back, fellow RAG enthusiasts! In our previous chapters, we laid the groundwork for Retrieval-Augmented Generation, exploring how to get relevant information to Large Language Models (LLMs) to improve their outputs. We’ve seen how crucial effective retrieval is, but what happens when a user’s question isn’t straightforward? What if the query is ambiguous, uses different terminology than your knowledge base, or requires piecing together information from multiple, distinct sources?
This is where the magic of RAG 2.0 truly shines. Simple vector or keyword search, while powerful, often struggles with these complex scenarios. Imagine asking a question like, “What is the primary product of the company founded by the creator of Python, and what year was that product first released?” A basic RAG system would likely struggle to answer this without explicit connections. In this chapter, we’re going to elevate our RAG game by empowering LLMs to actively participate in the retrieval process itself, not just the generation. We’ll learn how LLMs can transform user queries to find better matches and even orchestrate multi-step searches to answer truly complex questions.
By the end of this chapter, you’ll understand the core concepts behind LLM-driven query rewriting and multi-hop retrieval. You’ll gain practical insights into how these techniques address the limitations of basic RAG, leading to more robust, accurate, and intelligent conversational AI systems. Get ready to make your RAG systems think smarter about what to search for and how to search for it!
Intelligent Querying: Core Concepts
Basic RAG systems often perform a single search operation based on the user’s raw query. This works well for many questions, but it hits a wall when the query is:
- Lexically Mismatched: The user uses different words than those in the documents.
- Ambiguous: The query has multiple interpretations.
- Complex or Multi-faceted: It requires information from several distinct steps or sources.
- Requires Global Understanding: Answering necessitates connecting distant facts or performing reasoning across documents.
RAG 2.0 introduces techniques that empower LLMs to improve the query itself before retrieval, or even manage a sequence of retrieval steps.
Query Rewriting and Transformation
What is it? Query rewriting, also known as query transformation, is the process of using an LLM to modify, expand, or rephrase a user’s original query before it’s sent to the retrieval system. The goal is to create one or more new queries that are more likely to retrieve relevant information.
Why is it important? Think of it like having a super-smart librarian who, when you ask a vague question, helps you rephrase it in several ways or suggests related terms to get better results. This technique directly tackles lexical mismatch and query ambiguity, which are common hurdles in information retrieval. Even with sophisticated embedding models, a query like “Python founder’s company” might not perfectly match documents that only mention “Guido van Rossum” or “Dropbox.”
How LLMs enable it: LLMs are excellent at understanding context, synonyms, and even implied meanings. We can prompt an LLM to:
Reformulate: Rephrase the original query into several semantically equivalent but lexically different versions.
- Original: “Best way to learn Python”
- Rewritten: “Python learning resources,” “How to start programming in Python,” “Python tutorials for beginners”
Expand: Add relevant keywords, synonyms, or related concepts to the original query.
- Original: “Machine learning frameworks”
- Expanded: “Machine learning frameworks (TensorFlow, PyTorch, Scikit-learn, deep learning libraries)”
Decompose: Break down a complex, multi-part question into several simpler sub-queries. This is especially useful for questions that implicitly require multiple steps.
- Original: “What are the common side effects of ibuprofen and how does it compare to acetaminophen?”
- Decomposed:
- “Common side effects of ibuprofen”
- “Ibuprofen vs. acetaminophen side effects”
- “Mechanism of action of ibuprofen”
- “Mechanism of action of acetaminophen”
By generating multiple query variations, we increase the chances of hitting relevant documents in our knowledge base. The results from these multiple queries can then be combined using techniques like Reciprocal Rank Fusion (RRF) (which we touched upon in a previous chapter) to get a consolidated, highly relevant set of documents.
Multi-Hop Retrieval
What is it? Multi-hop retrieval is an advanced technique designed to answer questions that require reasoning across multiple pieces of information, potentially from different documents or parts of a knowledge graph, where each piece is a “hop” in the reasoning chain. It’s about connecting the dots.
Why is it important? Basic RAG struggles with questions that demand synthetic understanding. If you ask, “Where was the inventor of Python born, and what is the capital of that country?”, a single search might find documents about Guido van Rossum’s birthplace (Netherlands), but it won’t automatically know to then search for the capital of the Netherlands (Amsterdam). Multi-hop retrieval explicitly addresses this limitation by enabling an LLM to act as an orchestrator, planning and executing a sequence of retrieval steps.
How LLMs enable it: LLMs are central to multi-hop retrieval because they can:
- Analyze the Query: Understand the underlying reasoning steps required.
- Formulate Sub-queries: Break the main query into a logical sequence of smaller, answerable questions.
- Process Intermediate Results: Use the answer from one sub-query to inform the next sub-query.
- Synthesize Final Answer: Combine all the retrieved information into a coherent final response.
Let’s visualize this with a simple flow:
This diagram illustrates how the LLM acts as an agent, dynamically planning and executing retrieval operations based on the information gathered in previous steps. This iterative process allows RAG systems to tackle much more complex, globally-aware questions.
Step-by-Step Implementation: LLM-Driven Query Transformation
Let’s get practical! We’ll set up a simple environment in Python (version 3.11+) to demonstrate query rewriting. We’ll use langchain for orchestration and an LLM (e.g., OpenAI’s gpt-4o or a local model via ollama).
1. Setup Your Environment
First, ensure you have Python and pip installed.
# Verify Python version (should be 3.11 or newer)
python --version
# Create a virtual environment (good practice!)
python -m venv rag_env
source rag_env/bin/activate # On Windows, use `rag_env\Scripts\activate`
# Install necessary libraries
pip install openai==1.30.1 langchain==0.2.0 chromadb==0.4.24 tiktoken==0.7.0
Note: As of 2026-03-20, openai version 1.30.1, langchain version 0.2.0, chromadb version 0.4.24, and tiktoken version 0.7.0 are stable and widely used.
Next, set up your OpenAI API key as an environment variable. If you’re using a local LLM via ollama or llama.cpp, you might not need this, but for this example, we’ll assume OpenAI.
export OPENAI_API_KEY="your_openai_api_key_here"
Replace "your_openai_api_key_here" with your actual API key.
2. Prepare a Simple Knowledge Base
We’ll create a few dummy documents to simulate our knowledge base and store them in a ChromaDB vector store.
# knowledge_base.py
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os
def create_vector_store():
# Dummy documents
documents_content = [
"Guido van Rossum is the creator of the Python programming language.",
"Python was first released in 1991.",
"Guido van Rossum worked at Dropbox from 2013 to 2019.",
"Dropbox is a cloud storage and file synchronization service.",
"The capital city of the Netherlands is Amsterdam.",
"The Netherlands is a country in Western Europe, known for its flat landscape, canals, and windmills.",
"Amsterdam is famous for its artistic heritage, elaborate canal system, and narrow houses with gabled facades."
]
# Create temporary files for TextLoader
temp_files = []
for i, content in enumerate(documents_content):
file_path = f"doc_{i}.txt"
with open(file_path, "w") as f:
f.write(content)
temp_files.append(file_path)
# Load documents
loader = TextLoader(temp_files[0]) # Start with one to initialize, then add others
docs = loader.load()
for file_path in temp_files[1:]:
loader = TextLoader(file_path)
docs.extend(loader.load())
# Clean up temporary files
for file_path in temp_files:
os.remove(file_path)
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
splits = text_splitter.split_documents(docs)
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Using a modern, efficient embedding model
# Create and return a Chroma vector store
# We'll store it in memory for this example, but you could persist it.
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
print("Vector store created with", len(splits), "documents.")
return vectorstore
if __name__ == "__main__":
# Example usage:
db = create_vector_store()
retriever = db.as_retriever()
query = "Who invented Python?"
results = retriever.invoke(query)
print(f"\nBasic retrieval for '{query}':")
for doc in results:
print(f"- {doc.page_content[:50]}...")
Explanation:
- We define a list of simple strings representing our knowledge base. In a real application, these would come from files, databases, or APIs.
RecursiveCharacterTextSplitterbreaks these into smaller, manageable chunks.OpenAIEmbeddingsuses OpenAI’stext-embedding-3-smallmodel to convert these text chunks into numerical vectors. This model is a current best practice for general-purpose embeddings.Chroma.from_documentscreates an in-memory vector database where these embeddings are stored, making them searchable.- The
if __name__ == "__main__":block shows how to create the store and perform a basic retrieval, which serves as a baseline.
Run this script once to create your vector store:
python knowledge_base.py
3. Implement Query Rewriting
Now, let’s add query rewriting using an LLM.
# query_rewriter.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from knowledge_base import create_vector_store # Import our knowledge base function
def setup_query_rewriter():
# Initialize the LLM for rewriting
llm = ChatOpenAI(model="gpt-4o", temperature=0.7) # Using the latest GPT-4o model
# Define a prompt for query rewriting
# We ask the LLM to generate 3 alternative queries.
rewrite_prompt = ChatPromptTemplate.from_messages(
[
("system", "You are an expert at rephrasing user questions to improve search results. Generate 3 alternative, diverse queries that capture the user's intent, separating them with newlines. Focus on different keywords, synonyms, and phrasings. Do NOT answer the question directly."),
("user", "{original_query}"),
]
)
# Create a chain for query rewriting: Prompt -> LLM -> Output Parser
query_rewriter_chain = rewrite_prompt | llm | StrOutputParser()
return query_rewriter_chain
def perform_enhanced_retrieval(original_query, retriever, query_rewriter_chain):
print(f"\nOriginal Query: '{original_query}'")
# 1. Generate rewritten queries
print("Generating alternative queries...")
rewritten_queries_str = query_rewriter_chain.invoke({"original_query": original_query})
rewritten_queries = [q.strip() for q in rewritten_queries_str.split('\n') if q.strip()]
print(f"Rewritten Queries: {rewritten_queries}")
# Combine original and rewritten queries for comprehensive search
all_queries = [original_query] + rewritten_queries
# 2. Perform retrieval for each query
all_results = []
for query in all_queries:
results = retriever.invoke(query)
all_results.extend(results)
# print(f" Retrieval for '{query}': {len(results)} docs")
# For simplicity, we'll just show all unique results.
# In a real system, you'd apply RRF or similar ranking fusion.
unique_results = []
seen_content = set()
for doc in all_results:
if doc.page_content not in seen_content:
unique_results.append(doc)
seen_content.add(doc.page_content)
print(f"\nEnhanced Retrieval Results ({len(unique_results)} unique documents):")
for doc in unique_results:
print(f"- {doc.page_content}")
if __name__ == "__main__":
# Create the vector store and retriever
db = create_vector_store()
retriever = db.as_retriever()
# Setup the query rewriter chain
query_rewriter = setup_query_rewriter()
# Test with a challenging query
challenging_query = "What company did the person who created Python work for after 2010?"
perform_enhanced_retrieval(challenging_query, retriever, query_rewriter)
# Test with a simpler query to see how it performs
simple_query = "Who made Python?"
perform_enhanced_retrieval(simple_query, retriever, query_rewriter)
Explanation:
setup_query_rewriter():- We initialize
ChatOpenAIusing thegpt-4omodel, which is excellent for instruction following. - A
ChatPromptTemplateis defined. Thesystemmessage instructs the LLM to generate alternative queries, emphasizing diversity and not answering the question. This is crucial for keeping the LLM focused on query transformation. - The
query_rewriter_chainconnects the prompt, LLM, andStrOutputParserto get a clean string output.
- We initialize
perform_enhanced_retrieval():- It takes the
original_query, ourretriever, and thequery_rewriter_chain. - It invokes the
query_rewriter_chainwith theoriginal_queryto get several rewritten queries. - It combines the original query with the rewritten ones.
- It then performs retrieval for each of these queries using
retriever.invoke(). - Finally, it aggregates all the results, removing duplicates, and prints them. In a production system, you would use a rank fusion algorithm (like RRF) to intelligently combine and re-rank these results.
- It takes the
Run this script:
python query_rewriter.py
Observe how the LLM generates alternative queries, potentially leading to more comprehensive retrieval, especially for the more complex query. For example, “What company did the person who created Python work for after 2010?” might be rewritten to include “Guido van Rossum” and “Dropbox”, improving the search.
Multi-Hop Retrieval (Conceptual Implementation)
Implementing a full multi-hop retrieval system requires more sophisticated agentic loops and potentially a dedicated knowledge graph. For this chapter, we’ll focus on the conceptual flow using an LLM to orchestrate the steps, demonstrating how an LLM can break down a query and chain retrieval actions.
# multi_hop_agent.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from knowledge_base import create_vector_store # Import our knowledge base function
import re
def setup_multi_hop_llm():
# LLM for planning and extraction
return ChatOpenAI(model="gpt-4o", temperature=0.5)
def multi_hop_retrieval_agent(complex_query, retriever, llm_agent):
print(f"\n--- Initiating Multi-Hop Retrieval for: '{complex_query}' ---")
# 1. Initial LLM plan: Decompose the query into sub-questions
decomposition_prompt = ChatPromptTemplate.from_messages(
[
("system", "You are a helpful assistant that breaks down complex questions into a sequence of simpler, answerable sub-questions. List each sub-question on a new line. Do NOT answer the original question."),
("user", "Break down the following question: '{original_question}'"),
]
)
decomposition_chain = decomposition_prompt | llm_agent | StrOutputParser()
print("\nStep 1: Decomposing the complex query...")
sub_questions_str = decomposition_chain.invoke({"original_question": complex_query})
sub_questions = [q.strip() for q in sub_questions_str.split('\n') if q.strip()]
print(f"Decomposed into: {sub_questions}")
intermediate_context = []
for i, sq in enumerate(sub_questions):
print(f"\nStep {i+2}: Answering sub-question: '{sq}'")
# 2. Retrieve for the sub-question
retrieved_docs = retriever.invoke(sq)
if not retrieved_docs:
print(f" No documents found for '{sq}'. Skipping.")
continue
retrieved_content = "\n".join([doc.page_content for doc in retrieved_docs])
print(f" Retrieved content for '{sq}':\n---{retrieved_content[:200]}...\n---")
# 3. LLM extracts answer from retrieved content for the sub-question
extraction_prompt = ChatPromptTemplate.from_messages(
[
("system", "Given the following context, extract the answer to the question. If the answer is not in the context, state 'Not found'.\n\nContext:\n{context}"),
("user", "Question: {sub_question}"),
]
)
extraction_chain = extraction_prompt | llm_agent | StrOutputParser()
extracted_answer = extraction_chain.invoke({
"context": retrieved_content,
"sub_question": sq
})
print(f" Extracted Answer: {extracted_answer}")
intermediate_context.append(f"Q: {sq}\nA: {extracted_answer}")
# 4. Final LLM synthesis
if not intermediate_context:
return "Could not retrieve enough information to answer the complex query."
final_synthesis_prompt = ChatPromptTemplate.from_messages(
[
("system", "You are a helpful assistant. Given the following intermediate answers to sub-questions, synthesize a concise and comprehensive final answer to the original complex question."),
("user", "Original Complex Question: '{original_question}'\n\nIntermediate Answers:\n{intermediate_answers}\n\nFinal Answer:"),
]
)
final_synthesis_chain = final_synthesis_prompt | llm_agent | StrOutputParser()
print("\nStep Final: Synthesizing final answer...")
final_answer = final_synthesis_chain.invoke({
"original_question": complex_query,
"intermediate_answers": "\n".join(intermediate_context)
})
return final_answer
if __name__ == "__main__":
db = create_vector_store()
retriever = db.as_retriever()
llm_agent = setup_multi_hop_llm()
complex_query = "Where was the inventor of Python born, and what is the capital of that country?"
final_response = multi_hop_retrieval_agent(complex_query, retriever, llm_agent)
print(f"\n--- FINAL MULTI-HOP RESPONSE ---\n{final_response}")
print("\n--- Testing another complex query ---")
another_complex_query = "What is the primary function of the company where Python's creator worked after 2010, and what is that company's name?"
final_response_2 = multi_hop_retrieval_agent(another_complex_query, retriever, llm_agent)
print(f"\n--- FINAL MULTI-HOP RESPONSE ---\n{final_response_2}")
Explanation:
setup_multi_hop_llm(): Initializes an LLM (again,gpt-4o) to act as our agent.multi_hop_retrieval_agent():- Query Decomposition: The first step uses an LLM to break the
complex_queryinto a list ofsub_questions. This is the “planning” phase. - Iterative Retrieval and Extraction: For each
sub_question:- It performs retrieval using our
retrieverto find relevant documents. - It then uses the LLM again (with an
extraction_prompt) to read the retrieved documents and extract a concise answer to that specific sub-question. - These extracted answers are stored in
intermediate_context.
- It performs retrieval using our
- Final Synthesis: After all sub-questions are addressed, the LLM is given the
original_questionand allintermediate_answers. It then synthesizes a single, coherentfinal_answer.
- Query Decomposition: The first step uses an LLM to break the
This conceptual code demonstrates the power of LLMs in orchestrating complex information gathering. It’s a simplified agent, but it clearly shows the “multi-hop” nature where the LLM’s output from one step (e.g., “Guido van Rossum”) becomes the input for the next retrieval step (e.g., “Where was Guido van Rossum born?”).
Mini-Challenge: Enhance Query Expansion
You’ve seen how a simple query rewriter works. Now, let’s take it a step further.
Challenge: Modify the setup_query_rewriter function in query_rewriter.py to not only rephrase the query but also expand it by suggesting relevant entities or keywords if the LLM detects them. For instance, if the query is “Python’s founder,” the LLM might suggest adding “Guido van Rossum” to the expanded queries.
Hint: Adjust the system message in your rewrite_prompt. You might want to instruct the LLM to also identify key entities or concepts from the original query and include them in the alternative queries or as separate search terms. You could even ask it to output a JSON object containing the original query, rewritten queries, and identified entities.
What to observe/learn:
- How subtle changes in the LLM prompt can significantly alter its output and the effectiveness of your retrieval.
- The trade-offs between generating many diverse queries versus highly specific, entity-rich queries.
- The potential for LLMs to go beyond simple rephrasing to enrich the query with semantic information.
Common Pitfalls & Troubleshooting
As powerful as LLM-driven querying is, it’s not without its challenges.
Over-Rewriting or Hallucination in Query Transformation:
- Pitfall: The LLM might rephrase the query so much that it loses the original intent, or it might introduce incorrect information into the rewritten queries. This can lead to retrieving irrelevant documents or no documents at all.
- Troubleshooting:
- Prompt Engineering: Refine your LLM prompt. Be explicit about staying true to the original intent, avoiding speculation, and only rephrasing or expanding based on the user’s input. Use phrases like “Do NOT add new information.”
- Temperature Tuning: Lower the
temperatureparameter of your LLM to make its outputs more deterministic and less creative. - Human-in-the-Loop: For critical applications, consider having a human review rewritten queries or a small, representative sample.
Context Window Limits in Multi-Hop Retrieval:
- Pitfall: As multi-hop retrieval progresses, the
intermediate_contextcan grow very large. Passing all this context back to the LLM for subsequent steps or final synthesis can hit the LLM’s context window limits, leading to truncation or poor performance. - Troubleshooting:
- Summarization: After each hop, use the LLM to summarize the extracted answer and the relevant context, passing only the summary to the next step.
- Selective Context: Only pass the most critical pieces of information from previous hops to the current LLM prompt.
- Context Compression: Employ techniques like
Contextual Compression(available inlangchain) to dynamically filter and compress retrieved documents before they are passed to the LLM.
- Pitfall: As multi-hop retrieval progresses, the
Performance Overhead:
- Pitfall: Each LLM call (for rewriting, decomposition, extraction, synthesis) adds latency. Multi-hop retrieval, in particular, can involve several sequential LLM calls, making the overall response time significantly slower than basic RAG.
- Troubleshooting:
- Optimize LLM Calls: Use smaller, faster LLMs for simpler tasks (e.g., query rewriting) if possible.
- Parallelization: If your query decomposition yields independent sub-queries, retrieve and process them in parallel.
- Caching: Cache common query transformations or intermediate results.
- Batching: If processing multiple user queries, batch LLM calls where feasible.
Summary
Congratulations! You’ve taken a significant leap forward in understanding and building more intelligent RAG systems.
Here are the key takeaways from this chapter:
- Limitations of Basic RAG: Simple keyword or vector search often struggles with complex, ambiguous, or multi-faceted queries.
- Query Rewriting/Transformation: LLMs can be leveraged to rephrase, expand, or decompose user queries, improving the relevance and recall of retrieval by addressing lexical mismatches and ambiguity.
- Multi-Hop Retrieval: For questions requiring reasoning across multiple pieces of information, LLMs can act as intelligent agents to break down complex queries, perform iterative retrieval steps, extract intermediate answers, and synthesize a final comprehensive response.
- LLMs as Orchestrators: In RAG 2.0, LLMs are not just for generation; they are integral to the retrieval process, guiding and enhancing how information is found.
- Practical Application: We implemented a basic query rewriting system and explored the conceptual flow of a multi-hop retrieval agent using Python and
langchain.
By applying these advanced querying techniques, your RAG systems can move beyond simple information retrieval to truly understand and respond to complex user needs, delivering more accurate and nuanced answers.
What’s Next?
In the next chapter, we’ll dive deeper into GraphRAG, exploring how structured knowledge graphs can provide an even more robust foundation for multi-hop reasoning and precise context retrieval, building upon the agentic principles we’ve discussed here.
References
- RAG and Generative AI - Azure AI Search - Microsoft Learn
- LangChain Documentation - Chains
- OpenAI API Documentation - Chat Completions
- ChromaDB Documentation
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.