Welcome back, future RAG 2.0 architects! So far in our journey, we’ve explored how to supercharge Retrieval-Augmented Generation (RAG) by moving beyond simple chunking. We’ve delved into sophisticated techniques like hybrid search, advanced embeddings, GraphRAG, multi-hop retrieval, and intelligent query rewriting. These methods significantly improve how we retrieve relevant information.
But what if the Large Language Model (LLM) itself could be more than just a responder? What if it could plan its own retrieval strategy, decide which tools to use, and even refine its approach based on the results? This is the essence of Agentic Retrieval – an exciting evolution where LLMs transform from passive generators into active, intelligent orchestrators of information.
In this chapter, we’re going to unlock the next level of RAG 2.0. You’ll learn:
- What agentic retrieval is and how it differs from traditional RAG.
- The core components that make up an intelligent agentic system.
- How LLMs can plan, execute, and iterate on complex retrieval tasks.
- Practical steps to implement a basic agentic retrieval system using popular frameworks.
Get ready to empower your LLMs with true intelligence, making them proactive problem-solvers rather than just reactive answer machines. This is where RAG truly becomes an intelligent system!
The Evolution of RAG: From Simple Retrieval to Agentic Orchestration
Recall our earlier discussions: basic RAG often struggles with complex, multi-faceted queries because it’s limited by a single retrieval step. Even advanced techniques like GraphRAG and multi-hop retrieval, while powerful, still largely operate within a predefined pipeline. The LLM receives the context and generates.
Agentic Retrieval flips this script. Instead of the LLM simply receiving a context, the LLM becomes an agent that decides how to get the context. It’s like having a skilled detective who, given a complex case, doesn’t just look up one database, but formulates a plan: “First, I’ll check the witness statements. Then, if that’s inconclusive, I’ll look at the forensic reports. If I need to connect distant facts, I’ll consult the expert database.”
This paradigm shift allows RAG systems to tackle problems that require:
- Complex Reasoning: Breaking down a hard question into smaller, manageable sub-questions.
- Dynamic Tool Use: Selecting the best retrieval mechanism (vector search, keyword search, graph traversal, web search, API call) for each sub-problem.
- Iterative Refinement: Adapting its strategy based on partial results or failures.
- Multi-Source Integration: Seamlessly weaving together information from disparate data sources.
Core Concepts of Agentic Retrieval
At its heart, an agentic retrieval system comprises a few key components working in harmony:
- The Agent (LLM): This is the brain of the operation. The LLM is empowered with reasoning capabilities (often through techniques like ReAct - Reason and Act) to understand the user’s query, plan a course of action, select appropriate tools, and generate the final response. It’s not just generating text; it’s generating thoughts and actions.
- Tools: These are the agent’s hands. Tools are functions or APIs that the agent can call to interact with the outside world. For RAG 2.0, these tools are typically various retrieval mechanisms:
- Vector Store Retriever: For semantic similarity search.
- Keyword Search Retriever: For exact or fuzzy keyword matches.
- Graph Database Retriever: For traversing relationships and entities.
- Web Search Tool: To fetch real-time information from the internet.
- API Calls: To interact with specific knowledge bases, calculators, or other services.
- Memory: Agents need to remember previous interactions, past observations, and the steps they’ve already taken. This allows for multi-turn conversations and prevents redundant actions, enabling more coherent and efficient problem-solving.
- Orchestration Logic: This is the framework that guides the agent. It defines how the agent observes its environment (user query, tool outputs), decides on the next action (tool selection, reasoning step), and executes that action. Frameworks like LangChain and LlamaIndex provide robust abstractions for building this logic.
Let’s visualize this flow:
In this diagram, the Agent LLM is at the center, constantly thinking, selecting tools, observing their outputs, and refining its understanding until it can formulate a Final Answer.
How LLMs Orchestrate Retrieval: The ReAct Pattern
A common and effective pattern for empowering LLMs in agentic systems is ReAct (Reason and Act). Introduced in “ReAct: Synergizing Reasoning and Acting in Language Models” (Yao et al., 2022), ReAct prompts the LLM to interleave Thought, Action, and Observation steps.
- Thought: The LLM articulates its reasoning process, explaining why it’s taking a particular step or what it aims to achieve. This helps guide the LLM and makes its behavior more interpretable.
- Action: Based on its thought, the LLM decides to use a specific tool with certain inputs. The framework then executes this tool call.
- Observation: The result of the tool’s execution is fed back to the LLM. This observation informs the LLM’s next thought and action.
This iterative loop of Thought -> Action -> Observation allows the LLM to perform complex, multi-step problem-solving, dynamically adapting its approach based on real-time feedback from its tools.
Step-by-Step Implementation: Building a Simple Agentic Retriever
Let’s get our hands dirty and build a basic agentic system using Python and the LangChain framework. LangChain (version 0.1.13 as of 2026-03-20, with langchain-openai 0.0.8 for OpenAI models) is a popular choice for building LLM-powered applications, offering robust abstractions for agents and tools.
For this example, we’ll create an agent that can:
- Search a local vector store (simulating a private knowledge base).
- Perform a “web search” (we’ll simulate this for simplicity, but it could be a real search engine API).
Prerequisites:
Before we start, ensure you have Python 3.9+ installed. We’ll need to install a few libraries:
# As of 2026-03-20, these are stable and widely used versions.
pip install langchain==0.1.13 langchain-openai==0.0.8 faiss-cpu==1.7.4 python-dotenv==1.0.1
You’ll also need an OpenAI API key (or a key for another LLM provider). Create a .env file in your project directory and add your key:
OPENAI_API_KEY="your_openai_api_key_here"
Step 1: Prepare Our Tools
First, let’s set up our retrieval tools. We’ll create a dummy vector store and a simulated web search.
Create a file named agentic_rag.py:
# agentic_rag.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.tools import Tool
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Load environment variables
load_dotenv()
# --- 1. Prepare Data and Embeddings for Vector Store ---
# For a real application, you'd load documents from a database, files, etc.
# We'll use simple in-memory text for demonstration.
docs = [
"The capital of France is Paris.",
"Eiffel Tower is located in Paris.",
"The official language of France is French.",
"The largest ocean on Earth is the Pacific Ocean.",
"Mars is known as the Red Planet.",
"The average temperature on Earth is about 15 degrees Celsius.",
"RAG 2.0 improves context relevance through hybrid search and agentic retrieval.",
"GraphRAG is a technique within RAG 2.0 that leverages knowledge graphs for context.",
"LangChain is a popular framework for building LLM-powered applications.",
"LlamaIndex is another framework focused on data orchestration for LLMs."
]
# Initialize embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # As of 2026-03-20, this is a good, efficient model.
# Create a FAISS vector store from our documents
# In a real scenario, this would be persisted and loaded.
vectorstore = FAISS.from_texts(docs, embeddings)
# Create a retriever for the vector store
vectorstore_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
# --- 2. Define Our Tools ---
# Tool 1: Vector Store Retriever
# This tool will search our local knowledge base (the 'docs' above)
def get_vector_store_results(query: str) -> str:
"""Searches the local knowledge base for relevant information."""
retrieved_docs = vectorstore_retriever.invoke(query)
# Format the results into a readable string for the LLM
return "\n".join([doc.page_content for doc in retrieved_docs])
vector_search_tool = Tool(
name="LocalKnowledgeBase",
func=get_vector_store_results,
description="Useful for answering questions about specific facts or concepts from a curated local knowledge base. Input should be a clear, concise query."
)
# Tool 2: Simulated Web Search
# In a real application, this would integrate with an actual search API (e.g., Google Search API, Bing Search API)
def simulated_web_search(query: str) -> str:
"""Simulates a web search for general knowledge or current events."""
print(f"--- Performing simulated web search for: '{query}' ---")
if "current weather" in query.lower():
return "The current weather in London is partly cloudy with a temperature of 10 degrees Celsius."
elif "latest news" in query.lower():
return "The latest news headlines include advancements in AI ethics and global economic recovery efforts."
elif "population of earth" in query.lower():
return "The estimated population of Earth is currently over 8 billion people."
else:
return "No specific web results found for this query in the simulation. This tool is best for general knowledge."
web_search_tool = Tool(
name="WebSearch",
func=simulated_web_search,
description="Useful for answering general knowledge questions, current events, or information not found in the local knowledge base. Input should be a broad question."
)
# List of all tools available to the agent
tools = [vector_search_tool, web_search_tool]
print("Tools initialized successfully!")
Explanation:
- We import necessary modules from
langchain,langchain_openai, andlangchain_community. load_dotenv()helps us securely load our API key.- We define a small set of
docsto simulate our private knowledge base. OpenAIEmbeddingswithtext-embedding-3-smallis used to convert our text into numerical vectors. This model is efficient and high-performing as of 2026.FAISS.from_textscreates an in-memory vector store, andvectorstore.as_retriever()makes it searchable.- We then wrap our
vectorstore_retrieverinto aToolobject. Thenameanddescriptionare crucial – the LLM uses these to decide when to use the tool. - We create a
simulated_web_searchfunction and also wrap it as aTool. Notice itsdescriptionguides the LLM on its appropriate use. - Finally, we collect all our
Toolobjects into atoolslist.
Step 2: Initialize the Agent
Now, let’s create our LLM agent that will use these tools.
Add the following code to agentic_rag.py (after defining tools):
# --- 3. Initialize the Agent LLM ---
# Using ChatOpenAI with a recent, powerful model.
# As of 2026-03-20, gpt-4-turbo is a strong choice for reasoning tasks.
llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)
# --- 4. Define the Agent's Prompt ---
# The prompt is critical for guiding the agent's behavior (ReAct pattern)
# LangChain provides a default prompt for create_react_agent, but customizing
# it can give more control. For simplicity, we'll leverage the default here.
# --- 5. Create the Agent ---
# create_react_agent is a helper function to create an agent that uses the ReAct pattern.
agent = create_react_agent(llm, tools,
# We can customize the prompt here if needed.
# For now, we'll use a standard ReAct prompt structure implicitly.
# The agent will infer the prompt from the tools and LLM.
# If you want to customize, you'd pass a PromptTemplate like:
# prompt=PromptTemplate.from_template("You are a helpful assistant. Answer questions using the following tools...\n{agent_scratchpad}")
)
# --- 6. Create the Agent Executor ---
# The AgentExecutor is responsible for actually running the agent,
# executing its steps (Thought, Action, Observation) and managing its state.
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True)
print("Agent initialized and ready to go!")
Explanation:
- We initialize our
ChatOpenAILLM, specifyinggpt-4-turbofor its strong reasoning capabilities.temperature=0encourages deterministic and focused responses, ideal for agent planning. create_react_agentis a convenient LangChain function that sets up an agent to follow the ReAct pattern. It takes our LLM and the list oftoolswe defined. It implicitly constructs a prompt that instructs the LLM to think, act, and observe.- The
AgentExecutoris the runtime for our agent. It takes theagentandtools, andverbose=Trueis incredibly useful for debugging, as it prints out the agent’sThought,Action, andObservationsteps.handle_parsing_errors=Truehelps recover from minor LLM output formatting issues.
Step 3: Run the Agent
Now, let’s put our agent to work!
Add the following to the end of agentic_rag.py:
# --- 7. Run the Agent with a Query ---
if __name__ == "__main__":
print("\n--- Agentic Retrieval Demo ---")
print("Ask me a question (type 'exit' to quit).")
while True:
user_query = input("\nYour query: ")
if user_query.lower() == 'exit':
break
try:
# The agent_executor.invoke() method runs the agent
# The input is the user's query.
result = agent_executor.invoke({"input": user_query})
print("\n--- Final Answer ---")
print(result["output"])
except Exception as e:
print(f"An error occurred: {e}")
print("The agent might have struggled with this query or its tools.")
Explanation:
- We wrap our execution logic in an
if __name__ == "__main__":block to make the script runnable. - We enter a loop to allow multiple queries.
agent_executor.invoke({"input": user_query})is the core call. It passes the user’s query to the agent, which then begins itsThought -> Action -> Observationcycle until it produces a final answer.- The
verbose=Truesetting inAgentExecutorwill show you the fascinating internal monologue of the LLM as it processes your query!
Let’s run it! Save the file and run from your terminal:
python agentic_rag.py
Try these queries and observe the agent’s verbose output:
What is the capital of France?(Should useLocalKnowledgeBase)What is the estimated population of Earth?(Should useWebSearch)Tell me about RAG 2.0.(Should useLocalKnowledgeBase)What is the current weather in London?(Should useWebSearch)Who developed LangChain?(Might useLocalKnowledgeBaseand thenWebSearchif not found, or justLocalKnowledgeBaseif it can infer an answer from the description)
You’ll see the LLM’s “Thought” process, which “Action” it takes (calling a tool), and the “Observation” (the tool’s output) before it arrives at a “Final Answer.” This demonstrates the LLM’s ability to plan and execute.
Mini-Challenge: Enhance the Agent with a Custom Tool
Your turn! Let’s make our agent even smarter by giving it a new capability.
Challenge:
Add a new Calculator tool to our agent. This tool should be able to perform simple arithmetic operations. The agent should then be able to use this tool when a mathematical question is posed.
Hint:
- Define a Python function that takes a string representing a simple arithmetic expression (e.g., “2 + 2”) and returns the result. You can use Python’s
eval()function for simplicity, but be aware of its security implications in production. For this learning exercise, it’s fine. - Wrap this function as a
Toolobject, giving it a clearnameanddescriptionthat tells the LLM when to use it (e.g., “Useful for performing mathematical calculations. Input should be a valid arithmetic expression like ‘2 + 2’.”). - Add your new
Toolto thetoolslist that is passed tocreate_react_agentandAgentExecutor. - Test with queries like “What is 15 times 3 minus 7?” or “Calculate 25% of 200.”
What to observe/learn:
Pay close attention to the agent’s Thought process when you ask a mathematical question. Does it correctly identify the need for the Calculator tool? Does it formulate the input for the tool correctly? This exercise reinforces how tool descriptions are crucial for agent decision-making.
Click for a potential solution to the Mini-Challenge
# ... (previous code remains the same) ...
# --- Add a new Calculator Tool ---
def calculator_tool_func(expression: str) -> str:
"""Performs simple arithmetic calculations."""
try:
# WARNING: Using eval() directly can be a security risk in production
# For a learning exercise, it's acceptable.
# In a real app, use a safer math expression parser.
result = eval(expression)
return str(result)
except Exception as e:
return f"Error calculating: {e}. Please provide a valid arithmetic expression."
calculator_tool = Tool(
name="Calculator",
func=calculator_tool_func,
description="Useful for performing mathematical calculations. Input should be a valid arithmetic expression, e.g., '2 + 2' or '15 * 3'."
)
# Update the list of tools
tools = [vector_search_tool, web_search_tool, calculator_tool] # Add the new tool here!
# ... (rest of the code remains the same: llm, agent, agent_executor, and the main loop) ...
After adding this, try queries like:
What is 123 plus 456?What is 25 percent of 200?(The agent might need to rephrase this to ‘0.25 * 200’ for the calculator.)If I have 5 apples and buy 3 more, then eat 2, how many do I have?(This might be too complex for a single calculator step and might require multiple thoughts/actions.)
Common Pitfalls & Troubleshooting in Agentic Retrieval
Agentic systems are powerful, but they introduce new complexities. Here are some common issues and how to approach them:
Tool Hallucination or Misuse:
- Problem: The agent might invent tools that don’t exist, call a tool with incorrect parameters, or use the wrong tool for the job. This often happens if the tool descriptions are ambiguous or if the LLM isn’t powerful enough for complex reasoning.
- Troubleshooting:
- Clear Tool Descriptions: Ensure each
Tool’sdescriptionis precise, unambiguous, and clearly states its purpose and expected input format. - Few-Shot Examples: For more complex tools, consider providing few-shot examples within the agent’s prompt to demonstrate correct tool usage.
- LLM Choice: Use a more capable LLM (e.g.,
gpt-4-turboor equivalent) for agent planning, as their reasoning abilities are superior. - Input Validation: Implement robust input validation within your tool functions to catch and gracefully handle invalid inputs before they crash the agent.
- Clear Tool Descriptions: Ensure each
Infinite Loops or Early Stopping:
- Problem: The agent might get stuck in a loop of
Thought -> Action -> Observationwithout ever reaching a final answer, or it might stop prematurely with an incomplete answer. - Troubleshooting:
- Max Iterations: Implement a
max_iterationslimit on theAgentExecutorto prevent runaway processes. - Clear Stopping Criteria: Ensure the agent’s prompt clearly defines what constitutes a “final answer” and when it should stop.
- Observation Quality: If tool observations are unhelpful or ambiguous, the agent might struggle to make progress. Improve the output format of your tool functions.
- Prompt Engineering: Refine the agent’s prompt to encourage it to provide a final answer when it has sufficient information.
- Max Iterations: Implement a
- Problem: The agent might get stuck in a loop of
Cost and Latency:
- Problem: Agentic systems often make multiple LLM calls (for thoughts, actions, and observations) per user query, leading to higher costs and increased latency compared to single-shot RAG.
- Troubleshooting:
- Efficient LLMs: Use smaller, faster, or cheaper LLMs for simpler planning steps, reserving larger models for critical reasoning or generation.
- Caching: Implement caching for tool calls or LLM responses that are likely to be repeated.
- Parallel Tool Calls: If tools are independent, consider executing them in parallel where possible.
- Prompt Optimization: Minimize the token count in prompts and observations where feasible without losing critical information.
Context Overflow and Loss:
- Problem: As the agent interacts, its internal “scratchpad” (memory of thoughts, actions, observations) can grow, potentially exceeding the LLM’s context window.
- Troubleshooting:
- Summarization Tool: Give the agent a tool to summarize its own scratchpad or previous turns if they become too long.
- Memory Management: Implement more sophisticated memory management, such as summarizing past conversations or only retaining the most relevant recent interactions.
- Context Compression: Apply techniques to compress the observations from tools before feeding them back to the LLM.
Summary
Congratulations! You’ve successfully navigated the exciting world of Agentic Retrieval, a cornerstone of RAG 2.0.
Here’s a quick recap of what we’ve covered:
- Agentic Retrieval empowers LLMs to act as intelligent orchestrators, planning and executing complex retrieval strategies rather than just generating responses from a pre-assembled context.
- The core components include the Agent (LLM), Tools (various retrieval mechanisms, APIs), Memory, and Orchestration Logic.
- The ReAct pattern (
Thought -> Action -> Observation) is a powerful mechanism for guiding LLM reasoning and tool use. - We implemented a basic agentic system using
LangChain, demonstrating how to define tools and build anAgentExecutorto run an LLM-powered agent. - We explored common pitfalls like tool hallucination, infinite loops, and cost, along with practical troubleshooting strategies.
Agentic retrieval represents a significant leap towards more autonomous and capable AI systems. By giving LLMs the ability to plan and adapt, we unlock new possibilities for tackling highly complex information needs that traditional RAG struggles with.
In our final chapter, we’ll look ahead to the future of RAG, discussing emerging trends, ethical considerations, and how these advanced techniques will continue to shape the landscape of AI.
References
- LangChain Documentation
- LlamaIndex Documentation
- OpenAI API Documentation
- ReAct: Synergizing Reasoning and Acting in Language Models (Paper)
- RAG and Generative AI - Azure AI Search - Microsoft Learn
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.