Introduction to Agentic RAG: Beyond the Context Window
Welcome back, aspiring agent architects! In our previous chapters, we’ve explored how autonomous agents leverage Large Language Models (LLMs) for reasoning and how their “short-term memory” is managed through the LLM’s context window. This context window is fantastic for immediate conversations and sequential thoughts, but it has inherent limitations: it’s finite, expensive, and doesn’t inherently contain specialized or up-to-date information.
Imagine an agent trying to answer a question about the latest quarterly earnings report for a specific company, or debug a complex piece of code based on an internal documentation wiki. Without access to this external, specialized knowledge, the agent would either “hallucinate” (make up information) or simply state it doesn’t know. This is where Long-Term Memory comes into play for AI agents, specifically through a powerful technique called Retrieval-Augmented Generation (RAG).
In this chapter, you’ll learn how to equip your agents with a robust long-term memory system. We’ll demystify RAG, understand the magic of embeddings, and see how vector databases become the brain’s archive for your agents. By the end, you’ll be able to build a foundational RAG system, allowing your agents to access and utilize vast amounts of external knowledge, making them more informed, accurate, and powerful. Let’s dive in and unlock true knowledge for our agents!
Core Concepts: Agentic RAG Explained
The ability to access and synthesize information beyond an LLM’s initial training data is paramount for building truly intelligent and useful agents. RAG provides this crucial capability.
The Problem with Short-Term Memory
LLMs are incredible at understanding and generating human-like text. However, they have two main limitations when it comes to knowledge:
- Knowledge Cut-off: LLMs are trained on vast datasets up to a certain point in time. They don’t inherently know about events, products, or information that emerged after their training data was collected.
- Limited Context Window: While the context window (the maximum length of input tokens an LLM can process at once) has grown significantly, it’s still finite. You can’t fit an entire company’s documentation or all of Wikipedia into a single prompt. Storing and retrieving past interactions for an agent also becomes challenging over long sessions.
These limitations mean that an agent relying solely on its LLM’s internal knowledge and current context window will struggle with domain-specific tasks, real-time data, or long-running, knowledge-intensive operations.
Introducing Retrieval-Augmented Generation (RAG)
RAG is a technique that empowers LLMs to access, retrieve, and incorporate external, up-to-date, and domain-specific information into their responses. Instead of solely generating text based on its internal training data, an LLM enhanced with RAG first retrieves relevant information from an external knowledge base and then generates a response grounded in that retrieved context.
Why is RAG crucial for agents?
- Grounding: RAG ensures the agent’s responses are factually accurate and based on verifiable information, significantly reducing “hallucinations.”
- Up-to-Date Information: Agents can access the latest data, overcoming the LLM’s knowledge cut-off.
- Specialized Knowledge: Agents can operate effectively in niche domains by querying specific documentation, databases, or proprietary knowledge bases.
- Transparency: By showing the source of retrieved information, RAG can make an agent’s reasoning more transparent and auditable.
Here’s a high-level overview of the RAG process:
Embeddings: The Language of Similarity
At the heart of RAG lies the concept of embeddings. Think of an embedding as a numerical fingerprint for a piece of text (a word, a sentence, a paragraph, or even an entire document). This fingerprint is a dense vector (a list of numbers) that captures the semantic meaning of the text.
- How they work: An embedding model (often a specialized neural network) takes text as input and outputs a vector. Texts that are semantically similar will have embedding vectors that are “close” to each other in a multi-dimensional space.
- Why they’re important for RAG: When an agent receives a query, that query is also converted into an embedding. The system then searches for other embeddings in its knowledge base that are numerically closest to the query’s embedding. This allows for powerful semantic search, finding relevant information even if the exact keywords aren’t present.
Popular embedding models include those from OpenAI (e.g., text-embedding-3-small), Cohere, Google, and various open-source models available on Hugging Face. The choice of embedding model significantly impacts the quality of retrieval.
Vector Databases: Storing and Searching Knowledge
Once you have these numerical fingerprints (embeddings) for your knowledge base, you need a place to store them and efficiently search through them. This is the job of a vector database.
- What they are: Vector databases are specialized databases optimized for storing vector embeddings and performing lightning-fast similarity searches (e.g., finding the “nearest neighbors” to a query vector).
- How they work: They use sophisticated indexing algorithms (like Annoy, HNSW, IVF) to quickly find vectors that are geometrically close to a given query vector, representing semantic similarity.
- Examples: Popular choices include managed services like Pinecone and Weaviate, self-hostable options like Qdrant and ChromaDB, and in-memory libraries like FAISS for smaller, local deployments. As of 2026, the ecosystem of vector databases is mature and offers diverse options for various scales and use cases.
Components of an Agentic RAG System
Let’s break down the practical steps involved in setting up a RAG system for your agent:
- Document Loading: This is the first step, where you ingest your raw data from various sources. This could be PDFs, markdown files, web pages, Notion databases, Confluence wikis, or even structured data from SQL databases.
- Text Splitting (Chunking): Raw documents are often too large to fit into an LLM’s context window, even after retrieval. They also might contain irrelevant information. Therefore, documents are broken down into smaller, manageable “chunks” of text. The art of chunking is crucial: chunks need to be small enough to be relevant but large enough to retain sufficient context.
- Embedding Generation: Each of these text chunks is then passed through an embedding model to generate its corresponding vector embedding.
- Vector Storage: The generated embeddings (along with a reference back to their original text chunks) are stored in a vector database.
- Retrieval: When a user poses a query (or an agent needs information), the query is embedded, and a similarity search is performed in the vector database to find the most relevant text chunks.
- Prompt Augmentation: The retrieved text chunks are then dynamically inserted into the LLM’s prompt, providing the LLM with the necessary context to generate an informed response.
Step-by-Step Implementation: Building a Basic Agentic RAG System
Let’s get hands-on and build a simple RAG system using Python. We’ll use langchain-community for its helpful abstractions, openai for embeddings and the LLM, and chromadb as our local vector store.
Setup: Your Python Environment
First, ensure you have Python 3.11 or newer installed. Then, create a virtual environment and install the necessary packages.
# Create a virtual environment
python -m venv agent_rag_env
# Activate the virtual environment
# On macOS/Linux:
source agent_rag_env/bin/activate
# On Windows:
# .\agent_rag_env\Scripts\activate
# Install the required packages
pip install openai~=1.14.0 langchain-community~=0.0.30 chromadb~=0.4.24 tiktoken~=0.6.0
Note on Versions: The
~=operator ensures compatibility by installing a version that’s compatible with the specified one (e.g.,1.14.x). Always check the latest stable releases on PyPI if you encounter issues. As of March 20, 2026, these are good starting points.
You’ll also need an OpenAI API key. Store it securely, for example, as an environment variable named OPENAI_API_KEY.
Step 1: Choose an Embedding Model and LLM
For our example, we’ll use OpenAI’s text-embedding-3-small for embeddings and gpt-3.5-turbo for the LLM. Remember to set your OPENAI_API_KEY environment variable.
Create a new Python file named agent_rag.py.
# agent_rag.py
import os
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.chat_models import ChatOpenAI
# Ensure your OPENAI_API_KEY is set as an environment variable
# e.g., export OPENAI_API_KEY="your_api_key_here"
print("Initializing LLM and Embedding Model...")
# Initialize the LLM (for generation)
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.0)
print(f"LLM initialized: {llm.model_name}")
# Initialize the Embedding Model (for converting text to vectors)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
print(f"Embedding model initialized: {embeddings.model}")
print("\nReady to process knowledge!")
Explanation:
osis imported to manage environment variables.OpenAIEmbeddingsis a convenient wrapper fromlangchain-communityto use OpenAI’s embedding API. We specifytext-embedding-3-smallfor a cost-effective yet powerful embedding.ChatOpenAIis used to interact with OpenAI’s chat completion models, likegpt-3.5-turbo.temperature=0.0makes the responses more deterministic, which is often preferred for factual retrieval.
Run this script to ensure your API key is set up correctly and the models initialize.
python agent_rag.py
You should see output similar to:
Initializing LLM and Embedding Model...
LLM initialized: gpt-3.5-turbo
Embedding model initialized: text-embedding-3-small
Ready to process knowledge!
Step 2: Prepare Your Knowledge Base (Example Data)
For simplicity, let’s use a few sentences as our “documents.” In a real-world scenario, you’d load data from files, databases, or APIs. LangChain offers Document objects to represent chunks of text with associated metadata.
Add the following to agent_rag.py, replacing the print statement at the end:
# ... (previous code for LLM and embeddings) ...
from langchain_core.documents import Document
print("Preparing knowledge base documents...")
# Our sample knowledge base
raw_documents = [
"The capital of France is Paris. Paris is known for the Eiffel Tower.",
"The Amazon rainforest is the largest tropical rainforest in the world.",
"Python is a popular programming language, widely used for AI and web development.",
"The first AI agent was developed in the 1950s, though modern agentic AI has evolved significantly.",
"Vector databases are essential for efficient similarity search in RAG systems.",
"Reinforcement learning is a machine learning paradigm concerned with how intelligent agents ought to take actions in an environment.",
"OpenAI's GPT-4 is a large multimodal model that can accept image and text inputs and emit text outputs."
]
# Convert raw strings to LangChain Document objects (optional, but good practice for metadata)
documents = [Document(page_content=doc) for doc in raw_documents]
print(f"Loaded {len(documents)} documents into the knowledge base.")
Explanation:
- We create a list of
raw_documents(simple strings). - These are converted into
Documentobjects fromlangchain_core.documents. While not strictly necessary for this simple example,Documentobjects are crucial in real applications as they allow you to store metadata (like source, page number, author) alongside the text, which can be invaluable during retrieval and response generation.
Step 3: Chunking the Documents
Our example documents are already quite small, but in practice, you’ll deal with large texts. Chunking breaks these large texts into smaller, semantically coherent pieces.
Add the following to agent_rag.py:
# ... (previous code for documents) ...
from langchain.text_splitter import RecursiveCharacterTextSplitter
print("Chunking documents...")
# Initialize the text splitter
# We want chunks to be small enough for context, but large enough to retain meaning.
# `chunk_size` is the max number of characters in a chunk.
# `chunk_overlap` ensures continuity between chunks by repeating some text.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Max characters per chunk
chunk_overlap=50, # Overlap between chunks
length_function=len,
is_separator_regex=False,
)
# Split the documents into chunks
chunks = text_splitter.split_documents(documents)
print(f"Original documents split into {len(chunks)} chunks.")
# print(f"First chunk example:\n{chunks[0].page_content}") # Uncomment to see a chunk
Explanation:
RecursiveCharacterTextSplitteris a common and effective splitter. It tries to split by paragraphs, then sentences, then words, recursively, to keep chunks as semantically meaningful as possible.chunk_size: Defines the maximum size of each chunk. This is critical for fitting into the LLM’s context window.chunk_overlap: A small overlap between chunks helps maintain context if a crucial piece of information spans two chunks.
Step 4: Creating and Storing Embeddings with a Vector Database
Now we’ll take our chunks, generate embeddings for each, and store them in ChromaDB. Chroma is a great choice for local development as it runs in-process without needing a separate server.
Add the following to agent_rag.py:
# ... (previous code for chunks) ...
from langchain_community.vectorstores import Chroma
print("Generating embeddings and storing in ChromaDB...")
# Create a Chroma vector store from the documents and embeddings model
# This step generates embeddings for each chunk and stores them locally.
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db" # Directory to store the vector database
)
print("ChromaDB created and populated with embeddings.")
print(f"Vector store contains {vectorstore._collection.count()} items.")
Explanation:
Chroma.from_documentsis a powerful helper function. It takes yourchunks(which areDocumentobjects), yourembeddingsmodel, and apersist_directory.- It automatically iterates through each chunk, generates its embedding using
OpenAIEmbeddings, and stores the embedding along with the original text in the./chroma_dbdirectory. - The
persist_directoryis important: it means your vector database will be saved to disk, so you don’t have to re-embed your documents every time you run the script.
Step 5: Implementing the Retriever
With our vector database populated, we can now create a “retriever” that can fetch relevant chunks based on a query.
Add this to agent_rag.py:
# ... (previous code for vectorstore) ...
print("\nSetting up the retriever...")
# Create a retriever object from the vector store
retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) # Retrieve top 2 most relevant chunks
print("Retriever ready. Testing retrieval with a sample query...")
# Test the retriever
sample_query = "What is Python used for?"
retrieved_docs = retriever.invoke(sample_query)
print(f"\nQuery: '{sample_query}'")
print("Retrieved documents:")
for i, doc in enumerate(retrieved_docs):
print(f"--- Document {i+1} ---")
print(doc.page_content)
print("--------------------")
Explanation:
vectorstore.as_retriever()converts ourChromainstance into aRetrieverobject.search_kwargs={"k": 2}tells the retriever to fetch the top 2 most semantically similar documents (chunks) to our query. You can adjustkbased on your needs.retriever.invoke(sample_query)performs the actual retrieval:- It embeds
sample_query. - It searches the
vectorstorefor thekclosest embeddings. - It returns the original text content of those
kchunks.
- It embeds
Run the script now: python agent_rag.py
You should see the initialization, document loading, chunking, embedding, and finally, the retrieved documents for the sample query. The retrieved documents should be relevant to Python.
Step 6: Integrating RAG into an Agent Prompt
The final step is to take these retrieved documents and use them to augment the prompt sent to our LLM. This is where the “Augmented Generation” part of RAG comes in.
Add the following to agent_rag.py:
# ... (previous code for retriever test) ...
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
print("\nIntegrating RAG into an LLM chain...")
# Define a prompt template that includes a placeholder for context
template = """You are an AI assistant that answers questions based on the provided context.
If you cannot find the answer in the context, state that you don't know.
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
# Create a RAG chain
# This chain orchestrates the retrieval and generation steps.
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Now, let's ask a question that requires RAG
agent_query = "What is the largest rainforest and what is Python used for?"
print(f"\nAgent's Question: '{agent_query}'")
# Invoke the RAG chain
response = rag_chain.invoke(agent_query)
print("\nAgent's RAG-powered Answer:")
print(response)
# Clean up the ChromaDB directory (optional, if you want a fresh start next time)
# import shutil
# if os.path.exists("./chroma_db"):
# shutil.rmtree("./chroma_db")
# print("\nCleaned up ChromaDB directory.")
Explanation:
ChatPromptTemplate: We define a prompt that explicitly tells the LLM to use the providedcontextto answer thequestion. This is crucial for guiding the LLM to be factual.RunnablePassthrough: This is a LangChain utility that simply passes its input through. Here, it passes the originalagent_queryto thequestionplaceholder in the prompt.StrOutputParser: This ensures the LLM’s output is returned as a simple string.rag_chain: This is the core of our RAG system. It’s a sequence of operations:{"context": retriever, "question": RunnablePassthrough()}: This dictionary prepares the input for the prompt. It calls theretrieverwith the incoming query (which is passed throughRunnablePassthroughas thequestion), and theretrieverreturns the relevant documents, which are then mapped to thecontextkey.| prompt: The prepared context and question are then fed into ourChatPromptTemplate.| llm: The fully constructed prompt is sent to thegpt-3.5-turboLLM.| StrOutputParser(): The LLM’s raw output is parsed into a string.
- Finally, we
invoketherag_chainwith ouragent_query, and the entire RAG process (retrieve then generate) is executed.
Run the full agent_rag.py script. You should now see a comprehensive answer from the agent, combining information from multiple retrieved chunks about both the Amazon rainforest and Python.
This simple example demonstrates the fundamental building blocks of RAG. In more complex agentic systems, the agent’s planning and reasoning modules would decide when to invoke RAG, what questions to ask the retriever, and how to integrate the retrieved information into its overall plan or response.
Mini-Challenge: Expanding Your Agent’s Knowledge
Now it’s your turn to extend your agent’s long-term memory!
Challenge:
- Add a new document to the
raw_documentslist inagent_rag.pyabout a topic not currently covered (e.g., “The latest stable version of Python is 3.12, released in October 2023, offering significant performance improvements.”). - Modify the
agent_queryvariable to ask a question that can only be answered by the information in your newly added document. - Run the script again.
Hint: Remember that when you run the script, Chroma.from_documents will re-embed and store your entire document set, including the new one. If you want to simulate adding to an existing database without re-embedding everything, you’d typically use vectorstore.add_documents() after initializing Chroma from an existing persist_directory. For this challenge, re-running from_documents is fine.
What to Observe/Learn: Observe how the agent, powered by RAG, can now answer questions based on the new, previously unknown information. This highlights the dynamic and extensible nature of RAG-enabled agents. They don’t need to be retrained to learn new facts; they just need access to an updated knowledge base.
Common Pitfalls & Troubleshooting in Agentic RAG
While powerful, RAG systems come with their own set of challenges. Being aware of these common pitfalls will help you design more robust agents.
Irrelevant Retrievals (Garbage In, Garbage Out):
- Problem: The retriever fetches chunks that are not truly relevant to the query, leading the LLM to generate an incorrect or off-topic answer.
- Causes:
- Poor embedding model: The embedding model doesn’t accurately capture semantic similarity for your specific domain.
- Suboptimal chunking: Chunks are too small (lacking context) or too large (diluting relevance).
- Noisy or ambiguous data: The knowledge base itself contains conflicting, poorly written, or irrelevant information.
- Query-document mismatch: The way the query is phrased doesn’t align well with the embedded documents.
- Troubleshooting:
- Experiment with different embedding models.
- Adjust
chunk_sizeandchunk_overlap. Consider advanced chunking strategies (e.g., parent document retriever, hierarchical chunking). - Clean and preprocess your knowledge base.
- Implement re-ranking: After initial retrieval, use a smaller, more powerful model (or even the main LLM) to re-rank the top-K retrieved documents for better relevance.
Context Window Overflow:
- Problem: Retrieving too many documents, or documents that are individually too large, can exceed the LLM’s maximum context window, leading to errors or truncated responses.
- Causes:
k(number of retrieved documents) is set too high.chunk_sizeis too large.
- Troubleshooting:
- Carefully balance
kandchunk_sizewith your chosen LLM’s context window. - Implement context compression or summarization: Before passing to the LLM, summarize the retrieved documents or filter out redundant information.
- Consider LLMs with larger context windows (e.g., Claude 3 Opus, GPT-4 Turbo).
- Carefully balance
Latency Issues:
- Problem: The RAG process (embedding query, vector search, LLM inference) can add significant latency, impacting the agent’s responsiveness.
- Causes:
- Slow embedding model inference.
- Inefficient vector database indexing or slow query times, especially with very large knowledge bases.
- High latency LLM API calls.
- Troubleshooting:
- Optimize embedding model choice (smaller, faster models if acceptable quality).
- Choose a production-ready vector database with good scaling and indexing (e.g., Pinecone, Qdrant).
- Cache frequently accessed information.
- Consider batching embedding calls where possible.
“Hallucinations” Despite RAG:
- Problem: Even with retrieved context, the LLM might still generate incorrect or fabricated information, or misinterpret the provided context.
- Causes:
- Ambiguous or contradictory retrieved context.
- LLM’s inherent tendency to “fill in gaps” even when explicitly told not to.
- Poorly designed prompt that doesn’t sufficiently constrain the LLM to the context.
- Troubleshooting:
- Refine your prompt template: Be very explicit with instructions like “Based only on the following context…” and “If the answer is not in the context, state ‘I don’t know.’”
- Improve the quality and clarity of your knowledge base.
- Implement confidence scores or fact-checking: Have the agent evaluate its own answer against the retrieved context or use external tools to verify facts.
By proactively addressing these issues, you can build more reliable and effective RAG-powered agents.
Summary
Congratulations! You’ve successfully navigated the exciting world of long-term memory for AI agents. In this chapter, we’ve covered:
- The limitations of LLM context windows and the need for external knowledge.
- What Retrieval-Augmented Generation (RAG) is, why it’s vital for agents, and its high-level workflow.
- The role of embeddings in converting text into numerical representations for semantic search.
- How vector databases efficiently store and retrieve these embeddings.
- The core components of a RAG system: document loading, chunking, embedding generation, vector storage, retrieval, and prompt augmentation.
- Hands-on implementation of a basic RAG system using Python,
langchain-community,openai, andchromadb. - Common pitfalls in RAG (irrelevant retrievals, context overflow, latency, persistent hallucinations) and strategies to mitigate them.
You now understand how to give your agents access to a vast, dynamic, and up-to-date knowledge base, making them significantly more capable and grounded. In the next chapter, we’ll explore how agents can actively use this knowledge, along with external tools, to perform complex actions and solve real-world problems. Get ready to see your agents take on more sophisticated tasks!
References
- Microsoft Learn - Agent Framework documentation
- OpenAI Embeddings Documentation
- LangChain Documentation - RAG
- ChromaDB Documentation
- Hugging Face - What are embeddings?
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.