Persistent Memory & Context Management: Remembering the Past

Introduction: Why Agents Need a Memory Palace

Welcome back, fellow AI adventurer! In previous chapters, we’ve explored the building blocks of AI agents and how they can perform multi-step tasks. But have you ever noticed how large language models (LLMs) can sometimes “forget” what was said just a few turns ago in a conversation? Or how an agent might restart a complex task from scratch if interrupted? This is where the magic of memory and context management comes in!

Think about it: as humans, we don’t just process information in isolation. We remember past conversations, learn from experiences, and keep track of our progress on tasks. For AI agents to truly be intelligent, conversational, and capable of handling complex, long-running workflows, they need similar capabilities. They need a way to remember what’s happened, understand the current situation, and maintain state across multiple interactions.

In this chapter, we’ll dive deep into how modern AI agent frameworks tackle this crucial challenge. We’ll explore the difference between short-term and long-term memory, learn about state management, and see practical examples of how frameworks like LangGraph, AutoGen, CrewAI, and Semantic Kernel implement these concepts. By the end, you’ll be able to design agents that can recall past events, learn over time, and pick up exactly where they left off!

Core Concepts: Why Agents Need a Memory Palace

Before we jump into the frameworks, let’s solidify our understanding of the fundamental concepts. Why is memory so vital for AI agents, and what different kinds of memory do we need?

The Challenge of Stateless LLMs

At their core, most LLM calls are stateless. This means each time you send a prompt to an LLM, it processes that prompt independently of any previous prompts you might have sent. It doesn’t inherently “remember” the conversation history unless you explicitly include that history in the new prompt.

This stateless nature presents a significant challenge for building conversational agents or agents that handle multi-step tasks. Imagine a customer support agent that forgets your name and previous query with every response! Not very helpful, right?

The solution lies in providing the LLM with the necessary context. Context is all the relevant information an agent needs to perform its current task effectively. This includes:

Conversation History: What has been said so far?
User Preferences: Does the user have specific likes or dislikes?
Past Learnings: What facts or insights has the agent gathered?
Current Task Status: What step are we on in a multi-step process?

This leads us to differentiate between two primary types of memory, along with a related concept: state management.

Short-Term Memory: The Conversation Buffer

Short-term memory, often called the context window, is like our working memory. It holds the immediate, recent interactions that are directly relevant to the current conversation or task step.

What it is: This is typically managed by appending recent messages (user inputs, agent responses, tool outputs) directly into the prompt that’s sent to the LLM. The LLM then processes this entire sequence to generate its next response, maintaining conversational coherence.

Why it’s important:

Conversational Flow: Allows the agent to refer to previous turns, answer follow-up questions, and maintain context within a single dialogue.
Immediate Relevance: Keeps the most recent and relevant information readily available for decision-making.

Common Techniques:

Conversation Buffer: Simply stores the last ‘N’ messages.
Conversation Summary: Summarizes older parts of the conversation to keep the context concise, especially for longer dialogues, to avoid hitting the LLM’s token limit.
Token Management: Intelligently truncating or summarizing messages to fit within the LLM’s finite context window.

The Catch: LLMs have a finite context window (measured in tokens). If the conversation history grows too long, older messages will be truncated or ignored, leading to the agent “forgetting” earlier parts of the discussion. This is a constant balancing act!

Long-Term Memory: Persistent Knowledge

Long-term memory is where agents store information they need to recall over extended periods, across different sessions, or for generalized knowledge retrieval. This is akin to our personal knowledge base or factual memory.

What it is: This involves storing information outside the direct LLM prompt, typically in a structured database or a specialized vector database. When the agent needs to recall something, it retrieves relevant information from this store and injects it into the LLM’s short-term context.

Why it’s important:

Persistence: Information is retained indefinitely, even after the current conversation ends.
Knowledge Base: Allows agents to access a vast amount of information (documents, facts, user profiles) that wouldn’t fit in a single context window.
Personalization: Agents can remember user preferences, past interactions, or learned behaviors.
Learning: Agents can “learn” new facts or patterns by adding them to long-term memory.

Common Techniques:

Vector Databases (Vector Stores): Information (text, images, etc.) is converted into numerical representations called embeddings. These embeddings are stored in a vector database, allowing for fast semantic search (finding information similar in meaning). When an agent needs information, it queries the vector store with an embedding of its current context, retrieving relevant chunks of knowledge.
Traditional Databases: For structured data like user profiles, order history, etc.
Knowledge Graphs: For representing complex relationships between entities.

State Management: Knowing Where We Are

While memory deals with what an agent knows or has experienced, state management deals with where an agent is in a particular workflow or process.

What it is: State management involves tracking the current step, status, or phase of an agent’s operation. For multi-step workflows, this means knowing which task has been completed, which is pending, and what information has been gathered so far.

Why it’s important:

Workflow Orchestration: Enables complex, multi-stage processes by ensuring tasks are executed in the correct order and dependencies are met.
Resilience: Allows agents to pause a task and resume it later from the exact point of interruption, without losing progress.
Debugging: Provides visibility into the agent’s current operational status.

How it’s handled: Frameworks often use internal data structures, state machines, or graph-based models to represent and manage the flow of information and control between different agent steps or nodes.

Visualizing Memory and State in an Agent

Let’s use a simple diagram to illustrate how these components interact:

graph TD User_Input[User Input] --> Agent[AI Agent] Agent --> LLM[Large Language Model] subgraph Memory_System["Memory System"] Short_Term_Memory[Short-Term Memory] Long_Term_Memory[Long-Term Memory] end subgraph State_Manager["State Manager"] Workflow_State[Workflow State] end User_Input --> Short_Term_Memory Short_Term_Memory --> LLM LLM --> Short_Term_Memory Agent --> Long_Term_Memory Long_Term_Memory --> Agent Agent --> Workflow_State Workflow_State --> Agent LLM --> Agent_Output[Agent Output] Agent_Output --> User_Input

In this diagram:

User Input and Agent Output flow through the AI Agent.
The Short-Term Memory (context window) directly feeds into and is updated by the LLM.
The Long-Term Memory acts as an external knowledge base that the Agent can query and update.
The State Manager tracks the agent’s progress through a workflow, guiding the agent’s actions.

Step-by-Step Implementation: Building Memory into Agents

Now that we understand the core concepts, let’s get our hands dirty and implement memory and state management using our favorite AI agent frameworks. We’ll build up simple examples incrementally for each.

First, ensure you have the necessary libraries installed. As of March 20, 2026, these are the current stable versions:

pip install -U langchain==0.1.13 langchain-openai==0.1.1 langgraph==0.0.30 autogen==0.2.20 crewai==0.28.8 semantic-kernel==0.9.1 openai==1.16.1 qdrant-client==1.8.0

Remember to set your OPENAI_API_KEY (or other LLM provider keys) as an environment variable before running the examples: export OPENAI_API_KEY="YOUR_API_KEY".

1. LangGraph (and LangChain) - Conversational History

LangGraph, building on LangChain’s ecosystem, provides powerful tools for managing conversational memory. We’ll use RunnableWithMessageHistory to manage short-term conversational context.

Step 1: Set up your environment and basic LLM chain. Create a file named langgraph_memory_example.py.

# langgraph_memory_example.py
import os
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

# Set your OpenAI API key as an environment variable (e.g., export OPENAI_API_KEY="YOUR_KEY")
# For local testing, you can uncomment and set it directly, but environment variables are recommended for security.
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# 1. Define your LLM
# We're using gpt-4o, a powerful model available as of 2026-03-20.
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# 2. Define your prompt with a MessagesPlaceholder for history
# The MessagesPlaceholder is crucial; it tells the prompt where to inject the conversation history.
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful AI assistant. Keep your responses concise."),
        MessagesPlaceholder(variable_name="history"), # This is where memory goes!
        ("human", "{input}"),
    ]
)

# 3. Create a simple chain by piping the prompt to the LLM
chain = prompt | llm

print("Basic chain created. Now let's add memory!")

Explanation: We start with the fundamental components: an LLM, a prompt template, and a chain. The MessagesPlaceholder in the prompt is a special instruction to LangChain that indicates a slot for dynamic message history.

Step 2: Add a message history store and wrap your chain. Now, let’s introduce RunnableWithMessageHistory to manage the actual conversation history.

# Continue in langgraph_memory_example.py

from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain.memory import ChatMessageHistory # In-memory history store

# This dictionary will act as our simple, in-memory session store.
# In a real application, you'd use a persistent database (Redis, Postgres, etc.).
store = {}

# This function provides a ChatMessageHistory object for a given session ID.
# If a session ID is new, it creates a fresh history.
def get_session_history(session_id: str) -> ChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

# 4. Wrap the chain with RunnableWithMessageHistory
#    - The first argument is our original chain.
#    - The second is our function to retrieve session history.
#    - `input_messages_key` tells it which key in the input dictionary contains the new user message.
#    - `history_messages_key` tells it which key in the prompt's MessagesPlaceholder corresponds to history.
with_message_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

print("Chain wrapped with RunnableWithMessageHistory.")

Explanation: RunnableWithMessageHistory is a higher-order runnable that takes care of fetching, updating, and passing the correct message history to your underlying chain. Our get_session_history function simulates how you might retrieve history for different users or sessions.

Step 3: Interact with the agent across multiple sessions. Let’s see the memory in action!

# Continue in langgraph_memory_example.py

print("\n--- Conversation 1 (Session 'user123') ---")
# First interaction for 'user123'
# The `config` dictionary is where we pass the session_id to RunnableWithMessageHistory.
response1 = with_message_history.invoke(
    {"input": "Hi there! My name is Alice."},
    config={"configurable": {"session_id": "user123"}}
)
print(f"Agent: {response1.content}")

# Second interaction in the same session, agent should remember Alice
response2 = with_message_history.invoke(
    {"input": "What is my name?"},
    config={"configurable": {"session_id": "user123"}}
)
print(f"Agent: {response2.content}")

print("\n--- Conversation 2 (Session 'user456') ---")
# A new session, the agent should not remember Alice, it's a fresh start.
response3 = with_message_history.invoke(
    {"input": "Hello! I am Bob."},
    config={"configurable": {"session_id": "user456"}}
)
print(f"Agent: {response3.content}")

response4 = with_message_history.invoke(
    {"input": "What is my name?"},
    config={"configurable": {"session_id": "user456"}}
)
print(f"Agent: {response4.content}")

print("\n--- Back to Conversation 1 (Session 'user123') ---")
# Back to Alice's session, the agent should recall Alice's name again.
response5 = with_message_history.invoke(
    {"input": "Can you remind me of something we discussed earlier?"},
    config={"configurable": {"session_id": "user123"}}
)
print(f"Agent: {response5.content}")

Explanation: When you run this script, you’ll observe how the agent correctly remembers Alice’s name within session user123 but treats Bob in session user456 as a new user. This demonstrates effective short-term conversational memory managed by RunnableWithMessageHistory.

2. AutoGen - Agent Conversation History

AutoGen agents inherently manage their own conversation history. When agents chat, all messages are automatically recorded, forming their short-term memory for that specific interaction.

Step 1: Set up LLM configuration and create agents. Create a file named autogen_memory_example.py.

# autogen_memory_example.py
import autogen
import os

# Set your OpenAI API key as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# Configuration for the LLM
# autogen.config_list_from_json is a common way to load configs.
# Ensure you have an OAI_CONFIG_LIST file or environment variable set up.
# Example OAI_CONFIG_LIST content:
# [
#     {
#         "model": "gpt-4o",
#         "api_key": "YOUR_OPENAI_API_KEY"
#     }
# ]
config_list = autogen.config_list_from_json(
    "OAI_CONFIG_LIST",
    filter_dict={
        "model": ["gpt-4o", "gpt-4-turbo", "gpt-3.5-turbo"], # Prioritizing gpt-4o for 2026-03-20
    },
)

# 1. Create a User Proxy Agent
# This agent represents the human user and can execute code.
user_proxy = autogen.UserProxyAgent(
    name="Admin",
    system_message="A human admin. Interact with the Planner to review the plan and provide feedback.",
    code_execution_config={"last_n_messages": 2, "work_dir": "planning"}, # For tool execution context
    human_input_mode="NEVER", # For demonstration, set to NEVER for automatic flow
    llm_config={"config_list": config_list},
)

# 2. Create an Assistant Agent
# This agent is designed to plan tasks.
planner = autogen.AssistantAgent(
    name="Planner",
    system_message="You are a helpful AI assistant that plans tasks.",
    llm_config={"config_list": config_list},
)

print("AutoGen agents created.")

Explanation: We define two agents, a UserProxyAgent (representing a human interface) and an AssistantAgent (our planner). The llm_config connects them to our LLM provider.

Step 2: Initiate a chat and observe history. Now, let’s have them talk and then inspect their internal memory.

# Continue in autogen_memory_example.py

# 3. Initiate a chat between the user_proxy and the planner
print("\n--- Initial Chat ---")
user_proxy.initiate_chat(
    planner,
    message="Plan a simple dinner party for 4 people, including a starter, main course, and dessert. Suggest a cuisine."
)

# 4. Observe the conversation history stored by each agent
print("\n--- Planner's Message History After Initial Chat ---")
# Each agent stores messages exchanged with *other* specific agents.
# planner.chat_messages[user_proxy] holds the history of messages from user_proxy to planner.
for msg in planner.chat_messages[user_proxy]:
    print(f"Role: {msg['role']}, Content: {msg['content'][:100]}...") # Truncate for brevity

print("\n--- User Proxy's Message History After Initial Chat ---")
# user_proxy.chat_messages[planner] holds the history of messages from planner to user_proxy.
for msg in user_proxy.chat_messages[planner]:
    print(f"Role: {msg['role']}, Content: {msg['content'][:100]}...")

Explanation: The initiate_chat call starts a multi-turn conversation. AutoGen automatically populates the chat_messages attribute of each agent, which acts as their short-term memory for that specific conversation partner.

Step 3: Continue the chat and see the updated history. Let’s add another turn to the conversation.

# Continue in autogen_memory_example.py

# 5. Continue the chat with a follow-up question
print("\n--- Follow-up Chat ---")
user_proxy.send(
    message="Great plan! Now, can you suggest a wine pairing for the main course?",
    recipient=planner
)

# 6. Observe the updated history after the follow-up
print("\n--- Planner's Message History After Follow-up ---")
for msg in planner.chat_messages[user_proxy]:
    print(f"Role: {msg['role']}, Content: {msg['content'][:100]}...")

Explanation: When user_proxy.send is used, the planner agent receives the new message, and its internal chat_messages are updated. Crucially, the planner implicitly uses the entire history of its conversation with user_proxy to formulate its response, demonstrating its short-term memory.

3. CrewAI - Task-Driven Context and Agent Memory

CrewAI manages context through the flow of tasks and by enabling memory on individual agents. The output of one task often becomes the input or context for the next.

Step 1: Define your LLM and an agent with memory. Create a file named crewai_memory_example.py.

# crewai_memory_example.py
import os
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI

# Set your OpenAI API key as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# Define the LLM (using gpt-4o as of 2026-03-20)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# 1. Define an agent with memory enabled
# Setting `memory=True` allows this agent to retain context across its tasks.
researcher = Agent(
    role='Senior Research Analyst',
    goal='Uncover critical information about tech companies',
    backstory='A seasoned analyst with a knack for finding hidden gems in financial reports.',
    verbose=True, # Set to True to see the agent's internal thought process
    allow_delegation=False,
    llm=llm,
    memory=True # This enables the agent's short-term memory for its tasks
)

print("CrewAI agent with memory created.")

Explanation: The researcher agent is configured with memory=True. This tells CrewAI to ensure that the agent’s internal thought process and recent task context are retained as it works through its assigned tasks.

Step 2: Define sequential tasks. Now, let’s create two tasks where the second task depends on the output of the first.

# Continue in crewai_memory_example.py

# 2. Define tasks. The output of task1 will implicitly be available as context for task2
task1 = Task(
    description=(
        "Identify the top 3 emerging AI startups in Q1 2026 based on funding rounds and innovation."
        "Provide their names and primary focus areas."
    ),
    agent=researcher,
    expected_output="A bulleted list of 3 AI startups, their funding, and focus."
)

task2 = Task(
    description=(
        "Based on the identified startups from the previous task, research one of them in depth."
        "Provide a brief SWOT analysis (Strengths, Weaknesses, Opportunities, Threats) for the chosen company."
        "Remember the context of the previous task." # Explicitly guiding the agent
    ),
    agent=researcher,
    expected_output="A SWOT analysis for one specific AI startup identified previously."
)

print("CrewAI tasks defined.")

Explanation: task1 and task2 are assigned to the same researcher agent. The description of task2 explicitly instructs the agent to “Remember the context of the previous task,” which, combined with memory=True on the agent and the sequential process, allows it to leverage information from task1.

Step 3: Create and run the crew. Finally, we assemble the crew and kick off the work.

# Continue in crewai_memory_example.py

# 3. Create a crew with the agent and tasks
crew = Crew(
    agents=[researcher],
    tasks=[task1, task2],
    verbose=2, # Shows full execution logs, very helpful for debugging memory/context flow
    process=Process.sequential # Tasks run one after another, passing context
)

# 4. Kick off the crew's work
print("\n--- Starting Crew Work ---")
result = crew.kickoff()
print("\n--- Crew Work Finished ---")
print(result)

Explanation: When crew.kickoff() is called, the tasks are executed sequentially. The researcher agent, due to memory=True and the sequential Process, maintains the context from task1 and applies it to task2. The verbose=2 setting is excellent for observing how the agent’s internal thought process utilizes this context.

4. Semantic Kernel - Context Variables and MemoryStore

Semantic Kernel (SK) separates short-term context (using ContextVariables) from long-term memory (using MemoryStore). Planners then orchestrate when to use each.

Step 1: Initialize Kernel and add LLM and MemoryStore. Create a file named semantic_kernel_memory_example.py.

# semantic_kernel_memory_example.py
import semantic_kernel as sk
from semantic_kernel.connectors.ai.openai import OpenAIChatCompletion
from semantic_kernel.memory import VolatileMemoryStore # In-memory vector store for simplicity
from semantic_kernel.contents.chat_history import ChatHistory
import os

# Set your OpenAI API key as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# 1. Initialize the Kernel
kernel = sk.Kernel()

# 2. Add an LLM service (using gpt-4o as of 2026-03-20)
kernel.add_service(
    OpenAIChatCompletion(service_id="chat-gpt", ai_model_id="gpt-4o"),
)

# 3. Add a Memory Store to the kernel for long-term memory
#    VolatileMemoryStore is in-memory for quick testing; for persistence,
#    you would use a dedicated vector database like QdrantMemoryStore, PineconeMemoryStore, etc.
memory_store = VolatileMemoryStore()
kernel.add_memory_store(memory_store=memory_store)

print("Semantic Kernel initialized with LLM and VolatileMemoryStore.")

Explanation: We set up the Kernel as the central orchestrator, add our LLM, and crucially, attach a MemoryStore. The VolatileMemoryStore is useful for development but won’t persist data across runs; for production, you’d swap it for a persistent vector database.

Step 2: Define a chat history and a simple prompt skill. Next, we’ll set up a ChatHistory object for short-term conversation and a basic prompt template.

# Continue in semantic_kernel_memory_example.py

# 4. Create a chat history for short-term context
# This object will accumulate messages for the current conversation.
chat_history = ChatHistory()

# Define a simple semantic function (prompt) that can use chat history.
# The `{{$history}}` variable will be populated by our chat loop.
prompt_template = """
    You are a helpful assistant.
    If the user talks about their preferences or facts about themselves, try to remember them.
    Current conversation:
    {{$history}}
    User: {{$input}}
    Assistant:
"""

# Create a simple prompt function from our template
chat_function = kernel.create_function_from_prompt(
    prompt_template=prompt_template,
    function_name="ChatFunction",
    plugin_name="GeneralChat"
)

print("Chat history and basic chat function (skill) created.")

Explanation: ChatHistory is SK’s way to manage the turn-by-turn conversational context. Our prompt_template is a simple skill that uses a $history variable, which we’ll manually populate.

Step 3: Implement an asynchronous chat loop with memory interaction. Now, let’s create the interactive chat where we’ll explicitly save and recall information from our MemoryStore.

# Continue in semantic_kernel_memory_example.py
import asyncio

async def chat_with_memory():
    print("\n--- Starting Chat with Memory (Type 'exit' to end) ---")
    while True:
        user_input = input("User: ")
        if user_input.lower() == "exit":
            break

        chat_history.add_user_message(user_input)

        # --- Demonstrate saving to long-term memory ---
        # In a real scenario, a planner would decide when to save. Here, we manually check for a keyword.
        if "my favorite color is" in user_input.lower():
            color_fact = user_input.split("my favorite color is")[-1].strip().replace('.', '')
            # Save the fact into our "user_profiles" collection in the memory store.
            await kernel.memory.save_information_async(
                collection="user_profiles",
                text=f"User's favorite color is {color_fact}",
                id="fav_color" # A unique ID for this piece of information
            )
            print(f"Agent: Okay, I've noted that your favorite color is {color_fact} in my long-term memory.")
            # Add this to chat history so the LLM knows we acknowledged it.
            chat_history.add_assistant_message(f"Okay, I've noted that your favorite color is {color_fact}.")
            continue # Skip normal chat response for this turn

        # --- General chat response using short-term history ---
        # We manually build the context for the chat function, including the short-term history.
        context = kernel.create_new_context()
        context["history"] = "\n".join([f"{msg.role}: {msg.content}" for msg in chat_history.messages])
        context["input"] = user_input

        # Invoke our chat function (skill) with the current user input and history.
        response = await kernel.invoke(chat_function, input=user_input, arguments=context.variables)
        print(f"Agent: {response.value}")
        chat_history.add_assistant_message(response.value)

        # --- Demonstrate recalling from long-term memory ---
        # Again, in a real system, a planner would trigger this.
        if "what is my favorite color" in user_input.lower():
            # Query the memory store for information semantically similar to "favorite color".
            retrieved_info = await kernel.memory.recall_async(
                collection="user_profiles",
                query="favorite color",
                limit=1
            )
            if retrieved_info:
                print(f"Agent (from long-term memory): I recall from my notes that {retrieved_info[0].text}.")
            else:
                print("Agent (from long-term memory): I don't have that specific information stored.")

# Run the asynchronous chat loop
if __name__ == "__main__":
    asyncio.run(chat_with_memory())

Explanation: This interactive loop ties everything together.

The chat_history object continuously collects user and agent messages for short-term context.
When a specific phrase (“my favorite color is”) is detected, we manually use kernel.memory.save_information_async to store a fact in the VolatileMemoryStore (our long-term memory).
When another phrase (“what is my favorite color”) is detected, kernel.memory.recall_async is used to retrieve semantically relevant information from the long-term memory.
For general chat, the chat_function (our prompt skill) is invoked, and we explicitly pass the chat_history as the $history variable in the prompt. This example clearly distinguishes and demonstrates the use of both short-term conversational history and explicit long-term knowledge storage and retrieval in Semantic Kernel.

Mini-Challenge: Enhancing an Agent’s Recall

You’ve seen how RunnableWithMessageHistory works for basic conversation buffering in LangGraph/LangChain. Now, let’s tackle a common problem: long conversations hitting the token limit.

Challenge: Modify the LangGraph RunnableWithMessageHistory example from earlier. Instead of using the default ChatMessageHistory (which just buffers all messages), configure it to use ConversationSummaryBufferMemory. This memory type summarizes older parts of the conversation to keep the context concise, preventing context window overflow while still retaining the essence of the dialogue.

Hint:

You’ll need to import ConversationSummaryBufferMemory from langchain.memory.
ConversationSummaryBufferMemory requires an LLM to perform the summarization.
You’ll need to adjust the get_session_history function to return an instance of ConversationSummaryBufferMemory instead of ChatMessageHistory. Remember to pass the LLM to it and specify a max_token_limit.

What to Observe/Learn: After running a longer conversation (you might need to manually type more turns to see the effect), try to inspect the store dictionary. While ConversationSummaryBufferMemory won’t show you raw messages, the idea is that the LLM will still maintain context without the full, raw history consuming tokens. The key learning is how to swap out different memory strategies.

# Your turn! Modify the LangGraph example here.
# You'll need to import ConversationSummaryBufferMemory
# from langchain.memory import ConversationSummaryBufferMemory
# And then update the get_session_history function.

# Example structure (don't copy-paste, modify the original!):
# from langchain.memory import ConversationSummaryBufferMemory
# ...
# store = {}
# def get_session_history_summary(session_id: str) -> ConversationSummaryBufferMemory:
#     if session_id not in store:
#         # Use the 'llm' defined at the top of your script
#         store[session_id] = ConversationSummaryBufferMemory(llm=llm, max_token_limit=150, return_messages=True)
#     return store[session_id]
#
# # Then, use this new function when creating your RunnableWithMessageHistory
# with_message_history_summary = RunnableWithMessageHistory(
#     chain,
#     get_session_history_summary, # Use your new function here
#     input_messages_key="input",
#     history_messages_key="history",
# )
# ... then run some longer conversations with this new object ...

Common Pitfalls & Troubleshooting

Managing memory and state in AI agents can be tricky. Here are some common issues and how to approach them:

Context Window Overflow:
- Pitfall: The most frequent issue. Your agent starts “forgetting” earlier parts of a long conversation because the history exceeds the LLM’s token limit.
- Troubleshooting:
  - Summarization: Implement ConversationSummaryMemory or ConversationSummaryBufferMemory (as in your challenge) to condense older messages.
  - Windowing: Use ConversationBufferWindowMemory to only keep the most recent k messages.
  - Retrieval Augmented Generation (RAG): For long-term knowledge, don’t dump everything into the context. Instead, retrieve only the most relevant chunks from a vector store and inject those.
  - Refine Prompts: Make prompts more concise to save tokens.
Inconsistent State / Agent Getting “Lost”:
- Pitfall: In complex multi-step workflows, an agent might lose track of which step it’s on, or critical information gathered in a previous step isn’t available for the current one.
- Troubleshooting:
  - Explicit State Management: For LangGraph, ensure your State object is clearly defined and updated by each node. For CrewAI, ensure task outputs are correctly passed as context.
  - Logging: Verbose logging (e.g., verbose=True in CrewAI, or printing LangGraph state) is crucial to see how context and state are evolving.
  - Atomic Steps: Break down complex tasks into smaller, well-defined, and atomic steps. Each step should have clear inputs and outputs.
Token Usage Bloat & High Costs:
- Pitfall: Sending excessively long prompts (due to large conversation histories or too much retrieved information) leads to higher API costs and slower response times.
- Troubleshooting:
  - Memory Strategies: Employ summarization and windowing as discussed for context window overflow.
  - Efficient RAG: Ensure your vector store retrieval is precise and only fetches truly relevant documents. Optimize chunking strategies.
  - Caching: Cache LLM responses for repetitive queries if appropriate.
  - Model Choice: Use smaller, cheaper models (e.g., gpt-3.5-turbo) for tasks that don’t require the full power of larger models.
Debugging Memory Issues:
- Pitfall: It’s hard to tell what the agent “remembers” or why it’s behaving unexpectedly.
- Troubleshooting:
  - Print Context: Always print the full prompt (including history and retrieved context) that is sent to the LLM during development. This is the single most effective debugging technique.
  - Inspect Internal States: For AutoGen, inspect agent.chat_messages. For LangGraph, print the state object at each node.
  - Unit Testing: Write tests for your memory and retrieval components to ensure they’re functioning as expected.

Summary

Phew! We’ve covered a lot of ground in this chapter, transforming our agents from forgetful automatons into intelligent conversationalists and persistent task-doers.

Here are the key takeaways:

Stateless LLMs: Individual LLM calls are stateless, necessitating external memory and context management for coherent agent behavior.
Short-Term Memory (Context Window): Manages the immediate conversation history, often using buffers, windowing, or summarization techniques to fit within token limits. Frameworks like LangGraph’s RunnableWithMessageHistory, AutoGen’s internal message queues, and SK’s ContextVariables handle this.
Long-Term Memory (Persistent Knowledge): Stores information beyond the current session, typically using vector databases and embeddings for semantic retrieval. LangChain/LangGraph’s VectorStoreRetrieverMemory and Semantic Kernel’s MemoryStore are prime examples.
State Management: Tracks the progress and current status of multi-step workflows, ensuring agents know “where they are.” LangGraph’s graph-based state, AutoGen’s conversational flow, and CrewAI’s task-driven process are different approaches.
Framework Differences:
- LangGraph: Excellent for explicit state machines and integrates seamlessly with LangChain’s rich memory ecosystem (buffers, summaries, vector stores).
- AutoGen: Memory is inherent in its multi-agent conversational design, with each agent maintaining its chat history and offering persistence methods.
- CrewAI: Manages context through agent memory=True settings and the sequential flow of tasks, allowing information to pass between steps.
- Semantic Kernel: Uses ContextVariables for short-term context and a dedicated MemoryStore for long-term, vector-based knowledge retrieval, often orchestrated by planners.

By mastering these memory and state management techniques, you’re empowering your AI agents to tackle more complex, personalized, and robust applications. They’ll truly be able to remember the past and leverage it for future actions!

In the next chapter, we’ll dive into advanced orchestration patterns, building upon our understanding of memory to create even more sophisticated and dynamic agent workflows. Get ready to connect these intelligent components into powerful systems!

References

LangChain Expression Language (LCEL) - History: https://python.langchain.com/docs/expression_language/how_to/message_history
LangChain - Memory: https://python.langchain.com/docs/modules/memory/
LangGraph - State: https://langchain.com/docs/langgraph/concepts/state
AutoGen - Persistence: https://microsoft.github.io/autogen/docs/Use-Cases/Agent_Chat_Persistence
CrewAI - Agents: https://docs.crewai.com/core-concepts/agents/
Semantic Kernel - Memory: https://learn.microsoft.com/en-us/semantic-kernel/concepts/memory/
Semantic Kernel - Context Variables: https://learn.microsoft.com/en-us/semantic-kernel/concepts/context-variables/

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

Persistent Memory & Context Management: Remembering the Past

Table of Contents

Introduction: Why Agents Need a Memory Palace

Core Concepts: Why Agents Need a Memory Palace

The Challenge of Stateless LLMs

Short-Term Memory: The Conversation Buffer

Long-Term Memory: Persistent Knowledge

State Management: Knowing Where We Are

Visualizing Memory and State in an Agent

Step-by-Step Implementation: Building Memory into Agents

1. LangGraph (and LangChain) - Conversational History

2. AutoGen - Agent Conversation History

3. CrewAI - Task-Driven Context and Agent Memory

4. Semantic Kernel - Context Variables and MemoryStore

Mini-Challenge: Enhancing an Agent’s Recall

Common Pitfalls & Troubleshooting

Summary

References