Introduction: The Agent’s Ephemeral Mind

Welcome back, future agent architect! In our previous chapters, we laid the groundwork for understanding autonomous agents, their planning capabilities, and how they can leverage external tools to interact with the world. But what happens when an agent needs to remember something from a previous interaction? How does it maintain a coherent conversation? This is where memory comes into play.

In this chapter, we’re diving into the fascinating world of short-term memory for AI agents. Think of this as the agent’s immediate working memory – the thoughts and conversations it can recall right now to inform its next action. We’ll explore the fundamental concept of the Large Language Model’s (LLM) context window, learn how to manage conversation history effectively, and build a practical Python example to implement basic in-memory recall. Mastering short-term memory is crucial for creating agents that can hold meaningful, multi-turn interactions and make informed decisions based on recent events, preventing them from “forgetting” what just happened.

By the end of this chapter, you’ll be able to:

  • Understand the LLM context window and its limitations.
  • Implement strategies for managing conversation history.
  • Build a Python agent that maintains short-term conversational memory.
  • Identify common pitfalls in short-term memory management.

Ready to give your agent a short-term memory boost? Let’s get started!

Core Concepts: The Agent’s Immediate Recall

Just like humans, agents need to remember things to function effectively. Short-term memory is the most basic form of recall, essential for maintaining context within a single interaction or a brief series of turns.

The LLM’s “Working Memory”: The Context Window

At the heart of an agent’s short-term memory is the Large Language Model (LLM) context window. Imagine an LLM as a brilliant but incredibly forgetful assistant. It can only process and “remember” what you provide it in its immediate input, which we call the context window.

What is the Context Window?

The context window is a fixed-size buffer where all input to the LLM resides:

  • System Prompt: Instructions about the agent’s persona and rules.
  • User Input: The current question or command from the user.
  • Conversation History: Previous turns of dialogue between the user and the agent.
  • Tool Outputs: Results from any tools the agent used.
  • Internal Monologue: The agent’s thoughts, plans, and reasoning steps (especially in architectures like ReAct).

The LLM processes everything within this window to generate its next output. Once the response is generated, the LLM itself doesn’t inherently “remember” anything from that specific interaction for the next independent call. It’s up to us, the developers, to explicitly pass back the relevant history in subsequent API calls.

Tokens: The Building Blocks of Context

LLMs don’t count words; they count tokens. A token is a fundamental unit of text that an LLM understands. It can be a whole word (e.g., “hello”), a part of a word (e.g., “ing”), or even punctuation.

Why are tokens important? Every LLM has a maximum token limit for its context window (e.g., 8K, 16K, 128K tokens). Exceeding this limit will result in an error or truncation by the API, causing the agent to “forget” older parts of the conversation.

Think of tokens like pages in a physical notebook. Your assistant (the LLM) can only read a certain number of pages at once. If you keep adding new pages, eventually the oldest pages will fall out, and your assistant won’t see them anymore.

Impact of Context Size

  • Cost: LLM APIs typically charge per token, both for input and output. Larger context windows mean higher costs.
  • Speed: Processing more tokens generally takes longer, impacting response latency.
  • Capability: A larger context window allows the agent to retain more information, leading to more coherent, complex, and informed interactions. However, it doesn’t guarantee the LLM will use all the information effectively.

Conversation History: The Agent’s Recall Mechanism

To make an agent feel intelligent and conversational, we need to manage its conversation history. This history is essentially a list of messages, each attributed to a specific role (e.g., user, assistant, system).

Here’s an example of how conversation history might look as a list of dictionaries:

[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, my name is Alice."},
    {"role": "assistant", "content": "Hello Alice! How can I help you today?"},
    {"role": "user", "content": "Can you remind me what my name is?"}
]

Strategies for Managing Conversation History

Since the context window has limits, we can’t just send the entire history indefinitely. We need smart strategies:

  1. Full History (Simple but Limited):

    • How: Send every single message from the start of the conversation.
    • Pros: Easiest to implement, perfect recall within token limits.
    • Cons: Quickly hits token limits, becomes expensive and slow for longer conversations.
  2. Windowing / Truncation (Most Common Basic Approach):

    • How: Keep only the N most recent messages (or messages that fit within a specific token budget). Oldest messages are discarded.
    • Pros: Simple, predictable token usage, maintains recent context.
    • Cons: Agent “forgets” older, potentially important information.
  3. Summarization:

    • How: Periodically (or when context gets too large), use the LLM itself to summarize older parts of the conversation into a concise “summary” message. This summary then replaces the original long history.
    • Pros: Reduces token count while retaining key information, more intelligent than simple truncation.
    • Cons: Summarization can lose nuance, adds an extra LLM call (cost/latency).
  4. Retrieval (Advanced - Bridge to Long-Term Memory):

    • How: Store the full conversation history (and other relevant data) in an external database (e.g., a vector database). When the agent needs to “remember,” it queries this database to retrieve only the most relevant pieces of information for the current turn.
    • Pros: Scales to very long conversations, highly intelligent recall.
    • Cons: More complex to implement, requires external storage and retrieval mechanisms. (We’ll cover this in detail in the next chapter on long-term memory!).

In-Memory Storage: Simple, Ephemeral Memory

For many basic agent applications, especially for single-session interactions or testing, in-memory storage is perfectly adequate for short-term memory. This simply means storing the conversation history as a variable (like a Python list) within the running program.

  • Concept: The messages list we discussed earlier is a prime example. As the conversation progresses, you append new user and assistant turns to this list.
  • Limitations:
    • Ephemeral: All memory is lost when the program stops or restarts.
    • Not Scalable: Not suitable for multi-user applications or persistent agents.
  • Use Cases: Simple chatbots, command-line agents, proof-of-concept projects, or as the immediate buffer before more sophisticated long-term memory systems are invoked.

In the next section, we’ll build an agent using this simple yet effective in-memory storage for its short-term recall!

Step-by-Step Implementation: Building a Conversational Agent with Short-Term Memory

Let’s put these concepts into practice. We’ll build a simple Python-based conversational agent that remembers previous turns using an in-memory list and implements a basic truncation strategy.

Setup: Get Your Workspace Ready

As of 2026-03-20, we’ll use Python 3.10+ and the latest openai library.

  1. Create a New Project Directory:

    mkdir agent_memory_guide
    cd agent_memory_guide
    
  2. Set Up a Virtual Environment (Best Practice!):

    python -m venv .venv
    # On Windows:
    .venv\Scripts\activate
    # On macOS/Linux:
    source .venv/bin/activate
    
  3. Install the OpenAI Library:

    pip install openai==1.14.0 tiktoken==0.6.0
    
    • We’re installing openai for LLM interaction and tiktoken to help us estimate token usage, which is vital for context management.
  4. Set Your OpenAI API Key: You’ll need an API key from OpenAI. Store it securely. The recommended way is via an environment variable.

    # On macOS/Linux:
    export OPENAI_API_KEY="your_api_key_here"
    # On Windows (in PowerShell):
    $env:OPENAI_API_KEY="your_api_key_here"
    

    Replace "your_api_key_here" with your actual key. Remember to never hardcode API keys directly in your code!

Step 1: Initialize the LLM Client and Conversation History

Create a new Python file named memory_agent.py. We’ll start by setting up our LLM client and defining our initial conversation history, including a system message to establish the agent’s persona.

# memory_agent.py
import os
from openai import OpenAI
import tiktoken # For token counting

# --- Configuration ---
# As of 2026-03-20, gpt-4o is a powerful and cost-effective choice.
# Other models like gpt-3.5-turbo (for lower cost) or specific Claude/Azure models could also be used.
LLM_MODEL = "gpt-4o"
MAX_CONTEXT_TOKENS = 4096 # Example limit for our agent, typically lower than model's full capacity
                          # to leave room for output and avoid hitting hard limits.

# Initialize the OpenAI client
# It will automatically pick up OPENAI_API_KEY from environment variables.
client = OpenAI()

# Our in-memory conversation history
# Start with a system message to define the agent's persona.
conversation_history = [
    {"role": "system", "content": "You are a friendly and helpful AI assistant named 'MemoryBot'. You love to chat and remember details about our conversation. Always try to refer to previous topics if relevant."}
]

print(f"MemoryBot initialized using model: {LLM_MODEL}")
print("Type 'quit' or 'exit' to end the conversation.")

Explanation:

  • We import os to potentially get environment variables, OpenAI for API calls, and tiktoken for token counting.
  • LLM_MODEL specifies which large language model we’ll be using. gpt-4o is a strong choice as of early 2026.
  • MAX_CONTEXT_TOKENS defines a custom limit for our agent’s context. This is often smaller than the LLM’s absolute maximum to provide a safety buffer and manage costs.
  • client = OpenAI() initializes our connection to the OpenAI API.
  • conversation_history is our Python list that will store all messages. We start it with a system message. This message is crucial for setting the agent’s behavior and persona.

Step 2: Sending a User Message and Getting a Response

Now, let’s add a function that takes a user’s message, adds it to our history, sends the entire current history to the LLM, and gets back a response.

Add this function to memory_agent.py after the initial setup:

# ... (previous code) ...

# Function to get token count of messages
def count_tokens(messages):
    """Counts tokens using tiktoken. Assumes gpt-4o encoding."""
    # As of tiktoken 0.6.0, gpt-4o uses the 'o200k_base' encoding.
    # It's good practice to dynamically get encoding or use a known one.
    try:
        encoding = tiktoken.encoding_for_model(LLM_MODEL)
    except KeyError:
        # Fallback for models not directly in tiktoken's registry or future models
        encoding = tiktoken.get_encoding("cl100k_base") # Common encoding for many modern GPT models

    num_tokens = 0
    for message in messages:
        # Each message adds tokens for its content and role
        num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name": # if there's a name, role is omitted
                num_tokens += -1 # role is always 1 token less
    num_tokens += 2  # every reply is primed with <im_start>assistant\n
    return num_tokens

# Function to interact with the LLM and manage history
def chat_with_memorybot(user_message, history):
    # 1. Add the new user message to the history
    history.append({"role": "user", "content": user_message})

    # 2. Implement basic context window management (truncation)
    #    We'll keep the system message and then the most recent messages that fit.
    current_tokens = count_tokens(history)
    print(f"Current conversation tokens: {current_tokens}")

    # Ensure system message is always present, then truncate older user/assistant messages
    messages_to_send = [history[0]] # Always keep the system message
    for msg in reversed(history[1:]): # Iterate from most recent user/assistant messages
        temp_messages = [history[0]] + [msg] + messages_to_send[1:] # Test adding this message
        if count_tokens(temp_messages) <= MAX_CONTEXT_TOKENS:
            messages_to_send.insert(1, msg) # Insert after system message
        else:
            print(f"Truncating older message to stay within {MAX_CONTEXT_TOKENS} tokens.")
            break # Stop adding older messages

    print(f"Tokens after truncation (sent to LLM): {count_tokens(messages_to_send)}")

    try:
        # 3. Call the LLM API with the (potentially truncated) history
        response = client.chat.completions.create(
            model=LLM_MODEL,
            messages=messages_to_send, # Use the managed list
            temperature=0.7 # A bit creative, but not too wild
        )

        assistant_response = response.choices[0].message.content
        return assistant_response
    except Exception as e:
        print(f"An error occurred: {e}")
        return "I'm sorry, I couldn't process that. My brain seems to be a bit fuzzy."

Explanation:

  • count_tokens(messages): This helper function uses tiktoken to estimate the number of tokens in a list of messages. This is crucial for managing our MAX_CONTEXT_TOKENS limit. The token counting logic is adapted from OpenAI’s recommendations for chat models.
  • chat_with_memorybot(user_message, history):
    • It first appends the user_message to our history list.
    • Context Window Management: This is the core of our short-term memory strategy. We calculate the current token count. If it exceeds our MAX_CONTEXT_TOKENS, we truncate older messages. We ensure the system message is always kept as the first message, and then we add user/assistant messages starting from the most recent until we hit the token limit. This ensures the agent always has the most recent context.
    • It then calls client.chat.completions.create(), passing our messages_to_send list. This is how the LLM “sees” the conversation history.
    • temperature=0.7 influences the creativity of the response. Lower values (e.g., 0.2) make it more deterministic; higher values (e.g., 1.0) make it more random.
    • Finally, it extracts and returns the assistant’s reply.

Step 3: Storing the Assistant’s Response

This is a critical step for maintaining memory. After receiving the LLM’s response, we must add it to our conversation_history so it’s included in future API calls.

Modify the chat_with_memorybot function to also store the assistant’s response:

# ... (previous code for chat_with_memorybot) ...

def chat_with_memorybot(user_message, history):
    # ... (previous code for appending user message and truncation) ...

    try:
        response = client.chat.completions.create(
            model=LLM_MODEL,
            messages=messages_to_send,
            temperature=0.7
        )

        assistant_response = response.choices[0].message.content

        # CRITICAL: Add the assistant's response to the full conversation history
        history.append({"role": "assistant", "content": assistant_response})

        return assistant_response
    except Exception as e:
        print(f"An error occurred: {e}")
        return "I'm sorry, I couldn't process that. My brain seems to be a bit fuzzy."

Explanation:

  • history.append({"role": "assistant", "content": assistant_response}) is the magic line. Without this, the agent would only remember the system message and the current user input, essentially forgetting its own previous replies!

Step 4: Implementing a Simple Loop for Conversation

Now, let’s create a main loop that allows us to chat with our MemoryBot continuously.

Add this loop at the end of memory_agent.py:

# ... (all previous code) ...

# Main conversation loop
if __name__ == "__main__":
    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ["quit", "exit"]:
            print("MemoryBot: Goodbye! It was nice chatting with you.")
            break

        assistant_reply = chat_with_memorybot(user_input, conversation_history)
        print(f"MemoryBot: {assistant_reply}")

Explanation:

  • The if __name__ == "__main__": block ensures this code runs when the script is executed directly.
  • The while True: loop continuously prompts the user for input.
  • If the user types “quit” or “exit”, the loop breaks.
  • For any other input, chat_with_memorybot is called, passing the user’s message and our conversation_history.
  • The assistant’s reply is printed. Crucially, conversation_history is updated within the chat_with_memorybot function, so it grows with each turn.

Running Your MemoryBot

Save memory_agent.py and run it from your terminal:

python memory_agent.py

Now, try chatting with it!

  • “Hello, my name is Alex.”
  • “What is my name?” (It should remember!)
  • “Can you tell me more about large language models?”
  • “What was the first thing I asked you?” (It should recall the “Hello, my name is Alex” part if within context)

Observe how it maintains context. If you have a very long conversation, you’ll eventually see the truncation message as older messages are dropped to stay within the MAX_CONTEXT_TOKENS.

Diagram: The Flow of Short-Term Memory

Let’s visualize the flow of information for our MemoryBot:

flowchart TD A[Start Conversation Loop] --> B{User Input} B -->|Yes| C[Add User Message to History] C --> D[Count Tokens in History] D --> E{Exceed MAX CONTEXT TOKENS} E -->|Yes| F[Truncate Oldest Messages] E -->|No| G[Prepare Messages LLM] F --> G G --> H[Call LLM API] H --> I[Get Assistant Response] I --> J[Add Assistant Response to History] J --> K[Display Assistant Response] K --> B B -->|Quit| L[End Conversation]

Explanation of Diagram:

  • The conversation_history is constantly updated.
  • Before sending to the LLM, a check is performed (E{Exceed MAX_CONTEXT_TOKENS?}).
  • If the context is too large, older messages are Truncate Oldest Messages (Keep System Prompt) (F), ensuring the LLM always receives a manageable and relevant set of messages.
  • The Add Assistant Response to History (J) step is what makes the agent “remember.”

Mini-Challenge: Dynamic Token Budgeting

Our current truncation strategy is based on counting tokens and removing messages from the beginning of the list until the budget is met. This is a good start, but what if we want to dynamically adjust the MAX_CONTEXT_TOKENS based on the specific LLM model’s actual full context window, while still leaving room for the LLM’s response?

Challenge: Modify the chat_with_memorybot function to:

  1. Dynamically determine the LLM’s full context window size. While tiktoken helps with token counting, you might need to consult OpenAI’s documentation or a library that provides model metadata. For gpt-4o, let’s assume a full context of 128000 tokens.
  2. Reserve a portion of the context window for the LLM’s output. For example, reserve 1000 tokens for the assistant’s reply.
  3. Calculate the actual MAX_CONTEXT_TOKENS available for input messages for each API call. This means LLM_FULL_CONTEXT - RESERVED_OUTPUT_TOKENS.
  4. Apply the truncation logic using this dynamic MAX_CONTEXT_TOKENS_FOR_INPUT.

Hint:

  • You can define a LLM_FULL_CONTEXT constant (e.g., 128000 for gpt-4o) and a RESERVED_OUTPUT_TOKENS constant (e.g., 1000).
  • Calculate MAX_CONTEXT_TOKENS_FOR_INPUT = LLM_FULL_CONTEXT - RESERVED_OUTPUT_TOKENS inside chat_with_memorybot before the truncation logic.

What to observe/learn: This challenge will highlight how to manage the context window more robustly, accounting for both input history and anticipated output, which is a common practice in production systems to prevent context_window_exceeded errors. It makes your agent more resilient to variable-length LLM responses.

Common Pitfalls & Troubleshooting

Even with basic short-term memory, you’ll encounter challenges. Here are some common pitfalls and how to address them:

  1. Context Window Overflow / Agent “Forgetting”

    • Problem: The agent starts giving generic or irrelevant answers, or completely loses track of earlier parts of the conversation, even if they seem important. You might also get API errors indicating the context window limit was exceeded.
    • Why it happens: Your conversation history grew too large, and older messages were either implicitly truncated by the LLM API or explicitly by your truncation strategy. The LLM simply doesn’t “see” the older information.
    • Troubleshooting:
      • Verify token counting: Ensure your token counting mechanism (tiktoken) is accurate for your chosen LLM.
      • Adjust MAX_CONTEXT_TOKENS: Increase your MAX_CONTEXT_TOKENS if the LLM model supports it and your budget allows, or decrease it to force earlier truncation and prevent errors.
      • Implement better truncation: Instead of just dropping oldest messages, consider summarizing older parts of the conversation or using more sophisticated retrieval (which we’ll cover next).
      • Check API errors: Most LLM APIs provide clear error messages when context limits are hit.
  2. Irrelevant Information Bloat

    • Problem: The agent’s responses are slow, expensive, or sometimes confused, even if the context window isn’t technically overflowing. It might bring up old, irrelevant topics.
    • Why it happens: You’re sending too much information that isn’t pertinent to the current user query. Even if it fits, the LLM has to process it, which can dilute its focus and increase costs/latency.
    • Troubleshooting:
      • Smarter Truncation: Instead of just raw message count, try to prioritize messages. For example, keep the system message, the last N user/assistant turns, and any messages explicitly marked as “important.”
      • Summarization: As mentioned, summarizing older turns can condense information and remove irrelevant details.
      • Hybrid Memory: Combine short-term (recent conversation) with long-term (retrieved relevant facts) to only bring in necessary older context.
  3. Inconsistent Persona or Hallucination Due to Memory Loss

    • Problem: The agent deviates from its defined system persona or “hallucinates” facts it previously acknowledged, because the system message or crucial past facts have been pushed out of the context.
    • Why it happens: If your truncation strategy is too aggressive or doesn’t prioritize the system message, the LLM might lose its foundational instructions.
    • Troubleshooting:
      • Always include the system message: Our example code correctly ensures the system message is always the first message sent to the LLM. This is a critical best practice.
      • Pin crucial facts: If there are specific facts the agent must remember (e.g., the user’s name, a key project detail), consider storing them separately and prepending them to the system message or a dedicated “facts” message when needed, even if they’re old. This is a simple bridge to long-term memory.

Summary: The Foundation of Agent Intelligence

Congratulations! You’ve successfully delved into the critical world of short-term memory for AI agents. This chapter laid the groundwork for how agents can maintain context and engage in coherent conversations.

Here are the key takeaways:

  • LLM Context Window: This is the agent’s immediate working memory, a fixed-size buffer where all input (system prompt, history, tools) resides. LLMs only “remember” what’s in this window for the current API call.
  • Tokens: Text is broken down into tokens, and LLMs have strict token limits for their context windows, impacting cost, speed, and capability.
  • Conversation History: Storing messages with roles (system, user, assistant) is essential for multi-turn interactions.
  • Memory Management Strategies: We explored full history (simple), windowing/truncation (common), summarization (intelligent reduction), and briefly touched on retrieval (advanced, for long-term).
  • In-Memory Storage: A simple and effective way to manage short-term history within a running program, though it’s ephemeral.
  • Practical Implementation: We built a Python MemoryBot that uses tiktoken for token counting and implements a basic truncation strategy to manage its context window.
  • Common Pitfalls: We discussed context window overflow, irrelevant information bloat, and persona inconsistency, along with strategies to troubleshoot them.

Short-term memory is the bedrock upon which more sophisticated agentic behaviors are built. By effectively managing the LLM’s context, you empower your agents to have more natural, intelligent, and useful interactions.

What’s Next?

While short-term memory is great for ongoing conversations, what if an agent needs to remember something from weeks ago, or recall facts from a vast knowledge base? Our current in-memory system is ephemeral and limited. In the next chapter, we’ll expand our agent’s capabilities by exploring Long-Term Memory Systems, diving into vector databases, knowledge graphs, and advanced retrieval techniques to give our agents truly persistent and scalable recall!

References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.