Making Every Token Count: Context Reduction & Summarization

Introduction: The Art of Less is More

Welcome back, fellow AI adventurer! In our previous chapters, we laid the groundwork for understanding the critical role of context in LLM performance. We learned that the “context window” is the LLM’s short-term memory, and it has strict limits. Feeding too much information can lead to truncation, increased costs, and slower responses – not ideal for robust production systems.

In this chapter, we’re going to tackle these challenges head-on by diving into Context Reduction and Summarization. Think of it as decluttering your LLM’s workspace. We’ll explore techniques to intelligently trim down raw information, ensuring that only the most relevant and impactful data reaches your model. This isn’t just about saving tokens; it’s about improving the quality, reliability, and efficiency of your AI’s outputs. Get ready to make every token count!

By the end of this chapter, you’ll understand:

Why context reduction is indispensable for production LLM systems.
How to differentiate between context reduction and compression.
Practical strategies for filtering irrelevant information.
The power of summarization, both heuristic and LLM-based, to condense context.
The crucial trade-offs involved in these techniques.

Ready to optimize? Let’s begin!

Core Concepts: Sharpening the Focus

Before we get our hands dirty with code, let’s solidify our understanding of why and how context reduction and summarization work their magic.

Why Reduce? The “Context Window Crunch” Revisited

Imagine you’re trying to give instructions to a very intelligent, but forgetful, assistant. You have a massive binder of information, but the assistant can only read a few pages at a time. If you just hand them the whole binder, they’ll only read the first few pages and miss crucial details buried deeper inside.

This is precisely the challenge with LLMs and their context windows. Overloading the context window leads to several problems:

Token Limits and Truncation: Every LLM has a maximum context length (e.g., 4K, 8K, 32K, 128K tokens). Exceeding this limit means your input will be unceremoniously cut off, potentially losing vital information.
Increased Cost: Most LLM APIs charge per token. A larger context means more tokens, directly translating to higher operational costs.
Higher Latency: Processing more tokens takes more time. This can significantly impact the response time of your application, leading to a poor user experience.
Context Rot: This is a subtle but critical issue. Even if you don’t hit the token limit, a large context filled with irrelevant or outdated information can dilute the model’s focus, making it harder for the LLM to identify and utilize the truly important pieces. It’s like having too many open tabs in your browser – eventually, you lose track of what’s important.

The solution? Proactive, intelligent context reduction.

Context Reduction vs. Context Compression: A Key Distinction

It’s easy to confuse these two, but they serve different purposes:

Context Reduction: This involves removing information from the context. You’re making a conscious decision that certain pieces of data are either irrelevant, redundant, or can be adequately represented in a more condensed form. Examples include filtering out noise, summarizing long texts, or selecting only the most recent messages.
Context Compression: This involves encoding the existing information more efficiently without necessarily removing its semantic content. Think of it as zipping a file. Techniques like token-efficient encoding or using specialized models (e.g., smaller, faster models to process context before sending to a larger LLM) fall into this category. We’ll explore compression in more detail in a later chapter.

For now, our focus is on reduction – strategically deciding what not to send to the LLM.

Let’s visualize the flow of context management:

flowchart TD A[Raw, Unfiltered Information] --> B{Is all info relevant?} B -->|No, some is noise| C[Context Reduction: Filtering] C --> D{Is info too verbose?} D -->|Yes, needs condensing| E[Context Reduction: Summarization] E --> F[Optimized Context LLM] B -->|Yes, all relevant| D

Technique 1: Filtering Irrelevant Information

The simplest form of context reduction is filtering. This is about identifying and discarding information that doesn’t contribute to the LLM’s task.

Consider a chatbot conversation. Does “Hi, how are you?” really need to be part of the historical context for every subsequent turn? Probably not.

Common filtering strategies include:

Rule-Based Filtering: Define explicit rules to exclude certain types of content.
- Example: Remove messages shorter than N words, messages matching a list of “stop phrases” (e.g., greetings, acknowledgments), or data older than a certain timestamp.
Keyword Filtering: Only include information that contains specific keywords relevant to the current query or domain.
- Example: For a support bot, only pass historical interactions that mention the current product or issue category.
Semantic Similarity Filtering (Advanced): This involves using embedding models to measure the semantic similarity between chunks of context and the current query. Only the most similar chunks are passed. We’ll touch upon embeddings more in the RAG chapter, but it’s a powerful filtering concept.

The goal is to remove noise and focus the LLM on the signal.

Technique 2: Summarization

Sometimes, information isn’t irrelevant, but it’s too detailed. This is where summarization comes in. Summarization condenses longer texts into shorter, coherent versions that retain the most important points.

There are two main types of summarization:

Extractive Summarization: This method pulls exact sentences or phrases directly from the original text to form the summary. It’s like highlighting the most important parts.
- Pros: Retains original wording, factual accuracy.
- Cons: Can sometimes feel disjointed, might not capture nuances.
Abstractive Summarization: This method generates new sentences and phrases to create a summary, rephrasing and condensing the original content. It’s like writing a new, shorter version of the text.
- Pros: More fluent and coherent, can synthesize information better.
- Cons: More prone to factual errors or “hallucinations” if the model isn’t strong, requires more advanced models.

How do we perform summarization for context engineering?

Heuristic-Based Summarization (Simple): For very simple cases, you might use rules like:
- “Take the first N sentences.”
- “Extract sentences containing the most frequently occurring keywords.”
- These are generally less effective for complex texts but can be quick and cheap.
LLM-Based Summarization (Powerful): The most effective approach for complex summarization is to use an LLM itself! You send a chunk of text to a (potentially smaller, cheaper) LLM with a specific summarization prompt, and it returns a concise summary. This summary then becomes part of the context for your main LLM task.
- Example Prompt: “Summarize the following conversation history, focusing on key decisions, unresolved issues, and action items, into a maximum of 100 words:”

Trade-offs with Summarization:

Information Loss: All summarization inherently involves some loss of detail. The challenge is to lose irrelevant detail while retaining critical detail.
Cost & Latency: Using an LLM for summarization adds an extra API call (and its associated cost and time) before the main LLM call. This is a crucial design consideration.
Quality: The quality of the summary depends heavily on the summarization technique and the model used.

Now that we have a solid conceptual foundation, let’s put these ideas into practice!

Step-by-Step Implementation: Smart Context Handling

For our hands-on example, we’ll simulate managing a conversation history for a chatbot. We’ll implement both filtering and LLM-based summarization.

Setup Requirements:

We’ll use standard Python 3.12+ for this example. No external libraries are strictly necessary for the core logic, but if you wanted more advanced text processing (like sentence tokenization), you might use nltk. For simplicity, we’ll keep it basic.

Make sure you have Python 3.12 or newer installed. You can check your version with:

python --version

If you need to install nltk for more robust sentence splitting (though we’ll keep it simple for this example), you’d do:

pip install nltk==3.8.1
python -m nltk.downloader punkt

(Note: nltk==3.8.1 is a common stable version as of 2026-03-20, but always check the official NLTK documentation for the absolute latest.)

We’ll start by defining a mock LLM function, as a real LLM API call would require API keys and external network requests, which can complicate local execution. Our focus is on the context engineering logic.

# context_manager.py

import collections

# --- Mock LLM for Demonstration ---
def mock_llm_summarize(text_to_summarize: str, max_words: int = 100) -> str:
    """
    A mock function to simulate an LLM summarization API call.
    In a real scenario, this would interact with a service like OpenAI, Anthropic, etc.
    It returns a generic summary based on input length.
    """
    if not text_to_summarize.strip():
        return ""

    print(f"\n--- Mock LLM Call for Summarization ---")
    print(f"Input text length: {len(text_to_summarize)} characters")

    # Simulate summarization logic
    if len(text_to_summarize) > 500:
        summary_text = (
            f"This is a concise summary of the provided text, focusing on key points "
            f"and reducing its original length to under {max_words} words. "
            f"It covers the main discussion points and any identified action items."
        )
    else:
        summary_text = (
            f"Summary of: '{text_to_summarize[:150].strip()}...' "
            f"This short text was deemed concise enough."
        )
    print(f"Generated summary: {summary_text[:100]}...") # Show a snippet
    print(f"--- End Mock LLM Call ---")
    return summary_text

# --- Context Management Logic ---

def filter_conversation_history(
    messages: list[dict],
    min_length: int = 5,
    stop_phrases: list[str] = None
) -> list[dict]:
    """
    Filters a list of conversation messages based on length and stop phrases.

    Args:
        messages: A list of dictionaries, where each dict has 'role' and 'content'.
        min_length: Minimum character length for a message to be considered relevant.
        stop_phrases: A list of phrases to identify and filter out (case-insensitive).

    Returns:
        A new list containing only the filtered, relevant messages.
    """
    if stop_phrases is None:
        stop_phrases = ["hi", "hello", "how are you", "goodbye", "thanks", "ok", "got it"]

    filtered_messages = []
    for msg in messages:
        content = msg.get("content", "").strip()
        # Filter by minimum length
        if len(content) < min_length:
            print(f"  - Filtering out short message: '{content}'")
            continue

        # Filter by stop phrases
        is_stop_phrase = False
        for phrase in stop_phrases:
            if phrase.lower() in content.lower():
                is_stop_phrase = True
                print(f"  - Filtering out message with stop phrase '{phrase}': '{content}'")
                break
        if is_stop_phrase:
            continue

        filtered_messages.append(msg)
    return filtered_messages

def summarize_context_with_llm(
    context_messages: list[dict],
    summary_prompt_template: str,
    llm_summarizer_func=mock_llm_summarize,
    max_summary_words: int = 100
) -> str:
    """
    Uses an LLM to summarize a list of context messages.

    Args:
        context_messages: A list of dictionaries representing the context.
        summary_prompt_template: The template for the prompt to send to the summarizer LLM.
        llm_summarizer_func: The function to call for LLM summarization (e.g., mock_llm_summarize).
        max_summary_words: Desired maximum word count for the summary.

    Returns:
        A string containing the LLM-generated summary.
    """
    if not context_messages:
        return ""

    # Concatenate messages into a single string for summarization
    full_text_to_summarize = "\n".join(
        f"{msg['role'].capitalize()}: {msg['content']}" for msg in context_messages
    )

    # Construct the full prompt for the summarizer LLM
    full_prompt = summary_prompt_template.format(
        text_to_summarize=full_text_to_summarize,
        max_words=max_summary_words
    )

    print(f"\nAttempting to summarize {len(context_messages)} messages...")
    print(f"Prompt sent to summarizer LLM (first 200 chars): '{full_prompt[:200]}...'")

    summary = llm_summarizer_func(full_prompt, max_words=max_summary_words)
    return summary

def main():
    """
    Demonstrates context reduction through filtering and summarization.
    """
    conversation_history = [
        {"role": "user", "content": "Hi, how are you?"},
        {"role": "assistant", "content": "I'm doing great! How can I help you today?"},
        {"role": "user", "content": "I have a problem with my order #12345. The item arrived broken."},
        {"role": "assistant", "content": "Oh no! I'm sorry to hear that. Let me look up your order."},
        {"role": "user", "content": "Okay, thanks."},
        {"role": "assistant", "content": "I see order #12345 for a 'Deluxe Widget'. Can you confirm it's the widget?"},
        {"role": "user", "content": "Yes, that's correct. The screen is cracked."},
        {"role": "assistant", "content": "Understood. I'll initiate a replacement for you. It should arrive in 3-5 business days."},
        {"role": "user", "content": "Great, thank you!"},
        {"role": "assistant", "content": "You're welcome! Is there anything else I can assist with?"},
        {"role": "user", "content": "No, that's all. Goodbye!"},
    ]

    print("--- Original Conversation History ---")
    for msg in conversation_history:
        print(f"[{msg['role'].capitalize()}]: {msg['content']}")
    print(f"\nOriginal message count: {len(conversation_history)}")
    print("-" * 40)

    # --- Step 1: Filtering ---
    print("\n### Step 1: Applying Filtering ###")
    # Define custom stop phrases and minimum length
    custom_stop_phrases = ["hi", "hello", "how are you", "thanks", "thank you", "okay", "ok", "understood", "you're welcome", "goodbye", "that's all"]
    min_message_length = 10 # Example: messages shorter than 10 chars are often trivial

    filtered_history = filter_conversation_history(
        conversation_history,
        min_length=min_message_length,
        stop_phrases=custom_stop_phrases
    )

    print("\n--- Filtered Conversation History ---")
    if not filtered_history:
        print("No relevant messages after filtering.")
    for msg in filtered_history:
        print(f"[{msg['role'].capitalize()}]: {msg['content']}")
    print(f"\nFiltered message count: {len(filtered_history)}")
    print("-" * 40)

    # --- Step 2: Summarization ---
    print("\n### Step 2: Applying LLM-Based Summarization ###")
    summary_template = (
        "Please summarize the following conversation history, focusing on the core problem, "
        "any key information provided (like order numbers or product details), "
        "and the resolution or next steps. The summary should be concise, "
        "no more than {max_words} words, and suitable for providing context to another AI agent.\n\n"
        "Conversation:\n{text_to_summarize}"
    )
    max_summary_words = 70 # Targeting a short summary

    if filtered_history:
        conversation_summary = summarize_context_with_llm(
            filtered_history,
            summary_template,
            max_summary_words=max_summary_words
        )
    else:
        conversation_summary = "No conversation history to summarize after filtering."

    print("\n--- Final Summarized Context ---")
    print(f"Summary generated for LLM: '{conversation_summary}'")
    print(f"\nOriginal messages: {len(conversation_history)}, Filtered messages: {len(filtered_history)}")
    print(f"The summarized context is now a single concise string, saving many tokens!")
    print("-" * 40)

if __name__ == "__main__":
    main()

Let’s break down the code step by step.

Part 1: Setting up our Mock LLM and Filtering Logic

First, we define a mock_llm_summarize function. This function stands in for a real API call to an LLM service (like OpenAI’s GPT-4, Anthropic’s Claude, etc.). It helps us focus on the context engineering logic without needing API keys or network requests.

# context_manager.py (start with this)

import collections

# --- Mock LLM for Demonstration ---
def mock_llm_summarize(text_to_summarize: str, max_words: int = 100) -> str:
    """
    A mock function to simulate an LLM summarization API call.
    In a real scenario, this would interact with a service like OpenAI, Anthropic, etc.
    It returns a generic summary based on input length.
    """
    if not text_to_summarize.strip():
        return ""

    print(f"\n--- Mock LLM Call for Summarization ---")
    print(f"Input text length: {len(text_to_summarize)} characters")

    # Simulate summarization logic
    if len(text_to_summarize) > 500:
        summary_text = (
            f"This is a concise summary of the provided text, focusing on key points "
            f"and reducing its original length to under {max_words} words. "
            f"It covers the main discussion points and any identified action items."
        )
    else:
        summary_text = (
            f"Summary of: '{text_to_summarize[:150].strip()}...' "
            f"This short text was deemed concise enough."
        )
    print(f"Generated summary: {summary_text[:100]}...") # Show a snippet
    print(f"--- End Mock LLM Call ---")
    return summary_text

Next, we create a function filter_conversation_history. This function takes a list of messages (where each message is a dictionary with role and content) and applies simple rules to filter them.

# context_manager.py (add this after mock_llm_summarize)

# --- Context Management Logic ---

def filter_conversation_history(
    messages: list[dict],
    min_length: int = 5,
    stop_phrases: list[str] = None
) -> list[dict]:
    """
    Filters a list of conversation messages based on length and stop phrases.

    Args:
        messages: A list of dictionaries, where each dict has 'role' and 'content'.
        min_length: Minimum character length for a message to be considered relevant.
        stop_phrases: A list of phrases to identify and filter out (case-insensitive).

    Returns:
        A new list containing only the filtered, relevant messages.
    """
    if stop_phrases is None:
        stop_phrases = ["hi", "hello", "how are you", "goodbye", "thanks", "ok", "got it"]

    filtered_messages = []
    for msg in messages:
        content = msg.get("content", "").strip()
        # Filter by minimum length
        if len(content) < min_length:
            print(f"  - Filtering out short message: '{content}'")
            continue

        # Filter by stop phrases
        is_stop_phrase = False
        for phrase in stop_phrases:
            if phrase.lower() in content.lower():
                is_stop_phrase = True
                print(f"  - Filtering out message with stop phrase '{phrase}': '{content}'")
                break
        if is_stop_phrase:
            continue

        filtered_messages.append(msg)
    return filtered_messages

min_length: This parameter allows us to discard messages that are too short, which often contain minimal information.
stop_phrases: This list contains common conversational filler or greetings that don’t usually contribute to the core task. We check for these phrases in a case-insensitive manner.
The function iterates through each message, applying these rules. If a message fails a rule, it’s skipped; otherwise, it’s added to filtered_messages.

Part 2: Implementing LLM-Based Summarization

Now, let’s build the summarize_context_with_llm function. This function will take the filtered messages, combine them into a single text, and then use our mock_llm_summarize function to generate a concise summary.

# context_manager.py (add this after filter_conversation_history)

def summarize_context_with_llm(
    context_messages: list[dict],
    summary_prompt_template: str,
    llm_summarizer_func=mock_llm_summarize,
    max_summary_words: int = 100
) -> str:
    """
    Uses an LLM to summarize a list of context messages.

    Args:
        context_messages: A list of dictionaries representing the context.
        summary_prompt_template: The template for the prompt to send to the summarizer LLM.
        llm_summarizer_func: The function to call for LLM summarization (e.g., mock_llm_summarize).
        max_summary_words: Desired maximum word count for the summary.

    Returns:
        A string containing the LLM-generated summary.
    """
    if not context_messages:
        return ""

    # Concatenate messages into a single string for summarization
    full_text_to_summarize = "\n".join(
        f"{msg['role'].capitalize()}: {msg['content']}" for msg in context_messages
    )

    # Construct the full prompt for the summarizer LLM
    full_prompt = summary_prompt_template.format(
        text_to_summarize=full_text_to_summarize,
        max_words=max_summary_words
    )

    print(f"\nAttempting to summarize {len(context_messages)} messages...")
    print(f"Prompt sent to summarizer LLM (first 200 chars): '{full_prompt[:200]}...'")

    summary = llm_summarizer_func(full_prompt, max_words=max_summary_words)
    return summary

full_text_to_summarize: We combine all the filtered messages into one large string. This is the input that our (mock) LLM will summarize.
summary_prompt_template: This is a powerful concept. Instead of just sending raw text, we wrap it in a carefully crafted prompt that instructs the LLM how to summarize. This guides the LLM to focus on specific aspects (e.g., “core problem,” “resolution”).
llm_summarizer_func: This parameter makes our function flexible. We can pass our mock_llm_summarize for testing, or a real LLM client function in a production environment.

Part 3: Putting It All Together in `main`

Finally, we’ll create our main function to demonstrate the entire process with a sample conversation_history.

# context_manager.py (add this after summarize_context_with_llm)

def main():
    """
    Demonstrates context reduction through filtering and summarization.
    """
    conversation_history = [
        {"role": "user", "content": "Hi, how are you?"},
        {"role": "assistant", "content": "I'm doing great! How can I help you today?"},
        {"role": "user", "content": "I have a problem with my order #12345. The item arrived broken."},
        {"role": "assistant", "content": "Oh no! I'm sorry to hear that. Let me look up your order."},
        {"role": "user", "content": "Okay, thanks."},
        {"role": "assistant", "content": "I see order #12345 for a 'Deluxe Widget'. Can you confirm it's the widget?"},
        {"role": "user", "content": "Yes, that's correct. The screen is cracked."},
        {"role": "assistant", "content": "Understood. I'll initiate a replacement for you. It should arrive in 3-5 business days."},
        {"role": "user", "content": "Great, thank you!"},
        {"role": "assistant", "content": "You're welcome! Is there anything else I can assist with?"},
        {"role": "user", "content": "No, that's all. Goodbye!"},
    ]

    print("--- Original Conversation History ---")
    for msg in conversation_history:
        print(f"[{msg['role'].capitalize()}]: {msg['content']}")
    print(f"\nOriginal message count: {len(conversation_history)}")
    print("-" * 40)

    # --- Step 1: Filtering ---
    print("\n### Step 1: Applying Filtering ###")
    # Define custom stop phrases and minimum length
    custom_stop_phrases = ["hi", "hello", "how are you", "thanks", "thank you", "okay", "ok", "understood", "you're welcome", "goodbye", "that's all"]
    min_message_length = 10 # Example: messages shorter than 10 chars are often trivial

    filtered_history = filter_conversation_history(
        conversation_history,
        min_length=min_message_length,
        stop_phrases=custom_stop_phrases
    )

    print("\n--- Filtered Conversation History ---")
    if not filtered_history:
        print("No relevant messages after filtering.")
    for msg in filtered_history:
        print(f"[{msg['role'].capitalize()}]: {msg['content']}")
    print(f"\nFiltered message count: {len(filtered_history)}")
    print("-" * 40)

    # --- Step 2: Summarization ---
    print("\n### Step 2: Applying LLM-Based Summarization ###")
    summary_template = (
        "Please summarize the following conversation history, focusing on the core problem, "
        "any key information provided (like order numbers or product details), "
        "and the resolution or next steps. The summary should be concise, "
        "no more than {max_words} words, and suitable for providing context to another AI agent.\n\n"
        "Conversation:\n{text_to_summarize}"
    )
    max_summary_words = 70 # Targeting a short summary

    if filtered_history:
        conversation_summary = summarize_context_with_llm(
            filtered_history,
            summary_template,
            max_summary_words=max_summary_words
        )
    else:
        conversation_summary = "No conversation history to summarize after filtering."

    print("\n--- Final Summarized Context ---")
    print(f"Summary generated for LLM: '{conversation_summary}'")
    print(f"\nOriginal messages: {len(conversation_history)}, Filtered messages: {len(filtered_history)}")
    print(f"The summarized context is now a single concise string, saving many tokens!")
    print("-" * 40)

if __name__ == "__main__":
    main()

When you run this context_manager.py script (python context_manager.py), you’ll see:

The full, original conversation.
Messages being explicitly filtered out based on length and stop phrases.
The remaining, more focused messages.
The mock LLM being called to summarize these messages.
A final, concise summary string, which is what you would then feed to your main LLM for its task.

Notice how the original 11 messages are reduced to a much smaller, more focused set, and then further distilled into a single summary string. This significantly reduces the token count and focuses the LLM on the core issue!

Mini-Challenge: Enhance Filtering and Summarization

You’ve seen the power of basic filtering and summarization. Now, let’s make it even smarter!

Challenge: Modify the filter_conversation_history and/or summarize_context_with_llm functions to improve context quality.

Deduplication: Implement a simple deduplication logic in filter_conversation_history to remove identical consecutive messages from the same role (e.g., if a user says “yes” twice in a row, only keep one).
Summarize Short Sequences: Instead of completely removing short, non-stop-phrase messages, consider if you can combine a sequence of 2-3 very short, related messages into a single, slightly longer message before summarization. For instance, “Yes.” + “That’s correct.” could become “User confirmed the detail.” (This is a bit more advanced, so feel free to tackle just deduplication if combining is too much for now).
Refine Summary Prompt: Experiment with the summary_prompt_template to guide the LLM to focus on specific entities (like order # and product names). How would you make the prompt more robust to ensure these critical details are always included in the summary?

Hint for Deduplication: You could keep track of the (role, content) of the last message added to filtered_messages and skip adding the current message if it’s identical.

What to observe/learn: Pay attention to how your changes impact the final filtered_history and the conversation_summary. Did you lose any critical information? Did you make the context even more concise without sacrificing meaning? This iterative process is at the heart of context engineering!

Common Pitfalls & Troubleshooting

As powerful as context reduction and summarization are, they come with their own set of challenges.

Over-filtering / Over-summarization (The “Lost Detail” Trap):
- Pitfall: Being too aggressive with filtering rules or summarization can lead to losing crucial details that the LLM needs. For example, filtering out all short messages might remove an important “Yes” or “No” confirmation. Over-summarizing a legal document might omit a key clause.
- Troubleshooting: Always test your reduction strategies with representative data. Compare the LLM’s performance with and without the reduction. If the LLM starts “forgetting” details or giving generic answers, your reduction might be too aggressive. Use metrics relevant to your application (e.g., accuracy, completeness of generated answers).
- Best Practice: Prioritize completeness of critical information over extreme conciseness.
Context Rot (Even with Reduction):
- Pitfall: Even after filtering, if the remaining context is old or irrelevant to the current query, the LLM might still get sidetracked. For example, a long-running customer support chat might have filtered out greetings, but still contain discussions about a previous, resolved issue.
- Troubleshooting: Implement time-based decay or recency weighting. Prioritize newer information. In more advanced systems (like RAG), you’d dynamically retrieve context based on the current query, which inherently combats context rot.
- Best Practice: Regularly evaluate the “freshness” and “relevance” of your reduced context.
Cost and Latency Trade-offs for LLM-based Summarization:
- Pitfall: Using an LLM to summarize before sending to your main LLM adds an extra API call and its associated cost and latency. This can sometimes negate the benefits if the summarizer LLM is expensive or slow.
- Troubleshooting:
  - Use a smaller, cheaper, and faster LLM specifically for summarization. Many models are optimized for this task.
  - Consider batching summarization requests if possible.
  - Evaluate if heuristic-based summarization (even if less perfect) is sufficient for certain types of context where latency/cost is critical.
- Best Practice: Carefully benchmark the performance (cost, latency, output quality) of your summarization pipeline.
Preserving Named Entities and Specific Details:
- Pitfall: General summarization prompts might inadvertently generalize or remove specific entities like product names, order IDs, dates, or names of people, which are often crucial for the LLM’s task.
- Troubleshooting: Explicitly instruct the summarizer LLM to retain specific entities. Your summary_prompt_template should include directives like “Ensure to mention all product names and order IDs.” You might even use Named Entity Recognition (NER) models to extract these entities separately and inject them directly into the main prompt alongside the summary.
- Best Practice: Make retaining critical entities a non-negotiable requirement in your summarization strategy.

Summary

Phew! You’ve just taken a massive leap in mastering context engineering. In this chapter, we explored the vital techniques of Context Reduction and Summarization.

Here are the key takeaways:

Why Reduce? To overcome LLM context window limits, reduce costs, decrease latency, and prevent “Context Rot” by focusing the model on relevant information.
Reduction vs. Compression: Reduction removes information, while compression encodes it more efficiently.
Filtering: A primary reduction technique involving rule-based, keyword-based, or semantic filtering to remove irrelevant or trivial content.
Summarization: Condensing longer texts into shorter, coherent versions.
- Extractive: Pulls exact sentences.
- Abstractive: Generates new sentences (often via LLMs).
- LLM-based summarization is a powerful technique where an LLM summarizes context for another LLM.
Trade-offs are Key: Always consider the balance between information loss, cost, latency, and the quality of the LLM’s final output when applying these techniques.
Practical Application: We implemented a Python example demonstrating how to filter conversation history and then summarize the remaining content using a mock LLM.

You’re now equipped with powerful tools to make your LLMs more efficient, cost-effective, and accurate by intelligently managing their input context. This is a dynamic field, so keep experimenting and exploring new methods!

In the next chapter, we’ll dive deeper into Context Chunking Strategies, learning how to break down large documents into manageable pieces for effective retrieval and processing.

References

[1] HumanLayer. (2023). 12-Factor Agents - Factor 3: Own Your Context Window. Retrieved from https://github.com/humanlayer/12-factor-agents/blob/main/content/factor-03-own-your-context-window.md
[2] NLTK Project. (n.d.). NLTK Documentation. Retrieved from https://www.nltk.org/
[3] yzfly. (n.d.). awesome-context-engineering: A curated collection of …. GitHub. Retrieved from https://github.com/yzfly/awesome-context-engineering
[4] bonigarcia. (n.d.). context-engineering/README.md. GitHub. Retrieved from https://github.com/bonigarcia/context-engineering/blob/main/README.md

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant. +++