Debugging, Testing, and Monitoring: Building Reliable Agent Systems

Introduction: Ensuring Agent Reliability

Welcome back, intrepid AI architects! In previous chapters, we’ve had a blast bringing our AI agents to life, equipping them with tools, memory, and sophisticated orchestration patterns. You’ve seen them tackle tasks, engage in conversations, and even collaborate. That’s fantastic!

But here’s a crucial question: How do we know our agents are truly reliable? What happens when a Large Language Model (LLM) hallucinates, a tool fails, or an agent misinterprets a prompt? Building AI agent systems isn’t just about crafting clever prompts and chaining components; it’s also about anticipating failure, identifying issues swiftly, and ensuring consistent, trustworthy performance. This is where the pillars of Debugging, Testing, and Monitoring (DTM) come into play.

In this chapter, we’ll dive deep into the essential practices that transform a cool agent prototype into a robust, production-ready system. We’ll explore the unique challenges of DTM in the context of non-deterministic AI, learn practical strategies for each framework, and equip you with the skills to build agent systems you can truly depend on. Get ready to put on your detective hat and make your agents bulletproof!

Core Concepts: Navigating the Non-Deterministic World

Debugging, testing, and monitoring AI agent systems present a unique set of challenges compared to traditional software development. Let’s explore why and how we can address them.

The Unique Challenges of DTM for AI Agents

Think about a traditional application: if you give it the same input, it will almost always produce the exact same output. Not so with AI agents, especially those powered by LLMs!

Non-Determinism of LLMs: LLMs, by design, are probabilistic. Even with the same prompt and parameters (like temperature=0), their responses can vary slightly. This makes reproducing bugs and asserting exact outputs incredibly tricky.
Multi-Agent Interaction Complexity: When multiple agents interact, the number of possible conversational paths and states explodes. Debugging becomes like tracing a conversation among several highly intelligent, yet sometimes unpredictable, individuals.
Statefulness and Memory: Agents maintain internal state and memory. A subtle issue early in a long conversation might only manifest much later, making root cause analysis difficult.
Tool Reliability: Agents rely on external tools (APIs, databases, custom functions). Failures can originate in the tools themselves, or in how the agent uses them (e.g., incorrect arguments, misinterpretation of tool output).
Cost and Latency Considerations: Every LLM call costs money and takes time. Extensive debugging and testing can quickly become expensive and slow if not managed carefully.
Prompt Engineering Fragility: A small change in a system prompt or agent instruction can drastically alter behavior, potentially introducing subtle regressions.

These challenges mean we need a slightly different mindset and toolset for DTM in the agentic world.

Debugging Strategies: Unmasking Agent Behavior

Debugging is the art of finding and fixing errors. For AI agents, it’s often about understanding why an agent made a particular decision or produced an unexpected output.

Logging and Tracing: Your Agent’s Inner Monologue

The most fundamental debugging tool is logging. For AI agents, it’s not enough to just log errors; we need to log the thought process. This includes:

Inputs and Outputs: What prompt was sent to the LLM? What response did it return?
Tool Calls: Which tool was called? What arguments were passed? What was the tool’s raw output?
Agent Decisions: Why did the agent choose a particular path in a graph? What was its reasoning for delegating a task?
State Changes: How did the agent’s internal state or memory evolve over time?

Structured logging (e.g., JSON logs) is highly recommended, as it makes logs easier to parse and analyze with tools.

Observability Platforms: A Bird’s-Eye View

Specialized platforms like LangSmith (from LangChain, often used with LangGraph) are becoming indispensable. They provide:

Trace Visualization: A visual timeline of all LLM calls, tool executions, and agent decisions within a run.
Input/Output Inspection: Detailed views of prompts, responses, and intermediate steps.
Cost and Latency Metrics: Tracking token usage and execution time for each step.
Experimentation: Comparing different agent configurations or prompt versions.

While LangSmith is prominent for LangChain/LangGraph, other frameworks have built-in logging or community tools that offer similar insights.

Interactive Debugging: Stepping Through the Flow

Sometimes, you need to pause your agent and inspect its state directly. Using a Python debugger (like pdb or your IDE’s debugger) can be invaluable for:

Stepping through custom tool code.
Inspecting variables within an agent’s _run method or a graph node.
Understanding the exact data flowing between components.

Visualizing Workflows: The Map to Your Maze

For complex multi-agent systems or graph-based workflows, a visual representation can clarify the intended flow versus the actual execution. Tools like Mermaid can help you draw your agent’s decision tree or state transitions, making it easier to spot logical flaws.

Testing Methodologies: Building Confidence in Agent Behavior

Testing AI agents is about establishing confidence that they behave as expected under various conditions. Due to non-determinism, we often shift from “exact output” assertions to “reasonable output” or “expected behavior” assertions.

The Agent Testing Pyramid

Just like traditional software, we can think of an agent testing pyramid:

Unit Testing (Base):
- Focus: Individual, deterministic components.
- Examples: Testing a custom tool’s logic, a helper function that processes LLM output, or a prompt template’s rendering.
- Assertion: Exact output is often possible here.
Integration Testing (Middle):
- Focus: Interactions between components.
- Examples: An agent using a tool, a simple two-agent conversation, a specific node in a LangGraph workflow.
- Assertion: Check for correct tool calls, expected message formats, or general sentiment/topic of LLM responses. Mock LLM calls or specific tool outputs to isolate interaction logic.
End-to-End (E2E) Testing (Top):
- Focus: The entire agent system, from start to finish.
- Examples: A full multi-agent workflow, a complete conversation with a user, solving a complex problem.
- Assertion: Often involves “golden datasets” – predefined inputs with expected outputs (or output characteristics like keywords, structure, or successful task completion). This is where human evaluation often comes in.

Golden Datasets: Your Agent’s Report Card

For E2E testing, you’ll want to build a collection of “golden pairs”: (input_query, expected_output_characteristics).

input_query: A typical user query or starting condition.
expected_output_characteristics: This isn’t always an exact string. It might be:
- “The output should contain ‘stock price’ and ‘AAPL’.”
- “The output should be a valid JSON object with keys ‘summary’ and ‘recommendation’.”
- “The agent should successfully call the search_web tool at least once.”
- “The final answer should be positive/negative sentiment.”

Running your agent system against these datasets regularly (regression testing) helps ensure that new changes haven’t introduced regressions.

Monitoring Agent Performance: Keeping an Eye on Production

Once your agent system is deployed, monitoring becomes your eyes and ears, ensuring it continues to perform well in the wild.

Key Metrics to Track

Latency: How long does it take for the agent to respond? (Total, and per LLM call/tool call).
Token Usage & API Costs: How many tokens are consumed? What’s the cost per interaction? Are we staying within budget?
Success Rate: What percentage of interactions result in a successful task completion or a satisfactory answer?
Error Rate: How often do LLM calls fail, tools error out, or agents get stuck?
Tool Usage: Which tools are being used most? Are there tools that are never used, or misused?
User Feedback: How are users rating the agent’s performance?

Alerting and Anomaly Detection

Set up alerts for critical thresholds:

Latency spikes (e.g., response time > 10 seconds).
High error rates.
Unexpected token usage.
Sudden drops in success rate.

Feedback Loops for Continuous Improvement

Monitoring isn’t just about spotting problems; it’s about learning.

Human-in-the-Loop: For critical applications, human review of agent interactions can provide invaluable data for prompt refinement and system improvements.
Data Collection: Log all interactions (anonymized if necessary) to build datasets for future training, fine-tuning, or more comprehensive E2E testing.

Illustrative Workflow Diagram: Agent System DTM Pipeline

Let’s visualize the continuous process of DTM for an AI agent system.

flowchart LR subgraph Development_Phase["Development Phase"] A[Code Agent Workflow] --> B[Write Unit Tests Tools/Functions] B --> C[Run Unit Tests] end subgraph Testing_Phase["Testing Phase"] C --> D{Unit Tests Pass?} D -->|Fail| A D -->|Pass| E[Write Integration Tests Agent-Tool Interactions] E --> F[Run Integration Tests] F --> G{Integration Tests Pass?} G -->|Fail| A G -->|Pass| H[Create E2E Golden Datasets] H --> I[Run E2E Tests Against Golden Datasets] I --> J{E2E Tests Pass?} J -->|Fail| A J -->|Pass| K[Automated Regression Testing] end subgraph Deployment_and_Monitoring["Deployment and Monitoring"] K --> L[Deploy Agent System] L --> M[Monitor Performance and Logs] M --> N{Anomalies or Errors Detected?} N -->|Yes| A N -->|No| O[Collect User Feedback and Evaluate] O --> P[Continuous Improvement & Prompt Refinement] P --> A end

This diagram shows how DTM is not a one-time event, but an iterative cycle that feeds back into development.

Step-by-Step Implementation: Practical DTM Examples

Let’s get practical and see how we can apply some of these DTM concepts using our familiar frameworks. We’ll focus on adding logging and basic testing.

First, ensure you have the necessary environment setup, which includes python 3.9+ and pip. We’ll use pytest for testing, so let’s install that:

pip install pytest==8.1.1

(Note: pytest version 8.1.1 is stable as of 2026-03-20, but feel free to use the latest stable version if available.)

1. Debugging with Logging: LangGraph

LangGraph’s graph structure makes it relatively easy to instrument logging at each node. We’ll enhance a simple LangGraph workflow to show its internal steps.

Let’s imagine a very basic LangGraph that decides if a user’s query is about “math” or “general” and then routes it.

First, create a file named langgraph_debug.py:

# langgraph_debug.py
import os
import logging
from typing import Literal

from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END

# --- Setup Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# --- Environment Variable Check ---
# Make sure to set your OPENAI_API_KEY environment variable.
if not os.getenv("OPENAI_API_KEY"):
    logger.warning("OPENAI_API_KEY environment variable not set. LLM calls might fail or use a mock.")
    # For demonstration, we'll proceed, but in a real app, you'd handle this more robustly.

# --- Define Graph State ---
class AgentState:
    messages: list[BaseMessage]
    topic: Literal["math", "general", "unknown"] = "unknown"

# --- LLM Setup ---
# Using gpt-4o as of 2026-03-20. Ensure OPENAI_API_KEY is set in your environment.
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# --- Agent Nodes ---
def classify_topic(state: AgentState) -> AgentState:
    """Classifies the topic of the conversation."""
    logger.info(f"Node 'classify_topic' entered. Current messages: {state.messages}")
    last_message = state.messages[-1].content
    
    prompt = f"""
    Analyze the following user query and determine if it's primarily about 'math' or 'general' topics.
    Respond with only one word: 'math' or 'general'.

    Query: "{last_message}"
    """
    
    response = llm.invoke(prompt)
    classification = response.content.strip().lower()

    new_state = AgentState(messages=state.messages, topic=state.topic) # Start with current state
    if "math" in classification:
        new_state.topic = "math"
        logger.info("Topic classified as: math")
    elif "general" in classification:
        new_state.topic = "general"
        logger.info("Topic classified as: general")
    else:
        new_state.topic = "unknown"
        logger.warning(f"Could not classify topic. LLM response: '{classification}'. Defaulting to unknown.")
    
    return new_state

def handle_math_query(state: AgentState) -> AgentState:
    """Handles math-related queries."""
    logger.info(f"Node 'handle_math_query' entered. Current messages: {state.messages}")
    last_message = state.messages[-1].content
    
    response_content = f"I'm a math expert! Let me help with: '{last_message}'"
    new_message = HumanMessage(content=response_content)
    
    new_state = AgentState(messages=state.messages + [new_message], topic=state.topic)
    logger.info(f"Math query handled. Response: {response_content}")
    return new_state

def handle_general_query(state: AgentState) -> AgentState:
    """Handles general queries."""
    logger.info(f"Node 'handle_general_query' entered. Current messages: {state.messages}")
    last_message = state.messages[-1].content
    
    response_content = f"I'm a general knowledge expert! Here's my take on: '{last_message}'"
    new_message = HumanMessage(content=response_content)
    
    new_state = AgentState(messages=state.messages + [new_message], topic=state.topic)
    logger.info(f"General query handled. Response: {response_content}")
    return new_state

# --- Conditional Edges ---
def route_topic(state: AgentState) -> Literal["math_handler", "general_handler"]:
    """Routes based on the classified topic."""
    logger.info(f"Node 'route_topic' entered. Current topic: {state.topic}")
    if state.topic == "math":
        logger.info("Routing to math_handler.")
        return "math_handler"
    else: # Fallback to general handler for 'general' or 'unknown'
        logger.info(f"Routing to general_handler (topic was '{state.topic}').")
        return "general_handler"

# --- Build the Graph ---
workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("classify", classify_topic)
workflow.add_node("math_handler", handle_math_query)
workflow.add_node("general_handler", handle_general_query)

# Set entry point
workflow.set_entry_point("classify")

# Add edges
workflow.add_conditional_edges(
    "classify",
    route_topic,
    {
        "math_handler": "math_handler",
        "general_handler": "general_handler", # Handles both 'general' and 'unknown' topics
    },
)

# Set end points
workflow.add_edge("math_handler", END)
workflow.add_edge("general_handler", END)

# Compile the graph
app = workflow.compile()

# --- Run the Agent ---
if __name__ == "__main__":
    print("--- Running LangGraph Agent ---")

    # Example 1: Math query
    print("\nQuery: 'What is the square root of 9?'")
    initial_state_math = AgentState(messages=[HumanMessage(content="What is the square root of 9?")])
    result_math = app.invoke(initial_state_math)
    print(f"Final Math Response: {result_math.messages[-1].content}")

    # Example 2: General query
    print("\nQuery: 'Tell me a fun fact about cats.'")
    initial_state_general = AgentState(messages=[HumanMessage(content="Tell me a fun fact about cats.")])
    result_general = app.invoke(initial_state_general)
    print(f"Final General Response: {result_general.messages[-1].content}")

    # Example 3: Ambiguous query (should fallback to general)
    print("\nQuery: 'Purple elephant flying in space.'")
    initial_state_ambiguous = AgentState(messages=[HumanMessage(content="Purple elephant flying in space.")])
    result_ambiguous = app.invoke(initial_state_ambiguous)
    print(f"Final Ambiguous Response: {result_ambiguous.messages[-1].content}")

Explanation:

Logging Setup: We start by importing the logging module and configuring a basic logger. This logger will print messages to the console, showing the time, logger name, level (INFO, WARNING), and the message.
logger.info() & logger.warning(): Inside each node function (classify_topic, handle_math_query, handle_general_query) and the routing function (route_topic), we’ve added logger.info() calls. These messages tell us when a node is entered, what its current state is, and what decision it’s making.
State Inspection: Notice how we log state.messages or state.topic at the beginning of each node. This is crucial for understanding the context an agent is operating with at any given point.
Conditional Logging: In classify_topic, we log the classification result, and a logger.warning() if the classification is unknown. This highlights potential issues.

To run this:

Save the code as langgraph_debug.py.
Set your OPENAI_API_KEY environment variable. For example, on Linux/macOS: export OPENAI_API_KEY="sk-...". On Windows (CMD): set OPENAI_API_KEY="sk-...".
Run from your terminal: python langgraph_debug.py

Observe the detailed logs in your console. You’ll see the agent’s journey through the graph, making it much easier to debug if it takes an unexpected turn!

2. Testing a LangGraph Node

Now, let’s write a simple unit test for our classify_topic node using pytest.

Create a new file named test_langgraph_nodes.py in the same directory:

# test_langgraph_nodes.py
import pytest
import os
from unittest.mock import MagicMock

from langchain_core.messages import HumanMessage
from langgraph_debug import classify_topic, AgentState # Import from our previous file

# Mock the LLM to make tests deterministic and avoid API calls
@pytest.fixture
def mock_llm_response():
    """Fixture to mock the LLM's invoke method."""
    # Store the original LLM invoke method
    original_llm_invoke = classify_topic.__globals__['llm'].invoke
    
    def mock_invoke_logic(prompt):
        """Custom logic for the mocked LLM invoke."""
        if "square root" in prompt.lower() or "math" in prompt.lower():
            mock_response = MagicMock()
            mock_response.content = "math"
            return mock_response
        elif "fun fact" in prompt.lower() or "general" in prompt.lower():
            mock_response = MagicMock()
            mock_response.content = "general"
            return mock_response
        else:
            mock_response = MagicMock()
            mock_response.content = "unknown_classification"
            return mock_response
            
    # Replace the actual LLM's invoke with our mock logic
    classify_topic.__globals__['llm'].invoke = mock_invoke_logic
    yield # Run the test
    # Restore the original LLM invoke method after the test
    classify_topic.__globals__['llm'].invoke = original_llm_invoke

def test_classify_topic_math(mock_llm_response):
    """Test if classify_topic correctly identifies a math query."""
    initial_state = AgentState(messages=[HumanMessage(content="What is the square root of 16?")])
    result_state = classify_topic(initial_state)
    assert result_state.topic == "math"

def test_classify_topic_general(mock_llm_response):
    """Test if classify_topic correctly identifies a general query."""
    initial_state = AgentState(messages=[HumanMessage(content="Tell me a fun fact about giraffes.")])
    result_state = classify_topic(initial_state)
    assert result_state.topic == "general"

def test_classify_topic_unknown(mock_llm_response):
    """Test if classify_topic handles an unknown query."""
    initial_state = AgentState(messages=[HumanMessage(content="Purple elephants fly on Tuesdays.")])
    result_state = classify_topic(initial_state)
    assert result_state.topic == "unknown"

Explanation:

pytest: We use pytest for our testing framework. It automatically discovers tests (functions starting with test_).
mock_llm_response Fixture: This is crucial! To make our tests deterministic and avoid making actual API calls (which cost money and are slow), we mock the llm.invoke method.
- MagicMock allows us to simulate the behavior of an object.
- Our mock_invoke_logic function checks the prompt and returns a predefined MagicMock response with the expected content.
- The yield keyword ensures the mock is active during the test and then restored afterward, cleaning up the test environment.
Test Functions: Each test_ function creates an AgentState with a specific HumanMessage and then calls classify_topic.
assert Statements: We use assert to check if the result_state.topic matches our expected classification.

To run this:

Make sure langgraph_debug.py and test_langgraph_nodes.py are in the same directory.
Run from your terminal: pytest test_langgraph_nodes.py

You should see output indicating that all 3 tests passed! This gives us confidence that our classify_topic node works as intended, regardless of the actual LLM’s non-deterministic nature.

3. Debugging and Testing with AutoGen

AutoGen provides excellent built-in logging and a clear way to inspect conversation history, which is key for debugging. We’ll modify the example to load API keys directly from environment variables.

First, ensure you have AutoGen installed:

pip install pyautogen==0.2.20

(Note: pyautogen version 0.2.20 is stable as of 2026-03-20. Check for the latest stable version if needed.)

Create a file autogen_debug.py:

# autogen_debug.py
import os
import autogen
import logging

# --- Setup Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# --- Environment Variable Check and AutoGen Configuration ---
# It's best practice to load API keys from environment variables for security.
# Make sure to set your OPENAI_API_KEY environment variable.
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    logger.error("OPENAI_API_KEY environment variable not set. AutoGen will likely fail without it.")
    # In a real application, you might exit or provide a mock here.
    exit("Please set the OPENAI_API_KEY environment variable.")

# AutoGen config list using the environment variable
config_list = [
    {
        "model": "gpt-4o", # Using gpt-4o as of 2026-03-20
        "api_key": openai_api_key,
    }
]

# --- Define Agents ---
# The User Proxy Agent is typically the entry point for human interaction or initiating tasks.
user_proxy = autogen.UserProxyAgent(
    name="Admin",
    system_message="A human admin. Interact with the Planner to ensure tasks are completed.",
    code_execution_config={"last_n_messages": 3, "work_dir": "coding"},
    human_input_mode="NEVER", # Set to ALWAYS for interactive debugging
    is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
)

# The Planner Agent will receive the initial task and break it down.
planner = autogen.AssistantAgent(
    name="Planner",
    llm_config={"config_list": config_list}, # Use the config_list with API key
    system_message="""You are a helpful AI assistant that plans tasks.
    Your goal is to break down a complex request into smaller, manageable steps for other agents.
    When you have a plan, present it clearly.
    If the task involves code, just state "I will write code to solve this."
    Once the task is fully planned or completed, respond with TERMINATE.
    """,
)

# --- Define a GroupChat ---
groupchat = autogen.GroupChat(agents=[user_proxy, planner], messages=[], max_round=5)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config={"config_list": config_list})

# --- Run the Conversation ---
if __name__ == "__main__":
    print("--- Running AutoGen Agent Conversation ---")
    
    # Start a conversation
    print("\nStarting conversation: 'Plan a simple Python script to calculate the factorial of a number.'")
    user_proxy.initiate_chat(
        manager,
        message="Plan a simple Python script to calculate the factorial of a number.",
    )

    print("\n--- Conversation History (for Debugging) ---")
    # Accessing the conversation history is a key debugging technique in AutoGen
    # This shows the full back-and-forth between agents
    for msg in user_proxy.chat_messages[manager]:
        print(f"[{msg['name']}]: {msg['content']}")

Explanation:

Logging: Similar to LangGraph, we set up basic Python logging. AutoGen itself produces quite verbose logs at INFO level, which is helpful.
API Key from Environment: We now explicitly retrieve the OPENAI_API_KEY using os.getenv(). If it’s not set, the script will exit with an error, ensuring secure and proper setup.
config_list: The config_list for AutoGen agents is constructed directly using this environment variable, removing the need for an external OAI_CONFIG_LIST file.
human_input_mode="NEVER": For automated testing, we set this to NEVER. For interactive debugging, you might temporarily change it to ALWAYS to step through the conversation and provide manual input.
user_proxy.chat_messages[manager]: This is the magic for debugging in AutoGen! After a conversation, user_proxy.chat_messages holds a dictionary where keys are the agents it interacted with, and values are lists of all messages exchanged. Printing this history gives you a full transcript of the multi-agent deliberation.

To run this:

Save the code as autogen_debug.py.
Set your OPENAI_API_KEY environment variable.
Run from your terminal: python autogen_debug.py

You’ll see the logging messages and then the full conversation history, which is invaluable for understanding how your agents arrived at their decisions.

4. Testing an AutoGen Conversation

Testing full AutoGen conversations often involves checking the final message content or ensuring specific agents participated.

Create a file test_autogen_agents.py:

# test_autogen_agents.py
import pytest
import os
import autogen
from unittest.mock import patch, MagicMock

# --- AutoGen Configuration (for testing) ---
# We'll use a mock config list to avoid actual API calls during tests
test_config_list = [
    {
        "model": "mock-model", # Use a placeholder model name
        "api_key": "mock-key", # Use a placeholder API key (won't be used due to mocking)
    }
]

# --- Mock LLM for AutoGen ---
@pytest.fixture
def mock_autogen_llm():
    """Fixture to mock the LLM calls within AutoGen agents."""
    # Patch the underlying Completion.create method that AutoGen uses for LLM calls.
    with patch('autogen.Completion.create') as mock_create:
        def side_effect(*args, **kwargs):
            messages = kwargs.get('messages', [])
            last_message_content = messages[-1]['content'].lower() if messages else ""

            if "plan a simple python script" in last_message_content:
                response_content = "Plan: 1. Define a function for factorial. 2. Use a loop. 3. Return result. TERMINATE"
            elif "factorial of a number" in last_message_content: # For the planner's response
                response_content = "To calculate the factorial, you need a loop. TERMINATE"
            else:
                response_content = "Mock response for unknown query. TERMINATE"
            
            mock_choice = MagicMock()
            mock_choice.message.content = response_content
            mock_choice.message.function_call = None
            mock_choice.finish_reason = "stop"
            
            mock_response = MagicMock()
            mock_response.choices = [mock_choice]
            mock_response.usage.prompt_tokens = 10
            mock_response.usage.completion_tokens = 10
            return mock_response
        
        mock_create.side_effect = side_effect
        yield mock_create

def test_autogen_factorial_planning(mock_autogen_llm):
    """Test if AutoGen agents can plan a factorial script."""
    
    # Redefine agents for the test to ensure they use the mock config list
    user_proxy = autogen.UserProxyAgent(
        name="Admin",
        system_message="A human admin. Interact with the Planner to ensure tasks are completed.",
        code_execution_config={"last_n_messages": 3, "work_dir": "coding"},
        human_input_mode="NEVER",
        is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
    )

    planner = autogen.AssistantAgent(
        name="Planner",
        llm_config={"config_list": test_config_list}, # Use the mock config list
        system_message="""You are a helpful AI assistant that plans tasks.
        Your goal is to break down a complex request into smaller, manageable steps for other agents.
        When you have a plan, present it clearly.
        If the task involves code, just state "I will write code to solve this."
        Once the task is fully planned or completed, respond with TERMINATE.
        """,
    )

    groupchat = autogen.GroupChat(agents=[user_proxy, planner], messages=[], max_round=5)
    manager = autogen.GroupChatManager(groupchat=groupchat, llm_config={"config_list": test_config_list})

    # Initiate the chat
    user_proxy.initiate_chat(
        manager,
        message="Plan a simple Python script to calculate the factorial of a number.",
    )

    # Assertions: Check the last message from the Planner
    # We expect the Planner to have provided a plan and terminated.
    last_message = user_proxy.chat_messages[manager][-1]['content']
    assert "TERMINATE" in last_message
    assert "Plan:" in last_message
    assert "factorial" in last_message.lower()

Explanation:

@pytest.fixture with patch: This is a more advanced mocking technique. We use unittest.mock.patch to replace autogen.Completion.create (the underlying method AutoGen uses for LLM calls) with our custom side_effect function.
- The side_effect function simulates different LLM responses based on the input prompt. This makes our test entirely self-contained and deterministic.
- We set mock_choice.message.function_call = None to ensure it behaves like a text-only response.
Agent Redefinition: We redefine user_proxy and planner within the test function to ensure they pick up our test_config_list which is crucial for using the mocked LLM.
initiate_chat: We run a full conversation just like in the debug example.
Assertions: We check the user_proxy.chat_messages[manager][-1]['content'] (the last message in the conversation) for keywords like “TERMINATE” and “Plan:” to ensure the agents reached the expected outcome.

To run this:

Save the code as test_autogen_agents.py.
Run from your terminal: pytest test_autogen_agents.py

This test verifies that our agents can successfully plan a task given a specific prompt, without relying on actual LLM calls.

5. Debugging and Testing with CrewAI

CrewAI offers a verbose setting that provides excellent insights into the agent’s thought process and task execution.

First, ensure you have CrewAI installed:

pip install crewai==0.28.8 langchain-openai==0.1.1 # As of 2026-03-20

(Note: crewai version 0.28.8 and langchain-openai version 0.1.1 are stable as of 2026-03-20. Check for the latest stable versions if needed.)

Create a file crewai_debug.py:

# crewai_debug.py
import os
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
import logging

# --- Setup Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# --- Environment Variable Check ---
if not os.getenv("OPENAI_API_KEY"):
    logger.error("OPENAI_API_KEY environment variable not set. CrewAI will likely fail without it.")
    exit("Please set the OPENAI_API_KEY environment variable.")

# --- LLM Setup ---
# Using gpt-4o as of 2026-03-20
# CrewAI can pick up the model from this env var, or you can pass it explicitly.
os.environ["OPENAI_MODEL_NAME"] = "gpt-4o" 

# --- Define Agents ---
researcher = Agent(
    role='Senior Research Analyst',
    goal='Uncover critical insights about the tech industry',
    backstory="""You are a Senior Research Analyst at a leading tech firm.
    Your expertise lies in identifying emerging trends and market shifts.""",
    verbose=True, # CRITICAL for debugging: shows agent's thought process
    allow_delegation=False,
    llm=ChatOpenAI(model="gpt-4o", temperature=0) # Explicitly set LLM for this agent
)

writer = Agent(
    role='Content Strategist',
    goal='Craft compelling narratives from research findings',
    backstory="""You are a Content Strategist, skilled in transforming complex data
    into engaging and easy-to-understand reports.""",
    verbose=True, # CRITICAL for debugging
    allow_delegation=False,
    llm=ChatOpenAI(model="gpt-4o", temperature=0)
)

# --- Define Tasks ---
research_task = Task(
    description="Analyze the latest trends in AI and cloud computing.",
    expected_output="A concise summary of 3-5 key trends in AI and cloud computing.",
    agent=researcher
)

write_report_task = Task(
    description="Write a short report (2-3 paragraphs) based on the research findings.",
    expected_output="A well-structured 2-3 paragraph report summarizing AI and cloud trends.",
    agent=writer
)

# --- Form the Crew ---
tech_crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_report_task],
    process=Process.sequential,
    verbose=True, # CRITICAL for debugging: shows overall crew execution
)

# --- Run the Crew ---
if __name__ == "__main__":
    print("--- Running CrewAI Agent System ---")
    
    # Kick off the crew's work
    result = tech_crew.kickoff()
    print("\n--- CrewAI Execution Result ---")
    print(result)

Explanation:

verbose=True: This is the primary debugging mechanism in CrewAI.
- Setting verbose=True on individual Agent objects makes the agent print its thought process, tool usage, and reasoning before executing actions. This is incredibly helpful for understanding why an agent made a decision.
- Setting verbose=True on the Crew object itself shows the overall flow, task execution, and agent handoffs.
Explicit LLM: We explicitly set llm=ChatOpenAI(...) for each agent. This ensures consistency and makes it clear which LLM is being used.
Task expected_output: While not strictly for debugging, defining clear expected_output for tasks helps agents stay on track and provides a strong basis for testing.

To run this:

Save the code as crewai_debug.py.
Set your OPENAI_API_KEY environment variable.
Run from your terminal: python crewai_debug.py

You’ll see a wealth of output, including each agent’s “thought” process, observations, and decisions, making it much easier to pinpoint where a workflow might go awry.

6. Testing a CrewAI Task

Testing CrewAI often involves verifying the output of a task or the final result of a crew.

Create a file test_crewai_tasks.py:

# test_crewai_tasks.py
import pytest
import os
from unittest.mock import patch, MagicMock
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI

# --- Mock LLM for CrewAI ---
@pytest.fixture
def mock_crewai_llm():
    """Fixture to mock the LLM calls within CrewAI agents."""
    # Patch the 'invoke' method of ChatOpenAI instances.
    with patch('langchain_openai.chat_models.ChatOpenAI.invoke') as mock_invoke:
        def side_effect(*args, **kwargs):
            # Simulate LLM response based on prompt content
            message_content = kwargs.get('messages', [])[-1].content if kwargs.get('messages') else ""

            if "latest trends in AI and cloud computing" in message_content:
                response_text = "AI trends: Gen AI, Edge AI. Cloud trends: Serverless, Hybrid Cloud. TERMINATE"
            elif "write a short report" in message_content:
                response_text = "Report: Generative AI and Edge AI are booming. Cloud computing is evolving with serverless architectures and hybrid cloud solutions. These trends are shaping the future. TERMINATE"
            else:
                response_text = "Mocked LLM response. TERMINATE"
            
            mock_response = MagicMock()
            mock_response.content = response_text
            mock_response.tool_calls = [] # No tool calls in this simple mock
            return mock_response
        
        mock_invoke.side_effect = side_effect
        yield mock_invoke

def test_crewai_research_task(mock_crewai_llm):
    """Test the research task in isolation."""
    # Define agent and task specifically for this test
    researcher = Agent(
        role='Test Research Analyst',
        goal='Uncover specific test insights',
        backstory="You are a test researcher.",
        verbose=False, # Turn off verbose for cleaner test output
        allow_delegation=False,
        llm=ChatOpenAI(model="gpt-4o", temperature=0) # LLM will be mocked by the fixture
    )

    research_task = Task(
        description="Analyze the latest trends in AI and cloud computing.",
        expected_output="A concise summary of 3-5 key trends.",
        agent=researcher
    )

    # Execute the task
    result = research_task.execute()

    # Assertions
    assert "AI trends:" in result
    assert "Cloud trends:" in result
    assert "Gen AI" in result
    assert "Serverless" in result

def test_crewai_full_crew_execution(mock_crewai_llm):
    """Test the full crew execution and final report."""
    # Define agents
    researcher = Agent(
        role='Test Research Analyst',
        goal='Uncover specific test insights',
        backstory="You are a test researcher.",
        verbose=False,
        allow_delegation=False,
        llm=ChatOpenAI(model="gpt-4o", temperature=0) # LLM will be mocked
    )

    writer = Agent(
        role='Test Content Strategist',
        goal='Craft test narratives',
        backstory="You are a test writer.",
        verbose=False,
        allow_delegation=False,
        llm=ChatOpenAI(model="gpt-4o", temperature=0) # LLM will be mocked
    )

    # Define tasks
    research_task = Task(
        description="Analyze the latest trends in AI and cloud computing.",
        expected_output="A concise summary of 3-5 key trends in AI and cloud computing.",
        agent=researcher
    )

    write_report_task = Task(
        description="Write a short report (2-3 paragraphs) based on the research findings.",
        expected_output="A well-structured 2-3 paragraph report summarizing AI and cloud trends.",
        agent=writer
    )

    # Form the Crew
    tech_crew = Crew(
        agents=[researcher, writer],
        tasks=[research_task, write_report_task],
        process=Process.sequential,
        verbose=False,
    )

    # Kick off the crew's work
    result = tech_crew.kickoff()

    # Assertions for the final report
    assert "Report:" in result
    assert "Generative AI and Edge AI are booming." in result
    assert "Cloud computing is evolving with serverless architectures" in result

Explanation:

mock_crewai_llm Fixture: Similar to AutoGen, we use unittest.mock.patch to mock ChatOpenAI.invoke. This allows us to control the LLM’s responses and ensure deterministic test results.
test_crewai_research_task: This is a unit test for a single Task. We create the Agent and Task and then call research_task.execute(). This helps isolate potential issues in individual tasks.
test_crewai_full_crew_execution: This is an integration/E2E test for the entire Crew. We define all agents and tasks, then call tech_crew.kickoff().
Assertions: We use assert to check for key phrases in the result of both the single task and the full crew. This validates that the agents produced the expected information.

To run this:

Save the code as test_crewai_tasks.py.
Run from your terminal: pytest test_crewai_tasks.py

These tests provide confidence that your CrewAI agents are performing their tasks and collaborating correctly.

7. Debugging and Testing with Semantic Kernel

Semantic Kernel (SK) integrates well with standard Python logging and offers flexible ways to mock LLM services for testing.

First, ensure you have Semantic Kernel installed:

pip install semantic-kernel==0.9.1b1 # As of 2026-03-20

(Note: semantic-kernel version 0.9.1b1 is a pre-release as of 2026-03-20, reflecting its rapid development. Always verify the latest stable version if available. This example uses the beta version for modern features.)

Debugging with Logging: Semantic Kernel

Semantic Kernel’s Kernel object can be configured to produce detailed logs about prompt rendering, function calls, and LLM interactions.

Create a file sk_debug.py:

# sk_debug.py
import os
import logging
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAICompletion, OpenAIChatCompletion
from semantic_kernel.functions import kernel_function

# --- Setup Logging ---
# Configure SK's internal logging to be verbose
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# --- Environment Variable Check ---
if not os.getenv("OPENAI_API_KEY"):
    logger.error("OPENAI_API_KEY environment variable not set. Semantic Kernel will likely fail without it.")
    exit("Please set the OPENAI_API_KEY environment variable.")

# --- Define a simple skill for demonstration ---
class MathSkill:
    @kernel_function(
        description="Adds two numbers together.",
        name="Add",
        parameters=[
            {"name": "input", "description": "The first number to add", "type": "string", "required": True},
            {"name": "number2", "description": "The second number to add", "type": "string", "required": True},
        ],
    )
    def add(self, input: str, number2: str) -> str:
        logger.info(f"MathSkill.Add called with input='{input}', number2='{number2}'")
        try:
            result = float(input) + float(number2)
            return str(result)
        except ValueError:
            logger.error(f"Invalid input for MathSkill.Add: '{input}', '{number2}'")
            return "Error: Invalid numbers provided."

# --- Initialize Kernel and LLM ---
async def main():
    kernel = Kernel()

    # Add the OpenAI chat completion service
    # Using gpt-4o as of 2026-03-20
    kernel.add_service(
        OpenAIChatCompletion(
            service_id="default",
            ai_model_id="gpt-4o",
            api_key=os.getenv("OPENAI_API_KEY"),
        ),
    )

    # Import the custom skill
    kernel.import_plugin_from_object(MathSkill(), plugin_name="MyMath")

    # Define a prompt function that uses the skill
    prompt_template = """
    You are a helpful assistant.
    User query: {{ $input }}

    If the query involves adding two numbers, use the MyMath.Add skill.
    Otherwise, answer generally.
    """
    math_assistant_function = kernel.create_function_from_prompt(
        prompt_template=prompt_template,
        function_name="MathAssistant",
        plugin_name="MyAssistant",
        description="A math assistant that can add numbers or answer general questions."
    )

    print("--- Running Semantic Kernel Agent ---")

    # Example 1: Query that should use the MathSkill
    query1 = "What is 15 plus 7?"
    print(f"\nQuery: '{query1}'")
    # Using invoke_prompt_async for simpler calls, but planner would be used for complex chains
    # For a simple prompt function with tool calling, invoke_prompt_async is sufficient.
    result1 = await kernel.invoke_prompt_async(
        prompt=prompt_template,
        variables={"input": query1},
        function_call_behavior=kernel.auto_function_request(), # Enable auto-tool calling
    )
    print(f"SK Response: {result1}")

    # Example 2: General query
    query2 = "Tell me a fun fact about space."
    print(f"\nQuery: '{query2}'")
    result2 = await kernel.invoke_prompt_async(
        prompt=prompt_template,
        variables={"input": query2},
        function_call_behavior=kernel.auto_function_request(),
    )
    print(f"SK Response: {result2}")

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Explanation:

Logging Setup: We configure the root logger to DEBUG level. This allows Semantic Kernel’s internal components to emit detailed logs, including prompt construction, function calling, and LLM responses.
@kernel_function: We define a simple MathSkill with an Add function. The @kernel_function decorator makes it discoverable by the kernel. We add logger.info inside the skill to trace its execution.
Kernel Initialization: We initialize the Kernel and add OpenAIChatCompletion service, retrieving the API key from environment variables for security.
Plugin Import: Our MathSkill is imported into the kernel as a plugin named “MyMath”.
Prompt Function with Tool Calling: We create a prompt function MathAssistant that explicitly tells the LLM to use the MyMath.Add skill if appropriate. kernel.auto_function_request() is crucial for enabling the LLM to call the defined skills.
invoke_prompt_async: We use invoke_prompt_async to run our queries. The detailed logs will show whether the LLM decided to call the Add skill, what arguments it passed, and the skill’s return value.

To run this:

Save the code as sk_debug.py.
Set your OPENAI_API_KEY environment variable.
Run from your terminal: python sk_debug.py

You’ll observe detailed logs showing the kernel’s internal workings, the LLM’s thought process (if it decides to call a tool), and the execution of your MathSkill.Add function.

Testing with Semantic Kernel: Mocking LLM Calls

To make tests deterministic and avoid API costs, we can mock the LLM service that Semantic Kernel uses. SK allows you to add custom AI services, which we can leverage for mocking.

Create a file test_sk_skills.py:

# test_sk_skills.py
import pytest
import os
import asyncio
from unittest.mock import MagicMock
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
from semantic_kernel.functions import kernel_function
from semantic_kernel.connectors.ai.chat_completion_client_base import ChatCompletionClientBase
from semantic_kernel.contents.chat_history import ChatHistory
from semantic_kernel.contents.chat_message_content import ChatMessageContent

# --- Define the MathSkill (same as in sk_debug.py) ---
class MathSkill:
    @kernel_function(
        description="Adds two numbers together.",
        name="Add",
        parameters=[
            {"name": "input", "description": "The first number to add", "type": "string", "required": True},
            {"name": "number2", "description": "The second number to add", "type": "string", "required": True},
        ],
    )
    def add(self, input: str, number2: str) -> str:
        try:
            result = float(input) + float(number2)
            return str(result)
        except ValueError:
            return "Error: Invalid numbers provided."

# --- Mock Chat Completion Service ---
# We'll create a custom mock class that mimics the ChatCompletionClientBase interface.
class MockChatCompletion(ChatCompletionClientBase):
    def __init__(self, mock_responses: dict):
        self._mock_responses = mock_responses
        self._calls = [] # To record calls for assertion

    async def get_chat_message_contents(
        self, chat_history: ChatHistory, settings=None, **kwargs
    ) -> list[ChatMessageContent]:
        # Extract the last user message to determine the mock response
        last_user_message = ""
        for message in chat_history.messages:
            if message.role.value == "user":
                last_user_message = message.content
        
        self._calls.append(last_user_message) # Record the call

        # Check if the prompt suggests tool calling or general response
        if "MyMath.Add" in last_user_message and "15" in last_user_message and "7" in last_user_message:
            # Simulate LLM deciding to call the tool
            response_content = '{"tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "MyMath-Add", "arguments": "{\\"input\\": \\"15\\", \\"number2\\": \\"7\\"}"}}]}'
        elif "fun fact" in last_user_message:
            response_content = "Mock fun fact about space: It's big! TERMINATE"
        else:
            response_content = "Mock general response. TERMINATE"
        
        return [ChatMessageContent(role="assistant", content=response_content)]

    async def get_streaming_chat_message_contents(self, chat_history: ChatHistory, settings=None, **kwargs):
        # Not implemented for this test, but would yield ChatMessageContent
        yield ChatMessageContent(role="assistant", content="Mock streaming response.")

# --- Pytest fixture to provide a mocked kernel ---
@pytest.fixture
async def mocked_kernel():
    kernel = Kernel()
    
    # Instantiate our mock chat completion service
    mock_service = MockChatCompletion(mock_responses={})
    kernel.add_service(mock_service, service_id="mock_chat")
    
    # Import the real MathSkill
    kernel.import_plugin_from_object(MathSkill(), plugin_name="MyMath")

    # Define the prompt function using the mock service
    prompt_template = """
    You are a helpful assistant.
    User query: {{ $input }}

    If the query involves adding two numbers, use the MyMath.Add skill.
    Otherwise, answer generally.
    """
    math_assistant_function = kernel.create_function_from_prompt(
        prompt_template=prompt_template,
        function_name="MathAssistant",
        plugin_name="MyAssistant",
        description="A math assistant that can add numbers or answer general questions.",
        ai_service_id="mock_chat" # Crucially, tell it to use our mock service
    )

    yield kernel, mock_service # Yield both the kernel and the mock service for assertions

    # Cleanup if necessary (not strictly needed for this mock)

@pytest.mark.asyncio
async def test_sk_math_skill_invocation(mocked_kernel):
    """Test if Semantic Kernel correctly invokes the MathSkill."""
    kernel, mock_service = mocked_kernel

    query = "What is 15 plus 7?"
    result = await kernel.invoke_prompt_async(
        prompt="User query: {{ $input }}", # Only pass the user query, the prompt function handles the rest
        variables={"input": query},
        function_call_behavior=kernel.auto_function_request(),
        ai_service_id="mock_chat" # Ensure this specific call uses the mock
    )
    
    # Assert that the MathSkill was called and returned the correct value
    # The MockChatCompletion simulates the LLM outputting a tool call for MyMath.Add
    # The kernel then executes the real MathSkill.Add, and its output is returned.
    assert result == "22.0" # Expected output from MathSkill.Add

@pytest.mark.asyncio
async def test_sk_general_response(mocked_kernel):
    """Test if Semantic Kernel provides a general response when no skill is needed."""
    kernel, mock_service = mocked_kernel

    query = "Tell me a fun fact about space."
    result = await kernel.invoke_prompt_async(
        prompt="User query: {{ $input }}",
        variables={"input": query},
        function_call_behavior=kernel.auto_function_request(),
        ai_service_id="mock_chat"
    )

    # Assert the mocked general response
    assert "Mock fun fact about space" in str(result)

Explanation:

MockChatCompletion: This custom class inherits from ChatCompletionClientBase, which is the interface SK uses for its chat completion services.
- It has an _mock_responses dictionary (though not fully used in this simple example, it’s good practice) and a _calls list to track what messages were sent to it.
- The crucial get_chat_message_contents method intercepts LLM calls. We simulate the LLM’s response based on the input prompt. If it detects keywords related to MyMath.Add, it returns a string that mimics the LLM’s function call JSON. Otherwise, it returns a general mock response.
mocked_kernel Fixture:
- This pytest fixture creates a Kernel instance and registers our MockChatCompletion as an AI service with service_id="mock_chat".
- It then imports the real MathSkill and creates the MathAssistant prompt function.
- Crucially, when creating the math_assistant_function, we specify ai_service_id="mock_chat" to ensure it uses our mock.
@pytest.mark.asyncio: Since Semantic Kernel operations are asynchronous, we use pytest-asyncio to run our async test functions.
test_sk_math_skill_invocation:
- We invoke the math_assistant_function with a math query.
- Our MockChatCompletion intercepts the LLM call and returns a simulated function call for MyMath.Add.
- Semantic Kernel then executes the actual MathSkill.Add function with the mocked arguments.
- We assert that the result is “22.0”, verifying that the LLM (mocked) correctly identified the need for the tool, and the tool itself executed correctly.
test_sk_general_response:
- We invoke with a general query.
- The mock LLM returns a general response.
- We assert that the result contains our mock general response.

To run this:

Save the code as test_sk_skills.py.
Run from your terminal: pytest test_sk_skills.py

These tests demonstrate how to isolate and test Semantic Kernel components, including tool invocation and general responses, by effectively mocking the underlying LLM.

Mini-Challenge: Instrument Your Own Agent for Observability

Now it’s your turn! Pick an agent workflow you’ve built in a previous chapter (or create a new simple one) and apply the DTM principles we’ve discussed.

Challenge:

Choose a Framework: Select either LangGraph, AutoGen, CrewAI, or Semantic Kernel.
Add Comprehensive Logging:
- Instrument your agent’s core components (nodes, agents, tasks, tools, skills) with logging.info() or framework-specific verbose settings.
- Ensure your logs capture: inputs, outputs, key decisions, and state changes.
Create a Basic Test Case:
- Write a pytest test file for at least one critical part of your agent (e.g., a specific tool, an agent’s response to a query, or a sub-workflow).
- Crucially, mock any external LLM calls or API interactions to make your test deterministic and fast.
- Use assert statements to verify expected behavior or output characteristics.
Run and Observe: Execute your agent with the logging enabled, and then run your tests.

Hint: Start small! Don’t try to log every single variable. Focus on the decision points and data transformations. For testing, pick the most deterministic part of your agent first.

What to Observe/Learn:

How does adding logging immediately clarify the agent’s execution path and reasoning?
How much easier is it to pinpoint where an unexpected behavior might originate when you have detailed logs?
How does mocking LLM calls simplify testing and make your tests run faster and more reliably?
What kinds of assertions are most useful for testing non-deterministic AI outputs (e.g., checking for keywords, structure, or successful tool calls, rather than exact strings)?

Common Pitfalls & Troubleshooting

Even with good DTM practices, AI agents can be tricky. Here are some common pitfalls:

Over-reliance on LLM “Magic”: Assuming the LLM will “figure it out” without explicit instructions, robust tools, or validation.
- Troubleshooting: Break down complex reasoning into smaller, tool-assisted steps. Add explicit validation logic for LLM outputs (e.g., check if a JSON response is valid).
Neglecting Intermediate Logging: Only logging the start and end of a complex workflow.
- Troubleshooting: Log inputs and outputs at every significant step (each node, each tool call, each agent interaction, each skill execution). This “breadcrumbing” is vital for understanding multi-step failures.
Difficulty Reproducing Non-Deterministic Failures: An agent works 95% of the time, but fails sporadically in production.
- Troubleshooting: Log the exact prompts sent to the LLM and the exact responses received. When a failure occurs, try to replay that specific prompt/response sequence in a controlled environment (using mocks). Implement retry mechanisms for transient LLM errors.
Ignoring Token Usage and Cost in Monitoring: Only focusing on functional correctness.
- Troubleshooting: Integrate token usage tracking into your monitoring dashboards. Set up alerts for unexpected cost spikes. Optimize prompts for conciseness and consider caching LLM responses for common queries.
Brittle Prompts in Tests: Writing tests that break with minor, acceptable changes in LLM output (e.g., asserting an exact sentence match).
- Troubleshooting: Test for characteristics of the output (keywords, presence of certain data, valid JSON structure, successful tool execution) rather than pixel-perfect string matches. Use in checks or regular expressions if needed.

Summary

Phew! You’ve navigated the complex world of debugging, testing, and monitoring AI agent systems. Let’s recap the key takeaways:

DTM is paramount for building reliable and trustworthy AI agents, especially given the non-deterministic nature of LLMs and the complexity of multi-agent interactions.
Comprehensive Logging and Tracing are your best friends for debugging. Instrument every significant step of your agent’s workflow to understand its internal state and decision-making process.
Observability Platforms like LangSmith offer visual traces and metrics that dramatically simplify debugging and performance analysis.
The Testing Pyramid (Unit, Integration, E2E) provides a structured approach to building confidence in your agent’s behavior.
Mocking LLM calls and external tools is crucial for creating fast, deterministic, and reliable tests.
Golden Datasets are essential for E2E and regression testing, validating that your agent performs as expected for key scenarios.
Monitoring production agents for latency, token usage, error rates, and user feedback ensures continuous performance and improvement.
Common pitfalls like over-relying on LLM “magic” or neglecting intermediate logging can be avoided with diligent DTM practices.

By embracing these principles, you’re not just building smart agents; you’re building dependable agents. This is a crucial step towards deploying robust AI solutions in the real world.

In the next chapter, we’ll explore deployment strategies and how to get your reliable agent systems into production, ready to serve users!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

Debugging, Testing, and Monitoring: Building Reliable Agent Systems

Table of Contents

Introduction: Ensuring Agent Reliability

Core Concepts: Navigating the Non-Deterministic World

The Unique Challenges of DTM for AI Agents

Debugging Strategies: Unmasking Agent Behavior

Logging and Tracing: Your Agent’s Inner Monologue

Observability Platforms: A Bird’s-Eye View

Interactive Debugging: Stepping Through the Flow

Visualizing Workflows: The Map to Your Maze

Testing Methodologies: Building Confidence in Agent Behavior

The Agent Testing Pyramid

Golden Datasets: Your Agent’s Report Card

Monitoring Agent Performance: Keeping an Eye on Production

Key Metrics to Track

Alerting and Anomaly Detection

Feedback Loops for Continuous Improvement

Illustrative Workflow Diagram: Agent System DTM Pipeline

Step-by-Step Implementation: Practical DTM Examples

1. Debugging with Logging: LangGraph

2. Testing a LangGraph Node

3. Debugging and Testing with AutoGen

4. Testing an AutoGen Conversation

5. Debugging and Testing with CrewAI

6. Testing a CrewAI Task

7. Debugging and Testing with Semantic Kernel

Debugging with Logging: Semantic Kernel

Testing with Semantic Kernel: Mocking LLM Calls

Mini-Challenge: Instrument Your Own Agent for Observability

Common Pitfalls & Troubleshooting

Summary

References