Introduction: Ensuring Agent Reliability
Welcome back, intrepid AI architects! In previous chapters, we’ve had a blast bringing our AI agents to life, equipping them with tools, memory, and sophisticated orchestration patterns. You’ve seen them tackle tasks, engage in conversations, and even collaborate. That’s fantastic!
But here’s a crucial question: How do we know our agents are truly reliable? What happens when a Large Language Model (LLM) hallucinates, a tool fails, or an agent misinterprets a prompt? Building AI agent systems isn’t just about crafting clever prompts and chaining components; it’s also about anticipating failure, identifying issues swiftly, and ensuring consistent, trustworthy performance. This is where the pillars of Debugging, Testing, and Monitoring (DTM) come into play.
In this chapter, we’ll dive deep into the essential practices that transform a cool agent prototype into a robust, production-ready system. We’ll explore the unique challenges of DTM in the context of non-deterministic AI, learn practical strategies for each framework, and equip you with the skills to build agent systems you can truly depend on. Get ready to put on your detective hat and make your agents bulletproof!
Core Concepts: Navigating the Non-Deterministic World
Debugging, testing, and monitoring AI agent systems present a unique set of challenges compared to traditional software development. Let’s explore why and how we can address them.
The Unique Challenges of DTM for AI Agents
Think about a traditional application: if you give it the same input, it will almost always produce the exact same output. Not so with AI agents, especially those powered by LLMs!
- Non-Determinism of LLMs: LLMs, by design, are probabilistic. Even with the same prompt and parameters (like
temperature=0), their responses can vary slightly. This makes reproducing bugs and asserting exact outputs incredibly tricky. - Multi-Agent Interaction Complexity: When multiple agents interact, the number of possible conversational paths and states explodes. Debugging becomes like tracing a conversation among several highly intelligent, yet sometimes unpredictable, individuals.
- Statefulness and Memory: Agents maintain internal state and memory. A subtle issue early in a long conversation might only manifest much later, making root cause analysis difficult.
- Tool Reliability: Agents rely on external tools (APIs, databases, custom functions). Failures can originate in the tools themselves, or in how the agent uses them (e.g., incorrect arguments, misinterpretation of tool output).
- Cost and Latency Considerations: Every LLM call costs money and takes time. Extensive debugging and testing can quickly become expensive and slow if not managed carefully.
- Prompt Engineering Fragility: A small change in a system prompt or agent instruction can drastically alter behavior, potentially introducing subtle regressions.
These challenges mean we need a slightly different mindset and toolset for DTM in the agentic world.
Debugging Strategies: Unmasking Agent Behavior
Debugging is the art of finding and fixing errors. For AI agents, it’s often about understanding why an agent made a particular decision or produced an unexpected output.
Logging and Tracing: Your Agent’s Inner Monologue
The most fundamental debugging tool is logging. For AI agents, it’s not enough to just log errors; we need to log the thought process. This includes:
- Inputs and Outputs: What prompt was sent to the LLM? What response did it return?
- Tool Calls: Which tool was called? What arguments were passed? What was the tool’s raw output?
- Agent Decisions: Why did the agent choose a particular path in a graph? What was its reasoning for delegating a task?
- State Changes: How did the agent’s internal state or memory evolve over time?
Structured logging (e.g., JSON logs) is highly recommended, as it makes logs easier to parse and analyze with tools.
Observability Platforms: A Bird’s-Eye View
Specialized platforms like LangSmith (from LangChain, often used with LangGraph) are becoming indispensable. They provide:
- Trace Visualization: A visual timeline of all LLM calls, tool executions, and agent decisions within a run.
- Input/Output Inspection: Detailed views of prompts, responses, and intermediate steps.
- Cost and Latency Metrics: Tracking token usage and execution time for each step.
- Experimentation: Comparing different agent configurations or prompt versions.
While LangSmith is prominent for LangChain/LangGraph, other frameworks have built-in logging or community tools that offer similar insights.
Interactive Debugging: Stepping Through the Flow
Sometimes, you need to pause your agent and inspect its state directly. Using a Python debugger (like pdb or your IDE’s debugger) can be invaluable for:
- Stepping through custom tool code.
- Inspecting variables within an agent’s
_runmethod or a graph node. - Understanding the exact data flowing between components.
Visualizing Workflows: The Map to Your Maze
For complex multi-agent systems or graph-based workflows, a visual representation can clarify the intended flow versus the actual execution. Tools like Mermaid can help you draw your agent’s decision tree or state transitions, making it easier to spot logical flaws.
Testing Methodologies: Building Confidence in Agent Behavior
Testing AI agents is about establishing confidence that they behave as expected under various conditions. Due to non-determinism, we often shift from “exact output” assertions to “reasonable output” or “expected behavior” assertions.
The Agent Testing Pyramid
Just like traditional software, we can think of an agent testing pyramid:
Unit Testing (Base):
- Focus: Individual, deterministic components.
- Examples: Testing a custom tool’s logic, a helper function that processes LLM output, or a prompt template’s rendering.
- Assertion: Exact output is often possible here.
Integration Testing (Middle):
- Focus: Interactions between components.
- Examples: An agent using a tool, a simple two-agent conversation, a specific node in a LangGraph workflow.
- Assertion: Check for correct tool calls, expected message formats, or general sentiment/topic of LLM responses. Mock LLM calls or specific tool outputs to isolate interaction logic.
End-to-End (E2E) Testing (Top):
- Focus: The entire agent system, from start to finish.
- Examples: A full multi-agent workflow, a complete conversation with a user, solving a complex problem.
- Assertion: Often involves “golden datasets” – predefined inputs with expected outputs (or output characteristics like keywords, structure, or successful task completion). This is where human evaluation often comes in.
Golden Datasets: Your Agent’s Report Card
For E2E testing, you’ll want to build a collection of “golden pairs”: (input_query, expected_output_characteristics).
input_query: A typical user query or starting condition.expected_output_characteristics: This isn’t always an exact string. It might be:- “The output should contain ‘stock price’ and ‘AAPL’.”
- “The output should be a valid JSON object with keys ‘summary’ and ‘recommendation’.”
- “The agent should successfully call the
search_webtool at least once.” - “The final answer should be positive/negative sentiment.”
Running your agent system against these datasets regularly (regression testing) helps ensure that new changes haven’t introduced regressions.
Monitoring Agent Performance: Keeping an Eye on Production
Once your agent system is deployed, monitoring becomes your eyes and ears, ensuring it continues to perform well in the wild.
Key Metrics to Track
- Latency: How long does it take for the agent to respond? (Total, and per LLM call/tool call).
- Token Usage & API Costs: How many tokens are consumed? What’s the cost per interaction? Are we staying within budget?
- Success Rate: What percentage of interactions result in a successful task completion or a satisfactory answer?
- Error Rate: How often do LLM calls fail, tools error out, or agents get stuck?
- Tool Usage: Which tools are being used most? Are there tools that are never used, or misused?
- User Feedback: How are users rating the agent’s performance?
Alerting and Anomaly Detection
Set up alerts for critical thresholds:
- Latency spikes (e.g., response time > 10 seconds).
- High error rates.
- Unexpected token usage.
- Sudden drops in success rate.
Feedback Loops for Continuous Improvement
Monitoring isn’t just about spotting problems; it’s about learning.
- Human-in-the-Loop: For critical applications, human review of agent interactions can provide invaluable data for prompt refinement and system improvements.
- Data Collection: Log all interactions (anonymized if necessary) to build datasets for future training, fine-tuning, or more comprehensive E2E testing.
Illustrative Workflow Diagram: Agent System DTM Pipeline
Let’s visualize the continuous process of DTM for an AI agent system.
This diagram shows how DTM is not a one-time event, but an iterative cycle that feeds back into development.
Step-by-Step Implementation: Practical DTM Examples
Let’s get practical and see how we can apply some of these DTM concepts using our familiar frameworks. We’ll focus on adding logging and basic testing.
First, ensure you have the necessary environment setup, which includes python 3.9+ and pip. We’ll use pytest for testing, so let’s install that:
pip install pytest==8.1.1
(Note: pytest version 8.1.1 is stable as of 2026-03-20, but feel free to use the latest stable version if available.)
1. Debugging with Logging: LangGraph
LangGraph’s graph structure makes it relatively easy to instrument logging at each node. We’ll enhance a simple LangGraph workflow to show its internal steps.
Let’s imagine a very basic LangGraph that decides if a user’s query is about “math” or “general” and then routes it.
First, create a file named langgraph_debug.py:
# langgraph_debug.py
import os
import logging
from typing import Literal
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
# --- Setup Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# --- Environment Variable Check ---
# Make sure to set your OPENAI_API_KEY environment variable.
if not os.getenv("OPENAI_API_KEY"):
logger.warning("OPENAI_API_KEY environment variable not set. LLM calls might fail or use a mock.")
# For demonstration, we'll proceed, but in a real app, you'd handle this more robustly.
# --- Define Graph State ---
class AgentState:
messages: list[BaseMessage]
topic: Literal["math", "general", "unknown"] = "unknown"
# --- LLM Setup ---
# Using gpt-4o as of 2026-03-20. Ensure OPENAI_API_KEY is set in your environment.
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# --- Agent Nodes ---
def classify_topic(state: AgentState) -> AgentState:
"""Classifies the topic of the conversation."""
logger.info(f"Node 'classify_topic' entered. Current messages: {state.messages}")
last_message = state.messages[-1].content
prompt = f"""
Analyze the following user query and determine if it's primarily about 'math' or 'general' topics.
Respond with only one word: 'math' or 'general'.
Query: "{last_message}"
"""
response = llm.invoke(prompt)
classification = response.content.strip().lower()
new_state = AgentState(messages=state.messages, topic=state.topic) # Start with current state
if "math" in classification:
new_state.topic = "math"
logger.info("Topic classified as: math")
elif "general" in classification:
new_state.topic = "general"
logger.info("Topic classified as: general")
else:
new_state.topic = "unknown"
logger.warning(f"Could not classify topic. LLM response: '{classification}'. Defaulting to unknown.")
return new_state
def handle_math_query(state: AgentState) -> AgentState:
"""Handles math-related queries."""
logger.info(f"Node 'handle_math_query' entered. Current messages: {state.messages}")
last_message = state.messages[-1].content
response_content = f"I'm a math expert! Let me help with: '{last_message}'"
new_message = HumanMessage(content=response_content)
new_state = AgentState(messages=state.messages + [new_message], topic=state.topic)
logger.info(f"Math query handled. Response: {response_content}")
return new_state
def handle_general_query(state: AgentState) -> AgentState:
"""Handles general queries."""
logger.info(f"Node 'handle_general_query' entered. Current messages: {state.messages}")
last_message = state.messages[-1].content
response_content = f"I'm a general knowledge expert! Here's my take on: '{last_message}'"
new_message = HumanMessage(content=response_content)
new_state = AgentState(messages=state.messages + [new_message], topic=state.topic)
logger.info(f"General query handled. Response: {response_content}")
return new_state
# --- Conditional Edges ---
def route_topic(state: AgentState) -> Literal["math_handler", "general_handler"]:
"""Routes based on the classified topic."""
logger.info(f"Node 'route_topic' entered. Current topic: {state.topic}")
if state.topic == "math":
logger.info("Routing to math_handler.")
return "math_handler"
else: # Fallback to general handler for 'general' or 'unknown'
logger.info(f"Routing to general_handler (topic was '{state.topic}').")
return "general_handler"
# --- Build the Graph ---
workflow = StateGraph(AgentState)
# Add nodes
workflow.add_node("classify", classify_topic)
workflow.add_node("math_handler", handle_math_query)
workflow.add_node("general_handler", handle_general_query)
# Set entry point
workflow.set_entry_point("classify")
# Add edges
workflow.add_conditional_edges(
"classify",
route_topic,
{
"math_handler": "math_handler",
"general_handler": "general_handler", # Handles both 'general' and 'unknown' topics
},
)
# Set end points
workflow.add_edge("math_handler", END)
workflow.add_edge("general_handler", END)
# Compile the graph
app = workflow.compile()
# --- Run the Agent ---
if __name__ == "__main__":
print("--- Running LangGraph Agent ---")
# Example 1: Math query
print("\nQuery: 'What is the square root of 9?'")
initial_state_math = AgentState(messages=[HumanMessage(content="What is the square root of 9?")])
result_math = app.invoke(initial_state_math)
print(f"Final Math Response: {result_math.messages[-1].content}")
# Example 2: General query
print("\nQuery: 'Tell me a fun fact about cats.'")
initial_state_general = AgentState(messages=[HumanMessage(content="Tell me a fun fact about cats.")])
result_general = app.invoke(initial_state_general)
print(f"Final General Response: {result_general.messages[-1].content}")
# Example 3: Ambiguous query (should fallback to general)
print("\nQuery: 'Purple elephant flying in space.'")
initial_state_ambiguous = AgentState(messages=[HumanMessage(content="Purple elephant flying in space.")])
result_ambiguous = app.invoke(initial_state_ambiguous)
print(f"Final Ambiguous Response: {result_ambiguous.messages[-1].content}")
Explanation:
- Logging Setup: We start by importing the
loggingmodule and configuring a basic logger. This logger will print messages to the console, showing the time, logger name, level (INFO, WARNING), and the message. logger.info()&logger.warning(): Inside each node function (classify_topic,handle_math_query,handle_general_query) and the routing function (route_topic), we’ve addedlogger.info()calls. These messages tell us when a node is entered, what its current state is, and what decision it’s making.- State Inspection: Notice how we log
state.messagesorstate.topicat the beginning of each node. This is crucial for understanding the context an agent is operating with at any given point. - Conditional Logging: In
classify_topic, we log the classification result, and alogger.warning()if the classification isunknown. This highlights potential issues.
To run this:
- Save the code as
langgraph_debug.py. - Set your
OPENAI_API_KEYenvironment variable. For example, on Linux/macOS:export OPENAI_API_KEY="sk-...". On Windows (CMD):set OPENAI_API_KEY="sk-...". - Run from your terminal:
python langgraph_debug.py
Observe the detailed logs in your console. You’ll see the agent’s journey through the graph, making it much easier to debug if it takes an unexpected turn!
2. Testing a LangGraph Node
Now, let’s write a simple unit test for our classify_topic node using pytest.
Create a new file named test_langgraph_nodes.py in the same directory:
# test_langgraph_nodes.py
import pytest
import os
from unittest.mock import MagicMock
from langchain_core.messages import HumanMessage
from langgraph_debug import classify_topic, AgentState # Import from our previous file
# Mock the LLM to make tests deterministic and avoid API calls
@pytest.fixture
def mock_llm_response():
"""Fixture to mock the LLM's invoke method."""
# Store the original LLM invoke method
original_llm_invoke = classify_topic.__globals__['llm'].invoke
def mock_invoke_logic(prompt):
"""Custom logic for the mocked LLM invoke."""
if "square root" in prompt.lower() or "math" in prompt.lower():
mock_response = MagicMock()
mock_response.content = "math"
return mock_response
elif "fun fact" in prompt.lower() or "general" in prompt.lower():
mock_response = MagicMock()
mock_response.content = "general"
return mock_response
else:
mock_response = MagicMock()
mock_response.content = "unknown_classification"
return mock_response
# Replace the actual LLM's invoke with our mock logic
classify_topic.__globals__['llm'].invoke = mock_invoke_logic
yield # Run the test
# Restore the original LLM invoke method after the test
classify_topic.__globals__['llm'].invoke = original_llm_invoke
def test_classify_topic_math(mock_llm_response):
"""Test if classify_topic correctly identifies a math query."""
initial_state = AgentState(messages=[HumanMessage(content="What is the square root of 16?")])
result_state = classify_topic(initial_state)
assert result_state.topic == "math"
def test_classify_topic_general(mock_llm_response):
"""Test if classify_topic correctly identifies a general query."""
initial_state = AgentState(messages=[HumanMessage(content="Tell me a fun fact about giraffes.")])
result_state = classify_topic(initial_state)
assert result_state.topic == "general"
def test_classify_topic_unknown(mock_llm_response):
"""Test if classify_topic handles an unknown query."""
initial_state = AgentState(messages=[HumanMessage(content="Purple elephants fly on Tuesdays.")])
result_state = classify_topic(initial_state)
assert result_state.topic == "unknown"
Explanation:
pytest: We usepytestfor our testing framework. It automatically discovers tests (functions starting withtest_).mock_llm_responseFixture: This is crucial! To make our tests deterministic and avoid making actual API calls (which cost money and are slow), we mock thellm.invokemethod.MagicMockallows us to simulate the behavior of an object.- Our
mock_invoke_logicfunction checks the prompt and returns a predefinedMagicMockresponse with the expectedcontent. - The
yieldkeyword ensures the mock is active during the test and then restored afterward, cleaning up the test environment.
- Test Functions: Each
test_function creates anAgentStatewith a specificHumanMessageand then callsclassify_topic. assertStatements: We useassertto check if theresult_state.topicmatches our expected classification.
To run this:
- Make sure
langgraph_debug.pyandtest_langgraph_nodes.pyare in the same directory. - Run from your terminal:
pytest test_langgraph_nodes.py
You should see output indicating that all 3 tests passed! This gives us confidence that our classify_topic node works as intended, regardless of the actual LLM’s non-deterministic nature.
3. Debugging and Testing with AutoGen
AutoGen provides excellent built-in logging and a clear way to inspect conversation history, which is key for debugging. We’ll modify the example to load API keys directly from environment variables.
First, ensure you have AutoGen installed:
pip install pyautogen==0.2.20
(Note: pyautogen version 0.2.20 is stable as of 2026-03-20. Check for the latest stable version if needed.)
Create a file autogen_debug.py:
# autogen_debug.py
import os
import autogen
import logging
# --- Setup Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# --- Environment Variable Check and AutoGen Configuration ---
# It's best practice to load API keys from environment variables for security.
# Make sure to set your OPENAI_API_KEY environment variable.
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
logger.error("OPENAI_API_KEY environment variable not set. AutoGen will likely fail without it.")
# In a real application, you might exit or provide a mock here.
exit("Please set the OPENAI_API_KEY environment variable.")
# AutoGen config list using the environment variable
config_list = [
{
"model": "gpt-4o", # Using gpt-4o as of 2026-03-20
"api_key": openai_api_key,
}
]
# --- Define Agents ---
# The User Proxy Agent is typically the entry point for human interaction or initiating tasks.
user_proxy = autogen.UserProxyAgent(
name="Admin",
system_message="A human admin. Interact with the Planner to ensure tasks are completed.",
code_execution_config={"last_n_messages": 3, "work_dir": "coding"},
human_input_mode="NEVER", # Set to ALWAYS for interactive debugging
is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
)
# The Planner Agent will receive the initial task and break it down.
planner = autogen.AssistantAgent(
name="Planner",
llm_config={"config_list": config_list}, # Use the config_list with API key
system_message="""You are a helpful AI assistant that plans tasks.
Your goal is to break down a complex request into smaller, manageable steps for other agents.
When you have a plan, present it clearly.
If the task involves code, just state "I will write code to solve this."
Once the task is fully planned or completed, respond with TERMINATE.
""",
)
# --- Define a GroupChat ---
groupchat = autogen.GroupChat(agents=[user_proxy, planner], messages=[], max_round=5)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config={"config_list": config_list})
# --- Run the Conversation ---
if __name__ == "__main__":
print("--- Running AutoGen Agent Conversation ---")
# Start a conversation
print("\nStarting conversation: 'Plan a simple Python script to calculate the factorial of a number.'")
user_proxy.initiate_chat(
manager,
message="Plan a simple Python script to calculate the factorial of a number.",
)
print("\n--- Conversation History (for Debugging) ---")
# Accessing the conversation history is a key debugging technique in AutoGen
# This shows the full back-and-forth between agents
for msg in user_proxy.chat_messages[manager]:
print(f"[{msg['name']}]: {msg['content']}")
Explanation:
- Logging: Similar to LangGraph, we set up basic Python logging. AutoGen itself produces quite verbose logs at
INFOlevel, which is helpful. - API Key from Environment: We now explicitly retrieve the
OPENAI_API_KEYusingos.getenv(). If it’s not set, the script will exit with an error, ensuring secure and proper setup. config_list: Theconfig_listfor AutoGen agents is constructed directly using this environment variable, removing the need for an externalOAI_CONFIG_LISTfile.human_input_mode="NEVER": For automated testing, we set this toNEVER. For interactive debugging, you might temporarily change it toALWAYSto step through the conversation and provide manual input.user_proxy.chat_messages[manager]: This is the magic for debugging in AutoGen! After a conversation,user_proxy.chat_messagesholds a dictionary where keys are the agents it interacted with, and values are lists of all messages exchanged. Printing this history gives you a full transcript of the multi-agent deliberation.
To run this:
- Save the code as
autogen_debug.py. - Set your
OPENAI_API_KEYenvironment variable. - Run from your terminal:
python autogen_debug.py
You’ll see the logging messages and then the full conversation history, which is invaluable for understanding how your agents arrived at their decisions.
4. Testing an AutoGen Conversation
Testing full AutoGen conversations often involves checking the final message content or ensuring specific agents participated.
Create a file test_autogen_agents.py:
# test_autogen_agents.py
import pytest
import os
import autogen
from unittest.mock import patch, MagicMock
# --- AutoGen Configuration (for testing) ---
# We'll use a mock config list to avoid actual API calls during tests
test_config_list = [
{
"model": "mock-model", # Use a placeholder model name
"api_key": "mock-key", # Use a placeholder API key (won't be used due to mocking)
}
]
# --- Mock LLM for AutoGen ---
@pytest.fixture
def mock_autogen_llm():
"""Fixture to mock the LLM calls within AutoGen agents."""
# Patch the underlying Completion.create method that AutoGen uses for LLM calls.
with patch('autogen.Completion.create') as mock_create:
def side_effect(*args, **kwargs):
messages = kwargs.get('messages', [])
last_message_content = messages[-1]['content'].lower() if messages else ""
if "plan a simple python script" in last_message_content:
response_content = "Plan: 1. Define a function for factorial. 2. Use a loop. 3. Return result. TERMINATE"
elif "factorial of a number" in last_message_content: # For the planner's response
response_content = "To calculate the factorial, you need a loop. TERMINATE"
else:
response_content = "Mock response for unknown query. TERMINATE"
mock_choice = MagicMock()
mock_choice.message.content = response_content
mock_choice.message.function_call = None
mock_choice.finish_reason = "stop"
mock_response = MagicMock()
mock_response.choices = [mock_choice]
mock_response.usage.prompt_tokens = 10
mock_response.usage.completion_tokens = 10
return mock_response
mock_create.side_effect = side_effect
yield mock_create
def test_autogen_factorial_planning(mock_autogen_llm):
"""Test if AutoGen agents can plan a factorial script."""
# Redefine agents for the test to ensure they use the mock config list
user_proxy = autogen.UserProxyAgent(
name="Admin",
system_message="A human admin. Interact with the Planner to ensure tasks are completed.",
code_execution_config={"last_n_messages": 3, "work_dir": "coding"},
human_input_mode="NEVER",
is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
)
planner = autogen.AssistantAgent(
name="Planner",
llm_config={"config_list": test_config_list}, # Use the mock config list
system_message="""You are a helpful AI assistant that plans tasks.
Your goal is to break down a complex request into smaller, manageable steps for other agents.
When you have a plan, present it clearly.
If the task involves code, just state "I will write code to solve this."
Once the task is fully planned or completed, respond with TERMINATE.
""",
)
groupchat = autogen.GroupChat(agents=[user_proxy, planner], messages=[], max_round=5)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config={"config_list": test_config_list})
# Initiate the chat
user_proxy.initiate_chat(
manager,
message="Plan a simple Python script to calculate the factorial of a number.",
)
# Assertions: Check the last message from the Planner
# We expect the Planner to have provided a plan and terminated.
last_message = user_proxy.chat_messages[manager][-1]['content']
assert "TERMINATE" in last_message
assert "Plan:" in last_message
assert "factorial" in last_message.lower()
Explanation:
@pytest.fixturewithpatch: This is a more advanced mocking technique. We useunittest.mock.patchto replaceautogen.Completion.create(the underlying method AutoGen uses for LLM calls) with our customside_effectfunction.- The
side_effectfunction simulates different LLM responses based on the input prompt. This makes our test entirely self-contained and deterministic. - We set
mock_choice.message.function_call = Noneto ensure it behaves like a text-only response.
- The
- Agent Redefinition: We redefine
user_proxyandplannerwithin the test function to ensure they pick up ourtest_config_listwhich is crucial for using the mocked LLM. initiate_chat: We run a full conversation just like in the debug example.- Assertions: We check the
user_proxy.chat_messages[manager][-1]['content'](the last message in the conversation) for keywords like “TERMINATE” and “Plan:” to ensure the agents reached the expected outcome.
To run this:
- Save the code as
test_autogen_agents.py. - Run from your terminal:
pytest test_autogen_agents.py
This test verifies that our agents can successfully plan a task given a specific prompt, without relying on actual LLM calls.
5. Debugging and Testing with CrewAI
CrewAI offers a verbose setting that provides excellent insights into the agent’s thought process and task execution.
First, ensure you have CrewAI installed:
pip install crewai==0.28.8 langchain-openai==0.1.1 # As of 2026-03-20
(Note: crewai version 0.28.8 and langchain-openai version 0.1.1 are stable as of 2026-03-20. Check for the latest stable versions if needed.)
Create a file crewai_debug.py:
# crewai_debug.py
import os
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
import logging
# --- Setup Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# --- Environment Variable Check ---
if not os.getenv("OPENAI_API_KEY"):
logger.error("OPENAI_API_KEY environment variable not set. CrewAI will likely fail without it.")
exit("Please set the OPENAI_API_KEY environment variable.")
# --- LLM Setup ---
# Using gpt-4o as of 2026-03-20
# CrewAI can pick up the model from this env var, or you can pass it explicitly.
os.environ["OPENAI_MODEL_NAME"] = "gpt-4o"
# --- Define Agents ---
researcher = Agent(
role='Senior Research Analyst',
goal='Uncover critical insights about the tech industry',
backstory="""You are a Senior Research Analyst at a leading tech firm.
Your expertise lies in identifying emerging trends and market shifts.""",
verbose=True, # CRITICAL for debugging: shows agent's thought process
allow_delegation=False,
llm=ChatOpenAI(model="gpt-4o", temperature=0) # Explicitly set LLM for this agent
)
writer = Agent(
role='Content Strategist',
goal='Craft compelling narratives from research findings',
backstory="""You are a Content Strategist, skilled in transforming complex data
into engaging and easy-to-understand reports.""",
verbose=True, # CRITICAL for debugging
allow_delegation=False,
llm=ChatOpenAI(model="gpt-4o", temperature=0)
)
# --- Define Tasks ---
research_task = Task(
description="Analyze the latest trends in AI and cloud computing.",
expected_output="A concise summary of 3-5 key trends in AI and cloud computing.",
agent=researcher
)
write_report_task = Task(
description="Write a short report (2-3 paragraphs) based on the research findings.",
expected_output="A well-structured 2-3 paragraph report summarizing AI and cloud trends.",
agent=writer
)
# --- Form the Crew ---
tech_crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_report_task],
process=Process.sequential,
verbose=True, # CRITICAL for debugging: shows overall crew execution
)
# --- Run the Crew ---
if __name__ == "__main__":
print("--- Running CrewAI Agent System ---")
# Kick off the crew's work
result = tech_crew.kickoff()
print("\n--- CrewAI Execution Result ---")
print(result)
Explanation:
verbose=True: This is the primary debugging mechanism in CrewAI.- Setting
verbose=Trueon individualAgentobjects makes the agent print its thought process, tool usage, and reasoning before executing actions. This is incredibly helpful for understanding why an agent made a decision. - Setting
verbose=Trueon theCrewobject itself shows the overall flow, task execution, and agent handoffs.
- Setting
- Explicit LLM: We explicitly set
llm=ChatOpenAI(...)for each agent. This ensures consistency and makes it clear which LLM is being used. - Task
expected_output: While not strictly for debugging, defining clearexpected_outputfor tasks helps agents stay on track and provides a strong basis for testing.
To run this:
- Save the code as
crewai_debug.py. - Set your
OPENAI_API_KEYenvironment variable. - Run from your terminal:
python crewai_debug.py
You’ll see a wealth of output, including each agent’s “thought” process, observations, and decisions, making it much easier to pinpoint where a workflow might go awry.
6. Testing a CrewAI Task
Testing CrewAI often involves verifying the output of a task or the final result of a crew.
Create a file test_crewai_tasks.py:
# test_crewai_tasks.py
import pytest
import os
from unittest.mock import patch, MagicMock
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
# --- Mock LLM for CrewAI ---
@pytest.fixture
def mock_crewai_llm():
"""Fixture to mock the LLM calls within CrewAI agents."""
# Patch the 'invoke' method of ChatOpenAI instances.
with patch('langchain_openai.chat_models.ChatOpenAI.invoke') as mock_invoke:
def side_effect(*args, **kwargs):
# Simulate LLM response based on prompt content
message_content = kwargs.get('messages', [])[-1].content if kwargs.get('messages') else ""
if "latest trends in AI and cloud computing" in message_content:
response_text = "AI trends: Gen AI, Edge AI. Cloud trends: Serverless, Hybrid Cloud. TERMINATE"
elif "write a short report" in message_content:
response_text = "Report: Generative AI and Edge AI are booming. Cloud computing is evolving with serverless architectures and hybrid cloud solutions. These trends are shaping the future. TERMINATE"
else:
response_text = "Mocked LLM response. TERMINATE"
mock_response = MagicMock()
mock_response.content = response_text
mock_response.tool_calls = [] # No tool calls in this simple mock
return mock_response
mock_invoke.side_effect = side_effect
yield mock_invoke
def test_crewai_research_task(mock_crewai_llm):
"""Test the research task in isolation."""
# Define agent and task specifically for this test
researcher = Agent(
role='Test Research Analyst',
goal='Uncover specific test insights',
backstory="You are a test researcher.",
verbose=False, # Turn off verbose for cleaner test output
allow_delegation=False,
llm=ChatOpenAI(model="gpt-4o", temperature=0) # LLM will be mocked by the fixture
)
research_task = Task(
description="Analyze the latest trends in AI and cloud computing.",
expected_output="A concise summary of 3-5 key trends.",
agent=researcher
)
# Execute the task
result = research_task.execute()
# Assertions
assert "AI trends:" in result
assert "Cloud trends:" in result
assert "Gen AI" in result
assert "Serverless" in result
def test_crewai_full_crew_execution(mock_crewai_llm):
"""Test the full crew execution and final report."""
# Define agents
researcher = Agent(
role='Test Research Analyst',
goal='Uncover specific test insights',
backstory="You are a test researcher.",
verbose=False,
allow_delegation=False,
llm=ChatOpenAI(model="gpt-4o", temperature=0) # LLM will be mocked
)
writer = Agent(
role='Test Content Strategist',
goal='Craft test narratives',
backstory="You are a test writer.",
verbose=False,
allow_delegation=False,
llm=ChatOpenAI(model="gpt-4o", temperature=0) # LLM will be mocked
)
# Define tasks
research_task = Task(
description="Analyze the latest trends in AI and cloud computing.",
expected_output="A concise summary of 3-5 key trends in AI and cloud computing.",
agent=researcher
)
write_report_task = Task(
description="Write a short report (2-3 paragraphs) based on the research findings.",
expected_output="A well-structured 2-3 paragraph report summarizing AI and cloud trends.",
agent=writer
)
# Form the Crew
tech_crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_report_task],
process=Process.sequential,
verbose=False,
)
# Kick off the crew's work
result = tech_crew.kickoff()
# Assertions for the final report
assert "Report:" in result
assert "Generative AI and Edge AI are booming." in result
assert "Cloud computing is evolving with serverless architectures" in result
Explanation:
mock_crewai_llmFixture: Similar to AutoGen, we useunittest.mock.patchto mockChatOpenAI.invoke. This allows us to control the LLM’s responses and ensure deterministic test results.test_crewai_research_task: This is a unit test for a singleTask. We create theAgentandTaskand then callresearch_task.execute(). This helps isolate potential issues in individual tasks.test_crewai_full_crew_execution: This is an integration/E2E test for the entireCrew. We define all agents and tasks, then calltech_crew.kickoff().- Assertions: We use
assertto check for key phrases in theresultof both the single task and the full crew. This validates that the agents produced the expected information.
To run this:
- Save the code as
test_crewai_tasks.py. - Run from your terminal:
pytest test_crewai_tasks.py
These tests provide confidence that your CrewAI agents are performing their tasks and collaborating correctly.
7. Debugging and Testing with Semantic Kernel
Semantic Kernel (SK) integrates well with standard Python logging and offers flexible ways to mock LLM services for testing.
First, ensure you have Semantic Kernel installed:
pip install semantic-kernel==0.9.1b1 # As of 2026-03-20
(Note: semantic-kernel version 0.9.1b1 is a pre-release as of 2026-03-20, reflecting its rapid development. Always verify the latest stable version if available. This example uses the beta version for modern features.)
Debugging with Logging: Semantic Kernel
Semantic Kernel’s Kernel object can be configured to produce detailed logs about prompt rendering, function calls, and LLM interactions.
Create a file sk_debug.py:
# sk_debug.py
import os
import logging
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAICompletion, OpenAIChatCompletion
from semantic_kernel.functions import kernel_function
# --- Setup Logging ---
# Configure SK's internal logging to be verbose
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# --- Environment Variable Check ---
if not os.getenv("OPENAI_API_KEY"):
logger.error("OPENAI_API_KEY environment variable not set. Semantic Kernel will likely fail without it.")
exit("Please set the OPENAI_API_KEY environment variable.")
# --- Define a simple skill for demonstration ---
class MathSkill:
@kernel_function(
description="Adds two numbers together.",
name="Add",
parameters=[
{"name": "input", "description": "The first number to add", "type": "string", "required": True},
{"name": "number2", "description": "The second number to add", "type": "string", "required": True},
],
)
def add(self, input: str, number2: str) -> str:
logger.info(f"MathSkill.Add called with input='{input}', number2='{number2}'")
try:
result = float(input) + float(number2)
return str(result)
except ValueError:
logger.error(f"Invalid input for MathSkill.Add: '{input}', '{number2}'")
return "Error: Invalid numbers provided."
# --- Initialize Kernel and LLM ---
async def main():
kernel = Kernel()
# Add the OpenAI chat completion service
# Using gpt-4o as of 2026-03-20
kernel.add_service(
OpenAIChatCompletion(
service_id="default",
ai_model_id="gpt-4o",
api_key=os.getenv("OPENAI_API_KEY"),
),
)
# Import the custom skill
kernel.import_plugin_from_object(MathSkill(), plugin_name="MyMath")
# Define a prompt function that uses the skill
prompt_template = """
You are a helpful assistant.
User query: {{ $input }}
If the query involves adding two numbers, use the MyMath.Add skill.
Otherwise, answer generally.
"""
math_assistant_function = kernel.create_function_from_prompt(
prompt_template=prompt_template,
function_name="MathAssistant",
plugin_name="MyAssistant",
description="A math assistant that can add numbers or answer general questions."
)
print("--- Running Semantic Kernel Agent ---")
# Example 1: Query that should use the MathSkill
query1 = "What is 15 plus 7?"
print(f"\nQuery: '{query1}'")
# Using invoke_prompt_async for simpler calls, but planner would be used for complex chains
# For a simple prompt function with tool calling, invoke_prompt_async is sufficient.
result1 = await kernel.invoke_prompt_async(
prompt=prompt_template,
variables={"input": query1},
function_call_behavior=kernel.auto_function_request(), # Enable auto-tool calling
)
print(f"SK Response: {result1}")
# Example 2: General query
query2 = "Tell me a fun fact about space."
print(f"\nQuery: '{query2}'")
result2 = await kernel.invoke_prompt_async(
prompt=prompt_template,
variables={"input": query2},
function_call_behavior=kernel.auto_function_request(),
)
print(f"SK Response: {result2}")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
Explanation:
- Logging Setup: We configure the root logger to
DEBUGlevel. This allows Semantic Kernel’s internal components to emit detailed logs, including prompt construction, function calling, and LLM responses. @kernel_function: We define a simpleMathSkillwith anAddfunction. The@kernel_functiondecorator makes it discoverable by the kernel. We addlogger.infoinside the skill to trace its execution.- Kernel Initialization: We initialize the
Kerneland addOpenAIChatCompletionservice, retrieving the API key from environment variables for security. - Plugin Import: Our
MathSkillis imported into the kernel as a plugin named “MyMath”. - Prompt Function with Tool Calling: We create a prompt function
MathAssistantthat explicitly tells the LLM to use theMyMath.Addskill if appropriate.kernel.auto_function_request()is crucial for enabling the LLM to call the defined skills. invoke_prompt_async: We useinvoke_prompt_asyncto run our queries. The detailed logs will show whether the LLM decided to call theAddskill, what arguments it passed, and the skill’s return value.
To run this:
- Save the code as
sk_debug.py. - Set your
OPENAI_API_KEYenvironment variable. - Run from your terminal:
python sk_debug.py
You’ll observe detailed logs showing the kernel’s internal workings, the LLM’s thought process (if it decides to call a tool), and the execution of your MathSkill.Add function.
Testing with Semantic Kernel: Mocking LLM Calls
To make tests deterministic and avoid API costs, we can mock the LLM service that Semantic Kernel uses. SK allows you to add custom AI services, which we can leverage for mocking.
Create a file test_sk_skills.py:
# test_sk_skills.py
import pytest
import os
import asyncio
from unittest.mock import MagicMock
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
from semantic_kernel.functions import kernel_function
from semantic_kernel.connectors.ai.chat_completion_client_base import ChatCompletionClientBase
from semantic_kernel.contents.chat_history import ChatHistory
from semantic_kernel.contents.chat_message_content import ChatMessageContent
# --- Define the MathSkill (same as in sk_debug.py) ---
class MathSkill:
@kernel_function(
description="Adds two numbers together.",
name="Add",
parameters=[
{"name": "input", "description": "The first number to add", "type": "string", "required": True},
{"name": "number2", "description": "The second number to add", "type": "string", "required": True},
],
)
def add(self, input: str, number2: str) -> str:
try:
result = float(input) + float(number2)
return str(result)
except ValueError:
return "Error: Invalid numbers provided."
# --- Mock Chat Completion Service ---
# We'll create a custom mock class that mimics the ChatCompletionClientBase interface.
class MockChatCompletion(ChatCompletionClientBase):
def __init__(self, mock_responses: dict):
self._mock_responses = mock_responses
self._calls = [] # To record calls for assertion
async def get_chat_message_contents(
self, chat_history: ChatHistory, settings=None, **kwargs
) -> list[ChatMessageContent]:
# Extract the last user message to determine the mock response
last_user_message = ""
for message in chat_history.messages:
if message.role.value == "user":
last_user_message = message.content
self._calls.append(last_user_message) # Record the call
# Check if the prompt suggests tool calling or general response
if "MyMath.Add" in last_user_message and "15" in last_user_message and "7" in last_user_message:
# Simulate LLM deciding to call the tool
response_content = '{"tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "MyMath-Add", "arguments": "{\\"input\\": \\"15\\", \\"number2\\": \\"7\\"}"}}]}'
elif "fun fact" in last_user_message:
response_content = "Mock fun fact about space: It's big! TERMINATE"
else:
response_content = "Mock general response. TERMINATE"
return [ChatMessageContent(role="assistant", content=response_content)]
async def get_streaming_chat_message_contents(self, chat_history: ChatHistory, settings=None, **kwargs):
# Not implemented for this test, but would yield ChatMessageContent
yield ChatMessageContent(role="assistant", content="Mock streaming response.")
# --- Pytest fixture to provide a mocked kernel ---
@pytest.fixture
async def mocked_kernel():
kernel = Kernel()
# Instantiate our mock chat completion service
mock_service = MockChatCompletion(mock_responses={})
kernel.add_service(mock_service, service_id="mock_chat")
# Import the real MathSkill
kernel.import_plugin_from_object(MathSkill(), plugin_name="MyMath")
# Define the prompt function using the mock service
prompt_template = """
You are a helpful assistant.
User query: {{ $input }}
If the query involves adding two numbers, use the MyMath.Add skill.
Otherwise, answer generally.
"""
math_assistant_function = kernel.create_function_from_prompt(
prompt_template=prompt_template,
function_name="MathAssistant",
plugin_name="MyAssistant",
description="A math assistant that can add numbers or answer general questions.",
ai_service_id="mock_chat" # Crucially, tell it to use our mock service
)
yield kernel, mock_service # Yield both the kernel and the mock service for assertions
# Cleanup if necessary (not strictly needed for this mock)
@pytest.mark.asyncio
async def test_sk_math_skill_invocation(mocked_kernel):
"""Test if Semantic Kernel correctly invokes the MathSkill."""
kernel, mock_service = mocked_kernel
query = "What is 15 plus 7?"
result = await kernel.invoke_prompt_async(
prompt="User query: {{ $input }}", # Only pass the user query, the prompt function handles the rest
variables={"input": query},
function_call_behavior=kernel.auto_function_request(),
ai_service_id="mock_chat" # Ensure this specific call uses the mock
)
# Assert that the MathSkill was called and returned the correct value
# The MockChatCompletion simulates the LLM outputting a tool call for MyMath.Add
# The kernel then executes the real MathSkill.Add, and its output is returned.
assert result == "22.0" # Expected output from MathSkill.Add
@pytest.mark.asyncio
async def test_sk_general_response(mocked_kernel):
"""Test if Semantic Kernel provides a general response when no skill is needed."""
kernel, mock_service = mocked_kernel
query = "Tell me a fun fact about space."
result = await kernel.invoke_prompt_async(
prompt="User query: {{ $input }}",
variables={"input": query},
function_call_behavior=kernel.auto_function_request(),
ai_service_id="mock_chat"
)
# Assert the mocked general response
assert "Mock fun fact about space" in str(result)
Explanation:
MockChatCompletion: This custom class inherits fromChatCompletionClientBase, which is the interface SK uses for its chat completion services.- It has an
_mock_responsesdictionary (though not fully used in this simple example, it’s good practice) and a_callslist to track what messages were sent to it. - The crucial
get_chat_message_contentsmethod intercepts LLM calls. We simulate the LLM’s response based on the input prompt. If it detects keywords related toMyMath.Add, it returns a string that mimics the LLM’s function call JSON. Otherwise, it returns a general mock response.
- It has an
mocked_kernelFixture:- This
pytestfixture creates aKernelinstance and registers ourMockChatCompletionas an AI service withservice_id="mock_chat". - It then imports the real
MathSkilland creates theMathAssistantprompt function. - Crucially, when creating the
math_assistant_function, we specifyai_service_id="mock_chat"to ensure it uses our mock.
- This
@pytest.mark.asyncio: Since Semantic Kernel operations are asynchronous, we usepytest-asyncioto run our async test functions.test_sk_math_skill_invocation:- We invoke the
math_assistant_functionwith a math query. - Our
MockChatCompletionintercepts the LLM call and returns a simulated function call forMyMath.Add. - Semantic Kernel then executes the actual
MathSkill.Addfunction with the mocked arguments. - We assert that the
resultis “22.0”, verifying that the LLM (mocked) correctly identified the need for the tool, and the tool itself executed correctly.
- We invoke the
test_sk_general_response:- We invoke with a general query.
- The mock LLM returns a general response.
- We assert that the result contains our mock general response.
To run this:
- Save the code as
test_sk_skills.py. - Run from your terminal:
pytest test_sk_skills.py
These tests demonstrate how to isolate and test Semantic Kernel components, including tool invocation and general responses, by effectively mocking the underlying LLM.
Mini-Challenge: Instrument Your Own Agent for Observability
Now it’s your turn! Pick an agent workflow you’ve built in a previous chapter (or create a new simple one) and apply the DTM principles we’ve discussed.
Challenge:
- Choose a Framework: Select either LangGraph, AutoGen, CrewAI, or Semantic Kernel.
- Add Comprehensive Logging:
- Instrument your agent’s core components (nodes, agents, tasks, tools, skills) with
logging.info()or framework-specificverbosesettings. - Ensure your logs capture: inputs, outputs, key decisions, and state changes.
- Instrument your agent’s core components (nodes, agents, tasks, tools, skills) with
- Create a Basic Test Case:
- Write a
pytesttest file for at least one critical part of your agent (e.g., a specific tool, an agent’s response to a query, or a sub-workflow). - Crucially, mock any external LLM calls or API interactions to make your test deterministic and fast.
- Use
assertstatements to verify expected behavior or output characteristics.
- Write a
- Run and Observe: Execute your agent with the logging enabled, and then run your tests.
Hint: Start small! Don’t try to log every single variable. Focus on the decision points and data transformations. For testing, pick the most deterministic part of your agent first.
What to Observe/Learn:
- How does adding logging immediately clarify the agent’s execution path and reasoning?
- How much easier is it to pinpoint where an unexpected behavior might originate when you have detailed logs?
- How does mocking LLM calls simplify testing and make your tests run faster and more reliably?
- What kinds of assertions are most useful for testing non-deterministic AI outputs (e.g., checking for keywords, structure, or successful tool calls, rather than exact strings)?
Common Pitfalls & Troubleshooting
Even with good DTM practices, AI agents can be tricky. Here are some common pitfalls:
- Over-reliance on LLM “Magic”: Assuming the LLM will “figure it out” without explicit instructions, robust tools, or validation.
- Troubleshooting: Break down complex reasoning into smaller, tool-assisted steps. Add explicit validation logic for LLM outputs (e.g., check if a JSON response is valid).
- Neglecting Intermediate Logging: Only logging the start and end of a complex workflow.
- Troubleshooting: Log inputs and outputs at every significant step (each node, each tool call, each agent interaction, each skill execution). This “breadcrumbing” is vital for understanding multi-step failures.
- Difficulty Reproducing Non-Deterministic Failures: An agent works 95% of the time, but fails sporadically in production.
- Troubleshooting: Log the exact prompts sent to the LLM and the exact responses received. When a failure occurs, try to replay that specific prompt/response sequence in a controlled environment (using mocks). Implement retry mechanisms for transient LLM errors.
- Ignoring Token Usage and Cost in Monitoring: Only focusing on functional correctness.
- Troubleshooting: Integrate token usage tracking into your monitoring dashboards. Set up alerts for unexpected cost spikes. Optimize prompts for conciseness and consider caching LLM responses for common queries.
- Brittle Prompts in Tests: Writing tests that break with minor, acceptable changes in LLM output (e.g., asserting an exact sentence match).
- Troubleshooting: Test for characteristics of the output (keywords, presence of certain data, valid JSON structure, successful tool execution) rather than pixel-perfect string matches. Use
inchecks or regular expressions if needed.
- Troubleshooting: Test for characteristics of the output (keywords, presence of certain data, valid JSON structure, successful tool execution) rather than pixel-perfect string matches. Use
Summary
Phew! You’ve navigated the complex world of debugging, testing, and monitoring AI agent systems. Let’s recap the key takeaways:
- DTM is paramount for building reliable and trustworthy AI agents, especially given the non-deterministic nature of LLMs and the complexity of multi-agent interactions.
- Comprehensive Logging and Tracing are your best friends for debugging. Instrument every significant step of your agent’s workflow to understand its internal state and decision-making process.
- Observability Platforms like LangSmith offer visual traces and metrics that dramatically simplify debugging and performance analysis.
- The Testing Pyramid (Unit, Integration, E2E) provides a structured approach to building confidence in your agent’s behavior.
- Mocking LLM calls and external tools is crucial for creating fast, deterministic, and reliable tests.
- Golden Datasets are essential for E2E and regression testing, validating that your agent performs as expected for key scenarios.
- Monitoring production agents for latency, token usage, error rates, and user feedback ensures continuous performance and improvement.
- Common pitfalls like over-relying on LLM “magic” or neglecting intermediate logging can be avoided with diligent DTM practices.
By embracing these principles, you’re not just building smart agents; you’re building dependable agents. This is a crucial step towards deploying robust AI solutions in the real world.
In the next chapter, we’ll explore deployment strategies and how to get your reliable agent systems into production, ready to serve users!
References
- LangChain Documentation: LangSmith
- AutoGen Documentation: Logging and Debugging
- CrewAI Documentation: Verbose Mode
- Semantic Kernel Documentation: Logging
- Pytest Documentation
- Python Logging How-To
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.