When your AI agent fails in production at 3 AM, you need answers fast: What input triggered the failure? Which tool call failed? Why did the agent make that decision? Without proper observability, debugging agents is like flying blind—you're guessing instead of knowing.
This guide covers how to build comprehensive observability into agent systems so you can monitor, debug, and optimize with confidence.
Why Agent Observability Is Different
Traditional application monitoring focuses on request/response patterns, error rates, and latency. That's necessary but insufficient for agents. You also need to understand:
Agent Decision-Making:
- Why did the agent choose tool X over tool Y?
- What reasoning led to this output?
- Which retrieved documents influenced the response?
Multi-Step Workflows:
- How did the agent traverse a complex task?
- Where in a 20-step workflow did it fail?
- Which intermediate results propagated errors?
Non-Determinism:
- How much do outputs vary for the same input?
- Are failures reproducible or stochastic?
- What's the distribution of agent behaviors?
Context Window Management:
- What's in the agent's context at decision time?
- When does truncation occur?
- Are we hitting context limits?
What to Log in Agent Systems
1. Agent Execution Traces
Every agent invocation should be logged with full context:
interface AgentExecutionLog {
// Identity
traceId: string;
spanId: string;
parentSpanId?: string;
agentId: string;
agentVersion: string;
// Request
task: string;
inputTokens: number;
contextDocuments: string[];
// Execution
startTime: number;
endTime: number;
durationMs: number;
// Decision-making
reasoningSteps: ReasoningStep[];
toolCalls: ToolCall[];
modelCalls: ModelCall[];
// Output
response: string;
outputTokens: number;
confidence?: number;
// Metadata
userId: string;
sessionId: string;
environment: 'prod' | 'staging' | 'dev';
// Costs
totalCostUsd: number;
// Status
status: 'success' | 'error' | 'timeout';
error?: ErrorDetails;
}
interface ReasoningStep {
step: number;
thought: string;
observation: string;
action: string;
timestamp: number;
}
interface ToolCall {
toolName: string;
input: any;
output: any;
durationMs: number;
success: boolean;
error?: string;
timestamp: number;
}
interface ModelCall {
model: string;
promptTokens: number;
completionTokens: number;
latencyMs: number;
cached: boolean;
costUsd: number;
timestamp: number;
}
2. Structured Logging Implementation
class AgentLogger {
private currentTrace: AgentExecutionLog | null = null;
startTrace(task: string, context: ExecutionContext): string {
const traceId = generateTraceId();
this.currentTrace = {
traceId,
spanId: generateSpanId(),
agentId: context.agentId,
agentVersion: context.version,
task,
inputTokens: this.estimateTokens(task),
contextDocuments: context.documents.map(d => d.id),
startTime: Date.now(),
endTime: 0,
durationMs: 0,
reasoningSteps: [],
toolCalls: [],
modelCalls: [],
response: '',
outputTokens: 0,
userId: context.userId,
sessionId: context.sessionId,
environment: process.env.NODE_ENV as any,
totalCostUsd: 0,
status: 'success'
};
console.log(JSON.stringify({
event: 'agent_trace_started',
traceId,
task,
timestamp: Date.now()
}));
return traceId;
}
logReasoningStep(thought: string, observation: string, action: string) {
if (!this.currentTrace) return;
this.currentTrace.reasoningSteps.push({
step: this.currentTrace.reasoningSteps.length + 1,
thought,
observation,
action,
timestamp: Date.now()
});
console.log(JSON.stringify({
event: 'reasoning_step',
traceId: this.currentTrace.traceId,
step: this.currentTrace.reasoningSteps.length,
thought,
observation,
action
}));
}
logToolCall(toolName: string, input: any, output: any, duration: number, success: boolean, error?: string) {
if (!this.currentTrace) return;
const toolCall: ToolCall = {
toolName,
input,
output,
durationMs: duration,
success,
error,
timestamp: Date.now()
};
this.currentTrace.toolCalls.push(toolCall);
console.log(JSON.stringify({
event: 'tool_call',
traceId: this.currentTrace.traceId,
toolName,
success,
durationMs: duration,
error
}));
}
logModelCall(model: string, promptTokens: number, completionTokens: number, latency: number, cost: number, cached: boolean) {
if (!this.currentTrace) return;
const modelCall: ModelCall = {
model,
promptTokens,
completionTokens,
latencyMs: latency,
cached,
costUsd: cost,
timestamp: Date.now()
};
this.currentTrace.modelCalls.push(modelCall);
this.currentTrace.totalCostUsd += cost;
console.log(JSON.stringify({
event: 'model_call',
traceId: this.currentTrace.traceId,
model,
promptTokens,
completionTokens,
cached,
costUsd: cost
}));
}
endTrace(response: string, status: 'success' | 'error' | 'timeout', error?: ErrorDetails) {
if (!this.currentTrace) return;
this.currentTrace.response = response;
this.currentTrace.outputTokens = this.estimateTokens(response);
this.currentTrace.endTime = Date.now();
this.currentTrace.durationMs = this.currentTrace.endTime - this.currentTrace.startTime;
this.currentTrace.status = status;
this.currentTrace.error = error;
// Ship complete trace to logging backend
this.shipTrace(this.currentTrace);
console.log(JSON.stringify({
event: 'agent_trace_completed',
traceId: this.currentTrace.traceId,
status,
durationMs: this.currentTrace.durationMs,
totalCostUsd: this.currentTrace.totalCostUsd
}));
this.currentTrace = null;
}
private async shipTrace(trace: AgentExecutionLog) {
// Send to your observability backend (e.g., Langsmith, Helicone, custom)
await fetch('https://your-observability-backend/traces', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(trace)
});
}
private estimateTokens(text: string): number {
return Math.ceil(text.length / 4);
}
}
3. Distributed Tracing for Multi-Agent Workflows
When multiple agents collaborate, you need distributed tracing to follow the full execution path:
class DistributedTracer {
async traceWorkflow(workflowId: string, agents: Agent[]) {
const rootSpan = this.createSpan({
traceId: workflowId,
spanId: generateSpanId(),
name: 'workflow_execution',
startTime: Date.now()
});
for (const agent of agents) {
const childSpan = this.createSpan({
traceId: workflowId,
spanId: generateSpanId(),
parentSpanId: rootSpan.spanId,
name: `agent_${agent.name}`,
startTime: Date.now()
});
try {
await agent.execute({ traceContext: childSpan });
this.endSpan(childSpan, 'success');
} catch (error) {
this.endSpan(childSpan, 'error', error);
throw error;
}
}
this.endSpan(rootSpan, 'success');
}
createSpan(config: SpanConfig): Span {
const span = { ...config, events: [], status: 'running' };
console.log(JSON.stringify({ event: 'span_started', ...span }));
return span;
}
endSpan(span: Span, status: string, error?: any) {
span.endTime = Date.now();
span.durationMs = span.endTime - span.startTime;
span.status = status;
span.error = error;
console.log(JSON.stringify({ event: 'span_ended', ...span }));
}
}
Key Metrics to Track
1. Request-Level Metrics
interface RequestMetrics {
// Latency
p50LatencyMs: number;
p95LatencyMs: number;
p99LatencyMs: number;
// Success rates
successRate: number;
errorRate: number;
timeoutRate: number;
// Costs
avgCostPerRequest: number;
totalDailyCost: number;
// Token usage
avgPromptTokens: number;
avgCompletionTokens: number;
// Cache performance
cacheHitRate: number;
}
2. Agent-Level Metrics
interface AgentMetrics {
// Decision quality
avgConfidence: number;
decisionDistribution: Record<string, number>; // Which tools chosen how often
// Reasoning depth
avgReasoningSteps: number;
maxReasoningSteps: number;
// Tool usage
toolCallCounts: Record<string, number>;
toolSuccessRates: Record<string, number>;
avgToolLatency: Record<string, number>;
// Context management
avgContextLength: number;
contextTruncationRate: number;
}
3. System-Level Metrics
interface SystemMetrics {
// Throughput
requestsPerSecond: number;
concurrentRequests: number;
// Resource usage
cpuUtilization: number;
memoryUtilization: number;
// Model performance
modelLatencyByType: Record<string, number>;
modelCostByType: Record<string, number>;
}
Metrics Dashboard Implementation
class MetricsCollector {
private metrics: AgentExecutionLog[] = [];
record(log: AgentExecutionLog) {
this.metrics.push(log);
}
getRequestMetrics(): RequestMetrics {
const latencies = this.metrics.map(m => m.durationMs).sort((a, b) => a - b);
const costs = this.metrics.map(m => m.totalCostUsd);
return {
p50LatencyMs: this.percentile(latencies, 0.5),
p95LatencyMs: this.percentile(latencies, 0.95),
p99LatencyMs: this.percentile(latencies, 0.99),
successRate: this.metrics.filter(m => m.status === 'success').length / this.metrics.length,
errorRate: this.metrics.filter(m => m.status === 'error').length / this.metrics.length,
timeoutRate: this.metrics.filter(m => m.status === 'timeout').length / this.metrics.length,
avgCostPerRequest: costs.reduce((a, b) => a + b, 0) / costs.length,
totalDailyCost: costs.reduce((a, b) => a + b, 0),
avgPromptTokens: this.avg(this.metrics.map(m => m.inputTokens)),
avgCompletionTokens: this.avg(this.metrics.map(m => m.outputTokens)),
cacheHitRate: this.metrics.flatMap(m => m.modelCalls).filter(c => c.cached).length /
this.metrics.flatMap(m => m.modelCalls).length
};
}
private percentile(sorted: number[], p: number): number {
const index = Math.ceil(sorted.length * p) - 1;
return sorted[index];
}
private avg(nums: number[]): number {
return nums.reduce((a, b) => a + b, 0) / nums.length;
}
}
Alerting on Anomalies
Set up alerts for production issues:
class AgentAlerting {
private thresholds = {
errorRatePercent: 5,
p99LatencyMs: 10000,
dailyCostUsd: 1000,
lowConfidencePercent: 20
};
async checkAlerts(metrics: RequestMetrics, agentMetrics: AgentMetrics) {
const alerts = [];
if (metrics.errorRate * 100 > this.thresholds.errorRatePercent) {
alerts.push({
severity: 'critical',
title: 'High Error Rate',
message: `Error rate is ${(metrics.errorRate * 100).toFixed(1)}%, threshold: ${this.thresholds.errorRatePercent}%`
});
}
if (metrics.p99LatencyMs > this.thresholds.p99LatencyMs) {
alerts.push({
severity: 'warning',
title: 'High Latency',
message: `P99 latency is ${metrics.p99LatencyMs}ms, threshold: ${this.thresholds.p99LatencyMs}ms`
});
}
if (metrics.totalDailyCost > this.thresholds.dailyCostUsd) {
alerts.push({
severity: 'warning',
title: 'Budget Exceeded',
message: `Daily cost is $${metrics.totalDailyCost.toFixed(2)}, budget: $${this.thresholds.dailyCostUsd}`
});
}
if (agentMetrics.avgConfidence < (1 - this.thresholds.lowConfidencePercent / 100)) {
alerts.push({
severity: 'info',
title: 'Low Confidence',
message: `Average confidence is ${(agentMetrics.avgConfidence * 100).toFixed(1)}%`
});
}
if (alerts.length > 0) {
await this.sendAlerts(alerts);
}
}
private async sendAlerts(alerts: Alert[]) {
// Send to Slack, PagerDuty, email, etc.
for (const alert of alerts) {
console.error(`ALERT [${alert.severity}]: ${alert.title} - ${alert.message}`);
// await this.slackNotify(alert);
}
}
}
Debugging Production Failures
When an agent fails, you need a systematic debugging workflow:
1. Trace Lookup
class DebugWorkflow {
async investigateFailure(traceId: string) {
// 1. Fetch full trace
const trace = await this.fetchTrace(traceId);
console.log('=== FAILURE INVESTIGATION ===');
console.log(`Trace ID: ${traceId}`);
console.log(`Status: ${trace.status}`);
console.log(`Error: ${trace.error?.message}`);
console.log(`Duration: ${trace.durationMs}ms`);
console.log();
// 2. Analyze reasoning steps
console.log('=== REASONING STEPS ===');
trace.reasoningSteps.forEach(step => {
console.log(`Step ${step.step}:`);
console.log(` Thought: ${step.thought}`);
console.log(` Action: ${step.action}`);
console.log(` Observation: ${step.observation}`);
});
console.log();
// 3. Identify failed tool calls
console.log('=== TOOL CALLS ===');
trace.toolCalls.forEach(call => {
console.log(`${call.toolName}: ${call.success ? '✓' : '✗ FAILED'}`);
if (!call.success) {
console.log(` Error: ${call.error}`);
console.log(` Input: ${JSON.stringify(call.input)}`);
}
});
console.log();
// 4. Reproduce locally
console.log('=== REPRODUCTION ===');
console.log(`To reproduce locally, run:`);
console.log(` node debug.js --task="${trace.task}" --context=...`);
return trace;
}
async fetchTrace(traceId: string): Promise<AgentExecutionLog> {
const response = await fetch(`https://your-backend/traces/${traceId}`);
return await response.json();
}
}
2. Local Reproduction
// debug.js - Reproduce production failure locally
async function reproduceFailure(traceId: string) {
const debug = new DebugWorkflow();
const trace = await debug.investigateFailure(traceId);
// Recreate execution environment
const agent = new Agent({
model: trace.modelCalls[0].model,
tools: trace.toolCalls.map(c => c.toolName),
verbose: true // Enable detailed logging
});
try {
// Replay with same inputs
const result = await agent.execute(trace.task, {
context: trace.contextDocuments
});
console.log('Reproduction successful');
console.log('Result:', result);
} catch (error) {
console.log('Reproduction failed with same error:', error);
}
}
Observability Tools Ecosystem
Commercial Solutions
-
LangSmith (LangChain)
- Tracing, evaluation, monitoring
- Good for LangChain-based agents
- $$ pricing
-
Helicone
- LLM observability and caching
- Model-agnostic
- Affordable
-
Langfuse
- Open-source observability
- Self-hostable
- Free tier available
Custom Solution
Build your own with:
class CustomObservability {
// Log storage: Elasticsearch, Clickhouse, or PostgreSQL
private logStore: LogStore;
// Metrics: Prometheus + Grafana
private metricsClient: PrometheusClient;
// Tracing: Jaeger or Zipkin
private tracingClient: JaegerClient;
async recordExecution(log: AgentExecutionLog) {
// Store logs
await this.logStore.insert(log);
// Update metrics
this.metricsClient.recordLatency(log.durationMs);
this.metricsClient.recordCost(log.totalCostUsd);
this.metricsClient.incrementCounter('requests_total', { status: log.status });
// Record trace
await this.tracingClient.sendSpan({
traceId: log.traceId,
spanId: log.spanId,
duration: log.durationMs,
tags: {
agent_id: log.agentId,
status: log.status
}
});
}
async query(filters: LogFilters): Promise<AgentExecutionLog[]> {
return await this.logStore.query(filters);
}
}
Conclusion
Agent observability is not optional—it's the difference between "it's broken and we don't know why" and "we identified the issue in 5 minutes." The investment in logging, tracing, metrics, and debugging workflows pays dividends every time something goes wrong (and things will go wrong).
Key takeaways:
- Log everything: inputs, reasoning, tool calls, outputs, costs
- Use structured logging for queryability
- Implement distributed tracing for multi-agent workflows
- Track key metrics and set up alerting
- Build debugging workflows for fast root cause analysis
- Choose observability tools that fit your stack
Start simple: add basic logging and metrics first. Then layer on distributed tracing and advanced analytics as your system grows in complexity. But whatever you do, don't skip observability—your future self (or on-call engineer) will thank you.