Agent Observability: Monitoring, Logging, and Debugging Production AI Systems

When your AI agent fails in production at 3 AM, you need answers fast: What input triggered the failure? Which tool call failed? Why did the agent make that decision? Without proper observability, debugging agents is like flying blind—you're guessing instead of knowing.

This guide covers how to build comprehensive observability into agent systems so you can monitor, debug, and optimize with confidence.

Why Agent Observability Is Different

Traditional application monitoring focuses on request/response patterns, error rates, and latency. That's necessary but insufficient for agents. You also need to understand:

Agent Decision-Making:

Why did the agent choose tool X over tool Y?
What reasoning led to this output?
Which retrieved documents influenced the response?

Multi-Step Workflows:

How did the agent traverse a complex task?
Where in a 20-step workflow did it fail?
Which intermediate results propagated errors?

Non-Determinism:

How much do outputs vary for the same input?
Are failures reproducible or stochastic?
What's the distribution of agent behaviors?

Context Window Management:

What's in the agent's context at decision time?
When does truncation occur?
Are we hitting context limits?

What to Log in Agent Systems

1. Agent Execution Traces

Every agent invocation should be logged with full context:

interface AgentExecutionLog {
  // Identity
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  agentId: string;
  agentVersion: string;
  
  // Request
  task: string;
  inputTokens: number;
  contextDocuments: string[];
  
  // Execution
  startTime: number;
  endTime: number;
  durationMs: number;
  
  // Decision-making
  reasoningSteps: ReasoningStep[];
  toolCalls: ToolCall[];
  modelCalls: ModelCall[];
  
  // Output
  response: string;
  outputTokens: number;
  confidence?: number;
  
  // Metadata
  userId: string;
  sessionId: string;
  environment: 'prod' | 'staging' | 'dev';
  
  // Costs
  totalCostUsd: number;
  
  // Status
  status: 'success' | 'error' | 'timeout';
  error?: ErrorDetails;
}

interface ReasoningStep {
  step: number;
  thought: string;
  observation: string;
  action: string;
  timestamp: number;
}

interface ToolCall {
  toolName: string;
  input: any;
  output: any;
  durationMs: number;
  success: boolean;
  error?: string;
  timestamp: number;
}

interface ModelCall {
  model: string;
  promptTokens: number;
  completionTokens: number;
  latencyMs: number;
  cached: boolean;
  costUsd: number;
  timestamp: number;
}

2. Structured Logging Implementation

class AgentLogger {
  private currentTrace: AgentExecutionLog | null = null;
  
  startTrace(task: string, context: ExecutionContext): string {
    const traceId = generateTraceId();
    
    this.currentTrace = {
      traceId,
      spanId: generateSpanId(),
      agentId: context.agentId,
      agentVersion: context.version,
      task,
      inputTokens: this.estimateTokens(task),
      contextDocuments: context.documents.map(d => d.id),
      startTime: Date.now(),
      endTime: 0,
      durationMs: 0,
      reasoningSteps: [],
      toolCalls: [],
      modelCalls: [],
      response: '',
      outputTokens: 0,
      userId: context.userId,
      sessionId: context.sessionId,
      environment: process.env.NODE_ENV as any,
      totalCostUsd: 0,
      status: 'success'
    };
    
    console.log(JSON.stringify({
      event: 'agent_trace_started',
      traceId,
      task,
      timestamp: Date.now()
    }));
    
    return traceId;
  }
  
  logReasoningStep(thought: string, observation: string, action: string) {
    if (!this.currentTrace) return;
    
    this.currentTrace.reasoningSteps.push({
      step: this.currentTrace.reasoningSteps.length + 1,
      thought,
      observation,
      action,
      timestamp: Date.now()
    });
    
    console.log(JSON.stringify({
      event: 'reasoning_step',
      traceId: this.currentTrace.traceId,
      step: this.currentTrace.reasoningSteps.length,
      thought,
      observation,
      action
    }));
  }
  
  logToolCall(toolName: string, input: any, output: any, duration: number, success: boolean, error?: string) {
    if (!this.currentTrace) return;
    
    const toolCall: ToolCall = {
      toolName,
      input,
      output,
      durationMs: duration,
      success,
      error,
      timestamp: Date.now()
    };
    
    this.currentTrace.toolCalls.push(toolCall);
    
    console.log(JSON.stringify({
      event: 'tool_call',
      traceId: this.currentTrace.traceId,
      toolName,
      success,
      durationMs: duration,
      error
    }));
  }
  
  logModelCall(model: string, promptTokens: number, completionTokens: number, latency: number, cost: number, cached: boolean) {
    if (!this.currentTrace) return;
    
    const modelCall: ModelCall = {
      model,
      promptTokens,
      completionTokens,
      latencyMs: latency,
      cached,
      costUsd: cost,
      timestamp: Date.now()
    };
    
    this.currentTrace.modelCalls.push(modelCall);
    this.currentTrace.totalCostUsd += cost;
    
    console.log(JSON.stringify({
      event: 'model_call',
      traceId: this.currentTrace.traceId,
      model,
      promptTokens,
      completionTokens,
      cached,
      costUsd: cost
    }));
  }
  
  endTrace(response: string, status: 'success' | 'error' | 'timeout', error?: ErrorDetails) {
    if (!this.currentTrace) return;
    
    this.currentTrace.response = response;
    this.currentTrace.outputTokens = this.estimateTokens(response);
    this.currentTrace.endTime = Date.now();
    this.currentTrace.durationMs = this.currentTrace.endTime - this.currentTrace.startTime;
    this.currentTrace.status = status;
    this.currentTrace.error = error;
    
    // Ship complete trace to logging backend
    this.shipTrace(this.currentTrace);
    
    console.log(JSON.stringify({
      event: 'agent_trace_completed',
      traceId: this.currentTrace.traceId,
      status,
      durationMs: this.currentTrace.durationMs,
      totalCostUsd: this.currentTrace.totalCostUsd
    }));
    
    this.currentTrace = null;
  }
  
  private async shipTrace(trace: AgentExecutionLog) {
    // Send to your observability backend (e.g., Langsmith, Helicone, custom)
    await fetch('https://your-observability-backend/traces', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(trace)
    });
  }
  
  private estimateTokens(text: string): number {
    return Math.ceil(text.length / 4);
  }
}

3. Distributed Tracing for Multi-Agent Workflows

When multiple agents collaborate, you need distributed tracing to follow the full execution path:

class DistributedTracer {
  async traceWorkflow(workflowId: string, agents: Agent[]) {
    const rootSpan = this.createSpan({
      traceId: workflowId,
      spanId: generateSpanId(),
      name: 'workflow_execution',
      startTime: Date.now()
    });
    
    for (const agent of agents) {
      const childSpan = this.createSpan({
        traceId: workflowId,
        spanId: generateSpanId(),
        parentSpanId: rootSpan.spanId,
        name: `agent_${agent.name}`,
        startTime: Date.now()
      });
      
      try {
        await agent.execute({ traceContext: childSpan });
        this.endSpan(childSpan, 'success');
      } catch (error) {
        this.endSpan(childSpan, 'error', error);
        throw error;
      }
    }
    
    this.endSpan(rootSpan, 'success');
  }
  
  createSpan(config: SpanConfig): Span {
    const span = { ...config, events: [], status: 'running' };
    console.log(JSON.stringify({ event: 'span_started', ...span }));
    return span;
  }
  
  endSpan(span: Span, status: string, error?: any) {
    span.endTime = Date.now();
    span.durationMs = span.endTime - span.startTime;
    span.status = status;
    span.error = error;
    
    console.log(JSON.stringify({ event: 'span_ended', ...span }));
  }
}

Key Metrics to Track

1. Request-Level Metrics

interface RequestMetrics {
  // Latency
  p50LatencyMs: number;
  p95LatencyMs: number;
  p99LatencyMs: number;
  
  // Success rates
  successRate: number;
  errorRate: number;
  timeoutRate: number;
  
  // Costs
  avgCostPerRequest: number;
  totalDailyCost: number;
  
  // Token usage
  avgPromptTokens: number;
  avgCompletionTokens: number;
  
  // Cache performance
  cacheHitRate: number;
}

2. Agent-Level Metrics

interface AgentMetrics {
  // Decision quality
  avgConfidence: number;
  decisionDistribution: Record<string, number>; // Which tools chosen how often
  
  // Reasoning depth
  avgReasoningSteps: number;
  maxReasoningSteps: number;
  
  // Tool usage
  toolCallCounts: Record<string, number>;
  toolSuccessRates: Record<string, number>;
  avgToolLatency: Record<string, number>;
  
  // Context management
  avgContextLength: number;
  contextTruncationRate: number;
}

3. System-Level Metrics

interface SystemMetrics {
  // Throughput
  requestsPerSecond: number;
  concurrentRequests: number;
  
  // Resource usage
  cpuUtilization: number;
  memoryUtilization: number;
  
  // Model performance
  modelLatencyByType: Record<string, number>;
  modelCostByType: Record<string, number>;
}

Metrics Dashboard Implementation

class MetricsCollector {
  private metrics: AgentExecutionLog[] = [];
  
  record(log: AgentExecutionLog) {
    this.metrics.push(log);
  }
  
  getRequestMetrics(): RequestMetrics {
    const latencies = this.metrics.map(m => m.durationMs).sort((a, b) => a - b);
    const costs = this.metrics.map(m => m.totalCostUsd);
    
    return {
      p50LatencyMs: this.percentile(latencies, 0.5),
      p95LatencyMs: this.percentile(latencies, 0.95),
      p99LatencyMs: this.percentile(latencies, 0.99),
      successRate: this.metrics.filter(m => m.status === 'success').length / this.metrics.length,
      errorRate: this.metrics.filter(m => m.status === 'error').length / this.metrics.length,
      timeoutRate: this.metrics.filter(m => m.status === 'timeout').length / this.metrics.length,
      avgCostPerRequest: costs.reduce((a, b) => a + b, 0) / costs.length,
      totalDailyCost: costs.reduce((a, b) => a + b, 0),
      avgPromptTokens: this.avg(this.metrics.map(m => m.inputTokens)),
      avgCompletionTokens: this.avg(this.metrics.map(m => m.outputTokens)),
      cacheHitRate: this.metrics.flatMap(m => m.modelCalls).filter(c => c.cached).length / 
                    this.metrics.flatMap(m => m.modelCalls).length
    };
  }
  
  private percentile(sorted: number[], p: number): number {
    const index = Math.ceil(sorted.length * p) - 1;
    return sorted[index];
  }
  
  private avg(nums: number[]): number {
    return nums.reduce((a, b) => a + b, 0) / nums.length;
  }
}

Alerting on Anomalies

Set up alerts for production issues:

class AgentAlerting {
  private thresholds = {
    errorRatePercent: 5,
    p99LatencyMs: 10000,
    dailyCostUsd: 1000,
    lowConfidencePercent: 20
  };
  
  async checkAlerts(metrics: RequestMetrics, agentMetrics: AgentMetrics) {
    const alerts = [];
    
    if (metrics.errorRate * 100 > this.thresholds.errorRatePercent) {
      alerts.push({
        severity: 'critical',
        title: 'High Error Rate',
        message: `Error rate is ${(metrics.errorRate * 100).toFixed(1)}%, threshold: ${this.thresholds.errorRatePercent}%`
      });
    }
    
    if (metrics.p99LatencyMs > this.thresholds.p99LatencyMs) {
      alerts.push({
        severity: 'warning',
        title: 'High Latency',
        message: `P99 latency is ${metrics.p99LatencyMs}ms, threshold: ${this.thresholds.p99LatencyMs}ms`
      });
    }
    
    if (metrics.totalDailyCost > this.thresholds.dailyCostUsd) {
      alerts.push({
        severity: 'warning',
        title: 'Budget Exceeded',
        message: `Daily cost is $${metrics.totalDailyCost.toFixed(2)}, budget: $${this.thresholds.dailyCostUsd}`
      });
    }
    
    if (agentMetrics.avgConfidence < (1 - this.thresholds.lowConfidencePercent / 100)) {
      alerts.push({
        severity: 'info',
        title: 'Low Confidence',
        message: `Average confidence is ${(agentMetrics.avgConfidence * 100).toFixed(1)}%`
      });
    }
    
    if (alerts.length > 0) {
      await this.sendAlerts(alerts);
    }
  }
  
  private async sendAlerts(alerts: Alert[]) {
    // Send to Slack, PagerDuty, email, etc.
    for (const alert of alerts) {
      console.error(`ALERT [${alert.severity}]: ${alert.title} - ${alert.message}`);
      // await this.slackNotify(alert);
    }
  }
}

Debugging Production Failures

When an agent fails, you need a systematic debugging workflow:

1. Trace Lookup

class DebugWorkflow {
  async investigateFailure(traceId: string) {
    // 1. Fetch full trace
    const trace = await this.fetchTrace(traceId);
    
    console.log('=== FAILURE INVESTIGATION ===');
    console.log(`Trace ID: ${traceId}`);
    console.log(`Status: ${trace.status}`);
    console.log(`Error: ${trace.error?.message}`);
    console.log(`Duration: ${trace.durationMs}ms`);
    console.log();
    
    // 2. Analyze reasoning steps
    console.log('=== REASONING STEPS ===');
    trace.reasoningSteps.forEach(step => {
      console.log(`Step ${step.step}:`);
      console.log(`  Thought: ${step.thought}`);
      console.log(`  Action: ${step.action}`);
      console.log(`  Observation: ${step.observation}`);
    });
    console.log();
    
    // 3. Identify failed tool calls
    console.log('=== TOOL CALLS ===');
    trace.toolCalls.forEach(call => {
      console.log(`${call.toolName}: ${call.success ? '✓' : '✗ FAILED'}`);
      if (!call.success) {
        console.log(`  Error: ${call.error}`);
        console.log(`  Input: ${JSON.stringify(call.input)}`);
      }
    });
    console.log();
    
    // 4. Reproduce locally
    console.log('=== REPRODUCTION ===');
    console.log(`To reproduce locally, run:`);
    console.log(`  node debug.js --task="${trace.task}" --context=...`);
    
    return trace;
  }
  
  async fetchTrace(traceId: string): Promise<AgentExecutionLog> {
    const response = await fetch(`https://your-backend/traces/${traceId}`);
    return await response.json();
  }
}

2. Local Reproduction

// debug.js - Reproduce production failure locally
async function reproduceFailure(traceId: string) {
  const debug = new DebugWorkflow();
  const trace = await debug.investigateFailure(traceId);
  
  // Recreate execution environment
  const agent = new Agent({
    model: trace.modelCalls[0].model,
    tools: trace.toolCalls.map(c => c.toolName),
    verbose: true  // Enable detailed logging
  });
  
  try {
    // Replay with same inputs
    const result = await agent.execute(trace.task, {
      context: trace.contextDocuments
    });
    
    console.log('Reproduction successful');
    console.log('Result:', result);
  } catch (error) {
    console.log('Reproduction failed with same error:', error);
  }
}

Observability Tools Ecosystem

Commercial Solutions

LangSmith (LangChain)
- Tracing, evaluation, monitoring
- Good for LangChain-based agents
- $$ pricing
Helicone
- LLM observability and caching
- Model-agnostic
- Affordable
Langfuse
- Open-source observability
- Self-hostable
- Free tier available

Custom Solution

Build your own with:

class CustomObservability {
  // Log storage: Elasticsearch, Clickhouse, or PostgreSQL
  private logStore: LogStore;
  
  // Metrics: Prometheus + Grafana
  private metricsClient: PrometheusClient;
  
  // Tracing: Jaeger or Zipkin
  private tracingClient: JaegerClient;
  
  async recordExecution(log: AgentExecutionLog) {
    // Store logs
    await this.logStore.insert(log);
    
    // Update metrics
    this.metricsClient.recordLatency(log.durationMs);
    this.metricsClient.recordCost(log.totalCostUsd);
    this.metricsClient.incrementCounter('requests_total', { status: log.status });
    
    // Record trace
    await this.tracingClient.sendSpan({
      traceId: log.traceId,
      spanId: log.spanId,
      duration: log.durationMs,
      tags: {
        agent_id: log.agentId,
        status: log.status
      }
    });
  }
  
  async query(filters: LogFilters): Promise<AgentExecutionLog[]> {
    return await this.logStore.query(filters);
  }
}

Conclusion

Agent observability is not optional—it's the difference between "it's broken and we don't know why" and "we identified the issue in 5 minutes." The investment in logging, tracing, metrics, and debugging workflows pays dividends every time something goes wrong (and things will go wrong).

Key takeaways:

Log everything: inputs, reasoning, tool calls, outputs, costs
Use structured logging for queryability
Implement distributed tracing for multi-agent workflows
Track key metrics and set up alerting
Build debugging workflows for fast root cause analysis
Choose observability tools that fit your stack

Start simple: add basic logging and metrics first. Then layer on distributed tracing and advanced analytics as your system grows in complexity. But whatever you do, don't skip observability—your future self (or on-call engineer) will thank you.