Technical Architecture

    Agent Observability: Monitoring, Logging, and Debugging Production AI Systems

    13 min read
    By Sesha Kadakia
    Observability
    Monitoring
    Debugging
    Production

    When your AI agent fails in production at 3 AM, you need answers fast: What input triggered the failure? Which tool call failed? Why did the agent make that decision? Without proper observability, debugging agents is like flying blind—you're guessing instead of knowing.

    This guide covers how to build comprehensive observability into agent systems so you can monitor, debug, and optimize with confidence.

    Why Agent Observability Is Different

    Traditional application monitoring focuses on request/response patterns, error rates, and latency. That's necessary but insufficient for agents. You also need to understand:

    Agent Decision-Making:

    • Why did the agent choose tool X over tool Y?
    • What reasoning led to this output?
    • Which retrieved documents influenced the response?

    Multi-Step Workflows:

    • How did the agent traverse a complex task?
    • Where in a 20-step workflow did it fail?
    • Which intermediate results propagated errors?

    Non-Determinism:

    • How much do outputs vary for the same input?
    • Are failures reproducible or stochastic?
    • What's the distribution of agent behaviors?

    Context Window Management:

    • What's in the agent's context at decision time?
    • When does truncation occur?
    • Are we hitting context limits?

    What to Log in Agent Systems

    1. Agent Execution Traces

    Every agent invocation should be logged with full context:

    interface AgentExecutionLog {
      // Identity
      traceId: string;
      spanId: string;
      parentSpanId?: string;
      agentId: string;
      agentVersion: string;
      
      // Request
      task: string;
      inputTokens: number;
      contextDocuments: string[];
      
      // Execution
      startTime: number;
      endTime: number;
      durationMs: number;
      
      // Decision-making
      reasoningSteps: ReasoningStep[];
      toolCalls: ToolCall[];
      modelCalls: ModelCall[];
      
      // Output
      response: string;
      outputTokens: number;
      confidence?: number;
      
      // Metadata
      userId: string;
      sessionId: string;
      environment: 'prod' | 'staging' | 'dev';
      
      // Costs
      totalCostUsd: number;
      
      // Status
      status: 'success' | 'error' | 'timeout';
      error?: ErrorDetails;
    }
    
    interface ReasoningStep {
      step: number;
      thought: string;
      observation: string;
      action: string;
      timestamp: number;
    }
    
    interface ToolCall {
      toolName: string;
      input: any;
      output: any;
      durationMs: number;
      success: boolean;
      error?: string;
      timestamp: number;
    }
    
    interface ModelCall {
      model: string;
      promptTokens: number;
      completionTokens: number;
      latencyMs: number;
      cached: boolean;
      costUsd: number;
      timestamp: number;
    }
    

    2. Structured Logging Implementation

    class AgentLogger {
      private currentTrace: AgentExecutionLog | null = null;
      
      startTrace(task: string, context: ExecutionContext): string {
        const traceId = generateTraceId();
        
        this.currentTrace = {
          traceId,
          spanId: generateSpanId(),
          agentId: context.agentId,
          agentVersion: context.version,
          task,
          inputTokens: this.estimateTokens(task),
          contextDocuments: context.documents.map(d => d.id),
          startTime: Date.now(),
          endTime: 0,
          durationMs: 0,
          reasoningSteps: [],
          toolCalls: [],
          modelCalls: [],
          response: '',
          outputTokens: 0,
          userId: context.userId,
          sessionId: context.sessionId,
          environment: process.env.NODE_ENV as any,
          totalCostUsd: 0,
          status: 'success'
        };
        
        console.log(JSON.stringify({
          event: 'agent_trace_started',
          traceId,
          task,
          timestamp: Date.now()
        }));
        
        return traceId;
      }
      
      logReasoningStep(thought: string, observation: string, action: string) {
        if (!this.currentTrace) return;
        
        this.currentTrace.reasoningSteps.push({
          step: this.currentTrace.reasoningSteps.length + 1,
          thought,
          observation,
          action,
          timestamp: Date.now()
        });
        
        console.log(JSON.stringify({
          event: 'reasoning_step',
          traceId: this.currentTrace.traceId,
          step: this.currentTrace.reasoningSteps.length,
          thought,
          observation,
          action
        }));
      }
      
      logToolCall(toolName: string, input: any, output: any, duration: number, success: boolean, error?: string) {
        if (!this.currentTrace) return;
        
        const toolCall: ToolCall = {
          toolName,
          input,
          output,
          durationMs: duration,
          success,
          error,
          timestamp: Date.now()
        };
        
        this.currentTrace.toolCalls.push(toolCall);
        
        console.log(JSON.stringify({
          event: 'tool_call',
          traceId: this.currentTrace.traceId,
          toolName,
          success,
          durationMs: duration,
          error
        }));
      }
      
      logModelCall(model: string, promptTokens: number, completionTokens: number, latency: number, cost: number, cached: boolean) {
        if (!this.currentTrace) return;
        
        const modelCall: ModelCall = {
          model,
          promptTokens,
          completionTokens,
          latencyMs: latency,
          cached,
          costUsd: cost,
          timestamp: Date.now()
        };
        
        this.currentTrace.modelCalls.push(modelCall);
        this.currentTrace.totalCostUsd += cost;
        
        console.log(JSON.stringify({
          event: 'model_call',
          traceId: this.currentTrace.traceId,
          model,
          promptTokens,
          completionTokens,
          cached,
          costUsd: cost
        }));
      }
      
      endTrace(response: string, status: 'success' | 'error' | 'timeout', error?: ErrorDetails) {
        if (!this.currentTrace) return;
        
        this.currentTrace.response = response;
        this.currentTrace.outputTokens = this.estimateTokens(response);
        this.currentTrace.endTime = Date.now();
        this.currentTrace.durationMs = this.currentTrace.endTime - this.currentTrace.startTime;
        this.currentTrace.status = status;
        this.currentTrace.error = error;
        
        // Ship complete trace to logging backend
        this.shipTrace(this.currentTrace);
        
        console.log(JSON.stringify({
          event: 'agent_trace_completed',
          traceId: this.currentTrace.traceId,
          status,
          durationMs: this.currentTrace.durationMs,
          totalCostUsd: this.currentTrace.totalCostUsd
        }));
        
        this.currentTrace = null;
      }
      
      private async shipTrace(trace: AgentExecutionLog) {
        // Send to your observability backend (e.g., Langsmith, Helicone, custom)
        await fetch('https://your-observability-backend/traces', {
          method: 'POST',
          headers: { 'Content-Type': 'application/json' },
          body: JSON.stringify(trace)
        });
      }
      
      private estimateTokens(text: string): number {
        return Math.ceil(text.length / 4);
      }
    }
    

    3. Distributed Tracing for Multi-Agent Workflows

    When multiple agents collaborate, you need distributed tracing to follow the full execution path:

    class DistributedTracer {
      async traceWorkflow(workflowId: string, agents: Agent[]) {
        const rootSpan = this.createSpan({
          traceId: workflowId,
          spanId: generateSpanId(),
          name: 'workflow_execution',
          startTime: Date.now()
        });
        
        for (const agent of agents) {
          const childSpan = this.createSpan({
            traceId: workflowId,
            spanId: generateSpanId(),
            parentSpanId: rootSpan.spanId,
            name: `agent_${agent.name}`,
            startTime: Date.now()
          });
          
          try {
            await agent.execute({ traceContext: childSpan });
            this.endSpan(childSpan, 'success');
          } catch (error) {
            this.endSpan(childSpan, 'error', error);
            throw error;
          }
        }
        
        this.endSpan(rootSpan, 'success');
      }
      
      createSpan(config: SpanConfig): Span {
        const span = { ...config, events: [], status: 'running' };
        console.log(JSON.stringify({ event: 'span_started', ...span }));
        return span;
      }
      
      endSpan(span: Span, status: string, error?: any) {
        span.endTime = Date.now();
        span.durationMs = span.endTime - span.startTime;
        span.status = status;
        span.error = error;
        
        console.log(JSON.stringify({ event: 'span_ended', ...span }));
      }
    }
    

    Key Metrics to Track

    1. Request-Level Metrics

    interface RequestMetrics {
      // Latency
      p50LatencyMs: number;
      p95LatencyMs: number;
      p99LatencyMs: number;
      
      // Success rates
      successRate: number;
      errorRate: number;
      timeoutRate: number;
      
      // Costs
      avgCostPerRequest: number;
      totalDailyCost: number;
      
      // Token usage
      avgPromptTokens: number;
      avgCompletionTokens: number;
      
      // Cache performance
      cacheHitRate: number;
    }
    

    2. Agent-Level Metrics

    interface AgentMetrics {
      // Decision quality
      avgConfidence: number;
      decisionDistribution: Record<string, number>; // Which tools chosen how often
      
      // Reasoning depth
      avgReasoningSteps: number;
      maxReasoningSteps: number;
      
      // Tool usage
      toolCallCounts: Record<string, number>;
      toolSuccessRates: Record<string, number>;
      avgToolLatency: Record<string, number>;
      
      // Context management
      avgContextLength: number;
      contextTruncationRate: number;
    }
    

    3. System-Level Metrics

    interface SystemMetrics {
      // Throughput
      requestsPerSecond: number;
      concurrentRequests: number;
      
      // Resource usage
      cpuUtilization: number;
      memoryUtilization: number;
      
      // Model performance
      modelLatencyByType: Record<string, number>;
      modelCostByType: Record<string, number>;
    }
    

    Metrics Dashboard Implementation

    class MetricsCollector {
      private metrics: AgentExecutionLog[] = [];
      
      record(log: AgentExecutionLog) {
        this.metrics.push(log);
      }
      
      getRequestMetrics(): RequestMetrics {
        const latencies = this.metrics.map(m => m.durationMs).sort((a, b) => a - b);
        const costs = this.metrics.map(m => m.totalCostUsd);
        
        return {
          p50LatencyMs: this.percentile(latencies, 0.5),
          p95LatencyMs: this.percentile(latencies, 0.95),
          p99LatencyMs: this.percentile(latencies, 0.99),
          successRate: this.metrics.filter(m => m.status === 'success').length / this.metrics.length,
          errorRate: this.metrics.filter(m => m.status === 'error').length / this.metrics.length,
          timeoutRate: this.metrics.filter(m => m.status === 'timeout').length / this.metrics.length,
          avgCostPerRequest: costs.reduce((a, b) => a + b, 0) / costs.length,
          totalDailyCost: costs.reduce((a, b) => a + b, 0),
          avgPromptTokens: this.avg(this.metrics.map(m => m.inputTokens)),
          avgCompletionTokens: this.avg(this.metrics.map(m => m.outputTokens)),
          cacheHitRate: this.metrics.flatMap(m => m.modelCalls).filter(c => c.cached).length / 
                        this.metrics.flatMap(m => m.modelCalls).length
        };
      }
      
      private percentile(sorted: number[], p: number): number {
        const index = Math.ceil(sorted.length * p) - 1;
        return sorted[index];
      }
      
      private avg(nums: number[]): number {
        return nums.reduce((a, b) => a + b, 0) / nums.length;
      }
    }
    

    Alerting on Anomalies

    Set up alerts for production issues:

    class AgentAlerting {
      private thresholds = {
        errorRatePercent: 5,
        p99LatencyMs: 10000,
        dailyCostUsd: 1000,
        lowConfidencePercent: 20
      };
      
      async checkAlerts(metrics: RequestMetrics, agentMetrics: AgentMetrics) {
        const alerts = [];
        
        if (metrics.errorRate * 100 > this.thresholds.errorRatePercent) {
          alerts.push({
            severity: 'critical',
            title: 'High Error Rate',
            message: `Error rate is ${(metrics.errorRate * 100).toFixed(1)}%, threshold: ${this.thresholds.errorRatePercent}%`
          });
        }
        
        if (metrics.p99LatencyMs > this.thresholds.p99LatencyMs) {
          alerts.push({
            severity: 'warning',
            title: 'High Latency',
            message: `P99 latency is ${metrics.p99LatencyMs}ms, threshold: ${this.thresholds.p99LatencyMs}ms`
          });
        }
        
        if (metrics.totalDailyCost > this.thresholds.dailyCostUsd) {
          alerts.push({
            severity: 'warning',
            title: 'Budget Exceeded',
            message: `Daily cost is $${metrics.totalDailyCost.toFixed(2)}, budget: $${this.thresholds.dailyCostUsd}`
          });
        }
        
        if (agentMetrics.avgConfidence < (1 - this.thresholds.lowConfidencePercent / 100)) {
          alerts.push({
            severity: 'info',
            title: 'Low Confidence',
            message: `Average confidence is ${(agentMetrics.avgConfidence * 100).toFixed(1)}%`
          });
        }
        
        if (alerts.length > 0) {
          await this.sendAlerts(alerts);
        }
      }
      
      private async sendAlerts(alerts: Alert[]) {
        // Send to Slack, PagerDuty, email, etc.
        for (const alert of alerts) {
          console.error(`ALERT [${alert.severity}]: ${alert.title} - ${alert.message}`);
          // await this.slackNotify(alert);
        }
      }
    }
    

    Debugging Production Failures

    When an agent fails, you need a systematic debugging workflow:

    1. Trace Lookup

    class DebugWorkflow {
      async investigateFailure(traceId: string) {
        // 1. Fetch full trace
        const trace = await this.fetchTrace(traceId);
        
        console.log('=== FAILURE INVESTIGATION ===');
        console.log(`Trace ID: ${traceId}`);
        console.log(`Status: ${trace.status}`);
        console.log(`Error: ${trace.error?.message}`);
        console.log(`Duration: ${trace.durationMs}ms`);
        console.log();
        
        // 2. Analyze reasoning steps
        console.log('=== REASONING STEPS ===');
        trace.reasoningSteps.forEach(step => {
          console.log(`Step ${step.step}:`);
          console.log(`  Thought: ${step.thought}`);
          console.log(`  Action: ${step.action}`);
          console.log(`  Observation: ${step.observation}`);
        });
        console.log();
        
        // 3. Identify failed tool calls
        console.log('=== TOOL CALLS ===');
        trace.toolCalls.forEach(call => {
          console.log(`${call.toolName}: ${call.success ? '✓' : '✗ FAILED'}`);
          if (!call.success) {
            console.log(`  Error: ${call.error}`);
            console.log(`  Input: ${JSON.stringify(call.input)}`);
          }
        });
        console.log();
        
        // 4. Reproduce locally
        console.log('=== REPRODUCTION ===');
        console.log(`To reproduce locally, run:`);
        console.log(`  node debug.js --task="${trace.task}" --context=...`);
        
        return trace;
      }
      
      async fetchTrace(traceId: string): Promise<AgentExecutionLog> {
        const response = await fetch(`https://your-backend/traces/${traceId}`);
        return await response.json();
      }
    }
    

    2. Local Reproduction

    // debug.js - Reproduce production failure locally
    async function reproduceFailure(traceId: string) {
      const debug = new DebugWorkflow();
      const trace = await debug.investigateFailure(traceId);
      
      // Recreate execution environment
      const agent = new Agent({
        model: trace.modelCalls[0].model,
        tools: trace.toolCalls.map(c => c.toolName),
        verbose: true  // Enable detailed logging
      });
      
      try {
        // Replay with same inputs
        const result = await agent.execute(trace.task, {
          context: trace.contextDocuments
        });
        
        console.log('Reproduction successful');
        console.log('Result:', result);
      } catch (error) {
        console.log('Reproduction failed with same error:', error);
      }
    }
    

    Observability Tools Ecosystem

    Commercial Solutions

    1. LangSmith (LangChain)

      • Tracing, evaluation, monitoring
      • Good for LangChain-based agents
      • $$ pricing
    2. Helicone

      • LLM observability and caching
      • Model-agnostic
      • Affordable
    3. Langfuse

      • Open-source observability
      • Self-hostable
      • Free tier available

    Custom Solution

    Build your own with:

    class CustomObservability {
      // Log storage: Elasticsearch, Clickhouse, or PostgreSQL
      private logStore: LogStore;
      
      // Metrics: Prometheus + Grafana
      private metricsClient: PrometheusClient;
      
      // Tracing: Jaeger or Zipkin
      private tracingClient: JaegerClient;
      
      async recordExecution(log: AgentExecutionLog) {
        // Store logs
        await this.logStore.insert(log);
        
        // Update metrics
        this.metricsClient.recordLatency(log.durationMs);
        this.metricsClient.recordCost(log.totalCostUsd);
        this.metricsClient.incrementCounter('requests_total', { status: log.status });
        
        // Record trace
        await this.tracingClient.sendSpan({
          traceId: log.traceId,
          spanId: log.spanId,
          duration: log.durationMs,
          tags: {
            agent_id: log.agentId,
            status: log.status
          }
        });
      }
      
      async query(filters: LogFilters): Promise<AgentExecutionLog[]> {
        return await this.logStore.query(filters);
      }
    }
    

    Conclusion

    Agent observability is not optional—it's the difference between "it's broken and we don't know why" and "we identified the issue in 5 minutes." The investment in logging, tracing, metrics, and debugging workflows pays dividends every time something goes wrong (and things will go wrong).

    Key takeaways:

    • Log everything: inputs, reasoning, tool calls, outputs, costs
    • Use structured logging for queryability
    • Implement distributed tracing for multi-agent workflows
    • Track key metrics and set up alerting
    • Build debugging workflows for fast root cause analysis
    • Choose observability tools that fit your stack

    Start simple: add basic logging and metrics first. Then layer on distributed tracing and advanced analytics as your system grows in complexity. But whatever you do, don't skip observability—your future self (or on-call engineer) will thank you.

    We Value Your Privacy

    We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. You can choose which cookies to accept. Read our Privacy Policy to learn more.