LLM Cost Optimization: Managing Token Usage and Infrastructure at Scale

LLM costs can spiral out of control fast. A single production agent processing 100K requests/day can rack up $50K+/month in API costs. But with the right optimization strategies, you can cut that by 60% or more without sacrificing quality. This guide covers proven techniques for managing token economics at scale.

Understanding LLM Cost Structure

LLM pricing has two components:

1. Prompt tokens (input): What you send to the model 2. Completion tokens (output): What the model generates

Costs vary dramatically by model:

Model	Prompt Cost ($/1M tokens)	Completion Cost ($/1M tokens)	Use Case
GPT-4-Turbo	$10	$30	Complex reasoning, high accuracy
GPT-3.5-Turbo	$0.50	$1.50	General tasks, high volume
Claude-3-Opus	$15	$75	Long context, nuanced tasks
Claude-3-Sonnet	$3	$15	Balanced performance
Claude-3-Haiku	$0.25	$1.25	Fast, simple tasks
Gemini-1.5-Pro	$3.50	$10.50	Multimodal, long context
Llama-3-70B	$0.70	$0.90	Open source, self-hosted

Key insight: Prompt tokens usually dominate costs for agent systems because prompts include system instructions, few-shot examples, retrieved context, and conversation history—often 2000+ tokens per request.

Strategy 1: Semantic Caching

Caching identical or semantically similar requests can eliminate 40-60% of LLM calls.

Implementation

import { createHash } from 'crypto';

interface CacheEntry {
  prompt: string;
  embedding: number[];
  response: string;
  timestamp: number;
  hitCount: number;
}

class SemanticCache {
  private cache: Map<string, CacheEntry> = new Map();
  private embedModel: EmbeddingModel;
  private similarityThreshold = 0.95;
  
  constructor() {
    this.embedModel = new OpenAIEmbeddings(); // Use cheaper embedding model
  }
  
  async get(prompt: string): Promise<string | null> {
    // 1. Try exact match first (fastest)
    const exactKey = this.hashPrompt(prompt);
    const exactMatch = this.cache.get(exactKey);
    if (exactMatch && this.isValid(exactMatch)) {
      exactMatch.hitCount++;
      console.log('Cache hit: exact');
      return exactMatch.response;
    }
    
    // 2. Try semantic similarity (slower but catches paraphrases)
    const embedding = await this.embedModel.embed(prompt);
    
    for (const [key, entry] of this.cache.entries()) {
      if (!this.isValid(entry)) continue;
      
      const similarity = this.cosineSimilarity(embedding, entry.embedding);
      if (similarity >= this.similarityThreshold) {
        entry.hitCount++;
        console.log(`Cache hit: semantic (similarity: ${similarity.toFixed(3)})`);
        return entry.response;
      }
    }
    
    return null;
  }
  
  async set(prompt: string, response: string): Promise<void> {
    const key = this.hashPrompt(prompt);
    const embedding = await this.embedModel.embed(prompt);
    
    this.cache.set(key, {
      prompt,
      embedding,
      response,
      timestamp: Date.now(),
      hitCount: 0
    });
    
    // Evict old entries to prevent unbounded growth
    this.evictOldEntries();
  }
  
  private hashPrompt(prompt: string): string {
    return createHash('sha256').update(prompt).digest('hex');
  }
  
  private cosineSimilarity(a: number[], b: number[]): number {
    const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
    const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
    const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
    return dotProduct / (magnitudeA * magnitudeB);
  }
  
  private isValid(entry: CacheEntry): boolean {
    const maxAge = 24 * 60 * 60 * 1000; // 24 hours
    return Date.now() - entry.timestamp < maxAge;
  }
  
  private evictOldEntries(): void {
    const maxEntries = 10000;
    if (this.cache.size <= maxEntries) return;
    
    // Evict least recently used entries
    const entries = Array.from(this.cache.entries())
      .sort((a, b) => a[1].timestamp - b[1].timestamp);
    
    const toDelete = entries.slice(0, entries.length - maxEntries);
    toDelete.forEach(([key]) => this.cache.delete(key));
  }
  
  getStats() {
    const entries = Array.from(this.cache.values());
    const totalHits = entries.reduce((sum, e) => sum + e.hitCount, 0);
    
    return {
      totalEntries: this.cache.size,
      totalHits,
      avgHitsPerEntry: totalHits / this.cache.size,
      cacheHitRate: totalHits / (totalHits + this.cache.size) // Approximate
    };
  }
}

// Usage in agent
const cache = new SemanticCache();

async function callLLM(prompt: string): Promise<string> {
  // Check cache first
  const cached = await cache.get(prompt);
  if (cached) {
    return cached;
  }
  
  // Cache miss - call LLM
  const response = await openai.chat.completions.create({
    model: 'gpt-4-turbo',
    messages: [{ role: 'user', content: prompt }]
  });
  
  const content = response.choices[0].message.content;
  
  // Store in cache
  await cache.set(prompt, content);
  
  return content;
}

Cost savings example:

100K requests/day
40% cache hit rate
GPT-4-Turbo: $10/1M prompt tokens, $30/1M completion tokens
Avg prompt: 2000 tokens, avg completion: 500 tokens

Without cache:

Daily cost: 100K × (2000 × $10/1M + 500 × $30/1M) = $3,500/day = $105K/month

With cache:

Daily cost: 60K × (2000 × $10/1M + 500 × $30/1M) = $2,100/day = $63K/month
Savings: $42K/month (40%)

Strategy 2: Model Routing

Not all tasks need GPT-4. Route requests to the cheapest model capable of handling each task.

enum TaskComplexity {
  SIMPLE = 'simple',           // Classification, extraction, simple Q&A
  MODERATE = 'moderate',       // Summarization, basic reasoning
  COMPLEX = 'complex'          // Multi-step reasoning, creative writing
}

interface ModelConfig {
  name: string;
  promptCost: number;  // per 1M tokens
  completionCost: number;
  maxTokens: number;
  latencyMs: number;
}

const models: Record<TaskComplexity, ModelConfig> = {
  [TaskComplexity.SIMPLE]: {
    name: 'gpt-3.5-turbo',
    promptCost: 0.50,
    completionCost: 1.50,
    maxTokens: 4096,
    latencyMs: 500
  },
  [TaskComplexity.MODERATE]: {
    name: 'claude-3-haiku',
    promptCost: 0.25,
    completionCost: 1.25,
    maxTokens: 200000,
    latencyMs: 800
  },
  [TaskComplexity.COMPLEX]: {
    name: 'gpt-4-turbo',
    promptCost: 10,
    completionCost: 30,
    maxTokens: 128000,
    latencyMs: 2000
  }
};

class ModelRouter {
  classifyTask(task: string): TaskComplexity {
    // Use a small classifier model or heuristics
    const simplePatterns = [
      /^classify/i,
      /^extract/i,
      /^is this/i,
      /^yes or no/i
    ];
    
    const complexPatterns = [
      /step by step/i,
      /explain why/i,
      /compare and contrast/i,
      /analyze/i,
      /create.*detailed/i
    ];
    
    if (simplePatterns.some(p => p.test(task))) {
      return TaskComplexity.SIMPLE;
    }
    
    if (complexPatterns.some(p => p.test(task))) {
      return TaskComplexity.COMPLEX;
    }
    
    return TaskComplexity.MODERATE;
  }
  
  selectModel(task: string, contextLength: number): ModelConfig {
    const complexity = this.classifyTask(task);
    let model = models[complexity];
    
    // Upgrade model if context too large
    if (contextLength > model.maxTokens) {
      console.log(`Context (${contextLength} tokens) exceeds model limit, upgrading`);
      model = models[TaskComplexity.COMPLEX]; // GPT-4-Turbo has 128K context
    }
    
    return model;
  }
  
  async route(task: string, context: string): Promise<LLMResponse> {
    const contextLength = this.estimateTokens(context);
    const model = this.selectModel(task, contextLength);
    
    console.log(`Routing to ${model.name} (complexity: ${this.classifyTask(task)})`);
    
    return await this.callModel(model, task, context);
  }
  
  private estimateTokens(text: string): number {
    // Quick estimate: ~4 characters per token
    return Math.ceil(text.length / 4);
  }
}

// Usage
const router = new ModelRouter();
const response = await router.route(
  "Extract the email address from this text",
  userInput
);

Cost savings example:

100K requests/day
60% simple tasks → GPT-3.5
30% moderate tasks → Claude-3-Haiku
10% complex tasks → GPT-4

Without routing (all GPT-4):

$105K/month

With routing:

Simple: 60K × (2000 × $0.50/1M + 500 × $1.50/1M) = $750/day
Moderate: 30K × (2000 × $0.25/1M + 500 × $1.25/1M) = $204/day
Complex: 10K × (2000 × $10/1M + 500 × $30/1M) = $350/day
Total: $1,304/day = $39K/month
Savings: $66K/month (63%)

Strategy 3: Prompt Compression

Long prompts are expensive. Compress without losing information.

class PromptCompressor {
  // Technique 1: Remove redundancy
  deduplicateExamples(examples: string[]): string[] {
    const unique = new Set<string>();
    return examples.filter(ex => {
      const normalized = this.normalize(ex);
      if (unique.has(normalized)) return false;
      unique.add(normalized);
      return true;
    });
  }
  
  // Technique 2: Summarize long context
  async compressContext(context: string, maxTokens: number): Promise<string> {
    const currentTokens = this.estimateTokens(context);
    
    if (currentTokens <= maxTokens) {
      return context;
    }
    
    // Use cheap model to summarize
    const compressionRatio = maxTokens / currentTokens;
    const summary = await this.summarize(context, compressionRatio);
    
    return summary;
  }
  
  // Technique 3: Use structured formats
  toStructuredFormat(data: any): string {
    // JSON is more token-efficient than prose
    return JSON.stringify(data, null, 0); // No pretty-printing
  }
  
  // Technique 4: Remove unnecessary whitespace
  minify(text: string): string {
    return text
      .replace(/
s*
/g, '
')  // Multiple newlines → single
      .replace(/  +/g, ' ')        // Multiple spaces → single
      .trim();
  }
  
  async compress(prompt: string, targetTokens: number): Promise<string> {
    let compressed = prompt;
    
    // Step 1: Minify
    compressed = this.minify(compressed);
    
    // Step 2: If still too long, summarize
    if (this.estimateTokens(compressed) > targetTokens) {
      compressed = await this.compressContext(compressed, targetTokens);
    }
    
    console.log(`Compressed from ${this.estimateTokens(prompt)} to ${this.estimateTokens(compressed)} tokens`);
    
    return compressed;
  }
}

Compression example:

Original: 3000 tokens
Compressed: 1500 tokens
Cost reduction: 50% on prompt tokens

For 100K requests/day:

Prompt savings: 100K × 1500 × $10/1M = $1,500/day = $45K/month saved

Strategy 4: Batching and Parallelization

Process multiple requests in parallel to maximize throughput and reduce per-request overhead.

class BatchProcessor {
  private queue: Array<{
    prompt: string;
    resolve: (response: string) => void;
    reject: (error: Error) => void;
  }> = [];
  
  private batchSize = 10;
  private batchTimeoutMs = 100;
  private processingTimer: NodeJS.Timeout | null = null;
  
  async process(prompt: string): Promise<string> {
    return new Promise((resolve, reject) => {
      this.queue.push({ prompt, resolve, reject });
      
      // Start batch timer if not already running
      if (!this.processingTimer) {
        this.processingTimer = setTimeout(
          () => this.processBatch(),
          this.batchTimeoutMs
        );
      }
      
      // Process immediately if batch full
      if (this.queue.length >= this.batchSize) {
        clearTimeout(this.processingTimer);
        this.processingTimer = null;
        this.processBatch();
      }
    });
  }
  
  private async processBatch() {
    if (this.queue.length === 0) return;
    
    const batch = this.queue.splice(0, this.batchSize);
    
    console.log(`Processing batch of ${batch.length} requests`);
    
    try {
      // Send all prompts in parallel
      const responses = await Promise.all(
        batch.map(({ prompt }) => this.callLLM(prompt))
      );
      
      // Resolve all promises
      batch.forEach(({ resolve }, i) => resolve(responses[i]));
      
    } catch (error) {
      // Reject all promises on batch failure
      batch.forEach(({ reject }) => reject(error as Error));
    }
    
    // Process remaining queue
    if (this.queue.length > 0) {
      this.processingTimer = setTimeout(
        () => this.processBatch(),
        this.batchTimeoutMs
      );
    }
  }
  
  private async callLLM(prompt: string): Promise<string> {
    // Actual LLM call
    const response = await openai.chat.completions.create({
      model: 'gpt-3.5-turbo',
      messages: [{ role: 'user', content: prompt }]
    });
    return response.choices[0].message.content;
  }
}

// Usage
const batcher = new BatchProcessor();

// These will be batched together
const results = await Promise.all([
  batcher.process("Translate 'hello' to Spanish"),
  batcher.process("Translate 'goodbye' to French"),
  batcher.process("Translate 'thank you' to German")
]);

Batching benefits:

Reduced latency overhead
Better throughput (requests/second)
Lower infrastructure costs

Strategy 5: Cost Monitoring and Alerting

You can't optimize what you don't measure.

interface CostMetrics {
  requestId: string;
  model: string;
  promptTokens: number;
  completionTokens: number;
  promptCostUsd: number;
  completionCostUsd: number;
  totalCostUsd: number;
  latencyMs: number;
  cached: boolean;
  timestamp: number;
}

class CostMonitor {
  private metrics: CostMetrics[] = [];
  private alertThresholds = {
    dailyBudget: 1000,  // $1000/day
    requestCost: 0.50    // $0.50/request
  };
  
  logRequest(metrics: CostMetrics) {
    this.metrics.push(metrics);
    
    // Alert on expensive requests
    if (metrics.totalCostUsd > this.alertThresholds.requestCost) {
      this.alertExpensiveRequest(metrics);
    }
    
    // Alert on daily budget
    const todayCost = this.getTodayCost();
    if (todayCost > this.alertThresholds.dailyBudget) {
      this.alertBudgetExceeded(todayCost);
    }
  }
  
  getTodayCost(): number {
    const today = new Date().setHours(0, 0, 0, 0);
    return this.metrics
      .filter(m => m.timestamp >= today)
      .reduce((sum, m) => sum + m.totalCostUsd, 0);
  }
  
  getReport() {
    const total = this.metrics.reduce((sum, m) => sum + m.totalCostUsd, 0);
    const byModel = this.groupBy(this.metrics, 'model');
    const cacheHitRate = this.metrics.filter(m => m.cached).length / this.metrics.length;
    
    return {
      totalCost: total,
      totalRequests: this.metrics.length,
      avgCostPerRequest: total / this.metrics.length,
      cacheHitRate: (cacheHitRate * 100).toFixed(1) + '%',
      costByModel: Object.entries(byModel).map(([model, metrics]) => ({
        model,
        cost: metrics.reduce((sum, m) => sum + m.totalCostUsd, 0),
        requests: metrics.length
      }))
    };
  }
  
  private groupBy<T>(array: T[], key: keyof T): Record<string, T[]> {
    return array.reduce((groups, item) => {
      const value = String(item[key]);
      groups[value] = groups[value] || [];
      groups[value].push(item);
      return groups;
    }, {} as Record<string, T[]>);
  }
}

// Dashboard
const monitor = new CostMonitor();

setInterval(() => {
  const report = monitor.getReport();
  console.log('Cost Report:', report);
}, 3600000); // Every hour

Combined Strategy: 60%+ Cost Reduction

Putting it all together:

class OptimizedAgent {
  private cache: SemanticCache;
  private router: ModelRouter;
  private compressor: PromptCompressor;
  private batcher: BatchProcessor;
  private monitor: CostMonitor;
  
  async process(task: string, context: string): Promise<string> {
    const startTime = Date.now();
    
    // Step 1: Build prompt
    let prompt = this.buildPrompt(task, context);
    
    // Step 2: Compress prompt
    prompt = await this.compressor.compress(prompt, 2000);
    
    // Step 3: Check cache
    const cached = await this.cache.get(prompt);
    if (cached) {
      this.monitor.logRequest({
        requestId: generateId(),
        model: 'cache',
        promptTokens: 0,
        completionTokens: 0,
        promptCostUsd: 0,
        completionCostUsd: 0,
        totalCostUsd: 0,
        latencyMs: Date.now() - startTime,
        cached: true,
        timestamp: Date.now()
      });
      return cached;
    }
    
    // Step 4: Route to appropriate model
    const model = this.router.selectModel(task, prompt.length);
    
    // Step 5: Execute (batched if possible)
    const response = await this.batcher.process(prompt);
    
    // Step 6: Cache result
    await this.cache.set(prompt, response);
    
    // Step 7: Log metrics
    const promptTokens = this.estimateTokens(prompt);
    const completionTokens = this.estimateTokens(response);
    
    this.monitor.logRequest({
      requestId: generateId(),
      model: model.name,
      promptTokens,
      completionTokens,
      promptCostUsd: (promptTokens / 1_000_000) * model.promptCost,
      completionCostUsd: (completionTokens / 1_000_000) * model.completionCost,
      totalCostUsd: ((promptTokens / 1_000_000) * model.promptCost) + 
                    ((completionTokens / 1_000_000) * model.completionCost),
      latencyMs: Date.now() - startTime,
      cached: false,
      timestamp: Date.now()
    });
    
    return response;
  }
}

Combined savings:

Caching: -40%
Model routing: -50% of remaining
Prompt compression: -30% of remaining
Total: ~70% cost reduction

Starting cost: $105K/month Final cost: ~$31K/month Savings: $74K/month

Conclusion

LLM cost optimization isn't optional at scale—it's critical for sustainable AI products. The strategies covered here have proven effective across dozens of production systems:

Semantic caching eliminates redundant LLM calls
Model routing matches task complexity to model capability
Prompt compression reduces token usage without quality loss
Batching maximizes infrastructure efficiency
Cost monitoring enables data-driven optimization

Start with caching and routing—they deliver the biggest wins with minimal complexity. Add compression and batching as you scale. And always monitor costs in real-time; surprises are expensive.

The goal isn't to use the cheapest model for everything. It's to use the right model for each task, cache aggressively, and eliminate waste. Done right, you can build sophisticated agent systems that scale to millions of requests without breaking the bank.