Technical Architecture

    LLM Cost Optimization: Managing Token Usage and Infrastructure at Scale

    14 min read
    By Sesha Kadakia
    Cost Optimization
    LLM
    Infrastructure
    Performance

    LLM costs can spiral out of control fast. A single production agent processing 100K requests/day can rack up $50K+/month in API costs. But with the right optimization strategies, you can cut that by 60% or more without sacrificing quality. This guide covers proven techniques for managing token economics at scale.

    Understanding LLM Cost Structure

    LLM pricing has two components:

    1. Prompt tokens (input): What you send to the model 2. Completion tokens (output): What the model generates

    Costs vary dramatically by model:

    ModelPrompt Cost ($/1M tokens)Completion Cost ($/1M tokens)Use Case
    GPT-4-Turbo$10$30Complex reasoning, high accuracy
    GPT-3.5-Turbo$0.50$1.50General tasks, high volume
    Claude-3-Opus$15$75Long context, nuanced tasks
    Claude-3-Sonnet$3$15Balanced performance
    Claude-3-Haiku$0.25$1.25Fast, simple tasks
    Gemini-1.5-Pro$3.50$10.50Multimodal, long context
    Llama-3-70B$0.70$0.90Open source, self-hosted

    Key insight: Prompt tokens usually dominate costs for agent systems because prompts include system instructions, few-shot examples, retrieved context, and conversation history—often 2000+ tokens per request.

    Strategy 1: Semantic Caching

    Caching identical or semantically similar requests can eliminate 40-60% of LLM calls.

    Implementation

    import { createHash } from 'crypto';
    
    interface CacheEntry {
      prompt: string;
      embedding: number[];
      response: string;
      timestamp: number;
      hitCount: number;
    }
    
    class SemanticCache {
      private cache: Map<string, CacheEntry> = new Map();
      private embedModel: EmbeddingModel;
      private similarityThreshold = 0.95;
      
      constructor() {
        this.embedModel = new OpenAIEmbeddings(); // Use cheaper embedding model
      }
      
      async get(prompt: string): Promise<string | null> {
        // 1. Try exact match first (fastest)
        const exactKey = this.hashPrompt(prompt);
        const exactMatch = this.cache.get(exactKey);
        if (exactMatch && this.isValid(exactMatch)) {
          exactMatch.hitCount++;
          console.log('Cache hit: exact');
          return exactMatch.response;
        }
        
        // 2. Try semantic similarity (slower but catches paraphrases)
        const embedding = await this.embedModel.embed(prompt);
        
        for (const [key, entry] of this.cache.entries()) {
          if (!this.isValid(entry)) continue;
          
          const similarity = this.cosineSimilarity(embedding, entry.embedding);
          if (similarity >= this.similarityThreshold) {
            entry.hitCount++;
            console.log(`Cache hit: semantic (similarity: ${similarity.toFixed(3)})`);
            return entry.response;
          }
        }
        
        return null;
      }
      
      async set(prompt: string, response: string): Promise<void> {
        const key = this.hashPrompt(prompt);
        const embedding = await this.embedModel.embed(prompt);
        
        this.cache.set(key, {
          prompt,
          embedding,
          response,
          timestamp: Date.now(),
          hitCount: 0
        });
        
        // Evict old entries to prevent unbounded growth
        this.evictOldEntries();
      }
      
      private hashPrompt(prompt: string): string {
        return createHash('sha256').update(prompt).digest('hex');
      }
      
      private cosineSimilarity(a: number[], b: number[]): number {
        const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
        const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
        const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
        return dotProduct / (magnitudeA * magnitudeB);
      }
      
      private isValid(entry: CacheEntry): boolean {
        const maxAge = 24 * 60 * 60 * 1000; // 24 hours
        return Date.now() - entry.timestamp < maxAge;
      }
      
      private evictOldEntries(): void {
        const maxEntries = 10000;
        if (this.cache.size <= maxEntries) return;
        
        // Evict least recently used entries
        const entries = Array.from(this.cache.entries())
          .sort((a, b) => a[1].timestamp - b[1].timestamp);
        
        const toDelete = entries.slice(0, entries.length - maxEntries);
        toDelete.forEach(([key]) => this.cache.delete(key));
      }
      
      getStats() {
        const entries = Array.from(this.cache.values());
        const totalHits = entries.reduce((sum, e) => sum + e.hitCount, 0);
        
        return {
          totalEntries: this.cache.size,
          totalHits,
          avgHitsPerEntry: totalHits / this.cache.size,
          cacheHitRate: totalHits / (totalHits + this.cache.size) // Approximate
        };
      }
    }
    
    // Usage in agent
    const cache = new SemanticCache();
    
    async function callLLM(prompt: string): Promise<string> {
      // Check cache first
      const cached = await cache.get(prompt);
      if (cached) {
        return cached;
      }
      
      // Cache miss - call LLM
      const response = await openai.chat.completions.create({
        model: 'gpt-4-turbo',
        messages: [{ role: 'user', content: prompt }]
      });
      
      const content = response.choices[0].message.content;
      
      // Store in cache
      await cache.set(prompt, content);
      
      return content;
    }
    

    Cost savings example:

    • 100K requests/day
    • 40% cache hit rate
    • GPT-4-Turbo: $10/1M prompt tokens, $30/1M completion tokens
    • Avg prompt: 2000 tokens, avg completion: 500 tokens

    Without cache:

    • Daily cost: 100K × (2000 × $10/1M + 500 × $30/1M) = $3,500/day = $105K/month

    With cache:

    • Daily cost: 60K × (2000 × $10/1M + 500 × $30/1M) = $2,100/day = $63K/month
    • Savings: $42K/month (40%)

    Strategy 2: Model Routing

    Not all tasks need GPT-4. Route requests to the cheapest model capable of handling each task.

    enum TaskComplexity {
      SIMPLE = 'simple',           // Classification, extraction, simple Q&A
      MODERATE = 'moderate',       // Summarization, basic reasoning
      COMPLEX = 'complex'          // Multi-step reasoning, creative writing
    }
    
    interface ModelConfig {
      name: string;
      promptCost: number;  // per 1M tokens
      completionCost: number;
      maxTokens: number;
      latencyMs: number;
    }
    
    const models: Record<TaskComplexity, ModelConfig> = {
      [TaskComplexity.SIMPLE]: {
        name: 'gpt-3.5-turbo',
        promptCost: 0.50,
        completionCost: 1.50,
        maxTokens: 4096,
        latencyMs: 500
      },
      [TaskComplexity.MODERATE]: {
        name: 'claude-3-haiku',
        promptCost: 0.25,
        completionCost: 1.25,
        maxTokens: 200000,
        latencyMs: 800
      },
      [TaskComplexity.COMPLEX]: {
        name: 'gpt-4-turbo',
        promptCost: 10,
        completionCost: 30,
        maxTokens: 128000,
        latencyMs: 2000
      }
    };
    
    class ModelRouter {
      classifyTask(task: string): TaskComplexity {
        // Use a small classifier model or heuristics
        const simplePatterns = [
          /^classify/i,
          /^extract/i,
          /^is this/i,
          /^yes or no/i
        ];
        
        const complexPatterns = [
          /step by step/i,
          /explain why/i,
          /compare and contrast/i,
          /analyze/i,
          /create.*detailed/i
        ];
        
        if (simplePatterns.some(p => p.test(task))) {
          return TaskComplexity.SIMPLE;
        }
        
        if (complexPatterns.some(p => p.test(task))) {
          return TaskComplexity.COMPLEX;
        }
        
        return TaskComplexity.MODERATE;
      }
      
      selectModel(task: string, contextLength: number): ModelConfig {
        const complexity = this.classifyTask(task);
        let model = models[complexity];
        
        // Upgrade model if context too large
        if (contextLength > model.maxTokens) {
          console.log(`Context (${contextLength} tokens) exceeds model limit, upgrading`);
          model = models[TaskComplexity.COMPLEX]; // GPT-4-Turbo has 128K context
        }
        
        return model;
      }
      
      async route(task: string, context: string): Promise<LLMResponse> {
        const contextLength = this.estimateTokens(context);
        const model = this.selectModel(task, contextLength);
        
        console.log(`Routing to ${model.name} (complexity: ${this.classifyTask(task)})`);
        
        return await this.callModel(model, task, context);
      }
      
      private estimateTokens(text: string): number {
        // Quick estimate: ~4 characters per token
        return Math.ceil(text.length / 4);
      }
    }
    
    // Usage
    const router = new ModelRouter();
    const response = await router.route(
      "Extract the email address from this text",
      userInput
    );
    

    Cost savings example:

    • 100K requests/day
    • 60% simple tasks → GPT-3.5
    • 30% moderate tasks → Claude-3-Haiku
    • 10% complex tasks → GPT-4

    Without routing (all GPT-4):

    • $105K/month

    With routing:

    • Simple: 60K × (2000 × $0.50/1M + 500 × $1.50/1M) = $750/day
    • Moderate: 30K × (2000 × $0.25/1M + 500 × $1.25/1M) = $204/day
    • Complex: 10K × (2000 × $10/1M + 500 × $30/1M) = $350/day
    • Total: $1,304/day = $39K/month
    • Savings: $66K/month (63%)

    Strategy 3: Prompt Compression

    Long prompts are expensive. Compress without losing information.

    class PromptCompressor {
      // Technique 1: Remove redundancy
      deduplicateExamples(examples: string[]): string[] {
        const unique = new Set<string>();
        return examples.filter(ex => {
          const normalized = this.normalize(ex);
          if (unique.has(normalized)) return false;
          unique.add(normalized);
          return true;
        });
      }
      
      // Technique 2: Summarize long context
      async compressContext(context: string, maxTokens: number): Promise<string> {
        const currentTokens = this.estimateTokens(context);
        
        if (currentTokens <= maxTokens) {
          return context;
        }
        
        // Use cheap model to summarize
        const compressionRatio = maxTokens / currentTokens;
        const summary = await this.summarize(context, compressionRatio);
        
        return summary;
      }
      
      // Technique 3: Use structured formats
      toStructuredFormat(data: any): string {
        // JSON is more token-efficient than prose
        return JSON.stringify(data, null, 0); // No pretty-printing
      }
      
      // Technique 4: Remove unnecessary whitespace
      minify(text: string): string {
        return text
          .replace(/
    s*
    /g, '
    ')  // Multiple newlines → single
          .replace(/  +/g, ' ')        // Multiple spaces → single
          .trim();
      }
      
      async compress(prompt: string, targetTokens: number): Promise<string> {
        let compressed = prompt;
        
        // Step 1: Minify
        compressed = this.minify(compressed);
        
        // Step 2: If still too long, summarize
        if (this.estimateTokens(compressed) > targetTokens) {
          compressed = await this.compressContext(compressed, targetTokens);
        }
        
        console.log(`Compressed from ${this.estimateTokens(prompt)} to ${this.estimateTokens(compressed)} tokens`);
        
        return compressed;
      }
    }
    

    Compression example:

    • Original: 3000 tokens
    • Compressed: 1500 tokens
    • Cost reduction: 50% on prompt tokens

    For 100K requests/day:

    • Prompt savings: 100K × 1500 × $10/1M = $1,500/day = $45K/month saved

    Strategy 4: Batching and Parallelization

    Process multiple requests in parallel to maximize throughput and reduce per-request overhead.

    class BatchProcessor {
      private queue: Array<{
        prompt: string;
        resolve: (response: string) => void;
        reject: (error: Error) => void;
      }> = [];
      
      private batchSize = 10;
      private batchTimeoutMs = 100;
      private processingTimer: NodeJS.Timeout | null = null;
      
      async process(prompt: string): Promise<string> {
        return new Promise((resolve, reject) => {
          this.queue.push({ prompt, resolve, reject });
          
          // Start batch timer if not already running
          if (!this.processingTimer) {
            this.processingTimer = setTimeout(
              () => this.processBatch(),
              this.batchTimeoutMs
            );
          }
          
          // Process immediately if batch full
          if (this.queue.length >= this.batchSize) {
            clearTimeout(this.processingTimer);
            this.processingTimer = null;
            this.processBatch();
          }
        });
      }
      
      private async processBatch() {
        if (this.queue.length === 0) return;
        
        const batch = this.queue.splice(0, this.batchSize);
        
        console.log(`Processing batch of ${batch.length} requests`);
        
        try {
          // Send all prompts in parallel
          const responses = await Promise.all(
            batch.map(({ prompt }) => this.callLLM(prompt))
          );
          
          // Resolve all promises
          batch.forEach(({ resolve }, i) => resolve(responses[i]));
          
        } catch (error) {
          // Reject all promises on batch failure
          batch.forEach(({ reject }) => reject(error as Error));
        }
        
        // Process remaining queue
        if (this.queue.length > 0) {
          this.processingTimer = setTimeout(
            () => this.processBatch(),
            this.batchTimeoutMs
          );
        }
      }
      
      private async callLLM(prompt: string): Promise<string> {
        // Actual LLM call
        const response = await openai.chat.completions.create({
          model: 'gpt-3.5-turbo',
          messages: [{ role: 'user', content: prompt }]
        });
        return response.choices[0].message.content;
      }
    }
    
    // Usage
    const batcher = new BatchProcessor();
    
    // These will be batched together
    const results = await Promise.all([
      batcher.process("Translate 'hello' to Spanish"),
      batcher.process("Translate 'goodbye' to French"),
      batcher.process("Translate 'thank you' to German")
    ]);
    

    Batching benefits:

    • Reduced latency overhead
    • Better throughput (requests/second)
    • Lower infrastructure costs

    Strategy 5: Cost Monitoring and Alerting

    You can't optimize what you don't measure.

    interface CostMetrics {
      requestId: string;
      model: string;
      promptTokens: number;
      completionTokens: number;
      promptCostUsd: number;
      completionCostUsd: number;
      totalCostUsd: number;
      latencyMs: number;
      cached: boolean;
      timestamp: number;
    }
    
    class CostMonitor {
      private metrics: CostMetrics[] = [];
      private alertThresholds = {
        dailyBudget: 1000,  // $1000/day
        requestCost: 0.50    // $0.50/request
      };
      
      logRequest(metrics: CostMetrics) {
        this.metrics.push(metrics);
        
        // Alert on expensive requests
        if (metrics.totalCostUsd > this.alertThresholds.requestCost) {
          this.alertExpensiveRequest(metrics);
        }
        
        // Alert on daily budget
        const todayCost = this.getTodayCost();
        if (todayCost > this.alertThresholds.dailyBudget) {
          this.alertBudgetExceeded(todayCost);
        }
      }
      
      getTodayCost(): number {
        const today = new Date().setHours(0, 0, 0, 0);
        return this.metrics
          .filter(m => m.timestamp >= today)
          .reduce((sum, m) => sum + m.totalCostUsd, 0);
      }
      
      getReport() {
        const total = this.metrics.reduce((sum, m) => sum + m.totalCostUsd, 0);
        const byModel = this.groupBy(this.metrics, 'model');
        const cacheHitRate = this.metrics.filter(m => m.cached).length / this.metrics.length;
        
        return {
          totalCost: total,
          totalRequests: this.metrics.length,
          avgCostPerRequest: total / this.metrics.length,
          cacheHitRate: (cacheHitRate * 100).toFixed(1) + '%',
          costByModel: Object.entries(byModel).map(([model, metrics]) => ({
            model,
            cost: metrics.reduce((sum, m) => sum + m.totalCostUsd, 0),
            requests: metrics.length
          }))
        };
      }
      
      private groupBy<T>(array: T[], key: keyof T): Record<string, T[]> {
        return array.reduce((groups, item) => {
          const value = String(item[key]);
          groups[value] = groups[value] || [];
          groups[value].push(item);
          return groups;
        }, {} as Record<string, T[]>);
      }
    }
    
    // Dashboard
    const monitor = new CostMonitor();
    
    setInterval(() => {
      const report = monitor.getReport();
      console.log('Cost Report:', report);
    }, 3600000); // Every hour
    

    Combined Strategy: 60%+ Cost Reduction

    Putting it all together:

    class OptimizedAgent {
      private cache: SemanticCache;
      private router: ModelRouter;
      private compressor: PromptCompressor;
      private batcher: BatchProcessor;
      private monitor: CostMonitor;
      
      async process(task: string, context: string): Promise<string> {
        const startTime = Date.now();
        
        // Step 1: Build prompt
        let prompt = this.buildPrompt(task, context);
        
        // Step 2: Compress prompt
        prompt = await this.compressor.compress(prompt, 2000);
        
        // Step 3: Check cache
        const cached = await this.cache.get(prompt);
        if (cached) {
          this.monitor.logRequest({
            requestId: generateId(),
            model: 'cache',
            promptTokens: 0,
            completionTokens: 0,
            promptCostUsd: 0,
            completionCostUsd: 0,
            totalCostUsd: 0,
            latencyMs: Date.now() - startTime,
            cached: true,
            timestamp: Date.now()
          });
          return cached;
        }
        
        // Step 4: Route to appropriate model
        const model = this.router.selectModel(task, prompt.length);
        
        // Step 5: Execute (batched if possible)
        const response = await this.batcher.process(prompt);
        
        // Step 6: Cache result
        await this.cache.set(prompt, response);
        
        // Step 7: Log metrics
        const promptTokens = this.estimateTokens(prompt);
        const completionTokens = this.estimateTokens(response);
        
        this.monitor.logRequest({
          requestId: generateId(),
          model: model.name,
          promptTokens,
          completionTokens,
          promptCostUsd: (promptTokens / 1_000_000) * model.promptCost,
          completionCostUsd: (completionTokens / 1_000_000) * model.completionCost,
          totalCostUsd: ((promptTokens / 1_000_000) * model.promptCost) + 
                        ((completionTokens / 1_000_000) * model.completionCost),
          latencyMs: Date.now() - startTime,
          cached: false,
          timestamp: Date.now()
        });
        
        return response;
      }
    }
    

    Combined savings:

    • Caching: -40%
    • Model routing: -50% of remaining
    • Prompt compression: -30% of remaining
    • Total: ~70% cost reduction

    Starting cost: $105K/month Final cost: ~$31K/month Savings: $74K/month

    Conclusion

    LLM cost optimization isn't optional at scale—it's critical for sustainable AI products. The strategies covered here have proven effective across dozens of production systems:

    1. Semantic caching eliminates redundant LLM calls
    2. Model routing matches task complexity to model capability
    3. Prompt compression reduces token usage without quality loss
    4. Batching maximizes infrastructure efficiency
    5. Cost monitoring enables data-driven optimization

    Start with caching and routing—they deliver the biggest wins with minimal complexity. Add compression and batching as you scale. And always monitor costs in real-time; surprises are expensive.

    The goal isn't to use the cheapest model for everything. It's to use the right model for each task, cache aggressively, and eliminate waste. Done right, you can build sophisticated agent systems that scale to millions of requests without breaking the bank.

    We Value Your Privacy

    We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. You can choose which cookies to accept. Read our Privacy Policy to learn more.