LLM costs can spiral out of control fast. A single production agent processing 100K requests/day can rack up $50K+/month in API costs. But with the right optimization strategies, you can cut that by 60% or more without sacrificing quality. This guide covers proven techniques for managing token economics at scale.
Understanding LLM Cost Structure
LLM pricing has two components:
1. Prompt tokens (input): What you send to the model 2. Completion tokens (output): What the model generates
Costs vary dramatically by model:
| Model | Prompt Cost ($/1M tokens) | Completion Cost ($/1M tokens) | Use Case |
|---|---|---|---|
| GPT-4-Turbo | $10 | $30 | Complex reasoning, high accuracy |
| GPT-3.5-Turbo | $0.50 | $1.50 | General tasks, high volume |
| Claude-3-Opus | $15 | $75 | Long context, nuanced tasks |
| Claude-3-Sonnet | $3 | $15 | Balanced performance |
| Claude-3-Haiku | $0.25 | $1.25 | Fast, simple tasks |
| Gemini-1.5-Pro | $3.50 | $10.50 | Multimodal, long context |
| Llama-3-70B | $0.70 | $0.90 | Open source, self-hosted |
Key insight: Prompt tokens usually dominate costs for agent systems because prompts include system instructions, few-shot examples, retrieved context, and conversation history—often 2000+ tokens per request.
Strategy 1: Semantic Caching
Caching identical or semantically similar requests can eliminate 40-60% of LLM calls.
Implementation
import { createHash } from 'crypto';
interface CacheEntry {
prompt: string;
embedding: number[];
response: string;
timestamp: number;
hitCount: number;
}
class SemanticCache {
private cache: Map<string, CacheEntry> = new Map();
private embedModel: EmbeddingModel;
private similarityThreshold = 0.95;
constructor() {
this.embedModel = new OpenAIEmbeddings(); // Use cheaper embedding model
}
async get(prompt: string): Promise<string | null> {
// 1. Try exact match first (fastest)
const exactKey = this.hashPrompt(prompt);
const exactMatch = this.cache.get(exactKey);
if (exactMatch && this.isValid(exactMatch)) {
exactMatch.hitCount++;
console.log('Cache hit: exact');
return exactMatch.response;
}
// 2. Try semantic similarity (slower but catches paraphrases)
const embedding = await this.embedModel.embed(prompt);
for (const [key, entry] of this.cache.entries()) {
if (!this.isValid(entry)) continue;
const similarity = this.cosineSimilarity(embedding, entry.embedding);
if (similarity >= this.similarityThreshold) {
entry.hitCount++;
console.log(`Cache hit: semantic (similarity: ${similarity.toFixed(3)})`);
return entry.response;
}
}
return null;
}
async set(prompt: string, response: string): Promise<void> {
const key = this.hashPrompt(prompt);
const embedding = await this.embedModel.embed(prompt);
this.cache.set(key, {
prompt,
embedding,
response,
timestamp: Date.now(),
hitCount: 0
});
// Evict old entries to prevent unbounded growth
this.evictOldEntries();
}
private hashPrompt(prompt: string): string {
return createHash('sha256').update(prompt).digest('hex');
}
private cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
private isValid(entry: CacheEntry): boolean {
const maxAge = 24 * 60 * 60 * 1000; // 24 hours
return Date.now() - entry.timestamp < maxAge;
}
private evictOldEntries(): void {
const maxEntries = 10000;
if (this.cache.size <= maxEntries) return;
// Evict least recently used entries
const entries = Array.from(this.cache.entries())
.sort((a, b) => a[1].timestamp - b[1].timestamp);
const toDelete = entries.slice(0, entries.length - maxEntries);
toDelete.forEach(([key]) => this.cache.delete(key));
}
getStats() {
const entries = Array.from(this.cache.values());
const totalHits = entries.reduce((sum, e) => sum + e.hitCount, 0);
return {
totalEntries: this.cache.size,
totalHits,
avgHitsPerEntry: totalHits / this.cache.size,
cacheHitRate: totalHits / (totalHits + this.cache.size) // Approximate
};
}
}
// Usage in agent
const cache = new SemanticCache();
async function callLLM(prompt: string): Promise<string> {
// Check cache first
const cached = await cache.get(prompt);
if (cached) {
return cached;
}
// Cache miss - call LLM
const response = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: [{ role: 'user', content: prompt }]
});
const content = response.choices[0].message.content;
// Store in cache
await cache.set(prompt, content);
return content;
}
Cost savings example:
- 100K requests/day
- 40% cache hit rate
- GPT-4-Turbo: $10/1M prompt tokens, $30/1M completion tokens
- Avg prompt: 2000 tokens, avg completion: 500 tokens
Without cache:
- Daily cost: 100K × (2000 × $10/1M + 500 × $30/1M) = $3,500/day = $105K/month
With cache:
- Daily cost: 60K × (2000 × $10/1M + 500 × $30/1M) = $2,100/day = $63K/month
- Savings: $42K/month (40%)
Strategy 2: Model Routing
Not all tasks need GPT-4. Route requests to the cheapest model capable of handling each task.
enum TaskComplexity {
SIMPLE = 'simple', // Classification, extraction, simple Q&A
MODERATE = 'moderate', // Summarization, basic reasoning
COMPLEX = 'complex' // Multi-step reasoning, creative writing
}
interface ModelConfig {
name: string;
promptCost: number; // per 1M tokens
completionCost: number;
maxTokens: number;
latencyMs: number;
}
const models: Record<TaskComplexity, ModelConfig> = {
[TaskComplexity.SIMPLE]: {
name: 'gpt-3.5-turbo',
promptCost: 0.50,
completionCost: 1.50,
maxTokens: 4096,
latencyMs: 500
},
[TaskComplexity.MODERATE]: {
name: 'claude-3-haiku',
promptCost: 0.25,
completionCost: 1.25,
maxTokens: 200000,
latencyMs: 800
},
[TaskComplexity.COMPLEX]: {
name: 'gpt-4-turbo',
promptCost: 10,
completionCost: 30,
maxTokens: 128000,
latencyMs: 2000
}
};
class ModelRouter {
classifyTask(task: string): TaskComplexity {
// Use a small classifier model or heuristics
const simplePatterns = [
/^classify/i,
/^extract/i,
/^is this/i,
/^yes or no/i
];
const complexPatterns = [
/step by step/i,
/explain why/i,
/compare and contrast/i,
/analyze/i,
/create.*detailed/i
];
if (simplePatterns.some(p => p.test(task))) {
return TaskComplexity.SIMPLE;
}
if (complexPatterns.some(p => p.test(task))) {
return TaskComplexity.COMPLEX;
}
return TaskComplexity.MODERATE;
}
selectModel(task: string, contextLength: number): ModelConfig {
const complexity = this.classifyTask(task);
let model = models[complexity];
// Upgrade model if context too large
if (contextLength > model.maxTokens) {
console.log(`Context (${contextLength} tokens) exceeds model limit, upgrading`);
model = models[TaskComplexity.COMPLEX]; // GPT-4-Turbo has 128K context
}
return model;
}
async route(task: string, context: string): Promise<LLMResponse> {
const contextLength = this.estimateTokens(context);
const model = this.selectModel(task, contextLength);
console.log(`Routing to ${model.name} (complexity: ${this.classifyTask(task)})`);
return await this.callModel(model, task, context);
}
private estimateTokens(text: string): number {
// Quick estimate: ~4 characters per token
return Math.ceil(text.length / 4);
}
}
// Usage
const router = new ModelRouter();
const response = await router.route(
"Extract the email address from this text",
userInput
);
Cost savings example:
- 100K requests/day
- 60% simple tasks → GPT-3.5
- 30% moderate tasks → Claude-3-Haiku
- 10% complex tasks → GPT-4
Without routing (all GPT-4):
- $105K/month
With routing:
- Simple: 60K × (2000 × $0.50/1M + 500 × $1.50/1M) = $750/day
- Moderate: 30K × (2000 × $0.25/1M + 500 × $1.25/1M) = $204/day
- Complex: 10K × (2000 × $10/1M + 500 × $30/1M) = $350/day
- Total: $1,304/day = $39K/month
- Savings: $66K/month (63%)
Strategy 3: Prompt Compression
Long prompts are expensive. Compress without losing information.
class PromptCompressor {
// Technique 1: Remove redundancy
deduplicateExamples(examples: string[]): string[] {
const unique = new Set<string>();
return examples.filter(ex => {
const normalized = this.normalize(ex);
if (unique.has(normalized)) return false;
unique.add(normalized);
return true;
});
}
// Technique 2: Summarize long context
async compressContext(context: string, maxTokens: number): Promise<string> {
const currentTokens = this.estimateTokens(context);
if (currentTokens <= maxTokens) {
return context;
}
// Use cheap model to summarize
const compressionRatio = maxTokens / currentTokens;
const summary = await this.summarize(context, compressionRatio);
return summary;
}
// Technique 3: Use structured formats
toStructuredFormat(data: any): string {
// JSON is more token-efficient than prose
return JSON.stringify(data, null, 0); // No pretty-printing
}
// Technique 4: Remove unnecessary whitespace
minify(text: string): string {
return text
.replace(/
s*
/g, '
') // Multiple newlines → single
.replace(/ +/g, ' ') // Multiple spaces → single
.trim();
}
async compress(prompt: string, targetTokens: number): Promise<string> {
let compressed = prompt;
// Step 1: Minify
compressed = this.minify(compressed);
// Step 2: If still too long, summarize
if (this.estimateTokens(compressed) > targetTokens) {
compressed = await this.compressContext(compressed, targetTokens);
}
console.log(`Compressed from ${this.estimateTokens(prompt)} to ${this.estimateTokens(compressed)} tokens`);
return compressed;
}
}
Compression example:
- Original: 3000 tokens
- Compressed: 1500 tokens
- Cost reduction: 50% on prompt tokens
For 100K requests/day:
- Prompt savings: 100K × 1500 × $10/1M = $1,500/day = $45K/month saved
Strategy 4: Batching and Parallelization
Process multiple requests in parallel to maximize throughput and reduce per-request overhead.
class BatchProcessor {
private queue: Array<{
prompt: string;
resolve: (response: string) => void;
reject: (error: Error) => void;
}> = [];
private batchSize = 10;
private batchTimeoutMs = 100;
private processingTimer: NodeJS.Timeout | null = null;
async process(prompt: string): Promise<string> {
return new Promise((resolve, reject) => {
this.queue.push({ prompt, resolve, reject });
// Start batch timer if not already running
if (!this.processingTimer) {
this.processingTimer = setTimeout(
() => this.processBatch(),
this.batchTimeoutMs
);
}
// Process immediately if batch full
if (this.queue.length >= this.batchSize) {
clearTimeout(this.processingTimer);
this.processingTimer = null;
this.processBatch();
}
});
}
private async processBatch() {
if (this.queue.length === 0) return;
const batch = this.queue.splice(0, this.batchSize);
console.log(`Processing batch of ${batch.length} requests`);
try {
// Send all prompts in parallel
const responses = await Promise.all(
batch.map(({ prompt }) => this.callLLM(prompt))
);
// Resolve all promises
batch.forEach(({ resolve }, i) => resolve(responses[i]));
} catch (error) {
// Reject all promises on batch failure
batch.forEach(({ reject }) => reject(error as Error));
}
// Process remaining queue
if (this.queue.length > 0) {
this.processingTimer = setTimeout(
() => this.processBatch(),
this.batchTimeoutMs
);
}
}
private async callLLM(prompt: string): Promise<string> {
// Actual LLM call
const response = await openai.chat.completions.create({
model: 'gpt-3.5-turbo',
messages: [{ role: 'user', content: prompt }]
});
return response.choices[0].message.content;
}
}
// Usage
const batcher = new BatchProcessor();
// These will be batched together
const results = await Promise.all([
batcher.process("Translate 'hello' to Spanish"),
batcher.process("Translate 'goodbye' to French"),
batcher.process("Translate 'thank you' to German")
]);
Batching benefits:
- Reduced latency overhead
- Better throughput (requests/second)
- Lower infrastructure costs
Strategy 5: Cost Monitoring and Alerting
You can't optimize what you don't measure.
interface CostMetrics {
requestId: string;
model: string;
promptTokens: number;
completionTokens: number;
promptCostUsd: number;
completionCostUsd: number;
totalCostUsd: number;
latencyMs: number;
cached: boolean;
timestamp: number;
}
class CostMonitor {
private metrics: CostMetrics[] = [];
private alertThresholds = {
dailyBudget: 1000, // $1000/day
requestCost: 0.50 // $0.50/request
};
logRequest(metrics: CostMetrics) {
this.metrics.push(metrics);
// Alert on expensive requests
if (metrics.totalCostUsd > this.alertThresholds.requestCost) {
this.alertExpensiveRequest(metrics);
}
// Alert on daily budget
const todayCost = this.getTodayCost();
if (todayCost > this.alertThresholds.dailyBudget) {
this.alertBudgetExceeded(todayCost);
}
}
getTodayCost(): number {
const today = new Date().setHours(0, 0, 0, 0);
return this.metrics
.filter(m => m.timestamp >= today)
.reduce((sum, m) => sum + m.totalCostUsd, 0);
}
getReport() {
const total = this.metrics.reduce((sum, m) => sum + m.totalCostUsd, 0);
const byModel = this.groupBy(this.metrics, 'model');
const cacheHitRate = this.metrics.filter(m => m.cached).length / this.metrics.length;
return {
totalCost: total,
totalRequests: this.metrics.length,
avgCostPerRequest: total / this.metrics.length,
cacheHitRate: (cacheHitRate * 100).toFixed(1) + '%',
costByModel: Object.entries(byModel).map(([model, metrics]) => ({
model,
cost: metrics.reduce((sum, m) => sum + m.totalCostUsd, 0),
requests: metrics.length
}))
};
}
private groupBy<T>(array: T[], key: keyof T): Record<string, T[]> {
return array.reduce((groups, item) => {
const value = String(item[key]);
groups[value] = groups[value] || [];
groups[value].push(item);
return groups;
}, {} as Record<string, T[]>);
}
}
// Dashboard
const monitor = new CostMonitor();
setInterval(() => {
const report = monitor.getReport();
console.log('Cost Report:', report);
}, 3600000); // Every hour
Combined Strategy: 60%+ Cost Reduction
Putting it all together:
class OptimizedAgent {
private cache: SemanticCache;
private router: ModelRouter;
private compressor: PromptCompressor;
private batcher: BatchProcessor;
private monitor: CostMonitor;
async process(task: string, context: string): Promise<string> {
const startTime = Date.now();
// Step 1: Build prompt
let prompt = this.buildPrompt(task, context);
// Step 2: Compress prompt
prompt = await this.compressor.compress(prompt, 2000);
// Step 3: Check cache
const cached = await this.cache.get(prompt);
if (cached) {
this.monitor.logRequest({
requestId: generateId(),
model: 'cache',
promptTokens: 0,
completionTokens: 0,
promptCostUsd: 0,
completionCostUsd: 0,
totalCostUsd: 0,
latencyMs: Date.now() - startTime,
cached: true,
timestamp: Date.now()
});
return cached;
}
// Step 4: Route to appropriate model
const model = this.router.selectModel(task, prompt.length);
// Step 5: Execute (batched if possible)
const response = await this.batcher.process(prompt);
// Step 6: Cache result
await this.cache.set(prompt, response);
// Step 7: Log metrics
const promptTokens = this.estimateTokens(prompt);
const completionTokens = this.estimateTokens(response);
this.monitor.logRequest({
requestId: generateId(),
model: model.name,
promptTokens,
completionTokens,
promptCostUsd: (promptTokens / 1_000_000) * model.promptCost,
completionCostUsd: (completionTokens / 1_000_000) * model.completionCost,
totalCostUsd: ((promptTokens / 1_000_000) * model.promptCost) +
((completionTokens / 1_000_000) * model.completionCost),
latencyMs: Date.now() - startTime,
cached: false,
timestamp: Date.now()
});
return response;
}
}
Combined savings:
- Caching: -40%
- Model routing: -50% of remaining
- Prompt compression: -30% of remaining
- Total: ~70% cost reduction
Starting cost: $105K/month Final cost: ~$31K/month Savings: $74K/month
Conclusion
LLM cost optimization isn't optional at scale—it's critical for sustainable AI products. The strategies covered here have proven effective across dozens of production systems:
- Semantic caching eliminates redundant LLM calls
- Model routing matches task complexity to model capability
- Prompt compression reduces token usage without quality loss
- Batching maximizes infrastructure efficiency
- Cost monitoring enables data-driven optimization
Start with caching and routing—they deliver the biggest wins with minimal complexity. Add compression and batching as you scale. And always monitor costs in real-time; surprises are expensive.
The goal isn't to use the cheapest model for everything. It's to use the right model for each task, cache aggressively, and eliminate waste. Done right, you can build sophisticated agent systems that scale to millions of requests without breaking the bank.