12 min read

Building Cascade: An Intelligent LLM Router That Cut API Costs by 60%

How I built a production-ready ML system to optimize LLM routing using DistilBERT, semantic caching, and real-time cost tracking

#LLM#Machine Learning#Cost Optimization#Production#MLOps

The Problem: LLM Costs Are Out of Control

When building production AI systems, one painful reality hits you fast: LLM API costs scale linearly with usage. At $2.50 per million tokens for GPT-4, even a modest 10K queries/day can rack up $750/month—and most of those queries don't need GPT-4's power.

Simple questions like "What is 2+2?" or "Convert UTC to PST" get routed to the same expensive model as complex tasks like "Write a distributed systems architecture proposal." This is like using a Ferrari for a grocery run.

The insight: 60-70% of production LLM queries are simple enough for cheaper models. The challenge? Building a system that routes intelligently without sacrificing quality.

Enter Cascade - an intelligent LLM request router that automatically routes queries to the most cost-effective model based on complexity.

Architecture: 3-Stage Intelligence Pipeline

Cascade uses a multi-stage pipeline that processes requests in <100ms:

Request → Semantic Cache Check → ML Classification → Smart Routing → Response

Stage 1: Semantic Cache (50ms avg)

Before hitting any LLM, we check if we've seen a semantically similar query before.

How it works:

  1. Embed incoming query using all-MiniLM-L6-v2 (sentence-transformers)
  2. Search Qdrant vector database for cosine similarity > 0.92
  3. If match found, return cached response instantly

Why semantic vs. exact matching?

Exact caching (Redis key-value) only works for identical queries. Semantic caching catches variations. This boosts cache hit rate from ~15% (exact) to 42% (semantic).

Stage 2: ML Complexity Classification (<20ms)

This is the centerpiece of the system—a fine-tuned DistilBERT classifier that predicts query complexity.

Model: Fine-tuned distilbert-base-uncased (66M parameters)

  • Why DistilBERT? 40% smaller than BERT, 60% faster, retains 97% accuracy
  • Training: HuggingFace Trainer, 3 epochs, lr=2e-5, batch_size=16
  • Accuracy: 91% on held-out test set

ONNX Conversion: Converting to ONNX gave us 3-4x speedup: 45ms (PyTorch) → 12ms (ONNX) on e2-medium GCP VM

Stage 3: Smart Routing (1ms)

Based on the ML classifier's confidence score, route to the optimal model:

Complexity ScoreModelCost per 1M tokens
< 0.35Llama 3.2 (Ollama)$0 (local)
0.35 - 0.70GPT-4o-mini$0.15
> 0.70GPT-4o$2.50

Results: 60% Cost Reduction

After running Cascade on 1K real-world queries:

MetricBaseline (100% GPT-4)CascadeImprovement
Total Cost$7.89$2.34-70%
Cache Hit Rate0%42.6%N/A
Avg Latency1.8s1.2s-33%
Quality (BLEU)0.850.84-1.2%

Key Learnings

1. Synthetic Labeling Works

Using GPT-4 to label training data saved weeks of manual annotation. Cost $8, got 5K labeled examples. The classifier trained on synthetic labels achieved 91% accuracy—close enough for production.

2. ONNX is a Game-Changer for ML in Production

Converting to ONNX gave us:

  • 3-4x speedup (45ms → 12ms)
  • Lower memory footprint (400MB → 150MB)
  • No PyTorch dependency in production (smaller Docker image)

3. Semantic Caching > Exact Caching

Semantic cache hit rate (42%) was 3x higher than exact matching (14%) because it catches query variations. The embedding overhead (15ms) was worth it.

Conclusion

Cascade demonstrates that intelligent routing + semantic caching can cut LLM costs by 60%+ without sacrificing quality. The key insights:

  1. Most queries don't need GPT-4 - 70% can be handled by cheaper models
  2. ML classification is fast enough for real-time routing (<20ms)
  3. Semantic caching is 3x more effective than exact matching

This is production-ready infrastructure that directly impacts bottom line. At 10K queries/day, Cascade saves ~$450/month—enough to pay for itself and the infrastructure.

Try Cascade

Experience the intelligent routing and cost optimization in action