Building Cascade: An Intelligent LLM Router That Cut API Costs by 60%
How I built a production-ready ML system to optimize LLM routing using DistilBERT, semantic caching, and real-time cost tracking
The Problem: LLM Costs Are Out of Control
When building production AI systems, one painful reality hits you fast: LLM API costs scale linearly with usage. At $2.50 per million tokens for GPT-4, even a modest 10K queries/day can rack up $750/month—and most of those queries don't need GPT-4's power.
Simple questions like "What is 2+2?" or "Convert UTC to PST" get routed to the same expensive model as complex tasks like "Write a distributed systems architecture proposal." This is like using a Ferrari for a grocery run.
The insight: 60-70% of production LLM queries are simple enough for cheaper models. The challenge? Building a system that routes intelligently without sacrificing quality.
Enter Cascade - an intelligent LLM request router that automatically routes queries to the most cost-effective model based on complexity.
Architecture: 3-Stage Intelligence Pipeline
Cascade uses a multi-stage pipeline that processes requests in <100ms:
Request → Semantic Cache Check → ML Classification → Smart Routing → ResponseStage 1: Semantic Cache (50ms avg)
Before hitting any LLM, we check if we've seen a semantically similar query before.
How it works:
- Embed incoming query using
all-MiniLM-L6-v2(sentence-transformers) - Search Qdrant vector database for cosine similarity > 0.92
- If match found, return cached response instantly
Why semantic vs. exact matching?
Exact caching (Redis key-value) only works for identical queries. Semantic caching catches variations. This boosts cache hit rate from ~15% (exact) to 42% (semantic).
Stage 2: ML Complexity Classification (<20ms)
This is the centerpiece of the system—a fine-tuned DistilBERT classifier that predicts query complexity.
Model: Fine-tuned distilbert-base-uncased (66M parameters)
- Why DistilBERT? 40% smaller than BERT, 60% faster, retains 97% accuracy
- Training: HuggingFace Trainer, 3 epochs, lr=2e-5, batch_size=16
- Accuracy: 91% on held-out test set
ONNX Conversion: Converting to ONNX gave us 3-4x speedup: 45ms (PyTorch) → 12ms (ONNX) on e2-medium GCP VM
Stage 3: Smart Routing (1ms)
Based on the ML classifier's confidence score, route to the optimal model:
| Complexity Score | Model | Cost per 1M tokens |
|---|---|---|
| < 0.35 | Llama 3.2 (Ollama) | $0 (local) |
| 0.35 - 0.70 | GPT-4o-mini | $0.15 |
| > 0.70 | GPT-4o | $2.50 |
Results: 60% Cost Reduction
After running Cascade on 1K real-world queries:
| Metric | Baseline (100% GPT-4) | Cascade | Improvement |
|---|---|---|---|
| Total Cost | $7.89 | $2.34 | -70% |
| Cache Hit Rate | 0% | 42.6% | N/A |
| Avg Latency | 1.8s | 1.2s | -33% |
| Quality (BLEU) | 0.85 | 0.84 | -1.2% |
Key Learnings
1. Synthetic Labeling Works
Using GPT-4 to label training data saved weeks of manual annotation. Cost $8, got 5K labeled examples. The classifier trained on synthetic labels achieved 91% accuracy—close enough for production.
2. ONNX is a Game-Changer for ML in Production
Converting to ONNX gave us:
- 3-4x speedup (45ms → 12ms)
- Lower memory footprint (400MB → 150MB)
- No PyTorch dependency in production (smaller Docker image)
3. Semantic Caching > Exact Caching
Semantic cache hit rate (42%) was 3x higher than exact matching (14%) because it catches query variations. The embedding overhead (15ms) was worth it.
Conclusion
Cascade demonstrates that intelligent routing + semantic caching can cut LLM costs by 60%+ without sacrificing quality. The key insights:
- Most queries don't need GPT-4 - 70% can be handled by cheaper models
- ML classification is fast enough for real-time routing (<20ms)
- Semantic caching is 3x more effective than exact matching
This is production-ready infrastructure that directly impacts bottom line. At 10K queries/day, Cascade saves ~$450/month—enough to pay for itself and the infrastructure.