Building Cascade: An Intelligent LLM Router

The Problem: LLM Costs Are Out of Control

When building production AI systems, one painful reality hits you fast: LLM API costs scale linearly with usage. At $2.50 per million tokens for GPT-4, even a modest 10K queries/day can rack up $750/month—and most of those queries don't need GPT-4's power.

Simple questions like "What is 2+2?" or "Convert UTC to PST" get routed to the same expensive model as complex tasks like "Write a distributed systems architecture proposal." This is like using a Ferrari for a grocery run.

The insight: 60-70% of production LLM queries are simple enough for cheaper models. The challenge? Building a system that routes intelligently without sacrificing quality.

Enter Cascade - an intelligent LLM request router that automatically routes queries to the most cost-effective model based on complexity.

Architecture: 3-Stage Intelligence Pipeline

Cascade uses a multi-stage pipeline that processes requests in <100ms:

Request → Semantic Cache Check → ML Classification → Smart Routing → Response

Stage 1: Semantic Cache (50ms avg)

Before hitting any LLM, we check if we've seen a semantically similar query before.

How it works:

Embed incoming query using all-MiniLM-L6-v2 (sentence-transformers)
Search Qdrant vector database for cosine similarity > 0.92
If match found, return cached response instantly

Why semantic vs. exact matching?

Exact caching (Redis key-value) only works for identical queries. Semantic caching catches variations. This boosts cache hit rate from ~15% (exact) to 42% (semantic).

Stage 2: ML Complexity Classification (<20ms)

This is the centerpiece of the system—a fine-tuned DistilBERT classifier that predicts query complexity.

Model: Fine-tuned distilbert-base-uncased (66M parameters)

Why DistilBERT? 40% smaller than BERT, 60% faster, retains 97% accuracy
Training: HuggingFace Trainer, 3 epochs, lr=2e-5, batch_size=16
Accuracy: 91% on held-out test set

ONNX Conversion: Converting to ONNX gave us 3-4x speedup: 45ms (PyTorch) → 12ms (ONNX) on e2-medium GCP VM

Stage 3: Smart Routing (1ms)

Based on the ML classifier's confidence score, route to the optimal model:

Complexity Score	Model	Cost per 1M tokens
< 0.35	Llama 3.2 (Ollama)	$0 (local)
0.35 - 0.70	GPT-4o-mini	$0.15
> 0.70	GPT-4o	$2.50

Results: 60% Cost Reduction

After running Cascade on 1K real-world queries:

Metric	Baseline (100% GPT-4)	Cascade	Improvement
Total Cost	$7.89	$2.34	-70%
Cache Hit Rate	0%	42.6%	N/A
Avg Latency	1.8s	1.2s	-33%
Quality (BLEU)	0.85	0.84	-1.2%

Key Learnings

1. Synthetic Labeling Works

Using GPT-4 to label training data saved weeks of manual annotation. Cost $8, got 5K labeled examples. The classifier trained on synthetic labels achieved 91% accuracy—close enough for production.

2. ONNX is a Game-Changer for ML in Production

Converting to ONNX gave us:

3-4x speedup (45ms → 12ms)
Lower memory footprint (400MB → 150MB)
No PyTorch dependency in production (smaller Docker image)

3. Semantic Caching > Exact Caching

Semantic cache hit rate (42%) was 3x higher than exact matching (14%) because it catches query variations. The embedding overhead (15ms) was worth it.

Conclusion

Cascade demonstrates that intelligent routing + semantic caching can cut LLM costs by 60%+ without sacrificing quality. The key insights:

Most queries don't need GPT-4 - 70% can be handled by cheaper models
ML classification is fast enough for real-time routing (<20ms)
Semantic caching is 3x more effective than exact matching

This is production-ready infrastructure that directly impacts bottom line. At 10K queries/day, Cascade saves ~$450/month—enough to pay for itself and the infrastructure.

Building Cascade: An Intelligent LLM Router That Cut API Costs by 60%