What is model routing in LLM systems?

Model routing directs each request to the model best suited for it based on task type, cost, or latency requirements. It reduces token spend and improves response times without sacrificing quality on complex tasks.

What are the four main model routing strategies?

The four main strategies are capability-based routing (by task type), cost-aware routing (by budget), latency-aware routing (by speed requirements), and hybrid routing (combining all three). Most production systems use hybrid.

How do fallback chains work in LLM routing systems?

A fallback chain tries models in order from best to most reliable. If the primary model times out or fails, the system falls back to the next model in the chain. The last model should always be local — it won’t fail due to network issues.

When does model routing add more complexity than it is worth?

Model routing adds unnecessary complexity when all tasks have similar difficulty, when you are still prototyping, or when cost and latency are not problems yet. Start with one model and add routing only when the bill or slowness becomes a real issue.

What is the difference between local and API-based LLM routing?

Local models have zero per-token cost after hardware is amortized and never rate-limit, but require upfront investment. API models are flexible but cost per token and can hit rate limits. Most cost-aware routers prefer local for high-volume tasks.

Model Routing: Stop Using One Model for Everything

The right model for the right task.

Page content

Running a 70B parameter model to summarize a 200-word email is wasteful. Running a 3B model to review production code is reckless. Most systems live somewhere in between — and that’s where model routing comes in.

It matches task complexity to model capability. The tradeoffs are real, but the savings are too.

LLM model routing strategies diagram

The routing problem

People usually start with one model and stick with it. That works until you notice the cost, or the latency, or both. The alternative is building a router — something that decides which model handles which request.

Four strategies work in practice:

Capability-based — route by what the model can do
Cost-aware — route by what you’re willing to spend
Latency-aware — route by how fast you need it
Hybrid — combine them

Each optimizes something different. Picking one is usually a decision about what hurts most.

Capability-based routing

The simplest approach. Classify the task, send it to the model that handles it.

Task	Model size	Examples
Classification, tagging	1-3B	Qwen2.5-1.5B, Gemma-2-2B
Summarization, extraction	3-7B	Qwen2.5-7B, Llama-3.1-8B
Code generation	7-14B	Qwen2.5-Coder-7B, DeepSeek-Coder-V2
Complex reasoning	14-32B	Qwen2.5-32B, Llama-3.1-70B
Creative writing, analysis	32B+	Qwen2.5-72B, Claude, GPT-4

If the task doesn’t need the bigger model, don’t use it. A 1.5B model handles sentiment classification fine. It just won’t write a coherent essay.

Implementation is straightforward:

ROUTING_RULES = {
    "classify": {"model": "qwen2.5-1.5b", "max_tokens": 100},
    "summarize": {"model": "qwen2.5-7b", "max_tokens": 500},
    "code_review": {"model": "qwen2.5-coder-7b", "max_tokens": 2000},
    "reason": {"model": "qwen2.5-32b", "max_tokens": 4000},
    "creative": {"model": "claude-sonnet-4", "max_tokens": 8000},
}

def route_request(task_type: str) -> dict:
    return ROUTING_RULES.get(task_type, ROUTING_RULES["reason"])

The catch is classification itself. If you get the task type wrong, you route to the wrong model. I’ve seen systems classify code review as “summarization” and lose quality silently.

Cost-aware routing

Local inference shines here. Local models are effectively free after hardware amortization. A RTX 5080 pays for itself in about six months at moderate API usage.

Model	Input ($/M tokens)	Output ($/M tokens)	Local cost/hour
GPT-4o	$2.50	$10.00	—
Claude Sonnet 4	$3.00	$15.00	—
Qwen2.5-72B (API)	$0.50	$2.00	—
Qwen2.5-32B (local)	$0.00	$0.00	~$0.10
Qwen2.5-7B (local)	$0.00	$0.00	~$0.05

If you’re processing thousands of requests per session, even $0.05 in electricity beats $15/M tokens.

Budget-based routing falls back as you spend:

class CostAwareRouter:
    def __init__(self, budget_per_session: float = 0.10):
        self.budget = budget_per_session
        self.spent = 0.0
        self.models = {
            "cheap": {"model": "qwen2.5-7b", "cost": 0.0},
            "medium": {"model": "qwen2.5-32b", "cost": 0.0},
            "expensive": {"model": "claude-sonnet-4", "cost": 0.000015},
        }

    def route(self, task: str) -> str:
        ratio = self.spent / self.budget
        if ratio < 0.5:
            return self.models["expensive"]["model"]
        elif ratio < 0.8:
            return self.models["medium"]["model"]
        return self.models["cheap"]["model"]

Quality degrades as you fall back. You start with Claude, move to Qwen-32B, then to Qwen-7B. By the end of a long session, the output is noticeably worse. Whether that matters depends on what you’re building.

Latency-aware routing

Interactive tools need fast first tokens. Batch jobs can wait. The difference is usually a factor of five in model size.

Use case	First token	Complete	Max model size
Real-time chat	< 200ms	< 2s	< 7B
Interactive tools	< 500ms	< 5s	< 14B
Batch processing	< 1s	< 30s	Any
Research/analysis	< 2s	< 60s	Any

When you’re streaming tokens to a user, first token latency is what they feel. A 32B model taking half a second to start feels sluggish compared to a 1.5B model that fires instantly.

class LatencyAwareRouter:
    def __init__(self):
        self.model_latencies = {
            "qwen2.5-1.5b": {"first_token": 0.05, "complete": 0.5},
            "qwen2.5-7b": {"first_token": 0.15, "complete": 2.0},
            "qwen2.5-32b": {"first_token": 0.5, "complete": 10.0},
            "claude-sonnet-4": {"first_token": 0.3, "complete": 5.0},
        }

    def route(self, target_latency: float) -> str:
        for model, latencies in sorted(
            self.model_latencies.items(),
            key=lambda x: x[1]["complete"]
        ):
            if latencies["complete"] <= target_latency:
                return model
        return "qwen2.5-1.5b"

The latency numbers are rough — they depend on your hardware, quantization, and batch size. Measure on your own setup.

Fallback strategies

Models fail. APIs rate-limit. Timeouts happen. The pattern that works is a fallback chain, ordered from best to most reliable:

class FallbackRouter:
    def __init__(self):
        self.fallback_chain = [
            {"model": "claude-sonnet-4", "timeout": 30},
            {"model": "qwen2.5-72b", "timeout": 60},
            {"model": "qwen2.5-32b", "timeout": 120},
            {"model": "qwen2.5-7b", "timeout": 300},
        ]

    def route_with_fallback(self, prompt: str) -> str:
        for config in self.fallback_chain:
            try:
                return self.call_model(
                    config["model"], prompt,
                    timeout=config["timeout"]
                )
            except (TimeoutError, APIError) as e:
                log.warning(f"Model {config['model']} failed: {e}")
                continue
        raise RuntimeError("All fallback models failed")

The last model in the chain should be local. It’s slower, but it won’t fail because of a network issue or an API key.

When routing helps

Routing makes sense when your workload is mixed. If you’re doing classification, summarization, and reasoning in the same system, a router saves money and latency.

It doesn’t make sense when everything you do is the same complexity. Just use the model that’s good at that task. The router adds complexity you don’t need.

Early prototyping is another reason to skip it. Get the task working with one model, then add routing when cost or latency actually becomes a problem.

Tradeoffs

Every routing strategy optimizes something and sacrifices something else:

Single model — simplest, most expensive, consistent quality
Capability-based — better cost, higher quality per task, moderate complexity
Cost-aware — cheapest, quality varies, moderate complexity
Latency-aware — fastest, may sacrifice quality, moderate complexity
Hybrid — best of all, most complex to implement

Production systems usually converge on hybrid. Start with capability-based routing, add cost awareness when the bill comes in, add latency awareness when users complain about slowness.

Cost Optimization for LLM Systems — token budgeting, caching, fallback models
LLM Guardrails in Practice — input validation, output filtering, safety
Multi-Model System Design — architecture for multiple models
LLM Architecture — system design pillar: routing, cost, guardrails, and orchestration