Skip to content

Building a multi-model architecture: route requests to the right LLM

Using one model for everything is the simplest architecture. It’s also the most wasteful. A 685B-parameter reasoning model answering “what’s the weather?” is like hiring a PhD to sort mail.

This guide covers how to use a small, fast model to classify incoming requests and route them to the right specialist. The result: lower latency, lower cost, and often better quality — because each model handles what it’s actually good at.


The problem with single-model architectures

Section titled “The problem with single-model architectures”

Most applications start with one model:

User request --> Large Model --> Response

This works, but every request — simple or complex — pays the same latency and cost penalty. When 60% of your traffic is simple classification, FAQ, or extraction, you’re burning expensive compute on tasks a small model handles equally well.

Llama 3.1 8B
~200 t/s
DeepSeek V3.2
~60 t/s
DeepSeek R1
~30 t/s

The gap between Llama 8B and R1 is nearly 7x in throughput. Routing simple requests to the small model saves that difference on every request.


User request --> Router (Llama 8B) --> classify intent
|
+-----------+-----------+-----------+
| | | |
simple general reasoning code
| | | |
Llama 3.1 8B DeepSeek DeepSeek R1 Qwen3
V3.2 Coder
| | | |
+-----+-----+-----+-----+
|
Response

Two stages:

  1. Classify — The router model reads the user’s message and outputs a category. This takes ~0.2 seconds with Llama 8B.
  2. Route — Based on the category, forward the request to the appropriate specialist model.

The router adds minimal overhead (~200ms) but saves significant compute by keeping simple requests away from expensive models.


Llama 3.1 8B is the router. At ~200 t/s output speed, ~0.2s TTFT, and $0.02/M input tokens, the classification step costs almost nothing and completes before the user notices.

The classification prompt is simple — you want a single-word category, not a conversation:

from openai import OpenAI
client = OpenAI(
base_url="https://api.cheapestinference.com/v1",
api_key="your-api-key"
)
def classify_request(user_message: str) -> str:
"""Classify a user message into a routing category."""
response = client.chat.completions.create(
model="meta-llama/llama-3.1-8b-instruct",
messages=[
{
"role": "system",
"content": (
"Classify the user's message into exactly one category. "
"Respond with only the category name, nothing else.\n\n"
"Categories:\n"
"- simple: greetings, FAQ, simple factual questions\n"
"- general: complex questions, analysis, writing, summarization\n"
"- reasoning: math, logic, multi-step problems, science\n"
"- code: code generation, debugging, refactoring, technical implementation\n"
"- agent: tasks requiring tool use, web search, or multi-step execution"
)
},
{"role": "user", "content": user_message}
],
max_tokens=10,
temperature=0
)
category = response.choices[0].message.content.strip().lower()
# Default to general if classification is unclear
valid = {"simple", "general", "reasoning", "code", "agent"}
return category if category in valid else "general"

The key details: max_tokens=10 because we only need one word. temperature=0 for deterministic routing. The system prompt is explicit about format — no preamble, just the category.


Each category maps to a model optimized for that task:

# Model routing table
ROUTE_TABLE = {
"simple": "meta-llama/llama-3.1-8b-instruct",
"general": "deepseek/deepseek-chat-v3-0324",
"reasoning": "deepseek/deepseek-reasoner",
"code": "qwen/qwen3-coder",
"agent": "moonshotai/kimi-k2-5",
}
def route_request(user_message: str, conversation_history: list) -> str:
"""Classify and route a request to the appropriate model."""
category = classify_request(user_message)
model = ROUTE_TABLE[category]
response = client.chat.completions.create(
model=model,
messages=conversation_history + [
{"role": "user", "content": user_message}
],
stream=True
)
# Stream the response back
full_response = ""
for chunk in response:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
full_response += content
print(content, end="", flush=True)
return full_response

Notice that simple requests route back to Llama 8B — the same model that did the classification. For simple queries, the router overhead is effectively zero because the specialist is the same model and can reuse the warm connection.


The basic router works for most traffic, but production systems need a few refinements:

def route_request_production(
user_message: str,
conversation_history: list,
force_model: str = None
) -> tuple[str, str]:
"""Production router with overrides and fallback."""
# Allow explicit model override (for power users or testing)
if force_model:
model = force_model
category = "override"
else:
category = classify_request(user_message)
model = ROUTE_TABLE[category]
try:
response = client.chat.completions.create(
model=model,
messages=conversation_history + [
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.content, category
except Exception:
# Fallback to V3.2 if the specialist is unavailable
fallback = "deepseek/deepseek-chat-v3-0324"
response = client.chat.completions.create(
model=fallback,
messages=conversation_history + [
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.content, f"{category}->fallback"

Three patterns worth noting:

  1. Force model — Let callers bypass routing when they know what they need.
  2. Fallback — If a specialist model is down, fall back to V3.2. It handles everything reasonably well.
  3. Return the category — Log which route each request takes. You’ll need this data to tune the system.

Consider a workload of 1,000 requests with this distribution: 600 simple, 300 general, 70 reasoning, 30 code. Average 500 input tokens, 200 output tokens per request.

Single-model approach (everything on V3.2)

Section titled “Single-model approach (everything on V3.2)”
Avg latency
~4.5s
All 1000 reqs
V3.2 only

Every request waits for V3.2’s ~1.2s TTFT plus generation time at ~60 t/s. Simple questions get the same treatment as complex analysis.

Simple (600)
~1.2s (8B)
General (300)
~4.7s (V3.2)
Reasoning (70)
~9.0s (R1)
Code (30)
~3.5s (Coder)

The weighted average latency drops to approximately 2.7s — a 40% reduction. The 600 simple requests finish in ~1.2s instead of ~4.5s. That’s a 3.7x improvement for the majority of your traffic.

The 70 reasoning requests are slower individually (~9s vs ~4.5s) because R1 generates chain-of-thought tokens. But the quality on those specific requests is significantly better — R1 scores 50.2% on HLE versus V3.2’s 39.3%.

You get faster averages and better quality on the hard tail.


A customer support chatbot receives three types of requests:

  1. FAQ (60%) — “What are your business hours?” / “How do I reset my password?”
  2. Complex support (30%) — “I was charged twice for order #12345, can you investigate?”
  3. Technical issues (10%) — “Your API returns 500 when I send multipart form data with UTF-8 filenames”

All requests go to DeepSeek V3.2. FAQs get correct answers but with unnecessary latency. Technical issues get decent answers but miss edge cases that a code-specialized model would catch.

SUPPORT_ROUTES = {
"simple": "meta-llama/llama-3.1-8b-instruct", # FAQ, greetings
"general": "deepseek/deepseek-chat-v3-0324", # Complex support
"reasoning": "deepseek/deepseek-chat-v3-0324", # Investigations
"code": "qwen/qwen3-coder", # Technical issues
"agent": "moonshotai/kimi-k2-5", # Multi-step resolution
}

FAQs resolve in ~1 second via Llama 8B. Complex support issues get V3.2’s full analytical capability. Technical problems route to Qwen3 Coder, which understands the code context better. If a support issue requires looking up order data via API, it routes to Kimi K2.5 for tool-assisted resolution.

The classification step adds ~200ms. For the 60% of requests that drop from ~4.5s to ~1.2s, that’s an invisible cost.


Routing adds complexity. Skip it when:

  • All your requests are the same type. If you’re building a code editor, just use Qwen3 Coder. No routing needed.
  • You have fewer than 100 requests/day. The cost savings don’t justify the engineering overhead at low volume.
  • Latency doesn’t matter. For batch processing or async workloads, a single capable model is simpler.
  • Your classification accuracy is low. If the router misclassifies frequently, you get worse results than a single good model. Test the classifier on real traffic before deploying.

The sweet spot is high-volume applications with diverse request types — chatbots, API gateways, developer tools, and customer-facing products where response time directly affects user experience.


  1. Log your traffic. Before building a router, understand your request distribution. What percentage is simple? Complex? Code?
  2. Start with two tiers. Llama 8B for simple, V3.2 for everything else. Add specialists only when you have data showing they help.
  3. Measure classification accuracy. Sample 100 requests, manually label them, compare against the router’s output. Target >90% accuracy.
  4. Add fallback. Every specialist route should fall back to V3.2 if the specialist is unavailable.
  5. Monitor per-route metrics. Track latency, cost, and quality per category. This tells you where to optimize next.

All models in this guide are available through a single OpenAI-compatible API with no configuration changes between models. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · DeepSeek V3.2 · HLE Leaderboard · Kimi K2.5 Benchmarks