Building a multi-model architecture: route requests to the right LLM
Using one model for everything is the simplest architecture. It’s also the most wasteful. A 685B-parameter reasoning model answering “what’s the weather?” is like hiring a PhD to sort mail.
This guide covers how to use a small, fast model to classify incoming requests and route them to the right specialist. The result: lower latency, lower cost, and often better quality — because each model handles what it’s actually good at.
The problem with single-model architectures
Section titled “The problem with single-model architectures”Most applications start with one model:
User request --> Large Model --> ResponseThis works, but every request — simple or complex — pays the same latency and cost penalty. When 60% of your traffic is simple classification, FAQ, or extraction, you’re burning expensive compute on tasks a small model handles equally well.
The gap between Llama 8B and R1 is nearly 7x in throughput. Routing simple requests to the small model saves that difference on every request.
The multi-model architecture
Section titled “The multi-model architecture”User request --> Router (Llama 8B) --> classify intent | +-----------+-----------+-----------+ | | | | simple general reasoning code | | | | Llama 3.1 8B DeepSeek DeepSeek R1 Qwen3 V3.2 Coder | | | | +-----+-----+-----+-----+ | ResponseTwo stages:
- Classify — The router model reads the user’s message and outputs a category. This takes ~0.2 seconds with Llama 8B.
- Route — Based on the category, forward the request to the appropriate specialist model.
The router adds minimal overhead (~200ms) but saves significant compute by keeping simple requests away from expensive models.
Step 1: Classify with Llama 3.1 8B
Section titled “Step 1: Classify with Llama 3.1 8B”Llama 3.1 8B is the router. At ~200 t/s output speed, ~0.2s TTFT, and $0.02/M input tokens, the classification step costs almost nothing and completes before the user notices.
The classification prompt is simple — you want a single-word category, not a conversation:
from openai import OpenAI
client = OpenAI( base_url="https://api.cheapestinference.com/v1", api_key="your-api-key")
def classify_request(user_message: str) -> str: """Classify a user message into a routing category.""" response = client.chat.completions.create( model="meta-llama/llama-3.1-8b-instruct", messages=[ { "role": "system", "content": ( "Classify the user's message into exactly one category. " "Respond with only the category name, nothing else.\n\n" "Categories:\n" "- simple: greetings, FAQ, simple factual questions\n" "- general: complex questions, analysis, writing, summarization\n" "- reasoning: math, logic, multi-step problems, science\n" "- code: code generation, debugging, refactoring, technical implementation\n" "- agent: tasks requiring tool use, web search, or multi-step execution" ) }, {"role": "user", "content": user_message} ], max_tokens=10, temperature=0 ) category = response.choices[0].message.content.strip().lower() # Default to general if classification is unclear valid = {"simple", "general", "reasoning", "code", "agent"} return category if category in valid else "general"The key details: max_tokens=10 because we only need one word. temperature=0 for deterministic routing. The system prompt is explicit about format — no preamble, just the category.
Step 2: Route to the specialist
Section titled “Step 2: Route to the specialist”Each category maps to a model optimized for that task:
# Model routing tableROUTE_TABLE = { "simple": "meta-llama/llama-3.1-8b-instruct", "general": "deepseek/deepseek-chat-v3-0324", "reasoning": "deepseek/deepseek-reasoner", "code": "qwen/qwen3-coder", "agent": "moonshotai/kimi-k2-5",}
def route_request(user_message: str, conversation_history: list) -> str: """Classify and route a request to the appropriate model.""" category = classify_request(user_message) model = ROUTE_TABLE[category]
response = client.chat.completions.create( model=model, messages=conversation_history + [ {"role": "user", "content": user_message} ], stream=True )
# Stream the response back full_response = "" for chunk in response: if chunk.choices[0].delta.content: content = chunk.choices[0].delta.content full_response += content print(content, end="", flush=True)
return full_responseNotice that simple requests route back to Llama 8B — the same model that did the classification. For simple queries, the router overhead is effectively zero because the specialist is the same model and can reuse the warm connection.
Step 3: Handle edge cases
Section titled “Step 3: Handle edge cases”The basic router works for most traffic, but production systems need a few refinements:
def route_request_production( user_message: str, conversation_history: list, force_model: str = None) -> tuple[str, str]: """Production router with overrides and fallback."""
# Allow explicit model override (for power users or testing) if force_model: model = force_model category = "override" else: category = classify_request(user_message) model = ROUTE_TABLE[category]
try: response = client.chat.completions.create( model=model, messages=conversation_history + [ {"role": "user", "content": user_message} ] ) return response.choices[0].message.content, category
except Exception: # Fallback to V3.2 if the specialist is unavailable fallback = "deepseek/deepseek-chat-v3-0324" response = client.chat.completions.create( model=fallback, messages=conversation_history + [ {"role": "user", "content": user_message} ] ) return response.choices[0].message.content, f"{category}->fallback"Three patterns worth noting:
- Force model — Let callers bypass routing when they know what they need.
- Fallback — If a specialist model is down, fall back to V3.2. It handles everything reasonably well.
- Return the category — Log which route each request takes. You’ll need this data to tune the system.
Cost and latency comparison
Section titled “Cost and latency comparison”Consider a workload of 1,000 requests with this distribution: 600 simple, 300 general, 70 reasoning, 30 code. Average 500 input tokens, 200 output tokens per request.
Single-model approach (everything on V3.2)
Section titled “Single-model approach (everything on V3.2)”Every request waits for V3.2’s ~1.2s TTFT plus generation time at ~60 t/s. Simple questions get the same treatment as complex analysis.
Multi-model approach (routed)
Section titled “Multi-model approach (routed)”The weighted average latency drops to approximately 2.7s — a 40% reduction. The 600 simple requests finish in ~1.2s instead of ~4.5s. That’s a 3.7x improvement for the majority of your traffic.
The 70 reasoning requests are slower individually (~9s vs ~4.5s) because R1 generates chain-of-thought tokens. But the quality on those specific requests is significantly better — R1 scores 50.2% on HLE versus V3.2’s 39.3%.
You get faster averages and better quality on the hard tail.
Real example: a support chatbot
Section titled “Real example: a support chatbot”A customer support chatbot receives three types of requests:
- FAQ (60%) — “What are your business hours?” / “How do I reset my password?”
- Complex support (30%) — “I was charged twice for order #12345, can you investigate?”
- Technical issues (10%) — “Your API returns 500 when I send multipart form data with UTF-8 filenames”
Without routing
Section titled “Without routing”All requests go to DeepSeek V3.2. FAQs get correct answers but with unnecessary latency. Technical issues get decent answers but miss edge cases that a code-specialized model would catch.
With routing
Section titled “With routing”SUPPORT_ROUTES = { "simple": "meta-llama/llama-3.1-8b-instruct", # FAQ, greetings "general": "deepseek/deepseek-chat-v3-0324", # Complex support "reasoning": "deepseek/deepseek-chat-v3-0324", # Investigations "code": "qwen/qwen3-coder", # Technical issues "agent": "moonshotai/kimi-k2-5", # Multi-step resolution}FAQs resolve in ~1 second via Llama 8B. Complex support issues get V3.2’s full analytical capability. Technical problems route to Qwen3 Coder, which understands the code context better. If a support issue requires looking up order data via API, it routes to Kimi K2.5 for tool-assisted resolution.
The classification step adds ~200ms. For the 60% of requests that drop from ~4.5s to ~1.2s, that’s an invisible cost.
When NOT to use multi-model routing
Section titled “When NOT to use multi-model routing”Routing adds complexity. Skip it when:
- All your requests are the same type. If you’re building a code editor, just use Qwen3 Coder. No routing needed.
- You have fewer than 100 requests/day. The cost savings don’t justify the engineering overhead at low volume.
- Latency doesn’t matter. For batch processing or async workloads, a single capable model is simpler.
- Your classification accuracy is low. If the router misclassifies frequently, you get worse results than a single good model. Test the classifier on real traffic before deploying.
The sweet spot is high-volume applications with diverse request types — chatbots, API gateways, developer tools, and customer-facing products where response time directly affects user experience.
Implementation checklist
Section titled “Implementation checklist”- Log your traffic. Before building a router, understand your request distribution. What percentage is simple? Complex? Code?
- Start with two tiers. Llama 8B for simple, V3.2 for everything else. Add specialists only when you have data showing they help.
- Measure classification accuracy. Sample 100 requests, manually label them, compare against the router’s output. Target >90% accuracy.
- Add fallback. Every specialist route should fall back to V3.2 if the specialist is unavailable.
- Monitor per-route metrics. Track latency, cost, and quality per category. This tells you where to optimize next.
All models in this guide are available through a single OpenAI-compatible API with no configuration changes between models. If you’re building a platform that needs LLM access for your users, see how per-key plans work.
Sources: Artificial Analysis Leaderboard · DeepSeek V3.2 · HLE Leaderboard · Kimi K2.5 Benchmarks