Blog

Building a multi-model architecture: route requests to the right LLM

Mar 19, 2026

Using one model for everything is the simplest architecture. It’s also the most wasteful. A 685B-parameter reasoning model answering “what’s the weather?” is like hiring a PhD to sort mail.

This guide covers how to use a small, fast model to classify incoming requests and route them to the right specialist. The result: lower latency, lower cost, and often better quality — because each model handles what it’s actually good at.

The problem with single-model architectures

Most applications start with one model:

User request --> Large Model --> Response

This works, but every request — simple or complex — pays the same latency and cost penalty. When 60% of your traffic is simple classification, FAQ, or extraction, you’re burning expensive compute on tasks a small model handles equally well.

Llama 3.1 8B

~200 t/s

DeepSeek V3.2

~60 t/s

DeepSeek R1

~30 t/s

The gap between Llama 8B and R1 is nearly 7x in throughput. Routing simple requests to the small model saves that difference on every request.

The multi-model architecture

User request --> Router (Llama 8B) --> classify intent
                                          |
                  +-----------+-----------+-----------+
                  |           |           |           |
               simple      general    reasoning     code
                  |           |           |           |
            Llama 3.1 8B  DeepSeek   DeepSeek R1   Qwen3
                            V3.2                    Coder
                  |           |           |           |
                  +-----+-----+-----+-----+
                        |
                     Response

Two stages:

Classify — The router model reads the user’s message and outputs a category. This takes ~0.2 seconds with Llama 8B.
Route — Based on the category, forward the request to the appropriate specialist model.

The router adds minimal overhead (~200ms) but saves significant compute by keeping simple requests away from expensive models.

Step 1: Classify with Llama 3.1 8B

Llama 3.1 8B is the router. At ~200 t/s output speed, ~0.2s TTFT, and $0.02/M input tokens, the classification step costs almost nothing and completes before the user notices.

The classification prompt is simple — you want a single-word category, not a conversation:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.cheapestinference.com/v1",
    api_key="your-api-key"
)

def classify_request(user_message: str) -> str:
    """Classify a user message into a routing category."""
    response = client.chat.completions.create(
        model="meta-llama/llama-3.1-8b-instruct",
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify the user's message into exactly one category. "
                    "Respond with only the category name, nothing else.\n\n"
                    "Categories:\n"
                    "- simple: greetings, FAQ, simple factual questions\n"
                    "- general: complex questions, analysis, writing, summarization\n"
                    "- reasoning: math, logic, multi-step problems, science\n"
                    "- code: code generation, debugging, refactoring, technical implementation\n"
                    "- agent: tasks requiring tool use, web search, or multi-step execution"
                )
            },
            {"role": "user", "content": user_message}
        ],
        max_tokens=10,
        temperature=0
    )
    category = response.choices[0].message.content.strip().lower()
    # Default to general if classification is unclear
    valid = {"simple", "general", "reasoning", "code", "agent"}
    return category if category in valid else "general"

The key details: max_tokens=10 because we only need one word. temperature=0 for deterministic routing. The system prompt is explicit about format — no preamble, just the category.

Step 2: Route to the specialist

Each category maps to a model optimized for that task:

# Model routing table
ROUTE_TABLE = {
    "simple":    "meta-llama/llama-3.1-8b-instruct",
    "general":   "deepseek/deepseek-chat-v3-0324",
    "reasoning": "deepseek/deepseek-reasoner",
    "code":      "qwen/qwen3-coder",
    "agent":     "moonshotai/kimi-k2-5",
}

def route_request(user_message: str, conversation_history: list) -> str:
    """Classify and route a request to the appropriate model."""
    category = classify_request(user_message)
    model = ROUTE_TABLE[category]

    response = client.chat.completions.create(
        model=model,
        messages=conversation_history + [
            {"role": "user", "content": user_message}
        ],
        stream=True
    )

    # Stream the response back
    full_response = ""
    for chunk in response:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            full_response += content
            print(content, end="", flush=True)

    return full_response

Notice that simple requests route back to Llama 8B — the same model that did the classification. For simple queries, the router overhead is effectively zero because the specialist is the same model and can reuse the warm connection.

Step 3: Handle edge cases

The basic router works for most traffic, but production systems need a few refinements:

def route_request_production(
    user_message: str,
    conversation_history: list,
    force_model: str = None
) -> tuple[str, str]:
    """Production router with overrides and fallback."""

    # Allow explicit model override (for power users or testing)
    if force_model:
        model = force_model
        category = "override"
    else:
        category = classify_request(user_message)
        model = ROUTE_TABLE[category]

    try:
        response = client.chat.completions.create(
            model=model,
            messages=conversation_history + [
                {"role": "user", "content": user_message}
            ]
        )
        return response.choices[0].message.content, category

    except Exception:
        # Fallback to V3.2 if the specialist is unavailable
        fallback = "deepseek/deepseek-chat-v3-0324"
        response = client.chat.completions.create(
            model=fallback,
            messages=conversation_history + [
                {"role": "user", "content": user_message}
            ]
        )
        return response.choices[0].message.content, f"{category}->fallback"

Three patterns worth noting:

Force model — Let callers bypass routing when they know what they need.
Fallback — If a specialist model is down, fall back to V3.2. It handles everything reasonably well.
Return the category — Log which route each request takes. You’ll need this data to tune the system.

Cost and latency comparison

Consider a workload of 1,000 requests with this distribution: 600 simple, 300 general, 70 reasoning, 30 code. Average 500 input tokens, 200 output tokens per request.

Single-model approach (everything on V3.2)

Avg latency

~4.5s

All 1000 reqs

V3.2 only

Every request waits for V3.2’s ~1.2s TTFT plus generation time at ~60 t/s. Simple questions get the same treatment as complex analysis.

Multi-model approach (routed)

Simple (600)

~1.2s (8B)

General (300)

~4.7s (V3.2)

Reasoning (70)

~9.0s (R1)

Code (30)

~3.5s (Coder)

The weighted average latency drops to approximately 2.7s — a 40% reduction. The 600 simple requests finish in ~1.2s instead of ~4.5s. That’s a 3.7x improvement for the majority of your traffic.

The 70 reasoning requests are slower individually (~9s vs ~4.5s) because R1 generates chain-of-thought tokens. But the quality on those specific requests is significantly better — R1 scores 50.2% on HLE versus V3.2’s 39.3%.

You get faster averages and better quality on the hard tail.

Real example: a support chatbot

A customer support chatbot receives three types of requests:

FAQ (60%) — “What are your business hours?” / “How do I reset my password?”
Complex support (30%) — “I was charged twice for order #12345, can you investigate?”
Technical issues (10%) — “Your API returns 500 when I send multipart form data with UTF-8 filenames”

Without routing

All requests go to DeepSeek V3.2. FAQs get correct answers but with unnecessary latency. Technical issues get decent answers but miss edge cases that a code-specialized model would catch.

With routing

SUPPORT_ROUTES = {
    "simple":    "meta-llama/llama-3.1-8b-instruct",  # FAQ, greetings
    "general":   "deepseek/deepseek-chat-v3-0324",     # Complex support
    "reasoning": "deepseek/deepseek-chat-v3-0324",     # Investigations
    "code":      "qwen/qwen3-coder",                   # Technical issues
    "agent":     "moonshotai/kimi-k2-5",               # Multi-step resolution
}

FAQs resolve in ~1 second via Llama 8B. Complex support issues get V3.2’s full analytical capability. Technical problems route to Qwen3 Coder, which understands the code context better. If a support issue requires looking up order data via API, it routes to Kimi K2.5 for tool-assisted resolution.

The classification step adds ~200ms. For the 60% of requests that drop from ~4.5s to ~1.2s, that’s an invisible cost.

When NOT to use multi-model routing

Routing adds complexity. Skip it when:

All your requests are the same type. If you’re building a code editor, just use Qwen3 Coder. No routing needed.
You have fewer than 100 requests/day. The cost savings don’t justify the engineering overhead at low volume.
Latency doesn’t matter. For batch processing or async workloads, a single capable model is simpler.
Your classification accuracy is low. If the router misclassifies frequently, you get worse results than a single good model. Test the classifier on real traffic before deploying.

The sweet spot is high-volume applications with diverse request types — chatbots, API gateways, developer tools, and customer-facing products where response time directly affects user experience.

Implementation checklist

Log your traffic. Before building a router, understand your request distribution. What percentage is simple? Complex? Code?
Start with two tiers. Llama 8B for simple, V3.2 for everything else. Add specialists only when you have data showing they help.
Measure classification accuracy. Sample 100 requests, manually label them, compare against the router’s output. Target >90% accuracy.
Add fallback. Every specialist route should fall back to V3.2 if the specialist is unavailable.
Monitor per-route metrics. Track latency, cost, and quality per category. This tells you where to optimize next.

All models in this guide are available through a single OpenAI-compatible API with no configuration changes between models. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · DeepSeek V3.2 · HLE Leaderboard · Kimi K2.5 Benchmarks

How to choose the right open-source model for your task

Mar 19, 2026

Most teams default to the biggest model available and call it a day. That works — until latency spikes, costs climb, and you realize a 8B-parameter model would have handled 60% of your requests just fine.

This guide maps common use cases to specific models, with real throughput numbers from our infrastructure. No theory — just which model to pick and why.

Quick decision table

Use case	Model	Why
General chat / assistants	DeepSeek V3.2	Best all-rounder. 85% MMLU-Pro, 73% SWE-bench, 60 t/s.
Complex reasoning	DeepSeek R1	50.2% on Humanity’s Last Exam. Chain-of-thought built in.
Code generation	Qwen3 Coder	Purpose-built for code. Strong on completions, refactoring, and debugging.
Agentic workflows	Kimi K2.5	334 t/s output, native tool use, 50.2% HLE with tools. Built for agents.
Vision / multimodal	Llama 4 Scout	17 active experts, 109B params, native image understanding.
Fast classification	Llama 3.1 8B	~200 t/s, 0.2s TTFT. Small enough for routing, tagging, extraction.
General (budget)	GLM 4.7 Flash	Fast inference, competitive quality. Good when V3.2 is overkill.
Long context chat	MiniMax M2.5	Native long-context support. Handles large documents well.
Large general + reasoning	Qwen3 235B	235B MoE. Strong across benchmarks when you need maximum capability.
Embeddings	BGE Large	MTEB-tested. Solid retrieval quality for RAG pipelines.

General chat and assistants

Pick: DeepSeek V3.2

DeepSeek V3.2 is the default choice for most workloads. It scores 85% on MMLU-Pro (beating Claude Opus 4.6’s 82%), 73% on SWE-bench Verified, and runs at ~60 tokens/second on our infrastructure.

Kimi K2.5

334 t/s

Llama 3.1 8B

~200 t/s

DeepSeek V3.2

~60 t/s

DeepSeek R1

~30 t/s

Good at: Broad knowledge, instruction following, multilingual, structured output. Not ideal for: Tasks that need step-by-step reasoning chains (use R1) or sub-100ms latency (use Llama 8B). Pick over alternatives when: You need a reliable general-purpose model that handles most tasks without specialization.

Complex reasoning

Pick: DeepSeek R1

R1 is a reasoning-first model. It produces explicit chain-of-thought tokens before its final answer. On Humanity’s Last Exam — a benchmark designed to be unsolvable by current models — R1 scores 50.2%, beating GPT-5.4 (41.6%) and Claude Opus 4.6 (40%).

The tradeoff is speed. At ~30 t/s, R1 is the slowest model in our lineup. That’s expected — it’s generating reasoning tokens that never appear in the final output.

Good at: Math, science, logic puzzles, multi-step problems, anything where “thinking” helps. Not ideal for: Simple Q&A, classification, or latency-sensitive applications. Pick over alternatives when: The task requires multi-step deduction. If a human would need to “think through it,” R1 will outperform faster models.

Code generation

Pick: Qwen3 Coder

Qwen3 Coder is purpose-built for software engineering tasks — code completion, refactoring, debugging, and generation across languages. It’s trained specifically on code-heavy data and optimized for developer workflows.

Good at: Code completion, bug fixing, refactoring, test generation, multi-file edits. Not ideal for: General conversation or non-code tasks (use V3.2). Pick over alternatives when: Code quality matters more than general knowledge. For mixed code-and-chat workflows, V3.2 or Kimi K2.5 may be more versatile.

Agentic workflows

Pick: Kimi K2.5

Kimi K2.5 was designed for agentic use. It has native tool-calling support, runs at 334 t/s (the fastest model we serve), and scores 50.2% on HLE when using tools — matching R1’s reasoning-only score.

The speed matters for agents. Each tool call is a round trip: the model generates a function call, the tool executes, the result goes back to the model. At 334 t/s and 0.31s TTFT, Kimi completes multi-step agent loops in seconds where slower models take minutes.

Good at: Tool use, function calling, multi-step task execution, fast iteration loops. Not ideal for: Pure reasoning without tools (R1 is better). Code-only tasks (Qwen3 Coder is more specialized). Pick over alternatives when: Your application involves tool calling, API interactions, or multi-step agent orchestration where speed compounds.

Vision and multimodal

Pick: Llama 4 Scout

Llama 4 Scout is Meta’s mixture-of-experts multimodal model — 109B total parameters with 17 active experts. It handles text and images natively, making it the pick for tasks that require visual understanding alongside language.

Good at: Image description, visual Q&A, document understanding, chart interpretation. Not ideal for: Text-only tasks where you’re paying for vision capability you don’t use (use V3.2). Pick over alternatives when: Your input includes images. For text-only workloads, other models are more efficient.

Fast classification and routing

Pick: Llama 3.1 8B

At 8 billion parameters, Llama 3.1 8B runs at ~200 t/s with approximately 0.2s time to first token. It’s the right choice for tasks where speed matters more than depth: intent classification, sentiment analysis, entity extraction, content filtering, and request routing.

Good at: Classification, tagging, extraction, routing decisions, simple Q&A, content moderation. Not ideal for: Complex reasoning, long-form generation, or tasks requiring deep world knowledge. Pick over alternatives when: You need results in under a second and the task is well-defined. Also ideal as the router model in a multi-model architecture.

Budget general use

Pick: GLM 4.7 Flash

GLM 4.7 Flash delivers competitive quality at fast inference speeds. When DeepSeek V3.2 is more capability than you need — simple conversations, basic summarization, FAQ bots — GLM 4.7 Flash gets the job done efficiently.

Good at: Simple chat, summarization, translation, basic Q&A. Not ideal for: Complex reasoning or tasks where benchmark-leading quality matters. Pick over alternatives when: You want good-enough quality with better speed and lower cost than the largest models.

Long context

Pick: MiniMax M2.5

MiniMax M2.5 handles long context windows natively. For workloads that involve ingesting large documents, long conversation histories, or extensive codebases, M2.5 maintains coherence across the full context.

Good at: Document analysis, long conversations, large-context summarization. Not ideal for: Short, simple tasks where context length is irrelevant (use Llama 8B or GLM Flash). Pick over alternatives when: Your input regularly exceeds what smaller-context models handle well.

Maximum capability

Pick: Qwen3 235B

Qwen3 235B is a large mixture-of-experts model that competes across the full benchmark spectrum. When you need the highest possible quality and latency is not the primary constraint, Qwen3 235B delivers.

Good at: Broad capability across reasoning, knowledge, and generation. Strong multilingual support. Not ideal for: Latency-sensitive applications (large model, slower inference). Pick over alternatives when: You need top-tier quality and can tolerate higher latency. Good for batch processing and offline tasks.

Embeddings

Pick: BGE Large

BGE Large (BAAI General Embedding) is a well-tested embedding model for retrieval-augmented generation. It performs well on MTEB benchmarks and produces dense vectors suitable for semantic search, document retrieval, and clustering.

Good at: Semantic search, RAG pipelines, document similarity, clustering. Not ideal for: Generative tasks (it’s an embedding model, not a chat model). Pick over alternatives when: You need vector embeddings for search or retrieval. Pair it with a generative model for the full RAG pipeline.

The decision tree

What's your task?
|
+-- Need to understand images?
|   YES --> Llama 4 Scout
|
+-- Need step-by-step reasoning? (math, logic, science)
|   YES --> DeepSeek R1 (~30 t/s, but highest reasoning quality)
|
+-- Need tool calling / agent loops?
|   YES --> Kimi K2.5 (334 t/s, native tool use)
|
+-- Need code generation / editing?
|   YES --> Qwen3 Coder (purpose-built for code)
|
+-- Need embeddings for search/RAG?
|   YES --> BGE Large
|
+-- Need sub-200ms response?
|   YES --> Llama 3.1 8B (~200 t/s, 0.2s TTFT)
|
+-- Need long context (large documents)?
|   YES --> MiniMax M2.5
|
+-- Need maximum quality, latency flexible?
|   YES --> Qwen3 235B
|
+-- General purpose, good balance?
    YES --> DeepSeek V3.2 (default choice)

The 80/20 rule

You don’t need ten models to cover most workloads.

Llama 3.1 8B handles 60% of requests. Classification, routing, simple Q&A, extraction, content filtering. Fast and cheap.

DeepSeek V3.2 handles 30%. General chat, complex instructions, knowledge-intensive tasks. The reliable all-rounder.

Specialized models handle the last 10%. R1 for hard reasoning. Kimi K2.5 for agent loops. Qwen3 Coder for code. BGE Large for embeddings.

Start with Llama 8B + V3.2. Add specialists only when you have evidence that general models aren’t performing on specific task categories. Measure first, specialize second.

All models are available through a single OpenAI-compatible API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · HLE Leaderboard · MMLU-Pro Leaderboard · MTEB Leaderboard

Open-source models are production-ready. Here's the proof.

Mar 19, 2026

There’s a persistent assumption in the industry: open-source models are fine for experimentation, but production workloads need GPT-5 or Claude Opus. We run open-source models in production every day. Here’s what the benchmarks actually say.

We’re comparing 5 models across 5 metrics — the same models in every chart, no cherry-picking:

Open-source (available via our API): DeepSeek V3.2, DeepSeek R1, Kimi K2.5 Proprietary (reference): Claude Opus 4.6, GPT-5.4

Code quality: SWE-bench Verified (% resolved)

Claude Opus 4.6

80.8%

GPT-5.4

~80.0%

Kimi K2.5

76.8%

DeepSeek V3.2

73.0%

DeepSeek R1

57.6%

Proprietary models lead here. Opus 4.6 and GPT-5.4 are within a point of each other at ~80%. Kimi K2.5 is 4 points behind at 76.8% — competitive but not leading. R1 is a reasoning model, not optimized for code.

Reasoning: Humanity’s Last Exam (%)

Kimi K2.5 *

50.2%

DeepSeek R1

50.2%

GPT-5.4

41.6%

Claude Opus 4.6

40.0%

DeepSeek V3.2

39.3%

Open-source wins decisively. R1 hits 50.2% and Kimi K2.5 matches it with tool-use enabled (*without tools: 31.5%). Both beat Opus 4.6 (40%) and GPT-5.4 (41.6%). V3.2 is roughly at Opus level — it’s a general model, not a reasoning specialist.

*Kimi K2.5’s HLE score uses its agentic mode with tool access. This is how the model is designed to be used in production.

Knowledge: MMLU-Pro (%)

GPT-5.4

88.5%

Kimi K2.5

87.1%

DeepSeek V3.2

85.0%

DeepSeek R1

84.0%

Claude Opus 4.6

82.0%

GPT-5.4 leads narrowly at 88.5%, but Kimi K2.5 is 1.4 points behind and all three open-source models beat Opus 4.6. The gap across all 5 models is only 6.5 points — this benchmark is nearly saturated.

Speed: output tokens per second

Kimi K2.5

334 t/s

GPT-5.4

~78 t/s

DeepSeek V3.2

~60 t/s

Claude Opus 4.6

46 t/s

DeepSeek R1

~30 t/s

Kimi K2.5 at 334 tok/s is in a different league — 4x faster than GPT-5.4, 7x faster than Opus 4.6. R1 is the slowest (expected — it’s a reasoning model producing chain-of-thought tokens).

Latency: time to first token (seconds)

Kimi K2.5

0.31s

GPT-5.4

~0.95s

DeepSeek V3.2

1.18s

DeepSeek R1

~2.0s

Claude Opus 4.6

2.48s

Lower is better. Kimi K2.5 responds 8x faster than Opus 4.6 and 3x faster than GPT-5.4. Even V3.2 beats both proprietary models. Opus 4.6 is the slowest model in this comparison.

Speed and TTFT measured on our production infrastructure. Claude and GPT-5.4 data from Artificial Analysis.

The full picture

The scorecard

Metric	Winner	Open-source	Proprietary	Gap
Code (SWE-bench)	Opus 4.6	Kimi 76.8%	Opus 80.8%	-4 pts
Reasoning (HLE)	R1	R1 50.2%	GPT-5.4 41.6%	+8.6 pts
Knowledge (MMLU-Pro)	GPT-5.4	Kimi 87.1%	GPT-5.4 88.5%	-1.4 pts
Speed (tok/s)	Kimi K2.5	334 t/s	GPT-5.4 78 t/s	4.3x faster
Latency (TTFT)	Kimi K2.5	0.31s	GPT-5.4 0.95s	3x faster

Open-source wins 3 out of 5. Proprietary models lead on Code (by 4 points) and Knowledge (by 1.4 points). Open-source leads on Reasoning (by 8.6 points), Speed (by 4.3x), and Latency (by 3x).

Note: Kimi K2.5’s HLE score (50.2%) uses tool-augmented mode. Without tools it scores 31.5%. DeepSeek R1’s 50.2% is pure chain-of-thought reasoning without tools.

What “production-ready” actually means

Reliable enough. Consistent quality across thousands of requests.
Fast enough. Kimi K2.5 at 334 tok/s and 0.31s TTFT. That’s real-time.
Capable enough. Within 4 points of the best proprietary model on code, ahead on reasoning.
Predictable. Versioned models that don’t change without warning.

The real advantage: control

Proprietary models change under you. Fine one day, different behavior the next. No changelog, no warning. Open-source models are versioned — DeepSeek V3.2 behaves the same tomorrow as today. You choose when to upgrade.

For production workloads, that predictability is worth more than a marginal quality edge on any single benchmark.

We serve 70+ open-source models through a single API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · OpenAI Pricing · Anthropic Pricing · HLE Leaderboard · MMLU-Pro Leaderboard