Skip to content

How to choose the right open-source model for your task

Most teams default to the biggest model available and call it a day. That works — until latency spikes, costs climb, and you realize a 8B-parameter model would have handled 60% of your requests just fine.

This guide maps common use cases to specific models, with real throughput numbers from our infrastructure. No theory — just which model to pick and why.


Use caseModelWhy
General chat / assistantsDeepSeek V3.2Best all-rounder. 85% MMLU-Pro, 73% SWE-bench, 60 t/s.
Complex reasoningDeepSeek R150.2% on Humanity’s Last Exam. Chain-of-thought built in.
Code generationQwen3 CoderPurpose-built for code. Strong on completions, refactoring, and debugging.
Agentic workflowsKimi K2.5334 t/s output, native tool use, 50.2% HLE with tools. Built for agents.
Vision / multimodalLlama 4 Scout17 active experts, 109B params, native image understanding.
Fast classificationLlama 3.1 8B~200 t/s, 0.2s TTFT. Small enough for routing, tagging, extraction.
General (budget)GLM 4.7 FlashFast inference, competitive quality. Good when V3.2 is overkill.
Long context chatMiniMax M2.5Native long-context support. Handles large documents well.
Large general + reasoningQwen3 235B235B MoE. Strong across benchmarks when you need maximum capability.
EmbeddingsBGE LargeMTEB-tested. Solid retrieval quality for RAG pipelines.

Pick: DeepSeek V3.2

DeepSeek V3.2 is the default choice for most workloads. It scores 85% on MMLU-Pro (beating Claude Opus 4.6’s 82%), 73% on SWE-bench Verified, and runs at ~60 tokens/second on our infrastructure.

Kimi K2.5
334 t/s
Llama 3.1 8B
~200 t/s
DeepSeek V3.2
~60 t/s
DeepSeek R1
~30 t/s

Good at: Broad knowledge, instruction following, multilingual, structured output. Not ideal for: Tasks that need step-by-step reasoning chains (use R1) or sub-100ms latency (use Llama 8B). Pick over alternatives when: You need a reliable general-purpose model that handles most tasks without specialization.


Pick: DeepSeek R1

R1 is a reasoning-first model. It produces explicit chain-of-thought tokens before its final answer. On Humanity’s Last Exam — a benchmark designed to be unsolvable by current models — R1 scores 50.2%, beating GPT-5.4 (41.6%) and Claude Opus 4.6 (40%).

The tradeoff is speed. At ~30 t/s, R1 is the slowest model in our lineup. That’s expected — it’s generating reasoning tokens that never appear in the final output.

Good at: Math, science, logic puzzles, multi-step problems, anything where “thinking” helps. Not ideal for: Simple Q&A, classification, or latency-sensitive applications. Pick over alternatives when: The task requires multi-step deduction. If a human would need to “think through it,” R1 will outperform faster models.


Pick: Qwen3 Coder

Qwen3 Coder is purpose-built for software engineering tasks — code completion, refactoring, debugging, and generation across languages. It’s trained specifically on code-heavy data and optimized for developer workflows.

Good at: Code completion, bug fixing, refactoring, test generation, multi-file edits. Not ideal for: General conversation or non-code tasks (use V3.2). Pick over alternatives when: Code quality matters more than general knowledge. For mixed code-and-chat workflows, V3.2 or Kimi K2.5 may be more versatile.


Pick: Kimi K2.5

Kimi K2.5 was designed for agentic use. It has native tool-calling support, runs at 334 t/s (the fastest model we serve), and scores 50.2% on HLE when using tools — matching R1’s reasoning-only score.

The speed matters for agents. Each tool call is a round trip: the model generates a function call, the tool executes, the result goes back to the model. At 334 t/s and 0.31s TTFT, Kimi completes multi-step agent loops in seconds where slower models take minutes.

Good at: Tool use, function calling, multi-step task execution, fast iteration loops. Not ideal for: Pure reasoning without tools (R1 is better). Code-only tasks (Qwen3 Coder is more specialized). Pick over alternatives when: Your application involves tool calling, API interactions, or multi-step agent orchestration where speed compounds.


Pick: Llama 4 Scout

Llama 4 Scout is Meta’s mixture-of-experts multimodal model — 109B total parameters with 17 active experts. It handles text and images natively, making it the pick for tasks that require visual understanding alongside language.

Good at: Image description, visual Q&A, document understanding, chart interpretation. Not ideal for: Text-only tasks where you’re paying for vision capability you don’t use (use V3.2). Pick over alternatives when: Your input includes images. For text-only workloads, other models are more efficient.


Pick: Llama 3.1 8B

At 8 billion parameters, Llama 3.1 8B runs at ~200 t/s with approximately 0.2s time to first token. It’s the right choice for tasks where speed matters more than depth: intent classification, sentiment analysis, entity extraction, content filtering, and request routing.

Good at: Classification, tagging, extraction, routing decisions, simple Q&A, content moderation. Not ideal for: Complex reasoning, long-form generation, or tasks requiring deep world knowledge. Pick over alternatives when: You need results in under a second and the task is well-defined. Also ideal as the router model in a multi-model architecture.


Pick: GLM 4.7 Flash

GLM 4.7 Flash delivers competitive quality at fast inference speeds. When DeepSeek V3.2 is more capability than you need — simple conversations, basic summarization, FAQ bots — GLM 4.7 Flash gets the job done efficiently.

Good at: Simple chat, summarization, translation, basic Q&A. Not ideal for: Complex reasoning or tasks where benchmark-leading quality matters. Pick over alternatives when: You want good-enough quality with better speed and lower cost than the largest models.


Pick: MiniMax M2.5

MiniMax M2.5 handles long context windows natively. For workloads that involve ingesting large documents, long conversation histories, or extensive codebases, M2.5 maintains coherence across the full context.

Good at: Document analysis, long conversations, large-context summarization. Not ideal for: Short, simple tasks where context length is irrelevant (use Llama 8B or GLM Flash). Pick over alternatives when: Your input regularly exceeds what smaller-context models handle well.


Pick: Qwen3 235B

Qwen3 235B is a large mixture-of-experts model that competes across the full benchmark spectrum. When you need the highest possible quality and latency is not the primary constraint, Qwen3 235B delivers.

Good at: Broad capability across reasoning, knowledge, and generation. Strong multilingual support. Not ideal for: Latency-sensitive applications (large model, slower inference). Pick over alternatives when: You need top-tier quality and can tolerate higher latency. Good for batch processing and offline tasks.


Pick: BGE Large

BGE Large (BAAI General Embedding) is a well-tested embedding model for retrieval-augmented generation. It performs well on MTEB benchmarks and produces dense vectors suitable for semantic search, document retrieval, and clustering.

Good at: Semantic search, RAG pipelines, document similarity, clustering. Not ideal for: Generative tasks (it’s an embedding model, not a chat model). Pick over alternatives when: You need vector embeddings for search or retrieval. Pair it with a generative model for the full RAG pipeline.


What's your task?
|
+-- Need to understand images?
| YES --> Llama 4 Scout
|
+-- Need step-by-step reasoning? (math, logic, science)
| YES --> DeepSeek R1 (~30 t/s, but highest reasoning quality)
|
+-- Need tool calling / agent loops?
| YES --> Kimi K2.5 (334 t/s, native tool use)
|
+-- Need code generation / editing?
| YES --> Qwen3 Coder (purpose-built for code)
|
+-- Need embeddings for search/RAG?
| YES --> BGE Large
|
+-- Need sub-200ms response?
| YES --> Llama 3.1 8B (~200 t/s, 0.2s TTFT)
|
+-- Need long context (large documents)?
| YES --> MiniMax M2.5
|
+-- Need maximum quality, latency flexible?
| YES --> Qwen3 235B
|
+-- General purpose, good balance?
YES --> DeepSeek V3.2 (default choice)

You don’t need ten models to cover most workloads.

Llama 3.1 8B handles 60% of requests. Classification, routing, simple Q&A, extraction, content filtering. Fast and cheap.

DeepSeek V3.2 handles 30%. General chat, complex instructions, knowledge-intensive tasks. The reliable all-rounder.

Specialized models handle the last 10%. R1 for hard reasoning. Kimi K2.5 for agent loops. Qwen3 Coder for code. BGE Large for embeddings.

Start with Llama 8B + V3.2. Add specialists only when you have evidence that general models aren’t performing on specific task categories. Measure first, specialize second.


All models are available through a single OpenAI-compatible API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · HLE Leaderboard · MMLU-Pro Leaderboard · MTEB Leaderboard