Open-source models are production-ready. Here's the proof.
There’s a persistent assumption in the industry: open-source models are fine for experimentation, but production workloads need GPT-5 or Claude Opus. We run open-source models in production every day. Here’s what the benchmarks actually say.
We’re comparing 5 models across 5 metrics — the same models in every chart, no cherry-picking:
Open-source (available via our API): DeepSeek V3.2, DeepSeek R1, Kimi K2.5 Proprietary (reference): Claude Opus 4.6, GPT-5.4
Code quality: SWE-bench Verified (% resolved)
Section titled “Code quality: SWE-bench Verified (% resolved)”Proprietary models lead here. Opus 4.6 and GPT-5.4 are within a point of each other at ~80%. Kimi K2.5 is 4 points behind at 76.8% — competitive but not leading. R1 is a reasoning model, not optimized for code.
Reasoning: Humanity’s Last Exam (%)
Section titled “Reasoning: Humanity’s Last Exam (%)”Open-source wins decisively. R1 hits 50.2% and Kimi K2.5 matches it with tool-use enabled (*without tools: 31.5%). Both beat Opus 4.6 (40%) and GPT-5.4 (41.6%). V3.2 is roughly at Opus level — it’s a general model, not a reasoning specialist.
*Kimi K2.5’s HLE score uses its agentic mode with tool access. This is how the model is designed to be used in production.
Knowledge: MMLU-Pro (%)
Section titled “Knowledge: MMLU-Pro (%)”GPT-5.4 leads narrowly at 88.5%, but Kimi K2.5 is 1.4 points behind and all three open-source models beat Opus 4.6. The gap across all 5 models is only 6.5 points — this benchmark is nearly saturated.
Speed: output tokens per second
Section titled “Speed: output tokens per second”Kimi K2.5 at 334 tok/s is in a different league — 4x faster than GPT-5.4, 7x faster than Opus 4.6. R1 is the slowest (expected — it’s a reasoning model producing chain-of-thought tokens).
Latency: time to first token (seconds)
Section titled “Latency: time to first token (seconds)”Lower is better. Kimi K2.5 responds 8x faster than Opus 4.6 and 3x faster than GPT-5.4. Even V3.2 beats both proprietary models. Opus 4.6 is the slowest model in this comparison.
Speed and TTFT measured on our production infrastructure. Claude and GPT-5.4 data from Artificial Analysis.
The full picture
Section titled “The full picture”The scorecard
Section titled “The scorecard”| Metric | Winner | Open-source | Proprietary | Gap |
|---|---|---|---|---|
| Code (SWE-bench) | Opus 4.6 | Kimi 76.8% | Opus 80.8% | -4 pts |
| Reasoning (HLE) | R1 | R1 50.2% | GPT-5.4 41.6% | +8.6 pts |
| Knowledge (MMLU-Pro) | GPT-5.4 | Kimi 87.1% | GPT-5.4 88.5% | -1.4 pts |
| Speed (tok/s) | Kimi K2.5 | 334 t/s | GPT-5.4 78 t/s | 4.3x faster |
| Latency (TTFT) | Kimi K2.5 | 0.31s | GPT-5.4 0.95s | 3x faster |
Open-source wins 3 out of 5. Proprietary models lead on Code (by 4 points) and Knowledge (by 1.4 points). Open-source leads on Reasoning (by 8.6 points), Speed (by 4.3x), and Latency (by 3x).
Note: Kimi K2.5’s HLE score (50.2%) uses tool-augmented mode. Without tools it scores 31.5%. DeepSeek R1’s 50.2% is pure chain-of-thought reasoning without tools.
What “production-ready” actually means
Section titled “What “production-ready” actually means”- Reliable enough. Consistent quality across thousands of requests.
- Fast enough. Kimi K2.5 at 334 tok/s and 0.31s TTFT. That’s real-time.
- Capable enough. Within 4 points of the best proprietary model on code, ahead on reasoning.
- Predictable. Versioned models that don’t change without warning.
The real advantage: control
Section titled “The real advantage: control”Proprietary models change under you. Fine one day, different behavior the next. No changelog, no warning. Open-source models are versioned — DeepSeek V3.2 behaves the same tomorrow as today. You choose when to upgrade.
For production workloads, that predictability is worth more than a marginal quality edge on any single benchmark.
We serve 70+ open-source models through a single API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.
Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · OpenAI Pricing · Anthropic Pricing · HLE Leaderboard · MMLU-Pro Leaderboard