Skip to content

Open-source models are production-ready. Here's the proof.

There’s a persistent assumption in the industry: open-source models are fine for experimentation, but production workloads need GPT-5 or Claude Opus. We run open-source models in production every day. Here’s what the benchmarks actually say.

We’re comparing 5 models across 5 metrics — the same models in every chart, no cherry-picking:

Open-source (available via our API): DeepSeek V3.2, DeepSeek R1, Kimi K2.5 Proprietary (reference): Claude Opus 4.6, GPT-5.4


Code quality: SWE-bench Verified (% resolved)

Section titled “Code quality: SWE-bench Verified (% resolved)”
Claude Opus 4.6
80.8%
GPT-5.4
~80.0%
Kimi K2.5
76.8%
DeepSeek V3.2
73.0%
DeepSeek R1
57.6%

Proprietary models lead here. Opus 4.6 and GPT-5.4 are within a point of each other at ~80%. Kimi K2.5 is 4 points behind at 76.8% — competitive but not leading. R1 is a reasoning model, not optimized for code.


Kimi K2.5 *
50.2%
DeepSeek R1
50.2%
GPT-5.4
41.6%
Claude Opus 4.6
40.0%
DeepSeek V3.2
39.3%

Open-source wins decisively. R1 hits 50.2% and Kimi K2.5 matches it with tool-use enabled (*without tools: 31.5%). Both beat Opus 4.6 (40%) and GPT-5.4 (41.6%). V3.2 is roughly at Opus level — it’s a general model, not a reasoning specialist.

*Kimi K2.5’s HLE score uses its agentic mode with tool access. This is how the model is designed to be used in production.


GPT-5.4
88.5%
Kimi K2.5
87.1%
DeepSeek V3.2
85.0%
DeepSeek R1
84.0%
Claude Opus 4.6
82.0%

GPT-5.4 leads narrowly at 88.5%, but Kimi K2.5 is 1.4 points behind and all three open-source models beat Opus 4.6. The gap across all 5 models is only 6.5 points — this benchmark is nearly saturated.


Kimi K2.5
334 t/s
GPT-5.4
~78 t/s
DeepSeek V3.2
~60 t/s
Claude Opus 4.6
46 t/s
DeepSeek R1
~30 t/s

Kimi K2.5 at 334 tok/s is in a different league — 4x faster than GPT-5.4, 7x faster than Opus 4.6. R1 is the slowest (expected — it’s a reasoning model producing chain-of-thought tokens).


Kimi K2.5
0.31s
GPT-5.4
~0.95s
DeepSeek V3.2
1.18s
DeepSeek R1
~2.0s
Claude Opus 4.6
2.48s

Lower is better. Kimi K2.5 responds 8x faster than Opus 4.6 and 3x faster than GPT-5.4. Even V3.2 beats both proprietary models. Opus 4.6 is the slowest model in this comparison.

Speed and TTFT measured on our production infrastructure. Claude and GPT-5.4 data from Artificial Analysis.


Code Reasoning Knowledge Speed Latency Kimi K2.5 DeepSeek V3.2 DeepSeek R1 Claude Opus 4.6 GPT-5.4
MetricWinnerOpen-sourceProprietaryGap
Code (SWE-bench)Opus 4.6Kimi 76.8%Opus 80.8%-4 pts
Reasoning (HLE)R1R1 50.2%GPT-5.4 41.6%+8.6 pts
Knowledge (MMLU-Pro)GPT-5.4Kimi 87.1%GPT-5.4 88.5%-1.4 pts
Speed (tok/s)Kimi K2.5334 t/sGPT-5.4 78 t/s4.3x faster
Latency (TTFT)Kimi K2.50.31sGPT-5.4 0.95s3x faster

Open-source wins 3 out of 5. Proprietary models lead on Code (by 4 points) and Knowledge (by 1.4 points). Open-source leads on Reasoning (by 8.6 points), Speed (by 4.3x), and Latency (by 3x).

Note: Kimi K2.5’s HLE score (50.2%) uses tool-augmented mode. Without tools it scores 31.5%. DeepSeek R1’s 50.2% is pure chain-of-thought reasoning without tools.


What “production-ready” actually means

Section titled “What “production-ready” actually means”
  1. Reliable enough. Consistent quality across thousands of requests.
  2. Fast enough. Kimi K2.5 at 334 tok/s and 0.31s TTFT. That’s real-time.
  3. Capable enough. Within 4 points of the best proprietary model on code, ahead on reasoning.
  4. Predictable. Versioned models that don’t change without warning.

Proprietary models change under you. Fine one day, different behavior the next. No changelog, no warning. Open-source models are versioned — DeepSeek V3.2 behaves the same tomorrow as today. You choose when to upgrade.

For production workloads, that predictability is worth more than a marginal quality edge on any single benchmark.


We serve 70+ open-source models through a single API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · OpenAI Pricing · Anthropic Pricing · HLE Leaderboard · MMLU-Pro Leaderboard