Open-source models are production-ready. Here's the proof.

Mar 19, 2026

There’s a persistent assumption in the industry: open-source models are fine for experimentation, but production workloads need GPT-5 or Claude Opus. We run open-source models in production every day. Here’s what the benchmarks actually say.

We’re comparing 5 models across 5 metrics — the same models in every chart, no cherry-picking:

Open-source (available via our API): DeepSeek V3.2, DeepSeek R1, Kimi K2.5 Proprietary (reference): Claude Opus 4.6, GPT-5.4

Code quality: SWE-bench Verified (% resolved)

Claude Opus 4.6

80.8%

GPT-5.4

~80.0%

Kimi K2.5

76.8%

DeepSeek V3.2

73.0%

DeepSeek R1

57.6%

Proprietary models lead here. Opus 4.6 and GPT-5.4 are within a point of each other at ~80%. Kimi K2.5 is 4 points behind at 76.8% — competitive but not leading. R1 is a reasoning model, not optimized for code.

Reasoning: Humanity’s Last Exam (%)

Kimi K2.5 *

50.2%

DeepSeek R1

50.2%

GPT-5.4

41.6%

Claude Opus 4.6

40.0%

DeepSeek V3.2

39.3%

Open-source wins decisively. R1 hits 50.2% and Kimi K2.5 matches it with tool-use enabled (*without tools: 31.5%). Both beat Opus 4.6 (40%) and GPT-5.4 (41.6%). V3.2 is roughly at Opus level — it’s a general model, not a reasoning specialist.

*Kimi K2.5’s HLE score uses its agentic mode with tool access. This is how the model is designed to be used in production.

Knowledge: MMLU-Pro (%)

GPT-5.4

88.5%

Kimi K2.5

87.1%

DeepSeek V3.2

85.0%

DeepSeek R1

84.0%

Claude Opus 4.6

82.0%

GPT-5.4 leads narrowly at 88.5%, but Kimi K2.5 is 1.4 points behind and all three open-source models beat Opus 4.6. The gap across all 5 models is only 6.5 points — this benchmark is nearly saturated.

Speed: output tokens per second

Kimi K2.5

334 t/s

GPT-5.4

~78 t/s

DeepSeek V3.2

~60 t/s

Claude Opus 4.6

46 t/s

DeepSeek R1

~30 t/s

Kimi K2.5 at 334 tok/s is in a different league — 4x faster than GPT-5.4, 7x faster than Opus 4.6. R1 is the slowest (expected — it’s a reasoning model producing chain-of-thought tokens).

Latency: time to first token (seconds)

Kimi K2.5

0.31s

GPT-5.4

~0.95s

DeepSeek V3.2

1.18s

DeepSeek R1

~2.0s

Claude Opus 4.6

2.48s

Lower is better. Kimi K2.5 responds 8x faster than Opus 4.6 and 3x faster than GPT-5.4. Even V3.2 beats both proprietary models. Opus 4.6 is the slowest model in this comparison.

Speed and TTFT measured on our production infrastructure. Claude and GPT-5.4 data from Artificial Analysis.

The full picture

The scorecard

Metric	Winner	Open-source	Proprietary	Gap
Code (SWE-bench)	Opus 4.6	Kimi 76.8%	Opus 80.8%	-4 pts
Reasoning (HLE)	R1	R1 50.2%	GPT-5.4 41.6%	+8.6 pts
Knowledge (MMLU-Pro)	GPT-5.4	Kimi 87.1%	GPT-5.4 88.5%	-1.4 pts
Speed (tok/s)	Kimi K2.5	334 t/s	GPT-5.4 78 t/s	4.3x faster
Latency (TTFT)	Kimi K2.5	0.31s	GPT-5.4 0.95s	3x faster

Open-source wins 3 out of 5. Proprietary models lead on Code (by 4 points) and Knowledge (by 1.4 points). Open-source leads on Reasoning (by 8.6 points), Speed (by 4.3x), and Latency (by 3x).

Note: Kimi K2.5’s HLE score (50.2%) uses tool-augmented mode. Without tools it scores 31.5%. DeepSeek R1’s 50.2% is pure chain-of-thought reasoning without tools.

What “production-ready” actually means

Reliable enough. Consistent quality across thousands of requests.
Fast enough. Kimi K2.5 at 334 tok/s and 0.31s TTFT. That’s real-time.
Capable enough. Within 4 points of the best proprietary model on code, ahead on reasoning.
Predictable. Versioned models that don’t change without warning.

The real advantage: control

Proprietary models change under you. Fine one day, different behavior the next. No changelog, no warning. Open-source models are versioned — DeepSeek V3.2 behaves the same tomorrow as today. You choose when to upgrade.

For production workloads, that predictability is worth more than a marginal quality edge on any single benchmark.

We serve 70+ open-source models through a single API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · OpenAI Pricing · Anthropic Pricing · HLE Leaderboard · MMLU-Pro Leaderboard