Blog

Open-source models are production-ready. Here's the proof.

Mar 19, 2026

There’s a persistent assumption in the industry: open-source models are fine for experimentation, but production workloads need GPT-5 or Claude Opus. We run open-source models in production every day. Here’s what the benchmarks actually say.

We’re comparing 5 models across 5 metrics — the same models in every chart, no cherry-picking:

Open-source (available via our API): DeepSeek V3.2, DeepSeek R1, Kimi K2.5 Proprietary (reference): Claude Opus 4.6, GPT-5.4

Code quality: SWE-bench Verified (% resolved)

Claude Opus 4.6

80.8%

GPT-5.4

~80.0%

Kimi K2.5

76.8%

DeepSeek V3.2

73.0%

DeepSeek R1

57.6%

Proprietary models lead here. Opus 4.6 and GPT-5.4 are within a point of each other at ~80%. Kimi K2.5 is 4 points behind at 76.8% — competitive but not leading. R1 is a reasoning model, not optimized for code.

Reasoning: Humanity’s Last Exam (%)

Kimi K2.5 *

50.2%

DeepSeek R1

50.2%

GPT-5.4

41.6%

Claude Opus 4.6

40.0%

DeepSeek V3.2

39.3%

Open-source wins decisively. R1 hits 50.2% and Kimi K2.5 matches it with tool-use enabled (*without tools: 31.5%). Both beat Opus 4.6 (40%) and GPT-5.4 (41.6%). V3.2 is roughly at Opus level — it’s a general model, not a reasoning specialist.

*Kimi K2.5’s HLE score uses its agentic mode with tool access. This is how the model is designed to be used in production.

Knowledge: MMLU-Pro (%)

GPT-5.4

88.5%

Kimi K2.5

87.1%

DeepSeek V3.2

85.0%

DeepSeek R1

84.0%

Claude Opus 4.6

82.0%

GPT-5.4 leads narrowly at 88.5%, but Kimi K2.5 is 1.4 points behind and all three open-source models beat Opus 4.6. The gap across all 5 models is only 6.5 points — this benchmark is nearly saturated.

Speed: output tokens per second

Kimi K2.5

334 t/s

GPT-5.4

~78 t/s

DeepSeek V3.2

~60 t/s

Claude Opus 4.6

46 t/s

DeepSeek R1

~30 t/s

Kimi K2.5 at 334 tok/s is in a different league — 4x faster than GPT-5.4, 7x faster than Opus 4.6. R1 is the slowest (expected — it’s a reasoning model producing chain-of-thought tokens).

Latency: time to first token (seconds)

Kimi K2.5

0.31s

GPT-5.4

~0.95s

DeepSeek V3.2

1.18s

DeepSeek R1

~2.0s

Claude Opus 4.6

2.48s

Lower is better. Kimi K2.5 responds 8x faster than Opus 4.6 and 3x faster than GPT-5.4. Even V3.2 beats both proprietary models. Opus 4.6 is the slowest model in this comparison.

Speed and TTFT measured on our production infrastructure. Claude and GPT-5.4 data from Artificial Analysis.

The full picture

The scorecard

Metric	Winner	Open-source	Proprietary	Gap
Code (SWE-bench)	Opus 4.6	Kimi 76.8%	Opus 80.8%	-4 pts
Reasoning (HLE)	R1	R1 50.2%	GPT-5.4 41.6%	+8.6 pts
Knowledge (MMLU-Pro)	GPT-5.4	Kimi 87.1%	GPT-5.4 88.5%	-1.4 pts
Speed (tok/s)	Kimi K2.5	334 t/s	GPT-5.4 78 t/s	4.3x faster
Latency (TTFT)	Kimi K2.5	0.31s	GPT-5.4 0.95s	3x faster

Open-source wins 3 out of 5. Proprietary models lead on Code (by 4 points) and Knowledge (by 1.4 points). Open-source leads on Reasoning (by 8.6 points), Speed (by 4.3x), and Latency (by 3x).

Note: Kimi K2.5’s HLE score (50.2%) uses tool-augmented mode. Without tools it scores 31.5%. DeepSeek R1’s 50.2% is pure chain-of-thought reasoning without tools.

What “production-ready” actually means

Reliable enough. Consistent quality across thousands of requests.
Fast enough. Kimi K2.5 at 334 tok/s and 0.31s TTFT. That’s real-time.
Capable enough. Within 4 points of the best proprietary model on code, ahead on reasoning.
Predictable. Versioned models that don’t change without warning.

The real advantage: control

Proprietary models change under you. Fine one day, different behavior the next. No changelog, no warning. Open-source models are versioned — DeepSeek V3.2 behaves the same tomorrow as today. You choose when to upgrade.

For production workloads, that predictability is worth more than a marginal quality edge on any single benchmark.

We serve 70+ open-source models through a single API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · OpenAI Pricing · Anthropic Pricing · HLE Leaderboard · MMLU-Pro Leaderboard

What it takes to build your own LLM inference platform

Mar 19, 2026

If you’re building a SaaS that needs to give users access to LLMs, you have two options: build the infrastructure yourself, or use a platform that does it for you. Here’s what “build it yourself” actually looks like.

This isn’t theoretical. We built this. Here’s every component, what it does, and what alternatives exist.

0. Model access — the first problem

Before you write a single line of code, you need access to models.

Self-host on your own hardware: Buy GPUs, rent datacenter space, run the models yourself. Full control, best unit economics at scale — but massive upfront cost and you’re limited to the models you can afford to deploy. Running DeepSeek V3.2 requires multiple high-end GPUs. Running 70+ models? You’d need a data center.

Rent infrastructure: Use GPU clouds like Vast.ai, AWS, Hetzner, CoreWeave, or Lambda. No hardware to buy, but you still manage deployments, scaling, and failover. Costs add up fast — a single H100 runs $2-4/hr.

Use an inference provider: Sign agreements with DeepInfra, Together.ai, Fireworks, etc. who already have the models deployed. Pay per token, no GPU management. But you depend on their availability, pricing, and terms. If they change prices or drop a model, you need a plan B.

Mix: Most serious platforms end up here. Own hardware for high-volume models where the unit economics justify it, rented GPUs for burst capacity, and provider agreements for the long tail of models nobody runs enough to self-host.

Self-hosting 70+ models on your own is economically unrealistic. The real question is where to draw the line between own infra, rented compute, and providers.

1. Serving engine

If you self-host or rent GPUs, you need software to serve the models:

vLLM — most popular, good throughput, active community
TGI (Text Generation Inference) — Hugging Face’s solution, solid for single-model deployments
TensorRT-LLM — NVIDIA’s optimized engine, best raw performance but harder to set up
SGLang — newer, fast, good for structured generation

You’ll also need to handle model weights, quantization, scaling across GPUs, and failover when a node goes down. This is a full-time ops job.

2. API proxy layer

Your users shouldn’t hit the inference backend directly. You need a proxy that:

Translates between API formats (OpenAI, Anthropic)
Routes requests to the right model/provider
Injects authentication
Handles retries and failover
Strips provider headers so users don’t know your backend

Options:

Build from scratch with Express/Fastify + http-proxy-middleware
Use an open-source gateway: LiteLLM, Portkey, Kong AI Gateway, MLflow Gateway
Use a managed gateway: Helicone, Braintrust, Promptlayer

Each has trade-offs. Open-source gateways give you control but you manage the deployment. Managed gateways are easier but add latency and cost.

3. Authentication

Two layers:

User auth (dashboard login)

Firebase Auth, Auth0, Clerk, Supabase Auth, or roll your own
Supports email, Google, GitHub, wallet signatures

API key auth (inference requests)

Generate API keys per user
Validate on every request before proxying
Store key metadata (plan, rate limits, owner)

This is where it gets interesting for platforms. You need per-key plans — each key with its own rate limits and usage tracking. Most auth solutions don’t do this out of the box. You’ll need a custom key management layer.

4. Rate limiting

Per-key rate limiting with at least:

RPM (requests per minute)
TPM (tokens per minute)
Budget caps (dollar amount per time window)

This needs to be enforced at the proxy layer, before the request hits the inference backend. Otherwise a single user can exhaust your GPU allocation.

Options:

Redis-based counters (most common)
Token bucket algorithms
Proxy-level enforcement (some gateways include this)

If you’re using per-key plans, each key needs its own set of limits. Not one global limit — individual limits per key.

5. Usage tracking and billing

You need to know:

How many tokens each key consumed (input + output)
What model was used
Cost per request
Aggregate usage per user, per day, per billing period

For subscription billing:

Stripe for card payments
Budget windows (e.g., $X per 5-hour period)
Automatic key revocation when subscription expires

For pay-as-you-go:

Credit balance per user
Deduct per request based on token count × model price
Top-up flow (Stripe, crypto, etc.)

For crypto payments:

USDC on a supported chain
On-chain transaction verification
Wallet connector in the dashboard (wagmi, viem, etc.)

This is a significant amount of code. Usage tracking alone requires intercepting every response to count tokens, calculating cost based on the model’s pricing, and storing it per key.

6. Dashboard

Your users need a web UI to:

Create and manage API keys
View usage per key (tokens, requests, cost)
Subscribe to plans or top up credits
See available models and pricing

Tech stack typically:

React/Next.js/Vue frontend
REST API backend
Real-time usage updates

For platforms (your users creating keys for their users), you also need a management API — programmatic key creation, plan assignment, usage queries.

7. Model catalog management

Models change. New ones come out weekly. You need:

A catalog of which models you serve
Pricing per model (input/output cost per token)
Sync mechanism to update prices when providers change them
Display names, categories, tags for the dashboard
Cache pricing metadata (some models support prompt caching discounts)

This is an ongoing operational burden, not a one-time setup.

8. Documentation

Your users need:

API reference (endpoints, request/response formats)
SDK examples (Python, Node.js, at minimum)
Authentication guide
Billing/usage documentation
Quick start guide

This is easily 20-30 pages of documentation that needs to stay current.

9. Monitoring and reliability

Health checks on the inference backend
Status page for users
Alerting when latency spikes or errors increase
Logging (but not logging prompt content — privacy)
Graceful degradation when a model or provider is down

10. Compliance and privacy

Privacy policy
Data handling documentation
GDPR compliance if you serve EU users
Decision: do you store prompts? (You shouldn’t)
SOC 2 / ISO 27001 if targeting enterprise

The full stack

Component	Ongoing maintenance
Inference backend	High — scaling, failover, model updates
API proxy	Medium — format changes, new providers
Auth + key management	Low
Per-key rate limiting	Low
Usage tracking + billing	Medium — edge cases, reconciliation
Dashboard	Medium — new features, UX
Model catalog	High — weekly model updates
Documentation	Medium — keep current
Monitoring	Low
Privacy/compliance	Low

What breaks in production

Building is the easy part. The hard part is what breaks with real users:

A provider changes their API format without warning. Your proxy returns 500s for 2 hours until you notice.
A model gets deprecated. Your users’ hardcoded model IDs stop working overnight.
Token counting has an off-by-one bug. You’ve been undercharging for 3 weeks. Your margin is gone.
A user finds a way to exceed rate limits through concurrent requests. Your inference bill spikes 10x in one afternoon.
Stripe webhook fails silently. A user’s subscription expired but their API key still works. Free inference for a month.
You push a billing update and break the usage tracking. Three days of missing data. Users open tickets.

Each of these has happened to us. We fixed them. The question is whether you want to fix them yourself, with your users waiting, or use a platform that already has.

Or

You use an inference platform that already has all of this, create API keys for your users, and ship your product this week.

We built all of the above so you don’t have to. See how per-key plans work.