What it takes to build your own LLM inference platform

Mar 19, 2026

If you’re building a SaaS that needs to give users access to LLMs, you have two options: build the infrastructure yourself, or use a platform that does it for you. Here’s what “build it yourself” actually looks like.

This isn’t theoretical. We built this. Here’s every component, what it does, and what alternatives exist.

0. Model access — the first problem

Before you write a single line of code, you need access to models. Two paths:

Self-host: You download open weights (Llama, DeepSeek, Qwen, etc.) and run them on your own GPUs. You get full control, but you’re limited to the models you can afford to deploy. Running a single large model like DeepSeek V3.2 requires multiple high-end GPUs. Running 70+ models? You’d need a data center.

Use a provider: You sign agreements with one or more inference providers (DeepInfra, Together.ai, Fireworks, etc.) who already have the models deployed. You get access to many models, but you depend on their availability, pricing, and terms. If they change prices or drop a model, you need a plan B.

Most platforms end up with a provider. Self-hosting 70+ models is economically unrealistic for anyone short of a well-funded infrastructure company.

1. Inference backend

Once you have model access, you need to serve them. Three options:

Self-hosted (bare metal or cloud GPUs)

Rent A100s/H100s from AWS, GCP, Lambda, or CoreWeave
Run vLLM, TGI, or TensorRT-LLM as the serving engine
Manage model weights, quantization, scaling, failover
Cost: $2-8/hr per GPU. A single DeepSeek V3.2 deployment needs multiple GPUs.

Serverless inference providers

DeepInfra, Together.ai, Fireworks, Replicate, Anyscale
Pay per token, no GPU management
Fast to start, but you’re locked into their pricing and availability

Managed endpoints

AWS SageMaker, Google Vertex AI, Azure ML
Enterprise-grade but complex setup and expensive

For most teams, a serverless provider is the right starting point. Self-hosting only makes sense at very high volume (millions of requests/day).

2. API proxy layer

Your users shouldn’t hit the inference backend directly. You need a proxy that:

Translates between API formats (OpenAI, Anthropic)
Routes requests to the right model/provider
Injects authentication
Handles retries and failover
Strips provider headers so users don’t know your backend

Options:

Build from scratch with Express/Fastify + http-proxy-middleware
Use an open-source gateway: LiteLLM, Portkey, Kong AI Gateway, MLflow Gateway
Use a managed gateway: Helicone, Braintrust, Promptlayer

Each has trade-offs. Open-source gateways give you control but you manage the deployment. Managed gateways are easier but add latency and cost.

3. Authentication

Two layers:

User auth (dashboard login)

Firebase Auth, Auth0, Clerk, Supabase Auth, or roll your own
Supports email, Google, GitHub, wallet signatures

API key auth (inference requests)

Generate API keys per user
Validate on every request before proxying
Store key metadata (plan, rate limits, owner)

This is where it gets interesting for platforms. You need per-key plans — each key with its own rate limits and usage tracking. Most auth solutions don’t do this out of the box. You’ll need a custom key management layer.

4. Rate limiting

Per-key rate limiting with at least:

RPM (requests per minute)
TPM (tokens per minute)
Budget caps (dollar amount per time window)

This needs to be enforced at the proxy layer, before the request hits the inference backend. Otherwise a single user can exhaust your GPU allocation.

Options:

Redis-based counters (most common)
Token bucket algorithms
Proxy-level enforcement (some gateways include this)

If you’re using per-key plans, each key needs its own set of limits. Not one global limit — individual limits per key.

5. Usage tracking and billing

You need to know:

How many tokens each key consumed (input + output)
What model was used
Cost per request
Aggregate usage per user, per day, per billing period

For subscription billing:

Stripe for card payments
Budget windows (e.g., $X per 5-hour period)
Automatic key revocation when subscription expires

For pay-as-you-go:

Credit balance per user
Deduct per request based on token count × model price
Top-up flow (Stripe, crypto, etc.)

For crypto payments:

USDC on a supported chain
On-chain transaction verification
Wallet connector in the dashboard (wagmi, viem, etc.)

This is a significant amount of code. Usage tracking alone requires intercepting every response to count tokens, calculating cost based on the model’s pricing, and storing it per key.

6. Dashboard

Your users need a web UI to:

Create and manage API keys
View usage per key (tokens, requests, cost)
Subscribe to plans or top up credits
See available models and pricing

Tech stack typically:

React/Next.js/Vue frontend
REST API backend
Real-time usage updates

For platforms (your users creating keys for their users), you also need a management API — programmatic key creation, plan assignment, usage queries.

7. Model catalog management

Models change. New ones come out weekly. You need:

A catalog of which models you serve
Pricing per model (input/output cost per token)
Sync mechanism to update prices when providers change them
Display names, categories, tags for the dashboard
Cache pricing metadata (some models support prompt caching discounts)

This is an ongoing operational burden, not a one-time setup.

8. Documentation

Your users need:

API reference (endpoints, request/response formats)
SDK examples (Python, Node.js, at minimum)
Authentication guide
Billing/usage documentation
Quick start guide

This is easily 20-30 pages of documentation that needs to stay current.

9. Monitoring and reliability

Health checks on the inference backend
Status page for users
Alerting when latency spikes or errors increase
Logging (but not logging prompt content — privacy)
Graceful degradation when a model or provider is down

10. Compliance and privacy

Privacy policy
Data handling documentation
GDPR compliance if you serve EU users
Decision: do you store prompts? (You shouldn’t)
SOC 2 / ISO 27001 if targeting enterprise

The full stack

Component	Build time	Ongoing maintenance
Inference backend	1-2 weeks (serverless) or months (self-hosted)	High — scaling, failover, model updates
API proxy	1-2 weeks	Medium — format changes, new providers
Auth + key management	1-2 weeks	Low
Per-key rate limiting	1 week	Low
Usage tracking + billing	2-4 weeks	Medium — edge cases, reconciliation
Dashboard	2-4 weeks	Medium — new features, UX
Model catalog	1 week	High — weekly model updates
Documentation	1-2 weeks	Medium — keep current
Monitoring	1 week	Low
Privacy/compliance	1 week	Low

Total: 3-5 months for a production-ready platform. And that’s with a small team moving fast, using existing services where possible.

Or

You use an inference platform that already has all of this, create API keys for your users, and ship your product this week.

We built all of the above so you don’t have to. See how per-key plans work.