Skip to content

What it takes to build your own LLM inference platform

If you’re building a SaaS that needs to give users access to LLMs, you have two options: build the infrastructure yourself, or use a platform that does it for you. Here’s what “build it yourself” actually looks like.

This isn’t theoretical. We built this. Here’s every component, what it does, and what alternatives exist.

Before you write a single line of code, you need access to models. Two paths:

Self-host: You download open weights (Llama, DeepSeek, Qwen, etc.) and run them on your own GPUs. You get full control, but you’re limited to the models you can afford to deploy. Running a single large model like DeepSeek V3.2 requires multiple high-end GPUs. Running 70+ models? You’d need a data center.

Use a provider: You sign agreements with one or more inference providers (DeepInfra, Together.ai, Fireworks, etc.) who already have the models deployed. You get access to many models, but you depend on their availability, pricing, and terms. If they change prices or drop a model, you need a plan B.

Most platforms end up with a provider. Self-hosting 70+ models is economically unrealistic for anyone short of a well-funded infrastructure company.

Once you have model access, you need to serve them. Three options:

Self-hosted (bare metal or cloud GPUs)

  • Rent A100s/H100s from AWS, GCP, Lambda, or CoreWeave
  • Run vLLM, TGI, or TensorRT-LLM as the serving engine
  • Manage model weights, quantization, scaling, failover
  • Cost: $2-8/hr per GPU. A single DeepSeek V3.2 deployment needs multiple GPUs.

Serverless inference providers

  • DeepInfra, Together.ai, Fireworks, Replicate, Anyscale
  • Pay per token, no GPU management
  • Fast to start, but you’re locked into their pricing and availability

Managed endpoints

  • AWS SageMaker, Google Vertex AI, Azure ML
  • Enterprise-grade but complex setup and expensive

For most teams, a serverless provider is the right starting point. Self-hosting only makes sense at very high volume (millions of requests/day).

Your users shouldn’t hit the inference backend directly. You need a proxy that:

  • Translates between API formats (OpenAI, Anthropic)
  • Routes requests to the right model/provider
  • Injects authentication
  • Handles retries and failover
  • Strips provider headers so users don’t know your backend

Options:

  • Build from scratch with Express/Fastify + http-proxy-middleware
  • Use an open-source gateway: LiteLLM, Portkey, Kong AI Gateway, MLflow Gateway
  • Use a managed gateway: Helicone, Braintrust, Promptlayer

Each has trade-offs. Open-source gateways give you control but you manage the deployment. Managed gateways are easier but add latency and cost.

Two layers:

User auth (dashboard login)

  • Firebase Auth, Auth0, Clerk, Supabase Auth, or roll your own
  • Supports email, Google, GitHub, wallet signatures

API key auth (inference requests)

  • Generate API keys per user
  • Validate on every request before proxying
  • Store key metadata (plan, rate limits, owner)

This is where it gets interesting for platforms. You need per-key plans — each key with its own rate limits and usage tracking. Most auth solutions don’t do this out of the box. You’ll need a custom key management layer.

Per-key rate limiting with at least:

  • RPM (requests per minute)
  • TPM (tokens per minute)
  • Budget caps (dollar amount per time window)

This needs to be enforced at the proxy layer, before the request hits the inference backend. Otherwise a single user can exhaust your GPU allocation.

Options:

  • Redis-based counters (most common)
  • Token bucket algorithms
  • Proxy-level enforcement (some gateways include this)

If you’re using per-key plans, each key needs its own set of limits. Not one global limit — individual limits per key.

You need to know:

  • How many tokens each key consumed (input + output)
  • What model was used
  • Cost per request
  • Aggregate usage per user, per day, per billing period

For subscription billing:

  • Stripe for card payments
  • Budget windows (e.g., $X per 5-hour period)
  • Automatic key revocation when subscription expires

For pay-as-you-go:

  • Credit balance per user
  • Deduct per request based on token count × model price
  • Top-up flow (Stripe, crypto, etc.)

For crypto payments:

  • USDC on a supported chain
  • On-chain transaction verification
  • Wallet connector in the dashboard (wagmi, viem, etc.)

This is a significant amount of code. Usage tracking alone requires intercepting every response to count tokens, calculating cost based on the model’s pricing, and storing it per key.

Your users need a web UI to:

  • Create and manage API keys
  • View usage per key (tokens, requests, cost)
  • Subscribe to plans or top up credits
  • See available models and pricing

Tech stack typically:

  • React/Next.js/Vue frontend
  • REST API backend
  • Real-time usage updates

For platforms (your users creating keys for their users), you also need a management API — programmatic key creation, plan assignment, usage queries.

Models change. New ones come out weekly. You need:

  • A catalog of which models you serve
  • Pricing per model (input/output cost per token)
  • Sync mechanism to update prices when providers change them
  • Display names, categories, tags for the dashboard
  • Cache pricing metadata (some models support prompt caching discounts)

This is an ongoing operational burden, not a one-time setup.

Your users need:

  • API reference (endpoints, request/response formats)
  • SDK examples (Python, Node.js, at minimum)
  • Authentication guide
  • Billing/usage documentation
  • Quick start guide

This is easily 20-30 pages of documentation that needs to stay current.

  • Health checks on the inference backend
  • Status page for users
  • Alerting when latency spikes or errors increase
  • Logging (but not logging prompt content — privacy)
  • Graceful degradation when a model or provider is down
  • Privacy policy
  • Data handling documentation
  • GDPR compliance if you serve EU users
  • Decision: do you store prompts? (You shouldn’t)
  • SOC 2 / ISO 27001 if targeting enterprise

ComponentBuild timeOngoing maintenance
Inference backend1-2 weeks (serverless) or months (self-hosted)High — scaling, failover, model updates
API proxy1-2 weeksMedium — format changes, new providers
Auth + key management1-2 weeksLow
Per-key rate limiting1 weekLow
Usage tracking + billing2-4 weeksMedium — edge cases, reconciliation
Dashboard2-4 weeksMedium — new features, UX
Model catalog1 weekHigh — weekly model updates
Documentation1-2 weeksMedium — keep current
Monitoring1 weekLow
Privacy/compliance1 weekLow

Total: 3-5 months for a production-ready platform. And that’s with a small team moving fast, using existing services where possible.

You use an inference platform that already has all of this, create API keys for your users, and ship your product this week.


We built all of the above so you don’t have to. See how per-key plans work.