Skip to content

Self-hosted vs. API inference: the real cost comparison

“Why pay for an API when I can run the model myself?”

It’s a reasonable question. Open-source models are free. GPUs are available on every cloud. vLLM and Ollama make serving straightforward. The math should be simple: GPU cost per hour × hours = total cost. Done.

Except it’s not. The GPU is the minority of the cost. Here’s the full picture.


Running DeepSeek V3.2 (671B MoE, ~130B active parameters) requires at least 4× A100 80GB or 2× H100 80GB in FP8. Qwen 3.5 397B has similar requirements.

Setup Hourly Monthly (24/7) Monthly (8h/day)
4× A100 80GB (cloud) $12.80 $9,216 $2,816
2× H100 80GB (cloud) $8.40 $6,048 $1,848
1× A100 80GB (Llama 70B) $3.20 $2,304 $704
1× L40S (Llama 8B) $1.10 $792 $242

These are cloud GPU rental prices (AWS, GCP, Lambda Labs — varies by provider and availability). If you buy hardware, the upfront cost is $15K–$40K per GPU, amortized over 3–4 years, plus electricity, cooling, and data center costs.

Smaller models are cheaper — but limited

Section titled “Smaller models are cheaper — but limited”

Running Llama 3.1 8B on a single L40S costs $242/month (8h/day). That’s competitive with API pricing. But 8B models can’t handle complex coding, multi-step reasoning, or nuanced analysis — the tasks where AI provides the most value.

The models worth self-hosting (70B+, MoE) require multi-GPU setups where the economics change dramatically.


GPU rental is just the beginning.

Someone has to:

  • Set up vLLM/TGI with optimal batch sizes, quantization, and memory allocation
  • Monitor GPU utilization and restart crashed processes
  • Update model weights when new versions release
  • Handle OOM errors, NCCL failures, and driver issues
  • Manage the serving infrastructure (load balancer, health checks, auto-scaling)

If this is a full-time DevOps engineer at $150K/year, that’s $12,500/month in labor. If it’s 20% of a senior engineer’s time, it’s $2,500/month. Either way, it’s more than the GPU.

GPUs cost money whether they’re inferring or not. If your usage pattern is 8 hours of heavy use (work hours) and 16 hours of near-zero traffic, you’re paying for 24 hours and using 8.

Cloud spot instances help but introduce availability risk. Auto-scaling GPU clusters is possible but complex — model loading takes minutes, not seconds.

API pricing is purely usage-based. Zero requests = zero cost.

Self-hosting one model is manageable. Self-hosting five models for different tasks — a coding model, a reasoning model, a fast classification model, an embedding model, and a vision model — requires either:

  • 5 separate GPU instances (expensive)
  • Shared GPU with model swapping (slow — loading a 70B model takes 2–5 minutes)
  • A serving framework that handles multi-model routing (complex)

An API gives you access to many models through the same endpoint. No model loading, no GPU allocation, no routing logic.

Every hour your team spends on inference infrastructure is an hour not spent on your actual product. For startups, this is the most expensive cost of all — it doesn’t show up on any invoice.


For a team of 5 developers running AI-assisted coding with a mix of DeepSeek V3.2 and smaller models:

Cost Self-hosted API (per-token) API (flat-rate)
Compute/inference $2,800 $265 $250
Ops/maintenance $2,500 $0 $0
Idle waste (~60%) $1,680 $0 $0
Total monthly $6,980 $265 $250

Self-hosting costs 26x more for the same workload. The GPU is only 40% of the self-hosted cost — ops and idle waste are the majority.


Self-hosting wins in specific scenarios:

Data sovereignty: If your data cannot leave your network — regulated industries, government, healthcare with strict compliance — self-hosting is the only option. No API provider can guarantee the data isolation you need.

Extreme scale: If you’re processing millions of requests per day and your GPUs are consistently at 80%+ utilization, the per-token math eventually favors owned hardware. This threshold is higher than most teams expect — typically $20K+/month in API spend before self-hosting breaks even.

Custom models: If you’ve fine-tuned a model and need to serve it, self-hosting or a dedicated inference provider (Fireworks, Together) is required. Most unified APIs don’t serve custom model weights.

Latency control: If you need guaranteed sub-100ms TTFT and your data center is co-located with your GPUs, self-hosting eliminates network hops.

For everyone else — startups, small teams, companies with variable usage patterns — the API is cheaper, faster to set up, and easier to maintain.


Most teams don’t need to choose one forever. A practical approach:

  1. Start with an API: Get your product working, validate demand, understand your usage patterns.
  2. Optimize model selection: Use cheaper models for simple tasks, frontier models for hard tasks. Full guide: Multi-model architecture.
  3. Evaluate self-hosting when: Your monthly API spend exceeds $10K, your GPU utilization would be >70%, and you have DevOps capacity to maintain it.
  4. Hybrid: Self-host your high-volume models, use an API for long-tail models and overflow capacity.

The worst outcome is spending 3 months setting up GPU infrastructure before you’ve validated that anyone wants your product.


CheapestInference serves many models through a single API. No GPUs to manage, no idle costs, no ops burden. Flat-rate plans from $10/month. Get started or compare plans.