Blog

Self-hosted vs. API inference: the real cost comparison

Apr 15, 2026

“Why pay for an API when I can run the model myself?”

It’s a reasonable question. Open-source models are free. GPUs are available on every cloud. vLLM and Ollama make serving straightforward. The math should be simple: GPU cost per hour × hours = total cost. Done.

Except it’s not. The GPU is the minority of the cost. Here’s the full picture.

The visible costs

GPU hardware

Running DeepSeek V3.2 (671B MoE, ~130B active parameters) requires at least 4× A100 80GB or 2× H100 80GB in FP8. Qwen 3.5 397B has similar requirements.

Setup	Hourly	Monthly (24/7)	Monthly (8h/day)
4× A100 80GB (cloud)	$12.80	$9,216	$2,816
2× H100 80GB (cloud)	$8.40	$6,048	$1,848
1× A100 80GB (Llama 70B)	$3.20	$2,304	$704
1× L40S (Llama 8B)	$1.10	$792	$242

These are cloud GPU rental prices (AWS, GCP, Lambda Labs — varies by provider and availability). If you buy hardware, the upfront cost is $15K–$40K per GPU, amortized over 3–4 years, plus electricity, cooling, and data center costs.

Smaller models are cheaper — but limited

Running Llama 3.1 8B on a single L40S costs $242/month (8h/day). That’s competitive with API pricing. But 8B models can’t handle complex coding, multi-step reasoning, or nuanced analysis — the tasks where AI provides the most value.

The models worth self-hosting (70B+, MoE) require multi-GPU setups where the economics change dramatically.

The invisible costs

GPU rental is just the beginning.

1. Operations and maintenance

Someone has to:

Set up vLLM/TGI with optimal batch sizes, quantization, and memory allocation
Monitor GPU utilization and restart crashed processes
Update model weights when new versions release
Handle OOM errors, NCCL failures, and driver issues
Manage the serving infrastructure (load balancer, health checks, auto-scaling)

If this is a full-time DevOps engineer at $150K/year, that’s $12,500/month in labor. If it’s 20% of a senior engineer’s time, it’s $2,500/month. Either way, it’s more than the GPU.

2. Idle capacity

GPUs cost money whether they’re inferring or not. If your usage pattern is 8 hours of heavy use (work hours) and 16 hours of near-zero traffic, you’re paying for 24 hours and using 8.

Cloud spot instances help but introduce availability risk. Auto-scaling GPU clusters is possible but complex — model loading takes minutes, not seconds.

API pricing is purely usage-based. Zero requests = zero cost.

3. Multi-model overhead

Self-hosting one model is manageable. Self-hosting five models for different tasks — a coding model, a reasoning model, a fast classification model, an embedding model, and a vision model — requires either:

5 separate GPU instances (expensive)
Shared GPU with model swapping (slow — loading a 70B model takes 2–5 minutes)
A serving framework that handles multi-model routing (complex)

An API gives you access to many models through the same endpoint. No model loading, no GPU allocation, no routing logic.

4. Opportunity cost

Every hour your team spends on inference infrastructure is an hour not spent on your actual product. For startups, this is the most expensive cost of all — it doesn’t show up on any invoice.

Total cost of ownership

For a team of 5 developers running AI-assisted coding with a mix of DeepSeek V3.2 and smaller models:

Cost	Self-hosted	API (per-token)	API (flat-rate)
Compute/inference	$2,800	$265	$250
Ops/maintenance	$2,500	$0	$0
Idle waste (~60%)	$1,680	$0	$0
Total monthly	$6,980	$265	$250

Self-hosting costs 26x more for the same workload. The GPU is only 40% of the self-hosted cost — ops and idle waste are the majority.

When self-hosting makes sense

Self-hosting wins in specific scenarios:

Data sovereignty: If your data cannot leave your network — regulated industries, government, healthcare with strict compliance — self-hosting is the only option. No API provider can guarantee the data isolation you need.

Extreme scale: If you’re processing millions of requests per day and your GPUs are consistently at 80%+ utilization, the per-token math eventually favors owned hardware. This threshold is higher than most teams expect — typically $20K+/month in API spend before self-hosting breaks even.

Custom models: If you’ve fine-tuned a model and need to serve it, self-hosting or a dedicated inference provider (Fireworks, Together) is required. Most unified APIs don’t serve custom model weights.

Latency control: If you need guaranteed sub-100ms TTFT and your data center is co-located with your GPUs, self-hosting eliminates network hops.

For everyone else — startups, small teams, companies with variable usage patterns — the API is cheaper, faster to set up, and easier to maintain.

The migration path

Most teams don’t need to choose one forever. A practical approach:

Start with an API: Get your product working, validate demand, understand your usage patterns.
Optimize model selection: Use cheaper models for simple tasks, frontier models for hard tasks. Full guide: Multi-model architecture.
Evaluate self-hosting when: Your monthly API spend exceeds $10K, your GPU utilization would be >70%, and you have DevOps capacity to maintain it.
Hybrid: Self-host your high-volume models, use an API for long-tail models and overflow capacity.

The worst outcome is spending 3 months setting up GPU infrastructure before you’ve validated that anyone wants your product.

CheapestInference serves many models through a single API. No GPUs to manage, no idle costs, no ops burden. Flat-rate plans from $10/month. Get started or compare plans.

The real cost of running AI agents in production

Apr 15, 2026

Chatbots are cheap. Agents are not.

A chatbot sends a user message, gets a response, displays it. Maybe 2,000 tokens per exchange. An agent reads files, calls tools, retries on errors, re-sends the entire conversation every step, and does this 20–60 times per task. Same API, completely different economics.

If you’re budgeting for AI agents the same way you budget for a chatbot, you’re underestimating by 10–50x.

Token consumption: chatbot vs. agent

We measured token consumption across three workload types, each running for one hour:

Coding agent (OpenClaw)

~2.1M tokens

Research agent (CrewAI)

~1.2M tokens

RAG chatbot

~200K tokens

Simple chatbot

~40K tokens

The coding agent consumed 52x more tokens than a simple chatbot in the same time period. And this is normal — the agent was doing useful work the entire time.

Why agents cost so much

Three architectural properties of agents make them expensive:

1. Context accumulation

Every agent step appends tool outputs to the conversation. The LLM re-processes the entire conversation on each step. If the agent reads a 3,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to the end.

For a 40-step task, one file read costs: 3,000 tokens × 35 remaining steps = 105,000 tokens in re-transmission.

This is why agent token consumption grows quadratically, not linearly.

2. System prompt overhead

Agent frameworks use large system prompts — OpenClaw’s is ~9,600 tokens, CrewAI’s varies by agent configuration. This prompt is sent with every request. Over 40 steps, the system prompt alone costs 384,000 tokens.

3. Error retry loops

When a tool call fails, the agent retries. Each retry sends the full context plus the error message. Three retries on a 30K-token context wastes 90K tokens with no productive output.

Without a retry cap, this can run indefinitely. We covered this in detail in Why your AI agent needs a budget.

Monthly cost by model and framework

Assuming one developer running 15 agent tasks per day, 22 working days per month, ~500K tokens per task:

Model	Cost/task	Daily (×15)	Monthly
Claude Opus 4.6	$9.18	$137.70	$3,029
Claude Sonnet 4.6	$2.25	$33.75	$743
GPT-5.4	$4.73	$70.95	$1,561
DeepSeek V3.2	$0.16	$2.40	$53
Qwen 3.5 35B	$0.04	$0.60	$13
CheapestInference Pro	—	—	$50 flat

A team of 5 developers each running 15 tasks/day on Claude Opus spends $15,145/month. The same team on DeepSeek V3.2 via CheapestInference pays $250/month (5 × $50 Pro plan). That’s a 60x reduction.

Four strategies to cut agent inference costs

1. Switch to open-source models

DeepSeek V3.2 and Qwen 3.5 score within 4 points of GPT-5.4 and Opus on most benchmarks. For coding tasks specifically, DeepSeek V3.2 matches Opus on HumanEval and SWE-bench. Full data: Open-source models are production-ready.

2. Route by task complexity

Not every agent step needs a frontier model. File reads, simple classifications, and formatting don’t need 685B parameters. Use a small model for easy steps and a large model for hard ones. Full guide: Building a multi-model architecture.

3. Set per-key budgets with automatic reset

Give each agent its own API key with a dollar-denominated budget that resets every few hours. When the budget is exhausted, the agent pauses instead of burning through your allocation. We built this into every key: Agent budgets explained.

4. Use flat-rate pricing

Per-token pricing penalizes the exact patterns agents use: large contexts, many steps, retries. Flat-rate pricing makes all of that free. Your agent can use the full context window, retry freely, and run 24/7 without increasing the bill.

The math that matters

Here’s the equation most teams miss:

Agent cost = tokens_per_step × steps × cost_per_token

Most optimization focuses on cost_per_token — switching to a cheaper model. But tokens_per_step grows with context (quadratic), and steps is unpredictable. Optimizing only one variable leaves the other two working against you.

Flat-rate pricing eliminates all three variables from your bill. The cost is the subscription. Period.

We serve many models with flat-rate pricing and per-key budget caps. One subscription, unlimited keys, and the guarantee that your agent’s token consumption never becomes your problem. Get started or see plans.

Qwen 3.5 vs GPT-5.4 vs Claude Opus 4.6 — same quality, fraction of the price

Mar 26, 2026

You asked for this. After our first benchmark post, the most requested model was Qwen 3.5. Here it is — 4 models across 5 metrics, same models in every chart:

Open-source: Qwen3.5-397B-A17B (flagship), Qwen3.5-35B-A3B (efficient) Proprietary: GPT-5.4, Claude Opus 4.6

Knowledge: MMLU-Pro (%)

GPT-5.4

88.5%

Qwen3.5 397B

87.8%

Qwen3.5 35B

85.3%

Claude Opus 4.6

82.0%

GPT-5.4 leads at 88.5%, but Qwen3.5-397B is 0.7 points behind — statistically noise. The 35B with only 3B active parameters scores 85.3%, beating Opus by 3 points. The total spread across all four models is just 6.5 points.

Qwen3.5-397B matches GPT-5.4 at 5x less cost. The 35B beats Opus at 23x less.

Reasoning: GPQA Diamond (%)

GPT-5.4

92.0%

Claude Opus 4.6

91.3%

Qwen3.5 397B

88.4%

Qwen3.5 35B

84.2%

Proprietary models lead on graduate-level reasoning. GPT-5.4 at 92% and Opus at 91.3% are strong. But Qwen3.5-397B at 88.4% is within 4 points — and costs $0.54/M vs $2.50 and $5.00. The 35B at 84.2% is still PhD-level performance for $0.22/M input.

Code: LiveCodeBench v6 (%)

GPT-5.4

84.0%

Qwen3.5 397B

83.6%

Claude Opus 4.6

76.0%

Qwen3.5 35B

74.6%

The 397B essentially ties GPT-5.4 on competitive coding — 0.4 points apart. Both beat Opus by 8+ points. The 35B at 74.6% is within 2 points of Opus, at 1/23rd the price.

For dedicated coding workloads, we also serve Qwen3-Coder-480B (SWE-bench Verified: 69.6%, comparable to Claude Sonnet 4).

Speed: output tokens per second

Qwen3.5 35B

178 t/s

Qwen3.5 397B

84 t/s

GPT-5.4

~78 t/s

Claude Opus 4.6

46 t/s

The 35B’s MoE architecture pays off — 178 tok/s is 2.3x faster than GPT-5.4 and 3.9x faster than Opus. Even the 397B flagship at 84 tok/s outpaces both proprietary models. This is what happens when only 3-17B parameters activate per token instead of the full model.

Speed data from Artificial Analysis. Actual speeds on our infrastructure may differ.

Price: input cost per million tokens

Qwen3.5 35B

$0.22

Qwen3.5 397B

$0.54

GPT-5.4

$2.50

Claude Opus 4.6

$5.00

This is the chart that matters. Opus costs 23x more than the 35B and 9x more than the 397B. GPT-5.4 costs 5x more than the 397B. The quality difference? Single-digit percentage points on every benchmark.

The full picture

Quality only — no price axis. GPT-5.4 (gray) has the largest shape. Opus (dashed) is strong on reasoning and code. The 397B (indigo) nearly overlaps GPT-5.4 on code and knowledge. The 35B (teal) pulls hard left on speed — 178 tok/s is 2.3x faster than anything else here. Price tells its own story in the chart above.

The scorecard

Metric	Winner	Qwen3.5 397B	GPT-5.4	Claude Opus 4.6	Gap (397B vs best)
Knowledge (MMLU-Pro)	GPT-5.4	87.8%	88.5%	82.0%	-0.7 pts
Reasoning (GPQA)	GPT-5.4	88.4%	92.0%	91.3%	-3.6 pts
Code (LiveCodeBench)	GPT-5.4	83.6%	84.0%	76.0%	-0.4 pts
Speed (tok/s)	Qwen3.5 397B	84 t/s	~78 t/s	46 t/s	1.1x faster
Price ($/M input)	Qwen3.5 397B	$0.54	$2.50	$5.00	4.6x cheaper

Same weight class, different price tag. The 397B trades 0.4–3.6 points on quality for 4.6x lower price and faster speed. It beats Opus on 4 out of 5 metrics outright.

Note: The Qwen3.5-35B-A3B ($0.22/M) scores 85.3% MMLU-Pro, 84.2% GPQA, 74.6% LiveCodeBench at 178 tok/s — beating Opus on knowledge and speed at 23x less cost. A different weight class, but worth considering if speed and price matter more than the last few quality points.

The real question: what are you paying for?

The quality gap between Qwen3.5-397B and GPT-5.4 is 0.7 points on knowledge, 0.4 points on code. The price gap is 4.6x.

Put it differently:

Model	MMLU-Pro	Cost per quality point
Qwen3.5 35B	85.3%	$0.003 per point per M tokens
Qwen3.5 397B	87.8%	$0.006 per point per M tokens
GPT-5.4	88.5%	$0.028 per point per M tokens
Claude Opus 4.6	82.0%	$0.061 per point per M tokens

Opus costs 20x more per quality point than the 35B — and scores lower. GPT-5.4 leads on quality but costs 5-10x more for single-digit advantages.

For most workloads, the last 3% of benchmark performance isn’t worth a 5x price increase. And for workloads where it is — the 397B gets you within 1 point of GPT-5.4 at a fraction of the cost.

Also available: specialized Qwen models

Beyond the general-purpose models, we serve two Qwen specialists:

Qwen3-Coder-480B — SWE-bench Verified 69.6%, comparable to Claude Sonnet 4. Built for agentic coding.
Qwen3-235B-Thinking — Chain-of-thought reasoning specialist. When you need the model to show its work.

Both available through the same API, same flat-rate plans.

All Qwen 3.5 models are available now on our API. Flat rate from $20/mo, or pay-as-you-go credits. See pricing and try it →

Sources: Qwen3.5-397B Model Card · Qwen3.5-35B Model Card · Artificial Analysis Leaderboard · GPQA Diamond Leaderboard · OpenAI Pricing · Anthropic Pricing · LiveCodeBench Leaderboard

OpenClaw is free. Running it is not.

Mar 24, 2026

OpenClaw has 247,000 GitHub stars. It’s free, open-source, and runs locally. You install it, point it at an LLM, and it writes code, browses the web, queries databases, and executes files on your behalf.

The agent is free. The inference is not.

Every time OpenClaw calls a model, it re-sends the entire conversation history — every tool output, every file it read, every intermediate result. By iteration 20 of a typical task, the input context is 30,000+ tokens. By iteration 40, it’s past 100,000. And it sends this every single request.

This is not a bug. It’s how agents work. And it’s why running OpenClaw on pay-per-token APIs costs $300–600/month for active users — sometimes more.

Where the tokens go

We broke down token consumption for a typical OpenClaw coding task: “add authentication to an Express API.” The agent completed it in 38 tool calls.

Context accumulation

~280K tokens

System prompt (×38)

~156K tokens

Tool outputs (files, etc.)

~70K tokens

Agent output

~19K tokens

Total: ~525,000 tokens for a single task. The agent’s actual output — the code it wrote — was 19K tokens. The other 96% is overhead.

On Claude Opus at $15/M input + $75/M output, that single task costs $9.18. Run five tasks a day and you’re at $1,377/month.

On DeepSeek V3.2 via a pay-per-token provider at $0.27/M input + $1.10/M output, the same task costs $0.16. Better — but 20 tasks a day is still $96/month, and that’s one agent.

The three cost traps

We covered these in depth in Why your AI agent needs a budget, but here’s the OpenClaw-specific version:

1. Context grows quadratically

OpenClaw reads files into context. If it reads a 2,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to 38. That single file read costs 2,000 × 33 remaining steps = 66,000 tokens in re-transmission alone.

Users report session contexts at 56–58% of the 400K context window during normal use. This isn’t a failure mode — it’s the architecture working as designed.

2. System prompt is a fixed tax

OpenClaw’s system prompt is ~9,600 tokens. It gets sent with every request. Over 38 tool calls, that’s 365K tokens just in system prompts. You pay this whether the agent does useful work or not.

3. Wrong model for the job

OpenClaw defaults to a single model for everything. But not every tool call needs the same intelligence:

Reading a file and deciding what to edit? Llama 3.1 8B handles this at 200 tokens/sec.
Writing complex authentication logic? DeepSeek V3.2 or Kimi K2.5 is the right call.
Formatting a config file? Any 8B model is overkill but still cheaper than Opus.

We wrote a full guide on this pattern: Building a multi-model architecture. Routing agent requests to the right model can cut costs by 60–80% without reducing output quality.

The math on flat-rate vs. pay-per-token

Here’s the comparison for an OpenClaw user running ~20 tasks/day:

Provider	Cost/task	20 tasks/day	Monthly
Claude Opus (direct)	$9.18	$183.60	$5,508
GPT-5.4 (direct)	$4.73	$94.60	$2,838
DeepSeek V3.2 (per-token)	$0.16	$3.20	$96
CheapestInference Pro	—	—	$50/mo flat

Flat-rate means you don’t care about context accumulation. The 280K tokens of context overhead that makes pay-per-token expensive? Irrelevant. The system prompt tax? Doesn’t matter. Your agent can call models 24/7 and the bill is the same.

If you’re running OpenClaw, here’s the setup we see working best:

1. Use open-source models. DeepSeek V3.2 and Kimi K2.5 score within 4 points of proprietary models on coding benchmarks (the data). The gap doesn’t justify a 50x cost difference.

2. Route by complexity. Don’t send file reads and simple decisions to the same model as complex code generation. A router model costs fractions of a cent per classification. Full guide: Multi-model architecture.

3. Set per-key budgets. One API key per agent, each with a dollar-denominated budget that resets every few hours. When the budget runs out, the agent pauses instead of burning through your allocation. We built this into every key: Agent budgets explained.

4. Handle rate limits automatically. Budget caps mean your agent will hit 429s. That’s the point — the cap is working. But OpenClaw kills the conversation when it gets a 429. The agent stops, and if you close the dashboard, that conversation is gone.

We built an OpenClaw plugin that fixes this: openclaw-ratelimit-retry. It hooks into agent_end, detects retriable 429s, parks the session on disk, and waits for the budget window to reset. Then it sends chat.send to the original session — resuming the conversation with its full transcript, as if you had typed a message.

openclaw plugins install @cheapestinference/openclaw-ratelimit-retry

plugins:
  ratelimit-retry:
    budgetWindowHours: 8    # matches your CheapestInference budget reset
    maxRetryAttempts: 3     # give up after 3 consecutive 429s
    checkIntervalMinutes: 5 # check every 5 min for ready retries

The plugin is zero-dependency, persists across server restarts, deduplicates by session, and handles edge cases like sub-agents, queue overflow, and corrupted state files. If the retry itself hits a 429, it re-queues automatically. No tokens wasted on re-sending from scratch — the agent picks up exactly where it left off.

This turns budget caps from “your agent crashes” into “your agent naps and wakes up.” Set it up once and forget about it.

5. Consider flat-rate. If your agent runs more than a few tasks per day, per-token pricing works against you. Every token of context overhead is money. On flat-rate, context overhead is free — use the full 128K window, re-send everything, let the agent work without constraint.

The irony

OpenClaw is free because the code runs on your machine. But the valuable part — the intelligence — runs on someone else’s GPUs. The agent framework is the cheap part. Inference is the expensive part.

Open-source models on flat-rate infrastructure flip this equation. The models are free. The inference is flat. The only variable cost left is your time.

Point your OpenClaw base_url at https://api.cheapestinference.com/v1 and find out what unconstrained agents actually cost: nothing more than you already budgeted.

Why your AI agent needs a budget

Mar 24, 2026

There’s a pattern that plays out every week in AI Discord servers and GitHub issues: someone deploys an agent, goes to bed, and wakes up to a $400 bill from a loop that ran all night.

Agents are not humans. They don’t get tired. They don’t notice when they’re repeating themselves. And they consume tokens at a rate that makes interactive chat look like a rounding error.

If you’re running agents in production — or even in development — you need a budget. Here’s why, and how to implement one.

Agents consume 10–50x more tokens than humans

A human chatting with an LLM sends a message, reads the response, thinks, types another message. Maybe 10 requests per hour, a few hundred tokens each.

An agent running a tool loop does this:

1. Read task description (system prompt + context)     → 4,000 tokens input
2. Call tool #1                                         → 500 tokens output
3. Receive tool result, re-send full context + result   → 5,200 tokens input
4. Call tool #2                                         → 500 tokens output
5. Receive result, re-send everything                   → 6,800 tokens input
6. ... repeat 20-40 times ...

Each iteration re-sends the entire conversation history. By step 20, the input context is 30,000+ tokens — and the agent sends it every single time. A 40-step agent loop can consume 500,000+ tokens in a single task. That’s what a human user consumes in a week.

Agent (40-step loop)

~500K tokens

Agent (10-step loop)

~100K tokens

Human (1 hour chat)

~10K tokens

This is normal behavior. The agent is doing its job. The problem is when it does its job wrong — and nobody is watching.

The three failure modes that drain budgets

1. Infinite tool loops

The agent calls a tool, gets an error, retries the same call, gets the same error, retries again. Without a loop detector or retry cap, this continues until your rate limit or budget hits zero.

This is the most common failure mode. It happens when:

An API the agent calls is temporarily down
The agent’s output doesn’t match the tool’s expected input format
The agent misinterprets the tool result and keeps “trying harder”

A single infinite loop can consume millions of tokens in minutes.

2. Context accumulation

Every tool result gets appended to the conversation. The agent never summarizes or trims. By step 30, the input payload is 40K+ tokens, and most of it is irrelevant tool outputs from step 3.

This isn’t a bug — it’s the default behavior of most agent frameworks. The context grows linearly with each step, and each step costs more than the last because the full context is re-sent.

3. Wrong model for the job

An agent using DeepSeek R1 (a reasoning model at ~30 tokens/second) for tasks that don’t require reasoning — file listing, simple classification, template generation — is burning expensive compute for no quality gain. R1 also produces internal chain-of-thought tokens that you pay for but never see.

The fix is model routing — covered in our multi-model architecture guide. But even with routing, you need a budget as a backstop.

What happens without a budget

Without a spending cap, any of these failures means:

Pay-as-you-go API: The bill grows until you notice. Stories of $500+ surprise bills are common on forums. The provider has no reason to stop you — they’re selling tokens.
Self-hosted inference: The agent consumes your entire GPU allocation, starving other workloads.
Shared platform: One user’s agent consumes capacity that other users need.

In all three cases, the damage scales with time. An agent that runs for 8 hours unattended can do 8 hours of damage.

How budget caps work

A budget cap is a dollar ceiling on how much a single key can spend in a time window. When the cap is reached, requests return a 429 Too Many Requests error. No overage charges. No surprise bills. The agent stops, and you investigate.

The key properties of a good budget system:

1. Dollar-denominated, not token-denominated.

Token limits sound intuitive but don’t work across models. 100,000 tokens of Llama 3.1 8B costs $0.002. The same tokens on a large reasoning model costs 100x more. A dollar budget normalizes across all models automatically.

2. Time-windowed with automatic reset.

A budget that resets every few hours (e.g. every 8 hours) means a failure in one window doesn’t affect the next. The agent recovers automatically. If you set a one-time budget that never resets, you have to manually intervene every time the agent exhausts it.

3. Per-key, not per-account.

If you run 5 agents, each should have its own key and its own budget. One runaway agent should not starve the other four. Per-key budgets provide isolation — the same way containers isolate processes.

Designing agents that handle budget limits gracefully

A well-built agent treats a budget limit the same way a well-built web app treats a rate limit — as a normal operational condition, not an unexpected error.

Catch 429s and degrade

from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://api.cheapestinference.com/v1",
    api_key="sk_your_agent_key"
)

def agent_step(messages: list) -> str:
    try:
        response = client.chat.completions.create(
            model="deepseek/deepseek-chat-v3-0324",
            messages=messages
        )
        return response.choices[0].message.content
    except RateLimitError:
        # Budget exhausted — save state, wait for reset
        save_agent_state(messages)
        return "[BUDGET_LIMIT] Agent paused. Will resume on next window."

Monitor spend proactively

Don’t wait for the 429. Check your remaining budget periodically and adjust behavior:

import requests

def check_budget(api_key: str) -> dict:
    """Check remaining budget via the usage endpoint."""
    resp = requests.get(
        "https://api.cheapestinference.com/v1/usage",
        headers={"Authorization": f"Bearer {api_key}"}
    )
    return resp.json()["budget"]

budget = check_budget("sk_your_agent_key")
remaining = budget["limit"] - budget["spent"]

if remaining < 0.01:
    # Less than $0.01 left — switch to cheapest model or pause
    switch_to_model("meta-llama/llama-3.1-8b-instruct")

Set retry caps in your agent framework

Every agent framework has a way to limit retries. Use it:

# LangChain
agent = create_react_agent(
    llm=llm,
    tools=tools,
    max_iterations=25  # Hard cap on tool loop iterations
)

# CrewAI
agent = Agent(
    role="researcher",
    max_iter=15,  # Maximum iterations per task
    llm=llm
)

# Custom loop
MAX_STEPS = 30
for step in range(MAX_STEPS):
    result = agent_step(messages)
    if is_done(result):
        break
else:
    log.warning("Agent hit max steps without completing task")

A max iteration cap is your first line of defense. The budget cap is your second.

Subscriptions as a natural budget mechanism

Pay-per-token pricing gives agents an open-ended credit line. Subscriptions invert this — you decide upfront how much to spend, and the platform enforces it.

With a subscription plan on cheapestinference:

Each key gets a dollar budget that resets every 8 hours
When budget runs out → 429, never overage charges
You create unlimited keys — one per agent, each with its own budget
When your subscription expires, all keys are automatically revoked

This means your worst case is bounded. A runaway agent burns through one 8-hour budget window and stops. It doesn’t burn through your monthly allocation, because the next window starts fresh with a new budget.

For teams running multiple agents, the per-key isolation matters. Your research agent, your coding agent, and your monitoring agent each have independent budgets. If the research agent enters a loop, the others keep working.

The budget stack: defense in depth

No single mechanism catches every failure. Stack them:

Layer	What it catches	When it triggers
Max iterations (code)	Runaway tool loops	After N steps
Retry cap (code)	Repeated failed calls	After N consecutive errors
Budget cap (platform)	All spending, any cause	When dollar limit is reached
Subscription expiry (platform)	Abandoned agents	When subscription period ends

The first two are your responsibility as the developer. The last two are the platform’s. Together, they ensure that even if your code has a bug you haven’t found yet, the damage is capped.

What a budgeted agent looks like in practice

Here’s a complete pattern for a production agent:

from openai import OpenAI, RateLimitError
import requests
import time

client = OpenAI(
    base_url="https://api.cheapestinference.com/v1",
    api_key="sk_agent_research"
)

MAX_STEPS = 30
BUDGET_WARN_THRESHOLD = 0.02  # Switch models when < $0.02 left
RETRY_LIMIT = 3

def run_agent(task: str):
    messages = [
        {"role": "system", "content": "You are a research agent. ..."},
        {"role": "user", "content": task}
    ]
    model = "deepseek/deepseek-chat-v3-0324"
    consecutive_errors = 0

    for step in range(MAX_STEPS):
        # Check budget every 5 steps
        if step % 5 == 0 and step > 0:
            budget = check_budget("sk_agent_research")
            remaining = budget["limit"] - budget["spent"]
            if remaining < BUDGET_WARN_THRESHOLD:
                model = "meta-llama/llama-3.1-8b-instruct"

        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages
            )
            consecutive_errors = 0
            content = response.choices[0].message.content
            messages.append({"role": "assistant", "content": content})

            if is_task_complete(content):
                return content

        except RateLimitError:
            save_agent_state(messages, step)
            return f"Budget limit reached at step {step}. State saved."

        except Exception as e:
            consecutive_errors += 1
            if consecutive_errors >= RETRY_LIMIT:
                return f"Aborting after {RETRY_LIMIT} consecutive errors: {e}"

    return "Max steps reached. Partial results saved."

Three layers of protection:

Max 30 steps — prevents infinite loops
3 consecutive error retry cap — prevents retry storms
Budget check every 5 steps — degrades to cheaper model before hitting the hard cap

If all three fail, the platform’s budget cap catches it anyway.

The bottom line

Running an AI agent without a budget is like running a process without memory limits — it works fine until it doesn’t, and then the damage is proportional to how long nobody noticed.

Budget caps don’t limit what your agent can do. They limit what it can do wrong. A properly budgeted agent completes the same tasks — it just can’t bankrupt you in the process.

Set a budget. Set a retry cap. Set a max iteration count. Then let your agent run.

We serve open-source and proprietary models with per-key budget caps that reset every 8 hours. One subscription, unlimited keys, and the guarantee that a bad loop never turns into a bad bill. Get started or see how per-key plans work.

The visible costs

GPU hardware

Smaller models are cheaper — but limited

The invisible costs

1. Operations and maintenance

2. Idle capacity

3. Multi-model overhead

4. Opportunity cost

Total cost of ownership

When self-hosting makes sense

The migration path

Token consumption: chatbot vs. agent

Why agents cost so much

1. Context accumulation

2. System prompt overhead

3. Error retry loops

Monthly cost by model and framework

Four strategies to cut agent inference costs

1. Switch to open-source models

2. Route by task complexity

3. Set per-key budgets with automatic reset

4. Use flat-rate pricing

The math that matters

Knowledge: MMLU-Pro (%)

Reasoning: GPQA Diamond (%)

Code: LiveCodeBench v6 (%)

Speed: output tokens per second

Price: input cost per million tokens

The full picture

The scorecard

The real question: what are you paying for?

Also available: specialized Qwen models

Where the tokens go

The three cost traps

1. Context grows quadratically

2. System prompt is a fixed tax

3. Wrong model for the job

The math on flat-rate vs. pay-per-token

What we’d actually recommend

The irony

Agents consume 10–50x more tokens than humans

The three failure modes that drain budgets

1. Infinite tool loops

2. Context accumulation

3. Wrong model for the job

What happens without a budget

How budget caps work

Designing agents that handle budget limits gracefully

Catch 429s and degrade

Monitor spend proactively

Set retry caps in your agent framework

Subscriptions as a natural budget mechanism

The budget stack: defense in depth

What a budgeted agent looks like in practice

The bottom line