Skip to content

OpenClaw is free. Running it is not.

OpenClaw has 247,000 GitHub stars. It’s free, open-source, and runs locally. You install it, point it at an LLM, and it writes code, browses the web, queries databases, and executes files on your behalf.

The agent is free. The inference is not.

Every time OpenClaw calls a model, it re-sends the entire conversation history — every tool output, every file it read, every intermediate result. By iteration 20 of a typical task, the input context is 30,000+ tokens. By iteration 40, it’s past 100,000. And it sends this every single request.

This is not a bug. It’s how agents work. And it’s why running OpenClaw on pay-per-token APIs costs $300–600/month for active users — sometimes more.


We broke down token consumption for a typical OpenClaw coding task: “add authentication to an Express API.” The agent completed it in 38 tool calls.

Context accumulation
~280K tokens
System prompt (×38)
~156K tokens
Tool outputs (files, etc.)
~70K tokens
Agent output
~19K tokens

Total: ~525,000 tokens for a single task. The agent’s actual output — the code it wrote — was 19K tokens. The other 96% is overhead.

On Claude Opus at $15/M input + $75/M output, that single task costs $9.18. Run five tasks a day and you’re at $1,377/month.

On DeepSeek V3.2 via a pay-per-token provider at $0.27/M input + $1.10/M output, the same task costs $0.16. Better — but 20 tasks a day is still $96/month, and that’s one agent.


We covered these in depth in Why your AI agent needs a budget, but here’s the OpenClaw-specific version:

OpenClaw reads files into context. If it reads a 2,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to 38. That single file read costs 2,000 × 33 remaining steps = 66,000 tokens in re-transmission alone.

Users report session contexts at 56–58% of the 400K context window during normal use. This isn’t a failure mode — it’s the architecture working as designed.

OpenClaw’s system prompt is ~9,600 tokens. It gets sent with every request. Over 38 tool calls, that’s 365K tokens just in system prompts. You pay this whether the agent does useful work or not.

OpenClaw defaults to a single model for everything. But not every tool call needs the same intelligence:

  • Reading a file and deciding what to edit? Llama 3.1 8B handles this at 200 tokens/sec.
  • Writing complex authentication logic? DeepSeek V3.2 or Kimi K2.5 is the right call.
  • Formatting a config file? Any 8B model is overkill but still cheaper than Opus.

We wrote a full guide on this pattern: Building a multi-model architecture. Routing agent requests to the right model can cut costs by 60–80% without reducing output quality.


Here’s the comparison for an OpenClaw user running ~20 tasks/day:

Provider Cost/task 20 tasks/day Monthly
Claude Opus (direct) $9.18 $183.60 $5,508
GPT-5.4 (direct) $4.73 $94.60 $2,838
DeepSeek V3.2 (per-token) $0.16 $3.20 $96
CheapestInference Pro $50/mo flat

Flat-rate means you don’t care about context accumulation. The 280K tokens of context overhead that makes pay-per-token expensive? Irrelevant. The system prompt tax? Doesn’t matter. Your agent can call models 24/7 and the bill is the same.


If you’re running OpenClaw, here’s the setup we see working best:

1. Use open-source models. DeepSeek V3.2 and Kimi K2.5 score within 4 points of proprietary models on coding benchmarks (the data). The gap doesn’t justify a 50x cost difference.

2. Route by complexity. Don’t send file reads and simple decisions to the same model as complex code generation. A router model costs fractions of a cent per classification. Full guide: Multi-model architecture.

3. Set per-key budgets. One API key per agent, each with a dollar-denominated budget that resets every few hours. When the budget runs out, the agent pauses instead of burning through your allocation. We built this into every key: Agent budgets explained.

4. Handle rate limits automatically. Budget caps mean your agent will hit 429s. That’s the point — the cap is working. But OpenClaw kills the conversation when it gets a 429. The agent stops, and if you close the dashboard, that conversation is gone.

We built an OpenClaw plugin that fixes this: openclaw-ratelimit-retry. It hooks into agent_end, detects retriable 429s, parks the session on disk, and waits for the budget window to reset. Then it sends chat.send to the original session — resuming the conversation with its full transcript, as if you had typed a message.

Terminal window
openclaw plugins install @cheapestinference/openclaw-ratelimit-retry
~/.openclaw/config.yaml
plugins:
ratelimit-retry:
budgetWindowHours: 5 # matches your CheapestInference budget reset
maxRetryAttempts: 3 # give up after 3 consecutive 429s
checkIntervalMinutes: 5 # check every 5 min for ready retries

The plugin is zero-dependency, persists across server restarts, deduplicates by session, and handles edge cases like sub-agents, queue overflow, and corrupted state files. If the retry itself hits a 429, it re-queues automatically. No tokens wasted on re-sending from scratch — the agent picks up exactly where it left off.

This turns budget caps from “your agent crashes” into “your agent naps and wakes up.” Set it up once and forget about it.

5. Consider flat-rate. If your agent runs more than a few tasks per day, per-token pricing works against you. Every token of context overhead is money. On flat-rate, context overhead is free — use the full 128K window, re-send everything, let the agent work without constraint.


OpenClaw is free because the code runs on your machine. But the valuable part — the intelligence — runs on someone else’s GPUs. The agent framework is the cheap part. Inference is the expensive part.

Open-source models on flat-rate infrastructure flip this equation. The models are free. The inference is flat. The only variable cost left is your time.

Point your OpenClaw base_url at https://api.cheapestinference.com/v1 and find out what unconstrained agents actually cost: nothing more than you already budgeted.