The real cost of running AI agents in production
Chatbots are cheap. Agents are not.
A chatbot sends a user message, gets a response, displays it. Maybe 2,000 tokens per exchange. An agent reads files, calls tools, retries on errors, re-sends the entire conversation every step, and does this 20–60 times per task. Same API, completely different economics.
If you’re budgeting for AI agents the same way you budget for a chatbot, you’re underestimating by 10–50x.
Token consumption: chatbot vs. agent
Section titled “Token consumption: chatbot vs. agent”We measured token consumption across three workload types, each running for one hour:
The coding agent consumed 52x more tokens than a simple chatbot in the same time period. And this is normal — the agent was doing useful work the entire time.
Why agents cost so much
Section titled “Why agents cost so much”Three architectural properties of agents make them expensive:
1. Context accumulation
Section titled “1. Context accumulation”Every agent step appends tool outputs to the conversation. The LLM re-processes the entire conversation on each step. If the agent reads a 3,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to the end.
For a 40-step task, one file read costs: 3,000 tokens × 35 remaining steps = 105,000 tokens in re-transmission.
This is why agent token consumption grows quadratically, not linearly.
2. System prompt overhead
Section titled “2. System prompt overhead”Agent frameworks use large system prompts — OpenClaw’s is ~9,600 tokens, CrewAI’s varies by agent configuration. This prompt is sent with every request. Over 40 steps, the system prompt alone costs 384,000 tokens.
3. Error retry loops
Section titled “3. Error retry loops”When a tool call fails, the agent retries. Each retry sends the full context plus the error message. Three retries on a 30K-token context wastes 90K tokens with no productive output.
Without a retry cap, this can run indefinitely. We covered this in detail in Why your AI agent needs a budget.
Monthly cost by model and framework
Section titled “Monthly cost by model and framework”Assuming one developer running 15 agent tasks per day, 22 working days per month, ~500K tokens per task:
| Model | Cost/task | Daily (×15) | Monthly |
|---|---|---|---|
| Claude Opus 4.6 | $9.18 | $137.70 | $3,029 |
| Claude Sonnet 4.6 | $2.25 | $33.75 | $743 |
| GPT-5.4 | $4.73 | $70.95 | $1,561 |
| DeepSeek V3.2 | $0.16 | $2.40 | $53 |
| Qwen 3.5 35B | $0.04 | $0.60 | $13 |
| CheapestInference Pro | — | — | $50 flat |
A team of 5 developers each running 15 tasks/day on Claude Opus spends $15,145/month. The same team on DeepSeek V3.2 via CheapestInference pays $250/month (5 × $50 Pro plan). That’s a 60x reduction.
Four strategies to cut agent inference costs
Section titled “Four strategies to cut agent inference costs”1. Switch to open-source models
Section titled “1. Switch to open-source models”DeepSeek V3.2 and Qwen 3.5 score within 4 points of GPT-5.4 and Opus on most benchmarks. For coding tasks specifically, DeepSeek V3.2 matches Opus on HumanEval and SWE-bench. Full data: Open-source models are production-ready.
2. Route by task complexity
Section titled “2. Route by task complexity”Not every agent step needs a frontier model. File reads, simple classifications, and formatting don’t need 685B parameters. Use a small model for easy steps and a large model for hard ones. Full guide: Building a multi-model architecture.
3. Set per-key budgets with automatic reset
Section titled “3. Set per-key budgets with automatic reset”Give each agent its own API key with a dollar-denominated budget that resets every few hours. When the budget is exhausted, the agent pauses instead of burning through your allocation. We built this into every key: Agent budgets explained.
4. Use flat-rate pricing
Section titled “4. Use flat-rate pricing”Per-token pricing penalizes the exact patterns agents use: large contexts, many steps, retries. Flat-rate pricing makes all of that free. Your agent can use the full context window, retry freely, and run 24/7 without increasing the bill.
The math that matters
Section titled “The math that matters”Here’s the equation most teams miss:
Agent cost = tokens_per_step × steps × cost_per_tokenMost optimization focuses on cost_per_token — switching to a cheaper model. But tokens_per_step grows with context (quadratic), and steps is unpredictable. Optimizing only one variable leaves the other two working against you.
Flat-rate pricing eliminates all three variables from your bill. The cost is the subscription. Period.
We serve many models with flat-rate pricing and per-key budget caps. One subscription, unlimited keys, and the guarantee that your agent’s token consumption never becomes your problem. Get started or see plans.