Qwen 3.5 vs GPT-5.4 vs Claude Opus 4.6 — same quality, fraction of the price
You asked for this. After our first benchmark post, the most requested model was Qwen 3.5. Here it is — 4 models across 5 metrics, same models in every chart:
Open-source: Qwen3.5-397B-A17B (flagship), Qwen3.5-35B-A3B (efficient) Proprietary: GPT-5.4, Claude Opus 4.6
Knowledge: MMLU-Pro (%)
Section titled “Knowledge: MMLU-Pro (%)”GPT-5.4 leads at 88.5%, but Qwen3.5-397B is 0.7 points behind — statistically noise. The 35B with only 3B active parameters scores 85.3%, beating Opus by 3 points. The total spread across all four models is just 6.5 points.
Qwen3.5-397B matches GPT-5.4 at 5x less cost. The 35B beats Opus at 23x less.
Reasoning: GPQA Diamond (%)
Section titled “Reasoning: GPQA Diamond (%)”Proprietary models lead on graduate-level reasoning. GPT-5.4 at 92% and Opus at 91.3% are strong. But Qwen3.5-397B at 88.4% is within 4 points — and costs $0.54/M vs $2.50 and $5.00. The 35B at 84.2% is still PhD-level performance for $0.22/M input.
Code: LiveCodeBench v6 (%)
Section titled “Code: LiveCodeBench v6 (%)”The 397B essentially ties GPT-5.4 on competitive coding — 0.4 points apart. Both beat Opus by 8+ points. The 35B at 74.6% is within 2 points of Opus, at 1/23rd the price.
For dedicated coding workloads, we also serve Qwen3-Coder-480B (SWE-bench Verified: 69.6%, comparable to Claude Sonnet 4).
Speed: output tokens per second
Section titled “Speed: output tokens per second”The 35B’s MoE architecture pays off — 178 tok/s is 2.3x faster than GPT-5.4 and 3.9x faster than Opus. Even the 397B flagship at 84 tok/s outpaces both proprietary models. This is what happens when only 3-17B parameters activate per token instead of the full model.
Speed data from Artificial Analysis. Actual speeds on our infrastructure may differ.
Price: input cost per million tokens
Section titled “Price: input cost per million tokens”This is the chart that matters. Opus costs 23x more than the 35B and 9x more than the 397B. GPT-5.4 costs 5x more than the 397B. The quality difference? Single-digit percentage points on every benchmark.
The full picture
Section titled “The full picture”Quality only — no price axis. GPT-5.4 (gray) has the largest shape. Opus (dashed) is strong on reasoning and code. The 397B (indigo) nearly overlaps GPT-5.4 on code and knowledge. The 35B (teal) pulls hard left on speed — 178 tok/s is 2.3x faster than anything else here. Price tells its own story in the chart above.
The scorecard
Section titled “The scorecard”| Metric | Winner | Qwen3.5 397B | GPT-5.4 | Claude Opus 4.6 | Gap (397B vs best) |
|---|---|---|---|---|---|
| Knowledge (MMLU-Pro) | GPT-5.4 | 87.8% | 88.5% | 82.0% | -0.7 pts |
| Reasoning (GPQA) | GPT-5.4 | 88.4% | 92.0% | 91.3% | -3.6 pts |
| Code (LiveCodeBench) | GPT-5.4 | 83.6% | 84.0% | 76.0% | -0.4 pts |
| Speed (tok/s) | Qwen3.5 397B | 84 t/s | ~78 t/s | 46 t/s | 1.1x faster |
| Price ($/M input) | Qwen3.5 397B | $0.54 | $2.50 | $5.00 | 4.6x cheaper |
Same weight class, different price tag. The 397B trades 0.4–3.6 points on quality for 4.6x lower price and faster speed. It beats Opus on 4 out of 5 metrics outright.
Note: The Qwen3.5-35B-A3B ($0.22/M) scores 85.3% MMLU-Pro, 84.2% GPQA, 74.6% LiveCodeBench at 178 tok/s — beating Opus on knowledge and speed at 23x less cost. A different weight class, but worth considering if speed and price matter more than the last few quality points.
The real question: what are you paying for?
Section titled “The real question: what are you paying for?”The quality gap between Qwen3.5-397B and GPT-5.4 is 0.7 points on knowledge, 0.4 points on code. The price gap is 4.6x.
Put it differently:
| Model | MMLU-Pro | Cost per quality point |
|---|---|---|
| Qwen3.5 35B | 85.3% | $0.003 per point per M tokens |
| Qwen3.5 397B | 87.8% | $0.006 per point per M tokens |
| GPT-5.4 | 88.5% | $0.028 per point per M tokens |
| Claude Opus 4.6 | 82.0% | $0.061 per point per M tokens |
Opus costs 20x more per quality point than the 35B — and scores lower. GPT-5.4 leads on quality but costs 5-10x more for single-digit advantages.
For most workloads, the last 3% of benchmark performance isn’t worth a 5x price increase. And for workloads where it is — the 397B gets you within 1 point of GPT-5.4 at a fraction of the cost.
Also available: specialized Qwen models
Section titled “Also available: specialized Qwen models”Beyond the general-purpose models, we serve two Qwen specialists:
- Qwen3-Coder-480B — SWE-bench Verified 69.6%, comparable to Claude Sonnet 4. Built for agentic coding.
- Qwen3-235B-Thinking — Chain-of-thought reasoning specialist. When you need the model to show its work.
Both available through the same API, same flat-rate plans.
All Qwen 3.5 models are available now on our API. Flat rate from $20/mo, or pay-as-you-go credits. See pricing and try it →
Sources: Qwen3.5-397B Model Card · Qwen3.5-35B Model Card · Artificial Analysis Leaderboard · GPQA Diamond Leaderboard · OpenAI Pricing · Anthropic Pricing · LiveCodeBench Leaderboard