Skip to content

Qwen 3.5 vs GPT-5.4 vs Claude Opus 4.6 — same quality, fraction of the price

You asked for this. After our first benchmark post, the most requested model was Qwen 3.5. Here it is — 4 models across 5 metrics, same models in every chart:

Open-source: Qwen3.5-397B-A17B (flagship), Qwen3.5-35B-A3B (efficient) Proprietary: GPT-5.4, Claude Opus 4.6


GPT-5.4
88.5%
Qwen3.5 397B
87.8%
Qwen3.5 35B
85.3%
Claude Opus 4.6
82.0%

GPT-5.4 leads at 88.5%, but Qwen3.5-397B is 0.7 points behind — statistically noise. The 35B with only 3B active parameters scores 85.3%, beating Opus by 3 points. The total spread across all four models is just 6.5 points.

Qwen3.5-397B matches GPT-5.4 at 5x less cost. The 35B beats Opus at 23x less.


GPT-5.4
92.0%
Claude Opus 4.6
91.3%
Qwen3.5 397B
88.4%
Qwen3.5 35B
84.2%

Proprietary models lead on graduate-level reasoning. GPT-5.4 at 92% and Opus at 91.3% are strong. But Qwen3.5-397B at 88.4% is within 4 points — and costs $0.54/M vs $2.50 and $5.00. The 35B at 84.2% is still PhD-level performance for $0.22/M input.


GPT-5.4
84.0%
Qwen3.5 397B
83.6%
Claude Opus 4.6
76.0%
Qwen3.5 35B
74.6%

The 397B essentially ties GPT-5.4 on competitive coding — 0.4 points apart. Both beat Opus by 8+ points. The 35B at 74.6% is within 2 points of Opus, at 1/23rd the price.

For dedicated coding workloads, we also serve Qwen3-Coder-480B (SWE-bench Verified: 69.6%, comparable to Claude Sonnet 4).


Qwen3.5 35B
178 t/s
Qwen3.5 397B
84 t/s
GPT-5.4
~78 t/s
Claude Opus 4.6
46 t/s

The 35B’s MoE architecture pays off — 178 tok/s is 2.3x faster than GPT-5.4 and 3.9x faster than Opus. Even the 397B flagship at 84 tok/s outpaces both proprietary models. This is what happens when only 3-17B parameters activate per token instead of the full model.

Speed data from Artificial Analysis. Actual speeds on our infrastructure may differ.


Qwen3.5 35B
$0.22
Qwen3.5 397B
$0.54
GPT-5.4
$2.50
Claude Opus 4.6
$5.00

This is the chart that matters. Opus costs 23x more than the 35B and 9x more than the 397B. GPT-5.4 costs 5x more than the 397B. The quality difference? Single-digit percentage points on every benchmark.


Code Reasoning Knowledge Speed Qwen3.5 397B Qwen3.5 35B GPT-5.4 Opus 4.6

Quality only — no price axis. GPT-5.4 (gray) has the largest shape. Opus (dashed) is strong on reasoning and code. The 397B (indigo) nearly overlaps GPT-5.4 on code and knowledge. The 35B (teal) pulls hard left on speed — 178 tok/s is 2.3x faster than anything else here. Price tells its own story in the chart above.

MetricWinnerQwen3.5 397BGPT-5.4Claude Opus 4.6Gap (397B vs best)
Knowledge (MMLU-Pro)GPT-5.487.8%88.5%82.0%-0.7 pts
Reasoning (GPQA)GPT-5.488.4%92.0%91.3%-3.6 pts
Code (LiveCodeBench)GPT-5.483.6%84.0%76.0%-0.4 pts
Speed (tok/s)Qwen3.5 397B84 t/s~78 t/s46 t/s1.1x faster
Price ($/M input)Qwen3.5 397B$0.54$2.50$5.004.6x cheaper

Same weight class, different price tag. The 397B trades 0.4–3.6 points on quality for 4.6x lower price and faster speed. It beats Opus on 4 out of 5 metrics outright.

Note: The Qwen3.5-35B-A3B ($0.22/M) scores 85.3% MMLU-Pro, 84.2% GPQA, 74.6% LiveCodeBench at 178 tok/s — beating Opus on knowledge and speed at 23x less cost. A different weight class, but worth considering if speed and price matter more than the last few quality points.


The real question: what are you paying for?

Section titled “The real question: what are you paying for?”

The quality gap between Qwen3.5-397B and GPT-5.4 is 0.7 points on knowledge, 0.4 points on code. The price gap is 4.6x.

Put it differently:

ModelMMLU-ProCost per quality point
Qwen3.5 35B85.3%$0.003 per point per M tokens
Qwen3.5 397B87.8%$0.006 per point per M tokens
GPT-5.488.5%$0.028 per point per M tokens
Claude Opus 4.682.0%$0.061 per point per M tokens

Opus costs 20x more per quality point than the 35B — and scores lower. GPT-5.4 leads on quality but costs 5-10x more for single-digit advantages.

For most workloads, the last 3% of benchmark performance isn’t worth a 5x price increase. And for workloads where it is — the 397B gets you within 1 point of GPT-5.4 at a fraction of the cost.


Beyond the general-purpose models, we serve two Qwen specialists:

  • Qwen3-Coder-480B — SWE-bench Verified 69.6%, comparable to Claude Sonnet 4. Built for agentic coding.
  • Qwen3-235B-Thinking — Chain-of-thought reasoning specialist. When you need the model to show its work.

Both available through the same API, same flat-rate plans.


All Qwen 3.5 models are available now on our API. Flat rate from $20/mo, or pay-as-you-go credits. See pricing and try it →

Sources: Qwen3.5-397B Model Card · Qwen3.5-35B Model Card · Artificial Analysis Leaderboard · GPQA Diamond Leaderboard · OpenAI Pricing · Anthropic Pricing · LiveCodeBench Leaderboard