As AI labs like OpenAI and Anthropic push the boundaries with so-called “reasoning” models, a new challenge is emerging: Benchmarking these models is getting extremely expensive.
According to Artificial Analysis, a third-party AI benchmarking organization, the cost of evaluating models with advanced reasoning capabilities now runs into the thousands—even for a single round of testing. For example, benchmarking OpenAI’s o1 model across just seven popular tests cost the firm $2,767.05 due to the vast number of tokens generated.
By contrast, benchmarking non-reasoning models like GPT-4o or Claude 3.6 Sonnet cost just $108.85 and $81.41, respectively.
What Makes Benchmarking So Expensive?
The primary culprit? Tokens.
Reasoning models tend to generate significantly more tokens per test prompt because they work through problems step by step—just like a human might. That “thinking” process isn’t cheap, especially when labs like OpenAI and Anthropic charge per token used.
For example:
- OpenAI o1 reasoning model: 44+ million tokens
- GPT-4o non-reasoning model: ~5.5 million tokens
- Claude 3.7 Sonnet: ~$1,485 to benchmark
- Claude 3.7 Sonnet (custom tests): $580 for 3,700 prompts
- MMLU Pro benchmark alone: Could cost over $1,800
And top-tier models keep getting more expensive per token. In 2024:
- Claude 3 Opus: $75/million output tokens
- GPT-4.5: $150/million output tokens
- o1-pro: A jaw-dropping $600/million output tokens
Is This Even Science Anymore?
Some researchers are raising red flags—not just about costs but about scientific transparency.
“If you publish a result that no one can replicate with the same model, is it even science anymore?” — Ross Taylor, CEO of General Reasoning
When benchmarking access is limited or subsidized by the model creators themselves (as OpenAI reportedly does), it can cast doubt on the objectivity and reproducibility of published results—even if there’s no direct manipulation.
Add to that the complexity of modern benchmarks—requiring real-world tasks like coding, web browsing, and software interaction—and the evaluation landscape becomes even more fragmented and resource-intensive.
The Future of Benchmarking: Bigger Models, Bigger Bills?
Artificial Analysis says it plans to increase its monthly testing budget as more “reasoning” models hit the market. That may be necessary for independent transparency, but it also deepens the gap between well-funded labs and smaller academic or open-source teams.
“We’re moving to a world where labs report x% on a benchmark by spending y amount of compute—but academics can’t afford y.” — Taylor
With this, the AI reasoning model benchmarking cost debate isn’t just about dollars—it’s about who gets to shape the narrative of what “intelligence” means in artificial intelligence.
Get the Latest AI News on AI Content Minds Blog