The rise of AI ‘reasoning’ models is making benchmarking more expensive

Nikesh Vaishnav
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

AI labs like OpenAI claim that their so-called “reasoning” AI models, which can “think” through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such as physics. But while this generally appears to be the case, reasoning models are also much more expensive to benchmark, making it difficult to independently verify these claims.

According to data from Artificial Analysis, a third-party AI testing outfit, it costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.

Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid” reasoning model, on the same set of tests cost $1,485.35, while testing OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis.

Some reasoning models are cheaper to benchmark than others. Artificial Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on average, they tend to be pricey. All told, Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount the firm spent analyzing over 80 non-reasoning models ($2,400).

OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet — Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41.

Artificial Analysis co-founder George Cameron told TechCrunch that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models.

“At Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these,” Cameron said. “We are planning for this spend to increase as models are more frequently released.”

Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI benchmarking costs.

Ross Taylor, the CEO of AI startup General Reasoning, said he recently spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique prompts. Taylor estimates a single run-through of MMLU Pro, a question set designed to benchmark a model’s language comprehension skills, would have cost more than $1,800.

“We’re moving to a world where a lab reports x% on a benchmark where they spend y amount of compute, but where resources for academics are << y,” said Taylor in a recent post on X. “[N]o one is going to be able to reproduce the results.”

Why are reasoning models so expensive to test? Mainly because they generate a lot of tokens. Tokens represent bits of raw text, such as the word “fantastic” split into the syllables “fan,” “tas,” and “tic.” According to Artificial Analysis, OpenAI’s o1 generated over 44 million tokens during the firm’s benchmarking tests, around eight times the amount GPT-4o generated.

The vast majority of AI companies charge for model usage by the token, so you can see how this cost can add up.

Modern benchmarks also tend to elicit a lot of tokens from models because they contain questions involving complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI, which develops its own model benchmarks.

“[Today’s] benchmarks are more complex [even though] the number of questions per benchmark has overall decreased,” Denain told TechCrunch. “They often attempt to evaluate models’ ability to do real-world tasks, such as write and execute code, browse the internet, and use computers.”

Denain added that the most expensive models have gotten more expensive per token over time. For example, Anthropic’s Claude 3 Opus was the priciest model when it was released in May 2024, costing $75 per million output tokens. OpenAI’s GPT-4.5 and o1-pro, both of which launched earlier this year, cost $150 per million output tokens and $600 per million output tokens, respectively.

“[S]ince models have gotten better over time, it’s still true that the cost to reach a given level of performance has greatly decreased over time,” Denain said. “But if you want to evaluate the best largest models at any point in time, you’re still paying more.”

Many AI labs, including OpenAI, give benchmarking organizations free or subsidized access to their models for testing purposes. But this colors the results, some experts say — even if there’s no evidence of manipulation, the mere suggestion of an AI lab’s involvement threatens to harm the integrity of the evaluation scoring.

“From [a] scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore?” wrote Taylor in a follow-up post on X. “Was it ever science?”

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *