OpenAI o3 AI Model Under Fire: Independent Tests Reveal Lower Scores Than Claimed

Hand holding smartphone with ChatGPT and OpenAI text

OpenAI o3 AI Model has landed in controversy after independent benchmark tests revealed performance scores significantly lower than the company originally suggested.

When OpenAI introduced the OpenAI o3 AI Model back in December, its internal testing results appeared impressive, especially on the notoriously difficult FrontierMath benchmark. OpenAI claimed the model solved over 25% of the math problems in this benchmark — a number that dramatically outshined other AI systems, which reportedly hovered below 2%.

However, new third-party data is painting a different picture. Research institute Epoch AI, creators of the FrontierMath benchmark, recently published their own test results for the OpenAI o3 AI Model, revealing that the model scored around 10% — far below the figures OpenAI shared during its initial announcements.

The discrepancy raised questions about the testing conditions under which OpenAI’s internal figures were generated. Experts suggest that OpenAI may have used a more advanced compute configuration during its in-house evaluations, unlike the public-facing model now available to users.

Epoch clarified that multiple factors could explain the gap. Their statement pointed to potential differences in the problem sets used during the tests and the level of computational power allocated at inference time. Additionally, the OpenAI o3 AI Model that was ultimately released to the public appears to have been optimized for chat applications and general usage — rather than pure benchmark performance.

OpenAI has released o3, their highly anticipated reasoning model, along with o4-mini, a smaller and cheaper model that succeeds o3-mini.

We evaluated the new models on our suite of math and science benchmarks. Results in thread! pic.twitter.com/5gbtzkEy1B
— Epoch AI (@EpochAIResearch) April 18, 2025

The ARC Prize Foundation, which had early access to a pre-release version of the OpenAI o3 AI Model, confirmed that the production version of the model is different, likely smaller in size and intended for broader, more cost-effective deployment.

Clarifying o3’s ARC-AGI Performance

OpenAI has confirmed:

* The released o3 is a different model from what we tested in December 2024

* All released o3 compute tiers are smaller than the version we tested

* The released o3 was not trained on ARC-AGI data, not even the train…
— ARC Prize (@arcprize) April 16, 2025

Even OpenAI’s own technical team seems to acknowledge the gap. During a recent livestream, OpenAI staffer Wenda Zhou explained that the released version of OpenAI o3 AI Model was designed for real-world applications, emphasizing speed and efficiency over benchmark supremacy.

Despite the debate, OpenAI appears focused on the future. The company has hinted at the upcoming launch of the o3-pro variant, which is expected to address these performance gaps and push the OpenAI o3 AI Model even further.

This incident highlights a growing trend of benchmark controversies in the AI world, where claims often outpace the real-world capabilities of new models. OpenAI is not alone — similar scrutiny has been directed at other AI heavyweights, including Elon Musk’s xAI and Meta, both of which have faced criticism for benchmark-related discrepancies in recent months.

For now, the OpenAI o3 AI Model remains a powerful tool — but the latest findings offer a critical reminder that third-party validation is essential when evaluating AI capabilities.

Get the Latest AI News on AI Content Minds Blog