OpenAI’s New Reasoning AI Models Hallucinate More Than Ever

OpenAI’s new reasoning AI models hallucinate more than their predecessors, according to the company’s own internal testing and third-party research.

The much-anticipated o3 and o4-mini models, which were marketed as cutting-edge reasoning engines, have unexpectedly shown higher hallucination rates than earlier models like o1, o1-mini, and o3-mini. This discovery raises fresh concerns about the reliability of AI systems, especially as they become increasingly integrated into coding, research, and business workflows.

The issue of AI hallucinations — where systems confidently generate false or misleading information — has long been a thorn in the side of developers. But the latest findings suggest the problem could be worsening as OpenAI pushes the boundaries of AI reasoning capabilities.

In a technical disclosure, OpenAI acknowledged that OpenAI’s new reasoning AI models hallucinate more, especially when benchmarked against its in-house dataset, PersonQA. According to the report, the o3 model hallucinated responses in 33% of the test cases, which is roughly double the error rate of previous reasoning models. O4-mini performed even worse, with a staggering 48% hallucination rate.

While OpenAI’s new reasoning AI models hallucinate more, they still outperform older systems in tasks like coding and advanced mathematics. The trade-off appears to be that the more “claims” a reasoning model attempts, the more room there is for both accurate insights and falsehoods.

Third-party nonprofit research lab Transluce independently confirmed these findings, flagging scenarios where o3 fabricated actions, such as falsely claiming it executed code on a 2021 MacBook Pro outside of the ChatGPT environment — a technical impossibility for the model.

AI experts believe the problem may be linked to the way reinforcement learning is applied in the training process. According to Neil Chowdhury, a former OpenAI researcher now at Transluce, this reinforcement learning may actually “amplify issues usually mitigated by standard post-training pipelines.”

Others in the tech community, including Stanford adjunct professor and Workera CEO Kian Katanforoosh, noted that while o3 offers impressive coding support, it frequently fabricates broken web links — a quirk that underlines the risks of deploying these models for serious business use.

Despite these setbacks, OpenAI remains optimistic, suggesting that web search integration could reduce hallucination rates. The company’s GPT-4o model, equipped with search capabilities, has already achieved a 90% accuracy score on another benchmark known as SimpleQA.

However, the revelation that OpenAI’s new reasoning AI models hallucinate more poses a serious question about the future of AI reasoning. If reasoning models continue to scale in complexity but degrade in factual reliability, developers and businesses may need to rethink how these systems are integrated into sensitive workflows.

An OpenAI spokesperson responded to the concerns by emphasizing the company’s ongoing commitment to improving model reliability:

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” said Niko Felix.

As the AI industry continues shifting its focus toward reasoning-based models, the challenge will be finding ways to enhance creativity without sacrificing truth.

Get the Latest AI News on AI Content Minds Blog