Home
Opinion
The reference discrepancy of AI reveals lagoons in performance claims

The reference discrepancy of AI reveals lagoons in performance claims

FrontyerMath precision for O3 and O4-MINI of OpenAI compared to the main models. Image: Epoch AI

The latest FrontierMath results, a reference test for the generative AI in advanced mathematical problems, show that OPENAI's O3 model was done worse than OpenAi originally declared. While the newest OpenAi models now exceed O3, the discrepancy highlights the need to analyze the AI reference points closely.

Epoch AI, the Research Institute that created and manages the test, published its last findings on April 18.

Openai claimed 25% termination of the test in December

Last year, the FrontierMath score for OpenAI O3 was part of the almost overwhelming amount of advertisements and promotions published as part of the 12 -day OpenAI holiday event. The company said Openai O3, then its most powerful reasoning model, had solved more than 25% of the problems in FrontierMath. In comparison, most rivals models obtained around 2%, according to Techcrunch.

See: for Earth Day, Organizations could take into account the power of generative in their sustainability efforts.

On April 18, Epoch AI launched the results of the tests shown by OpenAi O3 obtained scores closer to 10%. So why is there a big difference? Both the model and the test could have been different in December. The OpenAi O3 version that had been sent for last year's comparative evaluation was a previous version of the height. Frontiermath has changed since December, with a different number of mathematical problems. This is not necessarily a reminder not to trust the reference points; Instead, just remember to dig in version numbers.

Operai O4 and O3 mini higher score in new frontiersath results

The updated results show Open O4 with improved reasoning, obtaining a score between 15% and 19%. He was followed by Operai O3 Mini, with O3 in third. Other classifications include:

OPENAI O1
Grok-3 mini
Claude 3.7 sonnet (16k)
Livestock
Claude 3.7 sonnet (64k)

Although Epoch AI independently manages the test, Openai originally commissioned FrontierMath and has its content.

Criticism of the comparative evaluation of AI

The reference points are a common way to compare generative models, but critics say that the results can be influenced by the design of the test or the lack of transparency. An study by July 2024 raised concerns that the reference points often emphasize the precision of the task and suffer non -stagnized evaluation practices.

Openai claimed 25% termination of the test in December

Operai O4 and O3 mini higher score in new frontiersath results

Criticism of the comparative evaluation of AI

You may also like...