Testing AI systems on hard math problems shows they still perform very poorly

12 Nov 2024, 16:10 by Bob Yirka

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

preprint

trusted source

proofread

Mathematical subject interconnections in FrontierMath. Node sizes indicate the frequency of each subject's appearance in problems, while connections indicate when multiple mathematical subjects are combined within single problems, demonstrating the benchmark's integration of many mathematical domains. Credit: arXiv (2024). DOI: 10.48550/arxiv.2411.04872

A team of AI researchers and mathematicians affiliated with several institutions in the U.S. and the U.K. has developed a math benchmark that allows scientists to test the ability of AI systems to solve exceptionally difficult math problems. Their paper is posted on the arXiv preprint server.

Over the past few years, LLMs such as ChatGPT have grown ever more sophisticated and therefore can at times appear to have a high level of intelligence. But there is one area where they fall short—solving difficult math problems.

As developers of AI systems work to improve the math skills of their models, they have developed benchmarks to serve as a means to test their progress. Two of the most popular are MATH and GSM8K. Over time, several LLMs have improved to the extent that they are able to score up to 90% on these tests. But, as the team on this new effort noted, the difficulty level of such benchmarks is not that high. They decided a new benchmark was needed, and so they created one they named FrontierMath.

To begin, the research team delved deep into the math world, reaching out to some of the brightest minds in the field. They asked them to provide some truly difficult math problems and got back hundreds of them in reply. Such problems, the researchers note, are not only unique (they have not been published before) but they also require a deep level of understanding of mathematics. Some take humans several days to solve.

They also cover a wide range of topics, from number theory to algebraic geometry. Because of that breadth, brute force will not work. Neither will making educated guesses. To score well on the FrontierMath benchmark, an AI system would have to have creativity, insight and what the research team describes as "deep domain expertise."

Testing thus far has demonstrated the difficulty found in FrontierMath. AIs that have scored well on traditional benchmarks have not been able to score any higher than 2%.

More information: Elliot Glazer et al, FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, arXiv (2024). DOI: 10.48550/arxiv.2411.04872

epochai.org/frontiermath/the-benchmark

Journal information: arXiv