Sekėjai

Ieškoti šiame dienoraštyje

2026 m. vasario 8 d., sekmadienis

These Mathematicians Are Putting A.I. to the Test

“Large language models struggle to solve research-level math questions. It takes a human to assess just how poorly they perform.

 

A few weeks ago, a high school student emailed Martin Hairer, a mathematician known for his startling creativity. The teenager was an aspiring mathematician, but with the rise of artificial intelligence, he was having doubts. “It is difficult to understand what is really happening,” he said. “It feels like every day these models are improving, and sooner rather than later they will render us useless.”

 

He asked: “If we have a machine that is significantly better than us at solving problems, doesn’t mathematics lose a part of its magic?”

 

Dr. Hairer, who in 2014 won a Fields Medal, the most prestigious prize in mathematics, and in 2021 won the lucrative Breakthrough Prize, splits his time between the Swiss Federal Technology Institute of Lausanne and Imperial College London. Responding to the student, he observed that many fields were grappling with the prospect of A.I.-induced obsolescence.

 

“I believe that mathematics is actually quite ‘safe,’” Dr. Hairer said. He noted that large-language models, or L.L.M.s, the technology at the heart of chatbots, are now quite good at solving made-up problems. But, he said, “I haven’t seen any plausible example of an L.L.M. coming up with a genuinely new idea and/or concept.”

 

Dr. Hairer mentioned this exchange while discussing a new paper, titled “First Proof,” that he cowrote with several mathematicians, including Mohammed Abouzaid of Stanford University; Lauren Williams of Harvard University; and Tamara Kolda, who runs MathSci.ai, a consultancy in the San Francisco Bay Area.

 

The paper describes a recently begun experiment that collects genuine test questions, drawn from unpublished research by the authors, in an effort to provide a meaningful measure of A.I.’s mathematical competency.”

 

Key Aspects of "First Proof":

 

    Purpose: To create a more accurate, meaningful, and rigorous benchmark for evaluating A.I.'s mathematical reasoning capabilities compared to standardized datasets.

    Methodology: The experiment utilizes genuine, unpublished research questions created by the authors.

    Authors: Dr. Hairer (EPFL) co-authored this work with Mohammed Abouzaid (Stanford), Lauren Williams (Harvard), and Tamara Kolda (MathSci.ai).

 

This effort aims to address the limitations of existing AI benchmarks by testing capabilities on novel problems rather than data the models may have already encountered.

 

 


Komentarų nėra: