“Large language models struggle to solve research-level math
questions. It takes a human to assess just how poorly they perform.
A few weeks ago, a high school student emailed Martin
Hairer, a mathematician known for his startling creativity. The teenager was an
aspiring mathematician, but with the rise of artificial intelligence, he was
having doubts. “It is difficult to understand what is really happening,” he
said. “It feels like every day these models are improving, and sooner rather
than later they will render us useless.”
He asked: “If we have a machine that is significantly better
than us at solving problems, doesn’t mathematics lose a part of its magic?”
Dr. Hairer, who in 2014 won a Fields Medal, the most
prestigious prize in mathematics, and in 2021 won the lucrative Breakthrough
Prize, splits his time between the Swiss Federal Technology Institute of
Lausanne and Imperial College London. Responding to the student, he observed
that many fields were grappling with the prospect of A.I.-induced obsolescence.
“I believe that mathematics is actually quite ‘safe,’” Dr.
Hairer said. He noted that large-language models, or L.L.M.s, the technology at
the heart of chatbots, are now quite good at solving made-up problems. But, he
said, “I haven’t seen any plausible example of an L.L.M. coming up with a
genuinely new idea and/or concept.”
Dr. Hairer mentioned this exchange while discussing a new
paper, titled “First Proof,” that he cowrote with several mathematicians,
including Mohammed Abouzaid of Stanford University; Lauren Williams of Harvard
University; and Tamara Kolda, who runs MathSci.ai, a consultancy in the San
Francisco Bay Area.
The paper describes a recently begun experiment that
collects genuine test questions, drawn from unpublished research by the
authors, in an effort to provide a meaningful measure of A.I.’s mathematical
competency.”
Key Aspects of "First Proof":
Purpose: To create
a more accurate, meaningful, and rigorous benchmark for evaluating A.I.'s
mathematical reasoning capabilities compared to standardized datasets.
Methodology: The
experiment utilizes genuine, unpublished research questions created by the
authors.
Authors: Dr.
Hairer (EPFL) co-authored this work with Mohammed Abouzaid (Stanford), Lauren
Williams (Harvard), and Tamara Kolda (MathSci.ai).
This effort
aims to address the limitations of existing AI benchmarks by testing
capabilities on novel problems rather than data the models may have already
encountered.
Komentarų nėra:
Rašyti komentarą