Benchmarks in Leipzig
A group of 49 mathematicians developed a dataset of 100 research-level math questions with known answers during a 3-day workshop in Leipzig, Germany. They tested five state-of-the-art LLMs on these questions, finding that after three evaluation stages, only two questions remained unsolved. This showcases the impressive advancements in LLMs' mathematical reasoning capabilities. AI
IMPACT Demonstrates significant progress in LLM mathematical reasoning, potentially impacting future AI development and applications in STEM fields.