AI Exam Shock: Humanity's Last Test 🤯🏆

AI

🎧English flagFrench flagGerman flagSpanish flag

Summary

In January 2025, the Humanity’s Last Exam benchmark debuted, presenting a rigorous test for large language models. Utilizing 2500 questions spanning “the frontier of human knowledge,” including inquiries about hummingbird tendons, the exam quickly became a key metric for AI companies. OpenAI’s o1 model initially achieved an 8.3% score, while Google’s Gemini 3 Deep Think later recorded a score of 48.4%. Researchers, like Jon Laurent of Edison Scientific, emphasize that individual question scores don’t necessarily reflect overall performance, highlighting the need for models to improve information retrieval. The benchmark, developed by the Center for AI Safety, and subsequent releases like FrontierScience, are intended to assess “expert-level scientific reasoning.” Ultimately, these benchmarks are proving to be a valuable tool for driving innovation and setting new standards within the field of artificial intelligence.

INSIGHTS


THE EVOLUTION OF SCIENTIFIC BENCHMARKS
The pursuit of artificial intelligence capable of genuine scientific discovery has spurred a rapid evolution in benchmark design. Initially, efforts focused on accumulating vast quantities of scientific knowledge, as exemplified by the Humanity’s Last Exam (HLE). This benchmark, utilizing 2500 questions drawn from “the frontier of human knowledge,” prioritized breadth of knowledge, resulting in early successes for models like OpenAI’s o1. However, concerns arose that HLE’s questions often assessed arcane facts rather than the core processes of scientific reasoning, leading to criticism from figures like Chenru Duan. This critique highlighted a crucial shift in the field: the need for benchmarks that more accurately reflect the complexities of real-world scientific inquiry.

MOVING BEYOND KNOWLEDGE ACQUISITION: FOCUSING ON SCIENTIFIC PROCESSES
The subsequent development of benchmarks, such as FrontierScience and Scientific Discovery Evaluation (SDE), represented a deliberate move toward evaluating AI’s ability to engage in the actual scientific workflow. FrontierScience, with its 700 questions centered on chemistry, biology, and physics, incorporating Olympiad-style questions and open-ended research problems, aimed to provide a more nuanced assessment. The emphasis on verifiable reasoning steps, as championed by OpenAI’s Miles Wang, proved effective in identifying models capable of simulating scientific thought. SDE, utilizing 1125 tasks linked to 43 real-world research scenarios, further underscored the importance of holistic project evaluation, recognizing that understanding the “big-picture” was often more valuable than precise knowledge of individual molecular properties. The limitations observed across models on similar challenges—stuck on the same hardest questions—suggested a shared reliance on comparable training data, reinforcing the need for more diverse and challenging benchmarks.

THE EMERGING ROLE OF AGENTIC AI AND MULTI-STEP PROBLEM-SOLVING
The most recent wave of benchmark development, exemplified by LABBench2 and FutureHouse’s agentic AI models, pushes the boundaries of scientific assessment by targeting the ability to execute complex, multi-step scientific projects. LABBench2, focused on biology, tests whether AI can truly transform a research idea into a finished paper, demanding capabilities in literature searching, data access, and gene sequence construction. The mixed results – strong performance in searching but struggles with integrated tasks – point to a key area for improvement: enhancing an AI’s ability to efficiently retrieve and navigate information within large, complex datasets. Jon Laurent of Edison Scientific emphasizes that benchmarks aren't merely about determining current leadership, but rather about setting ambitious goals that drive innovation. The pursuit of agentic AI, capable of independently driving research projects from conception to completion, represents the next critical step in the evolution of scientific benchmarks and the development of truly intelligent scientific tools.

THE EVOLUTION OF AI BENCHMARKS
The pursuit of artificial intelligence has consistently relied on benchmarks – measurable standards that define progress within the field. As Dr. Wang explains, these benchmarks act as “North Stars,” guiding research and development. A prime example is the ImageNet Large Scale Visual Recognition Challenge. This competition, which tasked computers with identifying images, directly led to the creation of convolutional neural networks, a cornerstone of modern AI. The success of AlexNet, the 2012 challenge winner, dramatically shifted the landscape of AI development, highlighting the importance of measurable progress.

A MULTIFACETED APPROACH TO AI EVALUATION
Recognizing the diverse demands of scientific research, experts are advocating for a shift away from singular benchmarks. Anna Ivanova, a cognitive neuroscience and AI researcher at the Georgia Institute of Technology, emphasizes that “how well a system plots your data is very different from its factual knowledge of analytical chemistry.” This divergence underscores the need for a more nuanced approach. AI specialists suggest that the research community should adopt a portfolio of tests, each designed to specifically target and stimulate improvement across various stages of the scientific workflow. This multifaceted strategy acknowledges the varied skills required for successful scientific endeavors.

MEASUREMENT AS A CATALYST FOR PROGRESS
Ultimately, the ability to measure performance is critical for driving advancements in AI within scientific contexts. Peng’s assertion – “In order to make progress, you have to be able to measure it” – encapsulates this fundamental principle. The selection of appropriate evaluations directly influences the direction of development. As the field moves toward a more diverse set of assessments, the focus shifts to continually refining and optimizing AI systems across a broader spectrum of scientific tasks.

This article is AI-synthesized from public sources and may not reflect original reporting.