ENTITY Humanity's Last Exam

Humanity's Last Exam

PulseAugur coverage of Humanity's Last Exam — every cluster mentioning Humanity's Last Exam across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

23 over 90d

Releases · 30d

0 over 90d

Papers · 30d

8 over 90d

TIER MIX · 90D

significant 3
research 8
tool 10
commentary 2

TOPICS

RELATIONSHIPS

SENTIMENT · 30D

6 day(s) with sentiment data

RECENT · PAGE 1/2 · 23 TOTAL

COMMENTARY · CL_162302 · Jul 24 · 19:36

Anthropic's new Opus model surpasses Fable 5 on key AI benchmarks

Anthropic has reportedly released a new Opus series model that outperforms its previous Fable 5 model on benchmarks like Humanity's Last Exam, agentic coding, and ARC-AGI. This development suggests significant advanceme…
TOOL · CL_161409 · Jul 24 · 11:00

GLM-4.7-Flash model shows performance metrics across benchmarks

The GLM-4.7-Flash model has demonstrated specific performance metrics across several benchmarks, including GPQA, Humanity's Last Exam, Long Context Reasoning, and SciCode. The model achieved 45.2% on GPQA and 25.5% on S…
TOOL · CL_153491 · Jul 20 · 23:05

Developer builds quiz to test human vs. AI on expert-level exam

A developer has created a web-based quiz called "Humans vs. Humanity's Last Exam" that pits human players against advanced AI models on challenging academic questions. The quiz utilizes the "Humanity's Last Exam" datase…
RESEARCH · CL_152721 · Jul 20 · 13:00

DBRX Instruct and Mistral Medium 3 benchmark results revealed

Independent benchmarks reveal performance metrics for two large language models. DBRX Instruct achieved scores of 33.1% on GPQA, 39.7% on MMLU-Pro, 6.6% on Humanity's Last Exam, and 9.3% on LiveCodeBench. Mistral Medium…
RESEARCH · CL_153696 · Jul 19 · 15:20

New AI agents tackle deep research and misleading web data · 4 sources tracked

Researchers have introduced AREX, a new family of recursively self-improving agents designed for deep research tasks. AREX alternates between research and self-improvement loops, using an autonomous context-update tool …
SIGNIFICANT · CL_149118 · Jul 17 · 19:01

Mira Murati's Thinking Machines releases open-source Inkling model

Thinking Machines, co-founded by former OpenAI executive Mira Murati, has released its first model, Inkling. Unlike many frontier models, Inkling does not aim to top leaderboards, scoring lower than models like Claude F…
RESEARCH · CL_137517 · Jul 11 · 17:00

Open-source LLMs show strong benchmark performance across multiple metrics · 4 sources tracked

Several open-source AI models have demonstrated strong performance on various benchmarks, according to independent measurements. Mi:dm K 2.5 Pro achieved 70.1% on GPQA and 80.9% on MMLU-Pro, while MiMo-V2-Flash showed 8…
COMMENTARY · CL_122014 · Jul 2 · 12:04

AI Benchmark 'Humanity's Last Exam' Criticized as Distraction

The article "Humanity's Last Exam" critiques the AI evaluation benchmark, exploring its origins and the varied expert opinions surrounding its creation. It suggests that the benchmark may serve as a distraction from mor…
TOOL · CL_121309 · Jul 2 · 02:39

OpenClaw AI agent framework matures, gains wider adoption

OpenClaw, an open-source AI agent framework, has matured significantly since its release a few months ago, evolving from a niche tool to a widely adopted local-first assistant. It can now execute real-world tasks by con…
TOOL · CL_108106 · Jun 24 · 04:00

Sakana Fugu orchestrator models combine LLMs for collective intelligence

Researchers have developed Sakana Fugu, a family of orchestrator models designed to combine the specialized capabilities of multiple Large Language Models (LLMs) into a collectively intelligent system. These models act …
TOOL · CL_86307 · Jun 11 · 22:21

Perplexity Integrates Deep Research with Multi-Model Orchestration System

Perplexity has integrated its Deep Research feature into its Computer orchestration system, enhancing its ability to break down complex questions into subtasks. These subtasks are then routed across more than 20 differe…
TOOL · CL_71823 · Jun 4 · 20:39

Andon Labs stress-tests AI agents in real-world business scenarios

Andon Labs is developing novel real-world evaluations for AI systems, moving beyond traditional benchmarks to assess model behavior in complex scenarios. Their "Vending-Bench" and "Luna" projects, which involve AI-run p…
SIGNIFICANT · CL_45430 · May 23 · 02:32

Google's Gemini 3.5 Flash outperforms 3.1 Pro on coding and agents

Google's Gemini 3.5 Flash model has surpassed its predecessor, Gemini 3.1 Pro, on several key benchmarks, particularly in coding and agentic tasks. This new tier offers a significant cost reduction of 40% and approximat…
TOOL · CL_30793 · May 13 · 06:15

LLMs learn to actively seek external info for better task adaptation

Researchers have developed a new method for adapting large language models (LLMs) by enabling them to actively seek information from external sources like Wikipedia and web browsers. This approach, termed "active inform…
TOOL · CL_18871 · May 6 · 04:00

New RSE strategy recycles LLM search experience for efficient test-time scaling

Researchers have introduced Recycling Search Experience (RSE), a novel method to improve the efficiency of test-time scaling for large language models. RSE transforms test-time search from isolated trials into a cumulat…
RESEARCH · CL_20273 · May 5 · 17:55

OpenSearch-VL offers open recipe for advanced multimodal search agents

Researchers have developed OpenSearch-VL, a novel, fully open-source recipe for training advanced multimodal deep search agents. This approach utilizes a curated pipeline for high-quality training data, a diverse tool e…
FRONTIER RELEASE · CL_07657 · Apr 28 · 12:16

Xiaomi's MiMo-v2.5-Pro open-source model rivals top AI coding assistants

Xiaomi has released MiMo-v2.5-Pro, an open-source coding-focused language model that demonstrates impressive capabilities in complex tasks. The model successfully completed a university-level compiler project in hours, …
RESEARCH · CL_06636 · Apr 28 · 04:00

MTRouter cuts LLM costs by 58% on ScienceWorld, 43% on HLE

Researchers have developed MTRouter, a novel system designed to optimize the cost of multi-turn interactions with large language models. By jointly embedding interaction history and candidate models, MTRouter learns to …
FRONTIER RELEASE · CL_11258 · Apr 21 · 16:30

Google Gemini API adds Deep Research updates with MCP and chart generation

Google has released two significant updates to its Gemini API, enhancing its Deep Research capabilities. These updates introduce improved quality, support for MCP, and native generation of charts and infographics. The G…
SIGNIFICANT · CL_97397 · Feb 12 · 16:55

Google upgrades Gemini 3 Deep Think for science and engineering

Google has released an upgraded version of Gemini 3 Deep Think, a specialized reasoning mode designed for complex scientific, research, and engineering challenges. This new iteration is available to Google AI Ultra subs…

Anthropic's new Opus model surpasses Fable 5 on key AI benchmarks

GLM-4.7-Flash model shows performance metrics across benchmarks

Developer builds quiz to test human vs. AI on expert-level exam

DBRX Instruct and Mistral Medium 3 benchmark results revealed

New AI agents tackle deep research and misleading web data · 4 sources tracked

Mira Murati's Thinking Machines releases open-source Inkling model

Open-source LLMs show strong benchmark performance across multiple metrics · 4 sources tracked

AI Benchmark 'Humanity's Last Exam' Criticized as Distraction

OpenClaw AI agent framework matures, gains wider adoption

Sakana Fugu orchestrator models combine LLMs for collective intelligence

Perplexity Integrates Deep Research with Multi-Model Orchestration System

Andon Labs stress-tests AI agents in real-world business scenarios

Google's Gemini 3.5 Flash outperforms 3.1 Pro on coding and agents

LLMs learn to actively seek external info for better task adaptation

New RSE strategy recycles LLM search experience for efficient test-time scaling

OpenSearch-VL offers open recipe for advanced multimodal search agents

Xiaomi's MiMo-v2.5-Pro open-source model rivals top AI coding assistants

MTRouter cuts LLM costs by 58% on ScienceWorld, 43% on HLE

Google Gemini API adds Deep Research updates with MCP and chart generation

Google upgrades Gemini 3 Deep Think for science and engineering