Humanity's Last Exam
PulseAugur coverage of Humanity's Last Exam — every cluster mentioning Humanity's Last Exam across labs, papers, and developer communities, ranked by signal.
2 天有情绪数据
-
Google 的 Gemini 3.5 Flash 在编码和代理任务上超越 3.1 Pro
Google 的 Gemini 3.5 Flash 模型在多项关键基准测试中超越了其前身 Gemini 3.1 Pro,尤其是在编码和代理任务方面。这一新层级相比 3.1 Pro 提供了显著的成本降低 40%,并且输出生成速度大约快四倍。虽然 Gemini 3.5 Flash 在工具使用和代理性能方面表现出色,但 Gemini 3.1 Pro 在纯粹推理和新颖问题解决基准测试中仍保持优势。
-
LLMs learn to actively seek external info for better task adaptation
Researchers have developed a new method for adapting large language models (LLMs) by enabling them to actively seek information from external sources like Wikipedia and web browsers. This approach, termed "active inform…
-
New RSE strategy recycles LLM search experience for efficient test-time scaling
Researchers have introduced Recycling Search Experience (RSE), a novel method to improve the efficiency of test-time scaling for large language models. RSE transforms test-time search from isolated trials into a cumulat…
-
OpenSearch-VL 提供高级多模态搜索代理的开放式方案
研究人员开发了 OpenSearch-VL,这是一种新颖的、完全开源的、用于训练高级多模态深度搜索代理的方案。该方法利用了一个精心策划的高质量训练数据管道、一个结合文本和图像搜索以及各种处理能力的多元化工具环境,以及一个专门用于处理工具失败的训练算法。由此产生的代理在多项基准测试中表现出显著的性能提升,可与专有模型相媲美,旨在使前沿搜索代理研究更加易于获取。
-
小米的MiMo-v2.5-Pro开源模型可与顶级AI编码助手相媲美
小米发布了MiMo-v2.5-Pro,这是一款专注于编码的开源语言模型,在复杂任务中展现出令人印象深刻的能力。该模型在数小时内成功完成了一个大学级别的编译器项目,根据模糊的提示构建了一个功能齐全的视频编辑器应用程序,并解决了模拟电路设计问题。MiMo-v2.5-Pro在编码基准测试中表现强劲,可与GPT-5.4和Claude Opus 4.6等顶级闭源模型相媲美,现已在HuggingFace上发布。
-
MTRouter cuts LLM costs by 58% on ScienceWorld, 43% on HLE
Researchers have developed MTRouter, a novel system designed to optimize the cost of multi-turn interactions with large language models. By jointly embedding interaction history and candidate models, MTRouter learns to …
-
Google Gemini API adds Deep Research updates with MCP and chart generation
Google has released two significant updates to its Gemini API, enhancing its Deep Research capabilities. These updates introduce improved quality, support for MCP, and native generation of charts and infographics. The G…
-
new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5
Google DeepMind has released Gemini 3 Deep Think V2, a new reasoning mode for Google AI Ultra subscribers and available via API early access. This model achieves new state-of-the-art results on benchmarks like ARC-AGI-2…
-
Kimi K2 model boasts 1T parameters and SOTA HLE, while Soumith Chintala departs PyTorch
Kimi K2, a new model from Kimi, boasts 1 trillion parameters and achieves state-of-the-art results on the HLE benchmark. It also demonstrates capabilities in BrowseComp and TauBench. Separately, Soumith Chintala has dep…
-
Google DeepMind launches Deep Think for Gemini Ultra subscribers
Google DeepMind has released a new AI capability called Deep Think, now available to Google AI Ultra subscribers via the Gemini app. This feature utilizes parallel thinking techniques, allowing the model to explore mult…