ENTITY BrowseComp+

BrowseComp+

PulseAugur coverage of BrowseComp+ — every cluster mentioning BrowseComp+ across labs, papers, and developer communities, ranked by signal.

Total · 30d

11

11 over 90d

Releases · 30d

0

0 over 90d

Papers · 30d

9

9 over 90d

TIER MIX · 90D

significant 1
research 8
tool 1
commentary 1

TOPICS

SENTIMENT · 30D

6 day(s) with sentiment data

RECENT · PAGE 1/1 · 11 TOTAL

RESEARCH · CL_99569 · Jun 18 · 15:25

New method mines agent skills from interaction data, but policy improvement is limited

Researchers have developed a method to automatically generate skill libraries for computer-using agents by mining interaction trajectories. The process involves segmenting graphical user interface (GUI) trajectories, cl…
RESEARCH · CL_106759 · Jun 17 · 00:00

New LLM Training Methods Optimize Data Scheduling for Efficiency and Performance

Researchers have developed new methods for optimizing the training of large language models (LLMs) through advanced data scheduling techniques. One approach, the Holistic Data Scheduler (HDS), uses multi-objective reinf…
TOOL · CL_86307 · Jun 11 · 22:21

Perplexity Integrates Deep Research with Multi-Model Orchestration System

Perplexity has integrated its Deep Research feature into its Computer orchestration system, enhancing its ability to break down complex questions into subtasks. These subtasks are then routed across more than 20 differe…
RESEARCH · CL_84831 · Jun 10 · 00:00

TreeSeeker framework enhances AI deep search with controlled trial-and-error

Researchers have introduced TreeSeeker, a novel framework designed to improve the efficiency of deep search agents. This system structures search processes as a tree, allowing agents to explore multiple potential paths …
RESEARCH · CL_65077 · Jun 1 · 00:00

New Korean web-browsing benchmark reveals LLM performance gaps

Researchers have introduced K-BrowseComp, a new benchmark designed to evaluate the web-browsing agent capabilities of large language models specifically within Korean contexts. The benchmark comprises 400 problems, with…
COMMENTARY · CL_61793 · May 31 · 19:31

Author warns AI evaluations are unreliable, risking unseen harms

The author argues that current AI evaluation methods are unreliable and systematically flawed, posing significant risks. They highlight issues like models gaming evaluations, distribution shifts rendering metrics inaccu…
RESEARCH · CL_55915 · May 27 · 00:00

New benchmark LiveBrowseComp tests LLM search agents' true discovery skills

A new research paper introduces LiveBrowseComp, a benchmark designed to assess whether large language model (LLM) search agents truly discover new information or merely verify their existing internal knowledge. The stud…
RESEARCH · CL_37215 · May 18 · 14:12

Hugging Face launches Open Agent Leaderboard for AI systems

Hugging Face has launched the Open Agent Leaderboard, a new framework for evaluating the performance and cost of AI agent systems. This benchmark focuses on assessing an agent's generality across diverse tasks and setti…
RESEARCH · CL_44793 · May 15 · 17:29

New open-weight agents tackle deep research tasks with synthetic data and novel architectures

Two new research papers introduce advanced agent systems designed for deep research tasks. The first, QUEST, offers a family of open-weight models (2B to 35B parameters) trained on synthetic data, demonstrating strong p…
RESEARCH · CL_20273 · May 5 · 17:55

OpenSearch-VL offers open recipe for advanced multimodal search agents

Researchers have developed OpenSearch-VL, a novel, fully open-source recipe for training advanced multimodal deep search agents. This approach utilizes a curated pipeline for high-quality training data, a diverse tool e…
FRONTIER RELEASE · CL_01790 · Nov 6 · 05:44

Kimi K2 model boasts 1T parameters and SOTA HLE, while Soumith Chintala departs PyTorch

Kimi K2, a new model from Kimi, boasts 1 trillion parameters and achieves state-of-the-art results on the HLE benchmark. It also demonstrates capabilities in BrowseComp and TauBench. Separately, Soumith Chintala has dep…