PulseAugur
实时 15:40:06
English(EN) K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

新的韩语网络浏览基准揭示了大型语言模型的性能差距

研究人员推出了 K-BrowseComp,这是一个旨在评估大型语言模型在韩国语境下网络浏览代理能力的新基准。该基准包含 400 个问题,其中 300 个问题经过人工验证。初步评估显示,GPT-5.5DeepSeek-V4-Pro 等领先的前沿模型在此子集上的性能水平在 30.00% 到 45.67% 之间,与它们在英语基准上的表现相比有显著下降。特定于韩语的大型语言模型表现更低,表明在韩语任务的代理能力方面存在巨大差距。 AI

影响 强调了在非英语语境下提高大型语言模型代理性能的关键需求,可能指导未来的模型开发和评估策略。

排序理由 该集群描述了一篇介绍用于评估大型语言模型能力基准的新学术论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.CL TIER_1 English(EN) · Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim ·

    K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

    arXiv:2606.02404v1 Announce Type: new Abstract: Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-bro…

  2. arXiv cs.CL TIER_1 English(EN) · Seungone Kim ·

    K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

    Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean context…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    K-BrowseComp:一个以韩国语境为基础的网络浏览代理基准测试

    Korean web-browsing agent benchmark K-BrowseComp evaluates frontier LLMs' capabilities with 400 problems, showing significant performance gaps compared to English benchmarks and highlighting the need for more robust Korean AI development.