Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 1d · [3 sources]

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Researchers have introduced K-BrowseComp, a new benchmark designed to evaluate the web-browsing agent capabilities of large language models specifically within Korean contexts. The benchmark comprises 400 problems, with a manually verified subset of 300 problems. Initial evaluations show that leading frontier models like GPT-5.5 and DeepSeek-V4-Pro achieve performance levels between 30.00% and 45.67% on this subset, a significant decrease compared to their performance on English benchmarks. Korean-specific LLMs performed even lower, indicating a substantial gap in agentic capabilities for Korean language tasks. AI

IMPACT Highlights a critical need for improved LLM agentic performance in non-English contexts, potentially guiding future model development and evaluation strategies.

GPT-5.5
DeepSeek-V4-Pro
Korean LLMs
K-BrowseComp
GLM-5.1
BrowseComp