Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Researchers have introduced 100-LongBench, a new benchmark designed to more accurately evaluate the long-context capabilities of large language models. Existing benchmarks often fail to distinguish between a model's general knowledge and its specific ability to process extended contexts. The new benchmark includes a length-controllable system and a novel metric to disentangle these factors, offering a clearer method for comparing different LLMs. AI

IMPACT Provides a more accurate method for evaluating LLM long-context performance, potentially guiding future model development.

LLMs
Wang Yang
100-LongBench