100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Researchers have introduced 100-LongBench, a new benchmark designed to more accurately evaluate the long-context capabilities of large language models. Existing benchmarks often fail to distinguish between a model's general knowledge and its specific ability to process extended contexts. The new benchmark includes a length-controllable system and a novel metric to disentangle these factors, offering a clearer method for comparing different LLMs. AI
IMPACT Provides a more accurate method for evaluating LLM long-context performance, potentially guiding future model development.