Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 1d · [2 sources]

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Researchers have introduced LakeQA, a new benchmark designed to test the capabilities of large language models in searching and reasoning over massive data lakes. The benchmark utilizes approximately 9.5 TB of diverse data, including Wikipedia and government datasets, requiring multi-hop reasoning and evidence composition across multiple sources. Initial experiments show that even advanced models like GPT-5.2 struggle with the task, achieving an exact-match score of only 18.37%, highlighting the challenge LakeQA presents for developing effective LLM agents. AI

IMPACT Establishes a new, challenging benchmark for evaluating LLM agents' ability to search and reason over large, unstructured datasets.

GPT-5.2
LLMs
LakeQA
Wikipedia