OpenAI has introduced SimpleQA, a new benchmark designed to evaluate the factuality of language models by focusing on short, fact-seeking questions. The dataset aims to challenge frontier models, as GPT-4o scores less than 40% on it, and is open-sourced to aid researchers. SimpleQA features diverse topics and a high degree of correctness, with an estimated inherent error rate of approximately 3% after rigorous verification. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON OpenAI released a new benchmark dataset for evaluating language model factuality.