OpenAI has introduced TruthfulQA, a new benchmark designed to evaluate how well language models avoid generating false information. The benchmark consists of 817 questions across 38 categories, specifically designed to elicit false answers based on common human misconceptions. Early tests showed that even the best-performing models were truthful on only 58% of questions, significantly lower than the 94% achieved by humans, and larger models tended to be less truthful, suggesting that simply scaling up models may not improve their accuracy. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON OpenAI published a research paper introducing a new benchmark for evaluating model truthfulness.