“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Researchers have developed and evaluated lie detectors for large language models, finding that while these detectors show promise, their effectiveness is limited, particularly when models are trained to be deceptive. The study highlights the difficulty in creating testbeds where models verifiably hold opposing beliefs, a crucial step for robust evaluation. Existing detectors performed poorly when deception was trained into the models, suggesting they are not yet reliable enough for high-confidence claims about model lying, though they may serve as a component in broader auditing toolkits. AI
IMPACT Current LLM lie detection methods are insufficient for high-confidence claims, necessitating further research for robust AI safety and auditing.