Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy
Researchers have developed a new framework called CIE-Scorer to detect when a large language model's chain-of-thought (CoT) reasoning does not accurately reflect its internal decision-making process. This method combines external signals, like answer consistency, with internal computational evidence derived from tracing model circuits. By efficiently constructing sentence-level circuits and comparing internal and external reasoning graphs, CIE-Scorer identifies discrepancies, achieving state-of-the-art performance on CoT unfaithfulness detection while reducing computational costs. AI
IMPACT This research offers a more cost-effective way to ensure the reliability of LLM reasoning, crucial for applications requiring trustworthy outputs.