Researchers have developed LegalCiteBench, a new benchmark designed to evaluate the reliability of legal language models in generating accurate case citations. The benchmark, comprising approximately 24,000 instances derived from 1,000 U.S. judicial opinions, focuses on tasks such as citation retrieval, completion, error detection, and case verification. Testing revealed that even advanced models struggle with exact citation recovery, scoring below 70% on critical tasks, with many exhibiting high rates of fabricating incorrect or irrelevant authorities. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT New benchmark highlights critical citation reliability issues in legal LLMs, potentially impacting adoption in legal drafting and research.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]