New benchmark reveals legal LLMs struggle with citation accuracy

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed LegalCiteBench, a new benchmark designed to evaluate the reliability of legal language models in generating accurate case citations. The benchmark, comprising approximately 24,000 instances derived from 1,000 U.S. judicial opinions, focuses on tasks such as citation retrieval, completion, error detection, and case verification. Testing revealed that even advanced models struggle with exact citation recovery, scoring below 70% on critical tasks, with many exhibiting high rates of fabricating incorrect or irrelevant authorities. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT New benchmark highlights critical citation reliability issues in legal LLMs, potentially impacting adoption in legal drafting and research.

RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

COVERAGE [1]

arXiv cs.AI TIER_1 · Shunfan Zhou · 2026-05-11 08:37

LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or…

COVERAGE [1]

LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

RELATED ENTITIES

RELATED TOPICS