Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA
Researchers have introduced SearchFireSafety, a new benchmark designed to evaluate the performance and safety of large language models in statute-centric legal question answering. Unlike previous benchmarks focused on case law, SearchFireSafety addresses the challenges of retrieving information from hierarchically linked statutory documents and assesses models' ability to abstain from answering when context is insufficient. Experiments revealed that while graph-guided retrieval improves performance, domain-adapted models exhibit a critical safety trade-off, becoming more prone to hallucination when essential statutory evidence is missing. AI
IMPACT Highlights the need for specialized benchmarks to ensure LLMs can safely and accurately process complex legal statutes, moving beyond case law.