New benchmark measures LLM over-alignment in criminal law

By PulseAugur Editorial · [1 sources] · 2026-06-22 14:08

A new benchmark, TF-RefusalBench, has been developed to measure and mitigate over-alignment in large language models (LLMs) used within multilingual criminal law contexts. The benchmark, comprising 5,200 prompts across French, German, Italian, and English, was derived from public Swiss Federal Supreme Court rulings. Researchers found that over-alignment is influenced by model and language, and its impact extends beyond simple refusal to affect task faithfulness. The study also evaluated mitigation strategies, suggesting that while prompting can help, ablating refusal directions is effective with minimal performance degradation. AI

IMPACT This research could lead to more reliable LLM applications in sensitive legal domains by addressing issues of over-alignment and refusal.

RANK_REASON The item is an academic paper introducing a new benchmark and evaluation methodology for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark measures LLM over-alignment in criminal law

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Andrei Kucharavy · 2026-06-22 14:08

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Federal Supreme Court uses small on-premises models for tenta…

COVERAGE [1]

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

RELATED ENTITIES

RELATED TOPICS