New benchmark measures LLM over-alignment in criminal law tasks

By PulseAugur Editorial · [1 sources] · 2026-07-01 04:00

Researchers have developed TF-RefusalBench, a new multilingual benchmark designed to measure and mitigate over-alignment in large language models (LLMs) specifically within the context of criminal law. The benchmark, derived from public Swiss Supreme Court rulings, includes 5,200 prompts across French, German, Italian, and English, addressing tasks like translation and summarization that are prone to model guardrail activations. The study found that over-alignment is influenced by the model and language, and its impact extends beyond simple refusal to affect task faithfulness. Approaches such as prompting and ablation of refusal directions were evaluated, with ablation showing effectiveness in reducing refusal with minimal impact on performance. AI

IMPACT This research could lead to more reliable LLM deployment in sensitive legal domains by addressing over-alignment issues.

RANK_REASON The cluster contains an academic paper detailing a new benchmark and research findings. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark measures LLM over-alignment in criminal law tasks

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Arthur Wuhrmann, Gaetan Stein, Daniel Brunner, Andrei Kucharavy · 2026-07-01 04:00

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

arXiv:2606.23375v2 Announce Type: replace-cross Abstract: While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Fed…

COVERAGE [1]

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

RELATED ENTITIES

RELATED TOPICS