Researchers have developed TF-RefusalBench, a new multilingual benchmark designed to measure and mitigate over-alignment in large language models (LLMs) specifically within the context of criminal law. The benchmark, derived from public Swiss Supreme Court rulings, includes 5,200 prompts across French, German, Italian, and English, addressing tasks like translation and summarization that are prone to model guardrail activations. The study found that over-alignment is influenced by the model and language, and its impact extends beyond simple refusal to affect task faithfulness. Approaches such as prompting and ablation of refusal directions were evaluated, with ablation showing effectiveness in reducing refusal with minimal impact on performance. AI
IMPACT This research could lead to more reliable LLM deployment in sensitive legal domains by addressing over-alignment issues.
RANK_REASON The cluster contains an academic paper detailing a new benchmark and research findings. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →