A new benchmark, TF-RefusalBench, has been developed to measure and mitigate over-alignment in large language models (LLMs) used within multilingual criminal law contexts. The benchmark, comprising 5,200 prompts across French, German, Italian, and English, was derived from public Swiss Federal Supreme Court rulings. Researchers found that over-alignment is influenced by model and language, and its impact extends beyond simple refusal to affect task faithfulness. The study also evaluated mitigation strategies, suggesting that while prompting can help, ablating refusal directions is effective with minimal performance degradation. AI
IMPACT This research could lead to more reliable LLM applications in sensitive legal domains by addressing issues of over-alignment and refusal.
RANK_REASON The item is an academic paper introducing a new benchmark and evaluation methodology for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →