PulseAugur
实时 13:51:31

新基准评估 LLM 在跨司法管辖区的法律推理能力

研究人员开发了新的基准来评估大型语言模型(LLM)在不同司法管辖区和语言中的法律推理能力。UA-Legal-Bench 专注于乌克兰法律,利用大量的法院判决语料库来执行案件类型分类和规范提取等任务。Multi-Legal-Bench 通过在六个国家/地区评估相同的任务来扩展这一研究,揭示了少样本提示(few-shot prompting)的效果是一致的,但模型性能因司法管辖区和语言而异。此外,BenGER 平台和数据集评估了 LLM 在德国法律推理方面的能力,引入了 LLM 作为法官(LLM-as-a-Judge)的框架,并证明了人类与人工智能的协同创作优于独立的人类工作。 AI

影响 这些基准将能够对 LLM 在法律等专业领域的评估更加严谨,有可能加速其在法律实践和研究中的应用。

排序理由 多篇研究论文介绍了用于评估 LLM 在法律推理方面能力的新基准和数据集。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。 我们如何撰写摘要 →

新基准评估 LLM 在跨司法管辖区的法律推理能力

报道来源 [6]

  1. arXiv cs.AI TIER_1 English(EN) · Volodymyr Ovcharov ·

    UA-Legal-Bench:用于评估大型语言模型在乌克兰法律推理能力上的基准

    arXiv:2605.29170v1 Announce Type: cross Abstract: Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models…

  2. arXiv cs.AI TIER_1 English(EN) · Volodymyr Ovcharov ·

    Multi-Legal-Bench:跨司法管辖区、语言和法律传统的LLM法律推理评估

    arXiv:2605.29738v1 Announce Type: cross Abstract: Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdic…

  3. arXiv cs.CL TIER_1 English(EN) · Volodymyr Ovcharov ·

    Multi-Legal-Bench:跨司法管辖区、语言和法律传统评估 LLM 法律推理能力

    Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical ta…

  4. arXiv cs.AI TIER_1 English(EN) · Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Ko\c{c}ak, Anne Zettelmeier, Elly Breu, Angelina Greiner, Sofija Milijas, Matthias Grabmair ·

    BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

    arXiv:2605.28183v1 Announce Type: cross Abstract: We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks a…

  5. arXiv cs.AI TIER_1 English(EN) · Sebastian Nagl, Matthias Grabmair ·

    BenGER Platform:面向德国法律任务端到端基准测试的协作式 Web 平台

    arXiv:2604.13583v3 Announce Type: replace-cross Abstract: Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and…

  6. arXiv cs.CL TIER_1 English(EN) · Matthias Grabmair ·

    BenGER:在基于包含的法律推理方面对德语法律的 LLM 系统进行基准测试

    We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 s…