English(EN) Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

新的“Pre-Flight”基准测试揭示了大型语言模型在航空知识方面的差距

作者 PulseAugur 编辑部 · [2 个来源] · 2026-07-02 07:49

研究人员开发了“Pre-Flight”，这是一个旨在评估大型语言模型（LLMs）在航空业特定运营知识的新基准测试。该基准测试包含 300 道多项选择题，这些题目源自国际航空标准、法规和运营场景，由航空专业人士创建和审查。初步评估显示，即使是测试中最先进的模型（于 2026 年发布），准确率也仅达到 82.7%，远低于人类专家约 95% 的准确率。研究人员强调，此类特定领域的评估对于负责任地在航空运营中部署生成式人工智能至关重要。 AI

影响强调了需要专门的基准测试来确保在航空等高风险行业中安全可靠地部署人工智能。

排序理由该集群描述了一篇介绍用于评估大型语言模型的特定领域基准测试的新学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Alex Brooker, Tim Hughes · 2026-07-03 04:00

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

arXiv:2607.01829v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly proposed for aviation business operations, from documentation and training generation to customer facing assistants. General purpose benchmarks do not measure whether a model reasons saf…
arXiv cs.CL TIER_1 English(EN) · Tim Hughes · 2026-07-02 07:49

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

Large language models (LLMs) are increasingly proposed for aviation business operations, from documentation and training generation to customer facing assistants. General purpose benchmarks do not measure whether a model reasons safely and correctly about aviation specific operat…

报道来源 [2]

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

相关实体

相关话题