English(EN) HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

新基准测试评估LLM在协作写作中的越狱风险

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-10 04:00

研究人员推出HarDBench，一个旨在评估大型语言模型（LLM）在协作写作场景中安全性的新基准测试。该基准测试侧重于“草稿式联合创作越狱攻击”，即恶意用户可能提示LLM在不完整的草稿中生成有害内容。HarDBench涵盖了爆炸物、毒品和武器等高风险领域，并包含用于测试模型易感性的真实提示。研究人员还开发了一种安全-效用平衡的对齐方法，以在不损害LLM在良性任务上的有用性的情况下减轻这些风险。 AI

影响引入了一种评估LLM在协作写作中安全性的新方法，可能带来更强大的AI助手。

排序理由该集群包含一篇介绍LLM安全评估新基准测试的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Euntae Kim, Soomin Han, Buru Chang · 2026-06-10 04:00

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arXiv:2604.19274v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a seri…

报道来源 [1]

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

相关实体

相关话题