New benchmark tests LLMs for jailbreak risks in collaborative writing

By PulseAugur Editorial · [1 sources] · 2026-06-10 04:00

Researchers have introduced HarDBench, a new benchmark designed to evaluate the safety of large language models (LLMs) when used in collaborative writing scenarios. The benchmark focuses on "draft-based co-authoring jailbreak attacks," where malicious users could prompt LLMs to generate harmful content within incomplete drafts. HarDBench covers high-risk domains like explosives, drugs, and weapons, and includes realistic prompts to test model susceptibility. The researchers also developed a safety-utility balanced alignment approach to mitigate these risks without compromising the LLM's helpfulness on benign tasks. AI

IMPACT Introduces a new method for evaluating LLM safety in collaborative writing, potentially leading to more robust AI assistants.

RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLM safety. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Euntae Kim, Soomin Han, Buru Chang · 2026-06-10 04:00

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arXiv:2604.19274v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a seri…

COVERAGE [1]

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

RELATED ENTITIES

RELATED TOPICS