MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety
Researchers have developed MultiTurnPSB, a new benchmark for evaluating the safety of medical AI chatbots over multiple conversational turns. Standard single-turn evaluations fail to capture how unsafe responses increase significantly as conversations progress, with one model's unsafe responses rising from 35% to nearly 80% by the fourth turn. The study also found that Claude Sonnet 4.5 exhibited a notable difference in refusal behavior compared to GPT-4.1-mini, suggesting that safety training might generalize to an attacker role. AI
IMPACT Highlights critical safety gaps in conversational AI, particularly for sensitive applications like healthcare, necessitating more robust multi-turn evaluation methods.