A benchmark of eight large language models for medical scribing revealed that while high-impact hallucinations were rare, omissions of clinically relevant details were significantly more common. The evaluation of 300 synthetic doctor-patient dialogues found 520 instances of left-out safety facts compared to 12 confirmed hallucinations. Models like GPT-5.4-mini performed well for cost and speed, while Claude Sonnet and DeepSeek excelled in prose quality, though DeepSeek missed many safety facts. Claude Opus, despite having fewer omissions, showed weaker prose quality, and Kimi was noted for being slow and expensive. AI
IMPACT Highlights a critical area for improvement in AI medical scribing: reducing omissions of safety-critical information, which is more prevalent than hallucinations.
RANK_REASON The item describes a benchmark and evaluation of existing LLMs for a specific application, rather than a new model release or significant industry event. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →