Researchers have developed DOSEBENCH, a new benchmark designed to evaluate how well large language models (LLMs) handle temporal uncertainty in over-the-counter medication dosing questions. The benchmark consists of 81 scenarios involving acetaminophen and ibuprofen, focusing on critical reasoning like tracking dose timing and adhering to product label constraints. Initial evaluations revealed that LLMs frequently struggle with the rolling-window calculations and ambiguous cases, often producing confident-sounding but incorrect dosing advice. AI
影响 Highlights LLM limitations in safety-critical temporal reasoning, suggesting a need for improved models in medical QA.
排序理由 The cluster contains an academic paper introducing a new benchmark for evaluating LLM capabilities. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →