A new study published on arXiv assessed 6,233 web-deployed medical large language models (LLMs), evaluating a sample of 1,500 along with 10 open-source models. The research found that a significant portion of these models exhibit factual inaccuracies, with 25-30% showing low accuracy and over half violating operational thresholds. Additionally, many action-enabled models lacked adequate privacy disclosures, indicating systemic gaps in safety and compliance. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights critical safety and compliance issues in medical AI, necessitating stronger safeguards for patient care.
RANK_REASON The cluster contains an academic paper detailing a large-scale assessment of medical LLMs. [lever_c_demoted from research: ic=1 ai=1.0]