A new study published on arXiv investigates the hallucination tendencies of popular large language models like ChatGPT, Grok, Gemini, and Copilot when used for academic writing. The research found that while Grok and Copilot excel at reference generation, they struggle with abstract tasks, whereas Gemini and ChatGPT show better tone control but higher factual hallucination risks. Separately, concerns are mounting about the reliability of LLMs for medical advice, with multiple studies indicating significant inaccuracies, fabricated citations, and a tendency to provide confident but incorrect information, raising safety issues for public deployment. Additionally, generative AI is being explored for mental health applications like anger management, though experts caution against replacing human therapists and highlight risks of misinformation and the need for robust safeguards. AI
Summary written by gemini-2.5-flash-lite from 6 sources. How we write summaries →
IMPACT LLM accuracy in academic and medical contexts remains a concern, highlighting the need for caution and further research before widespread deployment in sensitive areas.
RANK_REASON The cluster contains multiple academic papers and expert commentary discussing the performance and safety of LLMs, particularly concerning factual accuracy and potential risks.