Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing
A new study published on arXiv investigated the hallucination tendencies of four popular LLMs—ChatGPT, Grok, Gemini, and Copilot—when used for academic writing. The research introduced a "Hallucination Index" (HI) and found that Grok and Copilot performed better in reference generation but struggled with abstract prompts, while Gemini and ChatGPT showed better tone control but higher factual hallucination risks. The study concluded that hallucination behavior is influenced by task type and prompting conditions, not solely by model architecture. Separately, Gary Marcus highlighted multiple studies indicating that current LLMs are unreliable for medical advice, often providing inaccurate or fabricated information with high confidence, and should not be used for unsupervised clinical decision-making. AI
IMPACT LLM hallucinations in academic and medical contexts pose risks of misinformation and unreliable decision-making, highlighting the need for caution and further research.