A new paper investigates the hallucination tendencies of four large language models—ChatGPT, Grok, Gemini, and Copilot—when used for academic writing. Researchers designed 80 prompts across four categories and introduced a Hallucination Index (HI) to measure factual accuracy and reference validity. The study found that Grok and Copilot excelled at reference generation but struggled with abstract tasks, while Gemini and ChatGPT showed better tone control but higher hallucination risks in factual writing. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Highlights the persistent challenge of LLM factual accuracy in specialized domains like academic writing, suggesting prompt engineering and task-specific tuning are crucial.
RANK_REASON The cluster contains an academic paper detailing research findings on LLM hallucinations.