Researchers have developed MoRFI (Monotonic Sparse Autoencoder Feature Identification) to better understand how large language models hallucinate. By fine-tuning models like Llama 3.1 8B and Gemma 2 9B on new knowledge, they observed that prolonged training exacerbates hallucinations. MoRFI analyzes the models' internal states to identify specific directions in the residual stream that are causally linked to these factual inaccuracies, enabling targeted interventions to recover correct knowledge. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Provides a method to diagnose and potentially mitigate hallucinations in LLMs by identifying specific internal knowledge retrieval pathways.
RANK_REASON Academic paper introducing a new method for analyzing LLM behavior.