New MoRFI method identifies latent directions causing LLM hallucinations

作者 PulseAugur 编辑部 · [2 个来源] · 2026-04-29 16:32

Researchers have developed MoRFI (Monotonic Sparse Autoencoder Feature Identification) to better understand how large language models hallucinate. By fine-tuning models like Llama 3.1 8B and Gemma 2 9B on new knowledge, they observed that prolonged training exacerbates hallucinations. MoRFI analyzes the models' internal states to identify specific directions in the residual stream that are causally linked to these factual inaccuracies, enabling targeted interventions to recover correct knowledge. AI

影响 Provides a method to diagnose and potentially mitigate hallucinations in LLMs by identifying specific internal knowledge retrieval pathways.

排序理由 Academic paper introducing a new method for analyzing LLM behavior.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstas · 2026-04-30 04:00

MoRFI: Monotonic Sparse Autoencoder Feature Identification

arXiv:2604.26866v1 Announce Type: new Abstract: Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving…
arXiv cs.CL TIER_1 English(EN) · Ioannis Konstas · 2026-04-29 16:32

MoRFI: Monotonic Sparse Autoencoder Feature Identification

Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demon…

报道来源 [2]

MoRFI: Monotonic Sparse Autoencoder Feature Identification

MoRFI: Monotonic Sparse Autoencoder Feature Identification

相关实体

相关话题